Analyzing Behavioral Big Data Methodological, Practical, Ethical & Moral Issues
Galit Shmueli徐茉莉National Tsing Hua U
Stu Hunter Research Conference, Waterloo CA, March 2016
WhatisBehavioral BigData(BBD)• Special typeofBigData
• Behavioral:people’s actions,interactions,self-reported opinions, thoughts,feelings
• Humanandsocialaspects: Intentions,deception, emotion, reciprocation, herding,…• Whenawareofdatacollection ->modifiedbehavior (legalrisks,embarrassment,unwantedsolicitation)
BBDvs.MedicalBigData
• Physicalmeasurements• Datacollectiontimingoftensetbymedicalsystem• Clinicaltrials:awareness&vestedinterest
• People’s dailyactions,interactions, self-reported feelings,opinions, thoughts (UGC)• Datageneration timingoftenchosenbyuser• Experiments: usersoftenunaware;goalnotalwaysinuser’sinterest
BBDonCitizensandCustomers
Governmentssecurity, lawenforcement, traffic(cameras, sensors)
Financial Institutionsfraud, loans(ITsystems,cameras)
Telecoms fraud,infrastructure, marketing(ITsystems,mobile)
Retailchainsmarketing, operations,merchandising(POSsystems,video,social,mobile)
InsurancesetUsage-BasedInsurance premiums(telematics info)
DataCollectionTechnologies:• Cameras• Sensors• ITsystems
(POS,calls,…)• GPS• Things• Internet• Mobile• Social
BBDonEmployees
ServiceProvidersqualitycontrol,employeeperformance
ElectronicPerformanceMonitoring(EPM)systems,websurfing,e-mailssentandreceived, telephone use,video,location (taxis)
BBDonCitizens,Customers,Employees:Internet!
• BBDnowalsoavailable tosmall companies&organizations• OnlineplatformshaveBBD(e-commerce, gaming,search,socialnetworks…)• Voluntarily entered byusers:personaldetails,photos,comments,messages,searchterms,bidsinauctions, likes,paymentinformation, connections with“friends”• Passivefootprints: duration onthewebsite,pagesbrowsed,sequence, referringwebsite, Internetbrowser,operatingsystem,location, IPaddress.• BBDnowavailable toindividuals: Quantified Self(andapps)
Moreandmorehumanandsocialactivitiesaremovingonline
MostcompaniesthathaveBBDwerenotcreatedforthepurposeofgeneratingBBD
Twoimportantpoints
WhyshouldindustrialstatisticianscareaboutBBD?
Technology isadvancing intwodirections
Fullyautomated(algorithmic)solutions
Industrialstatisticiansare(andshouldbe)involvedindesigningboth!
Micro-levelrecordingofhumanandsocialbehavior
ResearchusingBBD
DuncanWatts,MicrosoftResearch:1. Social science problemsarealmostalwaysmore
difficult thantheyseem2.Thedatarequired toaddressmanyproblemsofinteresttosocialscientistsremaindifficult toassemble
3.Thorough exploration ofcomplexsocialproblemsoftenrequires thecomplementary application ofmultiple research traditions
AcademicResearchQsusingBBD
Researchabout humanandsocialbehavior
examinenewphenomena
re-examineoldphenomenawithbetterdata
ResearchCommunities
Researcherswithsocialscience +technical backgrounds
InformationSystems
Marketing ComputationalSocialScience
ExamplesofBBDStudiesinTopJournalsConsumptioninVirtualWorlds(Hinz etal.InfoSysResearch,2015)“Theideathatconspicuousconsumptioncanincreasesocialstatus,asaformofsocialcapital,hasbeenbroadlyaccepted,yetresearchershavenotbeenabletotestthiseffectempirically.”• age-oldsociologyquestionwithnewBBDdata
• BBDfromtwovirtualworldwebsites(gamingwithsocialnetwork)
SocialinfluenceinSocialNewsWebsites(Muchniketal.Science,2014)“Therecentavailabilityofpopulation-scaledatasetsonratingbehaviorandsocialcommunicationenablenovelinvestigationsofsocialinfluence...”• Existingquestioninnewcontext:studysocialinfluencebiasinratingbehavior
• BBDfromasocialnewsaggregationwebsitewhereuserscontributenewsarticles,discussthem,andratecomments
OnlineConsumerRatingsofPhysicians(Gaoetal.InformationSystemsResearch,2014)“examinehowcloselytheonlineratingsreflectpatients’opinionaboutphysicianqualityatlarge.”• newphenomenonofonlineratingsofserviceproviders
• BBDondirectmeasuresofboththeofflinepopulation’sperceptionofphysicianquality,andconsumergeneratedonlinereviews.
ImpactofTeachersonStudentOutcomesusingEducationandTaxBBD(Chetty etal.Amer EconReview,2014)• long-termimpactofteachersonstudentoutcomeshasbeenofinterestineconomicpolicy:oldquestionwithnewBBDdata
• combinedBBDfromadministrativeschooldistrictrecordsandfederalincometaxrecords
EmotionalContagioninSocialNetworks(Krameretal.ProcoftheNationalAcademiesofSciences,2014)• Canemotionalstatesbetransferredtoothersviaemotionalcontagion?
• BBDfromlarge-scaleexperimentrunbyFB,manipulatingusers’exposureleveltoemotionalexpressionsintheirFacebookNewsFeed
AnonymousBrowsinginOnlineDatingWebsites(Bapna etal.ManagementScience,2016)“Onlinedatingplatformsoffernewcapabilities,suchasextensivesearch,bigdata–basedrecommendations,andvaryinglevelsofanonymity,whoseparallelsdonotexistinthephysicalworld...”• newquestionsabouthumanbehaviorduetonewtechnologies
• BBDfromlarge-scaleexperiment,partneredwithlargedatingwebsiteinNAmerica,testingtheeffectofanonymousbrowsingonmatching.
ONE WAY MIRRORS IN ONLINE DATINGA Randomized Field Experiment
Ravi Bapna, University of MinnesotaJui Ramaprasad, Mcgill University
Galit Shmueli, National Tsing Hua UniversityAkhmed Umyarov, University of Minnesota
Online Dating
46of the single population in the US uses online dating
to find a partner (Gelles 2011)
%
Online Dating Website
Non-anonymous Browsing (Default)
ProfileVisit
Recentvisitor:
Anonymous Browsing
ProfileVisit
Recentvisitor:
NONE
Research Question (in simple words)
How does anonymous browsing affect user behavior?
… and matching?
Formal Research Question
what is the relative causal effect of social inhibitions on search preferences vs. social inhibitions of contact initiation in dating markets?
given known gender asymmetries, how does this effect differ for men vs. women?
Randomized Field Experiment on Large Online Dating Website
50,000usersreceivegiftofanonymousbrowsing
Results
Users treated with anonymity
become disinhibited view more profiles, view more same-sex and interracial mates
get less matcheslose ability to leave a weak signal- especially harmful for women!
Roleofanonymity andimportanceofWEAKSIGNAL
inonlineplatforms
InAcademiaCausalQsaremostpopular• Methodologicalchallenges:• scalabilityofstatmodels• small-samplestatinference• self-selection
PredictiveQs(quiterare)• Howtouseresultsbeyondapplication-specific?6usesofpredictiveanalyticsfortheorybuilding[Shmueli &Koppius,2011]
InIndustryPurpose:evaluateorimproveproducts,service,operations,etc.• NetflixPrize:movierecommendersystem
• Yahoo!,LinkedIn:personalizednewscontenttoincreaseuserengagement/clicks[Agarwal&Chen2016]
• Target:pregnancyprediction• Amazon:pricing,etc.• Government:campaigntargeting
BBD-basedResearchQuestions
GettingBBDforResearch
1.OpenData,PubliclyAvailableDataData.govTwitterKaggle (UCIMR)APIandwebscraping
2.PartneringwithaCompany• Bothpartiesinterestedinresearchquestion• Datapurchase• Personalconnections• Partnershipbetweenschoolandorganization(CMULivingAnalyticsResearchLab)
3.CrowdsourcingAMTReplacingstudentsubjects• Experimentsubjects• Surveyrespondents• Cleaningandtaggingdata
“easyaccesstoalarge,stable,anddiversesubjectpool,thelowcostofdoingexperiments,andfasteriterationbetweendevelopingtheoryandexecutingexperiments”[MasonandSuri,2012]
UsingBBDforResearch:HumanSubjects
Institutional ReviewBoard(IRB)“ethicscommittee”University-levelcommitteedesignatedtoapprove,monitor,andreviewbiomedicalandbehavioralresearchinvolvinghumans.• performsbenefit-riskanalysisforproposedstudy• guidelines:Beneficence, Justice,andRespect forpersons
• HHSproposenewIRBexemptioncriteriaforpubliclyavailabledata(orevenbuyingit)• CouncilforBigData,Ethics&Society’sletter:“thesecriteriaforexclusionfocusonthestatusofthedataset… notthecontentofthedatasetnorwhatwillbedonewiththedataset,whicharemoreaccuratecriteriafordeterminingtheriskprofileoftheproposedresearch
Ethics:BeyondIRBFacebookexperiment[Krameretal.2014]:• NoIRB
“[Thework]wasconsistentwithFacebook’sDataUsePolicy,towhichallusersagreepriortocreatinganaccountonFacebook,constitutinginformedconsentforthisresearch.”
• PNASeditorialExpressionofConcern• Variedresponsefrompublic,academia,press,ethicists,corporates[Adar2015]
BigBehavioralExperiments
BigBehavioralExperiments:IssuesComparetoindustrialenvironment
1.Fast-ChangingEnvironmentMultipleA/Btestsruneveryday(overlaps)Userskeepevolving
2.MultiplicityandScalingComputationaladvertisingandcontentrecommendation3M’s[Agarwal&Chen2016]:• Multi-response(clicks,shares,likes,…)• Multi-context(mobile,email,...)• Multipleobjective(engagment,revenue,...)
3.Spill-OverEffects• Treatmentcanaffectcontrolgroup(socialnetworks)
• Challengeofrandomizationonasocialnetwork(Fienberg,2015):eveniftreatmentandcontrolmemberssufficientlyfarawaytoavoidspill-overeffects,analysisstillmustaccountfordependenceamongunits.
BigBehavioralExperiments:IssuesComparetoindustrialenvironment
4.KnowledgeofAllocationandGiftEffect• Likeclinicaltrials:allocationknowledgecanaffectoutcome• Onlineusersdiscovertheirallocationviaonlineforums• Blindingandplacebo?• “Gift”orpreferentialtreatmentcanaffectoutcome• Bapna etal.(2016)comparedeffectatendofmanipulationtimeandrightafter,todeterminegifteffect
5.EthicalandMoralIssuesEaseofrunningalargescaleexperimentquicklyandatlowcost• dangerofharmingmanypeoplequickly• smallscalepilotstudy?AMT:Fairtreatment&paymenttoworkers
ObservationalBBD:Issues
EthicalandMoralIssues• Privacy(Netflix)• Dataprotectionandreproducibleresearch
• Conflictofinterestcompany-vs-users(Studyconclusionsleadtooperationalactionsthattrade-offthecompany’sinterestwithuserwell-being)
• AMT– paymenttoworkers
MethodologicalIssues1.Self-selectionBiasUserschoosetreatment• ScalingofPSMtobigdata?
2.Simpson’sParadoxCausaldirectionreverseswhendataaredisaggregated• Doesadatasethaveaparadox?
3.ContaminationbyExperiments
4.DataSize&DimensionNeedverylarge+rich datatoanswerpredictiveQs[Junque deFortuny etal.2014]
ATree-BasedApproachforAddressingSelf-selectioninImpactStudies
withBigData
Inbal Yahav Galit Shmueli DeepaManiBar Ilan University NationalTsingHuaU IndianSchoolofBusiness
Israel Taiwan India
SelfSelection:TheChallenge
• Large impactstudiesofanintervention• Individuals/firmschoosewhichgrouptojoin
Howtoidentifyandadjust forself-selection?
CurrentMethods:ChallengeswithBigData
1.Matchingleadstoseveredataloss
2.Sufferfrom“datadredging”
3.Donotidentifyvariablesthatdrivetheselection
4.Assumeconstantinterventioneffect
5.Sequential natureiscomputationallycostly
6.Requiresusertospecifyformofselectionmodel
OurTree-BasedApproach:Useadataminingalgorithminanovelway
Flexiblenon-parametricselectionmodel
Automated detectionofunbalancedvariables
Easytointerpret,transparent,visual
Applicabletobinary,polytomous,continuousintervention
UsefulinBigDatacontext
Identifyheterogeneouseffects
Example:Impactoftrainingonfinancialgains
Experiment:USAgovt programrandomly assignedeligiblecandidates totraining program• Goal:increasefutureearnings• Results(LaLonde, 1986):
üGroupsstatisticallyequalintermsofdemographic&pre-trainearnings
ü AverageTrainingEffect=$1794(p<0.004)
Treereveals…High-SchoolMatters!
LaLonde’snaïveapproach (experiment)
TreeapproachHSdropout(n=348)
HSdegree(n=97)
Nottrained(n=260) $4554 $4,495 $4,855Trained(n=185) $6349 $5,649 $8,047
Trainingeffect$1794
(p=0.004)$1,154
(p=0.063)$3,192
(p=0.015)Overall:$1598
(p=0.017)
no yes
Highschooldegree
TheForestortheTrees?TacklingSimpson’sParadox
withClassification&RegressionTrees
GalitShmueliNationalTsingHuaUniversity,TaiwanInbal Yahav-ShenbergerBar-Ilan University,Israel
Simpson’sParadox
Thedirection ofacauseonaneffectappears reversedwhenexaminingaggregatevs.disaggregateofasample(orpopulation)
Simpson'sParadoxisthereversal ofanassociation betweentwovariablesafterathirdvariable(aconfoundingfactor)istakenintoaccount. - Schield (1999)
ThephenomenonwherebyaneventB increasestheprobabilityofA inagivenpopulationp,atthesametime,decreasestheprobabilityofA ineverysubpopulationofp.- Pearl(2009)
Goal:DoesadatasetexhibitSP?
“Thereisnostatisticalcriterionthatwouldwarntheinvestigatoragainstdrawingthewrongconclusionorwouldindicatewhichtablerepresentsthecorrectanswer”
- Pearl,2009
“IfCornfield’sminimumeffectsizeisnotreached,[you]canassumenocausality”
- Schield,1999
Cornfieldetal’s Criterion
C=confounder
P(E|C)– P(E|C’ ) P(E|A )– P(E|A’ )
E=effectA=cause
Fivepotentialtreestructuressinglecausalvariable(X)andsingleconfoundingvariable(Z)
WhichmightexhibitSimpson’sParadox?
Simpson’sParadoxonaTree
#1Ifcause->effect,thencause shouldappearintree
#2IfZisconfounding,thenZshouldappearintree
Cornfield’scriterion+samplingerror:ConditionalInferenceTrees
SeatbeltsandInjuries(Agresti 2012)
Doesuseofseat-belts(X)reducechanceofinjury(Y)?Z =Passenger gender andaccidentlocation
n=68,694 passengersinvolvedinaccidentsinMaine
Potential Paradox(bylocation)
Howaboutlogisticregression?
%Injuries
Simpson’sParadoxinBigDataLargen ,High-dimensionalZ
MultiplePotentialConfounders(Z)
TheChallenge
Statistical significance ofSimpson’sparadox
≠Significance threshold oftreesplitsinCItreeCITree FullTree
Solution:X-TerminalTree
ParadoxDetectioninBigData:X-TerminalTrees
GrowtreeonlyuntilX-splits
SurveycommissionedbyGovt ofIndiain2006>9500individualswhousedpassportservices• Representativesampleof13PassportOffices• Equalnumberofofflineandonlineusers,
matchedbygeographyanddemographicsVariousoutcomesofinterest,suchasPolicebribing
ImpactassessmentofnewonlinepassportInitiativeinIndia
Y=policebribe(0/1)X=online/offlineZ={demographics;surveyQs}
Bribesbyonline/offlinefilteredbyupperZfactors
Splitp=.32 Paradoxp=0.003Paradoxp=0.16
Noparadox
KidneyAllocationinUSA(104,000patients,19confounders)
Isthekidneyallocationsystemracist?
Type4tree,butnosignificantSimpson’sparadoxdetected!
Y=waitingtime(days)X=patientraceZ={patientdemog,health,bio}
LargeScaleSurveys
DataQuality• duplicateresponses• insincereresponsesrequiredifferentapproachesatlargescale
Onlinesurveys:cheap,easy,fastLargepoolofavailable“workers”Supplementexperimental/observationalstudies
Paradatadataonhowthesurveywasaccessed/answered• timestampsofopeninginvitationemail,whensurveywasaccessed
• Durationforansweringeachquestion
• [SurveyofAdultSkillsbytheOECD]
LargeScaleSurveys
MethodologicalIssue:GeneralizationSamplingandnon-samplingerrors
“Thecentralissueiswhetherconditionaleffectsinthesample(thestudypopulation)maybetransportedtodesiredtargetpopulations.Successdependsoncompatibilityofcausalstructuresinstudyandtargetpopulations,andwillrequiresubjectmatterconsiderationsineachconcretecase.”
[Keiding andLouis,2016]
• Statisticalgeneralization&scientificgeneralization[Kenett&Shmueli,2014]
MethodicalAnalysisCycleofBBDInspiredbyLifecycle view[Kenett,2014],andstatthinkingbuildingblocks[Hoerl etal.2014]
1. understandcompanycontextandBBD2. setuptheresearchquestion3. determineexperimentaldesign4. obtainIRB approval(ifneeded)5. possibly:pilotexperiment6. communicatedesignwithcompany;assurefeasibility7. companydeploysexperimentandcollectsthedata8. companysharesthedatawiththeresearchers9. researchersanalyzethedataandarriveatconclusions10. researchers sharetheinsightsandconclusionswithcompanyandresearchcommunity11. companyoperationalizestheinsightstoimprovetheirbusiness12. companydeploysimpactstudy
Summary
TechnicalChallengesDataaccessAnalysisscalabilityQuick-changingenvironment
BBD=lotsofbehavioraldataWhohasit?Howisitanalyzed?Forwhatpurpose?
MethodologicalChallengesSelectionbiasGeneralization“Control”groupcontaminatedbyotherexperimentsSpill-overeffectsLackofmethodicallifecycle
Legal,Ethical,MoralChallengesPrivacyviolation(Netflix;networks)RiskstohumansubjectsCompanyvs.ResearcherObjectivesGainsofcompanyatexpenseofindividuals,communities,societies,&science
WhyshouldindustrialstatisticianscareaboutBBD?Technologyisadvancingintwodirections
Fullyautomated(algorithmic)solutions
Micro-levelrecordingofhumanandsocialbehavior
ContemplationThreatstoprivacy,society,governance,humanthought,andhumaninteraction
Generalizationforcompany≠scientificgeneralization
Personalizationefforts->de-personalization
“Lawofunintendedconsequences”• Labeling“studentatrisk”,
“potentialcriminal”
Speedofresearch,excitementofnewabilities,notimeforcontemplation
TheCircle,runoutofasprawlingCaliforniacampus,linksusers’personalemails,socialmedia,banking,andpurchasingwiththeiruniversaloperatingsystem,resultinginoneonlineidentityandanewageofcivilityandtransparency.
TheWayForward
ConvergenceofSocialSciencesandEngineering
Things eventuallycollectBBD(intentionallyornot)
AnalyticsHumanity
Responsibility
Galit ShmueliInstituteofService Science
Center forService Innovation&AnalyticsCollegeofTechnologyManagementNationalTsingHuaUniversity,[email protected]