Download - Research Using Behavioral Big Data (BBD)
Research Using Behavioral Big Data Methodological, Practical, Ethical & Moral Issues
IEEE BigData Congress, Taipei Satellite Session, May 2016
Galit Shmueli徐茉莉Institute of Service Science
WhatisBehavioral BigData(BBD)• SpecialtypeofBigData
• Behavioral:people’sactions,interactions,self-reportedopinions,thoughts,feelings
• Humanandsocialaspects:Intentions,deception,emotion,reciprocation,herding,…• Whenawareofdatacollection->modifiedbehavior(legalrisks,embarrassment,unwantedsolicitation)
BBDvs.MedicalBigData
• Physicalmeasurements• Datacollectiontimingoftensetbymedicalsystem• Clinicaltrials:awareness&vestedinterest
• People’sdailyactions,interactions,self-reportedfeelings,opinions,thoughts(UGC)• Datagenerationtimingoftenchosenbyuser• Experiments:usersoftenunaware;goalnotalwaysinuser’sinterest
BBDonCitizensandCustomers
Governmentssecurity,lawenforcement,traffic(cameras,sensors)
FinancialInstitutionsfraud,loans(ITsystems,cameras)
Telecoms fraud,infrastructure,marketing(ITsystems,mobile)
Retailchainsmarketing,operations,merchandising(POSsystems,video,social,mobile)
InsurancesetUsage-BasedInsurancepremiums(telematicsinfo)
DataCollectionTechnologies:• Cameras• Sensors• ITsystems
(POS,calls,…)• GPS• Things• Internet• Mobile• Social
BBDonEmployees
ServiceProvidersqualitycontrol,employeeperformance
ElectronicPerformanceMonitoring(EPM)systems,websurfing,e-mailssentandreceived,telephoneuse,video,location(taxis)
BBDonCitizens,Customers,Employees:Internet!
• BBDnowalsoavailabletosmall companies&organizations• OnlineplatformshaveBBD(e-commerce,gaming,search,socialnetworks…)• Voluntarily enteredbyusers:personaldetails,photos,comments,messages,searchterms,bidsinauctions,likes,paymentinformation,connectionswith“friends”• Passivefootprints:durationonthewebsite,pagesbrowsed,sequence,referringwebsite,Internetbrowser,operatingsystem,location,IPaddress.• BBDnowavailabletoindividuals:QuantifiedSelf(andapps)
Moreandmorehumanandsocialactivitiesaremovingonline
MostcompaniesthathaveBBDwerenotcreatedforthepurposeofgeneratingBBD
Twoimportantpoints
WhyshoulddatascienceresearcherscareaboutBBD?
Technologyisadvancingintwodirections
Fullyautomated(algorithmic)solutions
Becausetheyare(andshouldbe)involvedindesigningboth!
Micro-levelrecordingofhumanandsocialbehavior
ResearchusingBBD
DuncanWatts,MicrosoftResearch:1. Socialscienceproblemsarealmostalwaysmore
difficultthantheyseem2.Thedatarequiredtoaddressmanyproblemsofinteresttosocialscientistsremaindifficulttoassemble
3.Thoroughexplorationofcomplexsocialproblemsoftenrequiresthecomplementaryapplicationofmultipleresearchtraditions
AcademicResearchQsusingBBD
Researchabout humanandsocialbehavior
examinenewphenomena
re-examineoldphenomenawithbetterdata
ResearchCommunities
Researcherswithsocialscience+technicalbackgrounds
InformationSystems
Marketing ComputationalSocialScience
ExamplesofBBDStudiesinTopJournalsConsumptioninVirtualWorlds(Hinz etal.InfoSysResearch,2015)“Theideathatconspicuousconsumptioncanincreasesocialstatus,asaformofsocialcapital,hasbeenbroadlyaccepted,yetresearchershavenotbeenabletotestthiseffectempirically.”• age-oldsociologyquestionwithnewBBDdata
• BBDfromtwovirtualworldwebsites(gamingwithsocialnetwork)
SocialinfluenceinSocialNewsWebsites(Muchniketal.Science,2014)“Therecentavailabilityofpopulation-scaledatasetsonratingbehaviorandsocialcommunicationenablenovelinvestigationsofsocialinfluence...”• Existingquestioninnewcontext:studysocialinfluencebiasinratingbehavior
• BBDfromasocialnewsaggregationwebsitewhereuserscontributenewsarticles,discussthem,andratecomments
OnlineConsumerRatingsofPhysicians(Gaoetal.InformationSystemsResearch,2014)“examinehowcloselytheonlineratingsreflectpatients’opinionaboutphysicianqualityatlarge.”• newphenomenonofonlineratingsofserviceproviders
• BBDondirectmeasuresofboththeofflinepopulation’sperceptionofphysicianquality,andconsumergeneratedonlinereviews.
ImpactofTeachersonStudentOutcomesusingEducationandTaxBBD(Chetty etal.Amer EconReview,2014)• long-termimpactofteachersonstudentoutcomeshasbeenofinterestineconomicpolicy:oldquestionwithnewBBDdata
• combinedBBDfromadministrativeschooldistrictrecordsandfederalincometaxrecords
EmotionalContagioninSocialNetworks(Krameretal.ProcoftheNationalAcademiesofSciences,2014)• Canemotionalstatesbetransferredtoothersviaemotionalcontagion?
• BBDfromlarge-scaleexperimentrunbyFB,manipulatingusers’exposureleveltoemotionalexpressionsintheirFacebookNewsFeed
AnonymousBrowsinginOnlineDatingWebsites(Bapna etal.ManagementScience,2016)“Onlinedatingplatformsoffernewcapabilities,suchasextensivesearch,bigdata–basedrecommendations,andvaryinglevelsofanonymity,whoseparallelsdonotexistinthephysicalworld...”• newquestionsabouthumanbehaviorduetonewtechnologies
• BBDfromlarge-scaleexperiment,partneredwithlargedatingwebsiteinNAmerica,testingtheeffectofanonymousbrowsingonmatching.
ONE WAY MIRRORS IN ONLINE DATINGA Randomized Field Experiment
Ravi Bapna, University of MinnesotaJui Ramaprasad, Mcgill University
Galit Shmueli, National Tsing Hua UniversityAkhmed Umyarov, University of Minnesota
Online Dating
46of the single population in the US uses online dating
to find a partner (Gelles 2011)
%
Online Dating Website
Non-anonymous Browsing (Default)
ProfileVisit
Recentvisitor:
Anonymous Browsing
ProfileVisit
Recentvisitor:
NONE
Research Question (in simple words)
How does anonymous browsing affect user behavior?
… and matching?
Formal Research Question
what is the relative causal effect of social inhibitions on search preferences vs. social inhibitions of contact initiation in dating markets?
given known gender asymmetries, how does this effect differ for men vs. women?
Randomized Field Experiment on Large Online Dating Website
50,000usersreceivegiftofanonymousbrowsing
Results
Users treated with anonymity
become disinhibited view more profiles, view more same-sex and interracial mates
get less matcheslose ability to leave a weak signal- especially harmful for women!
Roleofanonymity andimportanceofWEAKSIGNAL
inonlineplatforms
InAcademiaCausalQsaremostpopular• Methodologicalchallenges:• scalabilityofstatmodels• small-samplestatinference• self-selection
PredictiveQs(quiterare)• Howtouseresultsbeyondapplication-specific?6usesofpredictiveanalyticsfortheorybuilding[Shmueli &Koppius,2011]
InIndustryPurpose:evaluateorimproveproducts,service,operations,etc.• NetflixPrize:movierecommendersystem
• Yahoo!,LinkedIn:personalizednewscontenttoincreaseuserengagement/clicks[Agarwal&Chen2016]
• Target:pregnancyprediction• Amazon:pricing,etc.• Government:campaigntargeting
BBD-basedResearchQuestions
GettingBBDforResearch
1.OpenData,PubliclyAvailableDataData.govTwitterKaggle (UCIMR)APIandwebscraping
2.PartneringwithaCompany• Bothpartiesinterestedinresearchquestion• Datapurchase• Personalconnections• Partnershipbetweenschoolandorganization(CMULivingAnalyticsResearchLab)
3.CrowdsourcingAMTReplacingstudentsubjects• Experimentsubjects• Surveyrespondents• Cleaningandtaggingdata
“easyaccesstoalarge,stable,anddiversesubjectpool,thelowcostofdoingexperiments,andfasteriterationbetweendevelopingtheoryandexecutingexperiments”[MasonandSuri,2012]
UsingBBDforResearch:HumanSubjects
InstitutionalReviewBoard(IRB)“ethicscommittee”University-levelcommitteedesignatedtoapprove,monitor,andreviewbiomedicalandbehavioralresearchinvolvinghumans.• performsbenefit-riskanalysisforproposedstudy• guidelines:Beneficence,Justice,andRespect forpersons
• HHSproposenewIRBexemptioncriteriaforpubliclyavailabledata(orevenbuyingit)• CouncilforBigData,Ethics&Society’sletter:“thesecriteriaforexclusionfocusonthestatus ofthedataset… notthecontent ofthedatasetnorwhatwillbedonewiththedataset,whicharemoreaccuratecriteriafordeterminingtheriskprofileoftheproposedresearch
Ethics:BeyondIRBFacebookexperiment[Krameretal.2014]:• NoIRB
“[Thework]wasconsistentwithFacebook’sDataUsePolicy,towhichallusersagreepriortocreatinganaccountonFacebook,constitutinginformedconsentforthisresearch.”
• PNASeditorialExpressionofConcern• Variedresponsefrompublic,academia,press,ethicists,corporates[Adar2015]
BigBehavioralExperiments
BigBehavioralExperiments:IssuesComparetoindustrialenvironment
1.Fast-ChangingEnvironmentMultipleA/Btestsruneveryday(overlaps)Userskeepevolving
2.MultiplicityandScalingComputationaladvertisingandcontentrecommendation3M’s[Agarwal&Chen2016]:• Multi-response(clicks,shares,likes,…)• Multi-context(mobile,email,...)• Multipleobjective(engagment,revenue,...)
3.Spill-OverEffects• Treatmentcanaffectcontrolgroup(socialnetworks)
• Challengeofrandomizationonasocialnetwork(Fienberg,2015):eveniftreatmentandcontrolmemberssufficientlyfarawaytoavoidspill-overeffects,analysisstillmustaccountfordependenceamongunits.
BigBehavioralExperiments:IssuesComparetoindustrialenvironment
4.KnowledgeofAllocationandGiftEffect• Likeclinicaltrials:allocationknowledgecanaffectoutcome• Onlineusersdiscovertheirallocationviaonlineforums• Blindingandplacebo?• “Gift”orpreferentialtreatmentcanaffectoutcome• Bapna etal.(2016)comparedeffectatendofmanipulationtimeandrightafter,todeterminegifteffect
5.EthicalandMoralIssuesEaseofrunningalargescaleexperimentquickly andatlowcost• dangerofharmingmanypeoplequickly• smallscalepilotstudy?AMT:Fairtreatment&paymenttoworkers
ObservationalBBD:Issues
EthicalandMoralIssues• Privacy(Netflix)• Dataprotectionandreproducibleresearch
• Conflictofinterestcompany-vs-users(Studyconclusionsleadtooperationalactionsthattrade-offthecompany’sinterestwithuserwell-being)
• AMT– paymenttoworkers
MethodologicalIssues1.Self-selectionBiasUserschoosetreatment• ScalingofPSMtobigdata?
2.Simpson’sParadoxCausaldirectionreverseswhendataaredisaggregated• Doesadatasethaveaparadox?
3.ContaminationbyExperiments
4.DataSize&DimensionNeedverylarge+rich datatoanswerpredictiveQs[Junque deFortuny etal.2014]
ATree-BasedApproachforAddressingSelf-selectioninImpactStudies
withBigData
Inbal Yahav Galit Shmueli Deepa ManiBarIlan University NationalTsingHuaU IndianSchoolofBusiness
Israel Taiwan India
SelfSelection:TheChallenge
• Large impactstudiesofanintervention• Individuals/firmschoosewhichgrouptojoin
Howtoidentifyandadjustforself-selection?
CurrentMethods:ChallengeswithBigData
1.Matchingleadstoseveredataloss
2.Sufferfrom“datadredging”
3.Donotidentifyvariablesthatdrivetheselection
4.Assumeconstantinterventioneffect
5.Sequential natureiscomputationallycostly
6.Requiresusertospecifyformofselectionmodel
OurTree-BasedApproach:Useadataminingalgorithminanovelway
Flexiblenon-parametricselectionmodel
Automated detectionofunbalancedvariables
Easytointerpret,transparent,visual
Applicabletobinary,polytomous,continuousintervention
UsefulinBigDatacontext
Identifyheterogeneouseffects
Example:Impactoftrainingonfinancialgains
Experiment:USAgovt programrandomlyassignedeligiblecandidatestotrainingprogram• Goal:increasefutureearnings• Results(LaLonde,1986):
üGroupsstatisticallyequalintermsofdemographic&pre-trainearnings
ü AverageTrainingEffect=$1794(p<0.004)
Treereveals…High-SchoolMatters!
LaLonde’snaïveapproach(experiment)
TreeapproachHSdropout(n=348)
HSdegree(n=97)
Nottrained(n=260) $4554 $4,495 $4,855Trained(n=185) $6349 $5,649 $8,047
Trainingeffect$1794
(p=0.004)$1,154
(p=0.063)$3,192(p=0.015)
Overall:$1598(p=0.017)
no yes
Highschooldegree
LargeScaleSurveys
DataQuality• duplicateresponses• insincereresponsesrequiredifferentapproachesatlargescale
Onlinesurveys:cheap,easy,fastLargepoolofavailable“workers”Supplementexperimental/observationalstudies
Paradatadataonhowthesurveywasaccessed/answered• timestampsofopeninginvitationemail,whensurveywasaccessed
• Durationforansweringeachquestion
• [SurveyofAdultSkillsbytheOECD]
LargeScaleSurveys
MethodologicalIssue:GeneralizationSamplingandnon-samplingerrors
“Thecentralissueiswhetherconditionaleffectsinthesample(thestudypopulation)maybetransportedtodesiredtargetpopulations.Successdependsoncompatibilityofcausalstructuresinstudyandtargetpopulations,andwillrequiresubjectmatterconsiderationsineachconcretecase.”
[Keiding andLouis,2016]
• Statisticalgeneralization&scientificgeneralization[Kenett&Shmueli,2014]
MethodicalAnalysisCycleofBBDInspiredbyLifecycleview[Kenett,2014],andstatthinkingbuildingblocks[Hoerl etal.2014]
1. understandcompanycontext andBBD2. setuptheresearchquestion3. determineexperimentaldesign4. obtainIRB approval(ifneeded)5. possibly:pilot experiment6. communicate designwithcompany;assurefeasibility7. companydeploys experimentandcollectsthedata8. companyshares thedatawiththeresearchers9. researchersanalyzethedataandarriveatconclusions10. researchers share theinsightsandconclusionswithcompanyandresearchcommunity11. companyoperationalizes theinsightstoimprovetheirbusiness12. companydeploysimpactstudy
Summary
TechnicalChallengesDataaccessAnalysisscalabilityQuick-changingenvironment
BBD=lotsofbehavioraldataWhohasit?Howisitanalyzed?Forwhatpurpose?
MethodologicalChallengesSelectionbiasGeneralization“Control”groupcontaminatedbyotherexperimentsSpill-overeffectsLackofmethodicallifecycle
Legal,Ethical,MoralChallengesPrivacyviolation(Netflix;networks)RiskstohumansubjectsCompanyvs.ResearcherObjectivesGainsofcompanyatexpenseofindividuals,communities,societies,&science
WhyshoulddatascienceresearcherscareaboutBBD?Technologyisadvancingintwodirections
Fullyautomated(algorithmic)solutions
Micro-levelrecordingofhumanandsocialbehavior
ContemplationThreatstoprivacy,society,governance,humanthought,andhumaninteraction
Generalizationforcompany≠scientificgeneralization
Personalizationefforts->de-personalization
“Lawofunintendedconsequences”• Labeling“studentatrisk”,
“potentialcriminal”
Speedofresearch,excitementofnewabilities,notimeforcontemplation
TheCircle,runoutofasprawlingCaliforniacampus,linksusers’personalemails,socialmedia,banking,andpurchasingwiththeiruniversaloperatingsystem,resultinginoneonlineidentityandanewageofcivilityandtransparency.
TheWayForward
ConvergenceofSocialSciencesandEngineering
Things eventuallycollectBBD(intentionallyornot)
AnalyticsHumanity
Responsibility
Galit Shmueli徐茉莉Institute of Service Science