analyzing behavioral big data: methodological, practical, ethical & moral issues

Post on 14-Feb-2017

713 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Analyzing Behavioral Big Data Methodological, Practical, Ethical & Moral Issues

Galit Shmueli徐茉莉National Tsing Hua U

Stu Hunter Research Conference, Waterloo CA, March 2016

WhatisBehavioral BigData(BBD)• Special typeofBigData

• Behavioral:people’s actions,interactions,self-reported opinions, thoughts,feelings

• Humanandsocialaspects: Intentions,deception, emotion, reciprocation, herding,…• Whenawareofdatacollection ->modifiedbehavior (legalrisks,embarrassment,unwantedsolicitation)

BBDvs.MedicalBigData

• Physicalmeasurements• Datacollectiontimingoftensetbymedicalsystem• Clinicaltrials:awareness&vestedinterest

• People’s dailyactions,interactions, self-reported feelings,opinions, thoughts (UGC)• Datageneration timingoftenchosenbyuser• Experiments: usersoftenunaware;goalnotalwaysinuser’sinterest

BBDonCitizensandCustomers

Governmentssecurity, lawenforcement, traffic(cameras, sensors)

Financial Institutionsfraud, loans(ITsystems,cameras)

Telecoms fraud,infrastructure, marketing(ITsystems,mobile)

Retailchainsmarketing, operations,merchandising(POSsystems,video,social,mobile)

InsurancesetUsage-BasedInsurance premiums(telematics info)

DataCollectionTechnologies:• Cameras• Sensors• ITsystems

(POS,calls,…)• GPS• Things• Internet• Mobile• Social

BBDonEmployees

ServiceProvidersqualitycontrol,employeeperformance

ElectronicPerformanceMonitoring(EPM)systems,websurfing,e-mailssentandreceived, telephone use,video,location (taxis)

BBDonCitizens,Customers,Employees:Internet!

• BBDnowalsoavailable tosmall companies&organizations• OnlineplatformshaveBBD(e-commerce, gaming,search,socialnetworks…)• Voluntarily entered byusers:personaldetails,photos,comments,messages,searchterms,bidsinauctions, likes,paymentinformation, connections with“friends”• Passivefootprints: duration onthewebsite,pagesbrowsed,sequence, referringwebsite, Internetbrowser,operatingsystem,location, IPaddress.• BBDnowavailable toindividuals: Quantified Self(andapps)

Moreandmorehumanandsocialactivitiesaremovingonline

MostcompaniesthathaveBBDwerenotcreatedforthepurposeofgeneratingBBD

Twoimportantpoints

WhyshouldindustrialstatisticianscareaboutBBD?

Technology isadvancing intwodirections

Fullyautomated(algorithmic)solutions

Industrialstatisticiansare(andshouldbe)involvedindesigningboth!

Micro-levelrecordingofhumanandsocialbehavior

ResearchusingBBD

DuncanWatts,MicrosoftResearch:1. Social science problemsarealmostalwaysmore

difficult thantheyseem2.Thedatarequired toaddressmanyproblemsofinteresttosocialscientistsremaindifficult toassemble

3.Thorough exploration ofcomplexsocialproblemsoftenrequires thecomplementary application ofmultiple research traditions

AcademicResearchQsusingBBD

Researchabout humanandsocialbehavior

examinenewphenomena

re-examineoldphenomenawithbetterdata

ResearchCommunities

Researcherswithsocialscience +technical backgrounds

InformationSystems

Marketing ComputationalSocialScience

ExamplesofBBDStudiesinTopJournalsConsumptioninVirtualWorlds(Hinz etal.InfoSysResearch,2015)“Theideathatconspicuousconsumptioncanincreasesocialstatus,asaformofsocialcapital,hasbeenbroadlyaccepted,yetresearchershavenotbeenabletotestthiseffectempirically.”• age-oldsociologyquestionwithnewBBDdata

• BBDfromtwovirtualworldwebsites(gamingwithsocialnetwork)

SocialinfluenceinSocialNewsWebsites(Muchniketal.Science,2014)“Therecentavailabilityofpopulation-scaledatasetsonratingbehaviorandsocialcommunicationenablenovelinvestigationsofsocialinfluence...”• Existingquestioninnewcontext:studysocialinfluencebiasinratingbehavior

• BBDfromasocialnewsaggregationwebsitewhereuserscontributenewsarticles,discussthem,andratecomments

OnlineConsumerRatingsofPhysicians(Gaoetal.InformationSystemsResearch,2014)“examinehowcloselytheonlineratingsreflectpatients’opinionaboutphysicianqualityatlarge.”• newphenomenonofonlineratingsofserviceproviders

• BBDondirectmeasuresofboththeofflinepopulation’sperceptionofphysicianquality,andconsumergeneratedonlinereviews.

ImpactofTeachersonStudentOutcomesusingEducationandTaxBBD(Chetty etal.Amer EconReview,2014)• long-termimpactofteachersonstudentoutcomeshasbeenofinterestineconomicpolicy:oldquestionwithnewBBDdata

• combinedBBDfromadministrativeschooldistrictrecordsandfederalincometaxrecords

EmotionalContagioninSocialNetworks(Krameretal.ProcoftheNationalAcademiesofSciences,2014)• Canemotionalstatesbetransferredtoothersviaemotionalcontagion?

• BBDfromlarge-scaleexperimentrunbyFB,manipulatingusers’exposureleveltoemotionalexpressionsintheirFacebookNewsFeed

AnonymousBrowsinginOnlineDatingWebsites(Bapna etal.ManagementScience,2016)“Onlinedatingplatformsoffernewcapabilities,suchasextensivesearch,bigdata–basedrecommendations,andvaryinglevelsofanonymity,whoseparallelsdonotexistinthephysicalworld...”• newquestionsabouthumanbehaviorduetonewtechnologies

• BBDfromlarge-scaleexperiment,partneredwithlargedatingwebsiteinNAmerica,testingtheeffectofanonymousbrowsingonmatching.

ONE WAY MIRRORS IN ONLINE DATINGA Randomized Field Experiment

Ravi Bapna, University of MinnesotaJui Ramaprasad, Mcgill University

Galit Shmueli, National Tsing Hua UniversityAkhmed Umyarov, University of Minnesota

Online Dating

46of the single population in the US uses online dating

to find a partner (Gelles 2011)

%

Online Dating Website

Non-anonymous Browsing (Default)

ProfileVisit

Recentvisitor:

Anonymous Browsing

ProfileVisit

Recentvisitor:

NONE

Research Question (in simple words)

How does anonymous browsing affect user behavior?

… and matching?

Formal Research Question

what is the relative causal effect of social inhibitions on search preferences vs. social inhibitions of contact initiation in dating markets?

given known gender asymmetries, how does this effect differ for men vs. women?

Randomized Field Experiment on Large Online Dating Website

50,000usersreceivegiftofanonymousbrowsing

Results

Users treated with anonymity

become disinhibited view more profiles, view more same-sex and interracial mates

get less matcheslose ability to leave a weak signal- especially harmful for women!

Roleofanonymity andimportanceofWEAKSIGNAL

inonlineplatforms

InAcademiaCausalQsaremostpopular• Methodologicalchallenges:• scalabilityofstatmodels• small-samplestatinference• self-selection

PredictiveQs(quiterare)• Howtouseresultsbeyondapplication-specific?6usesofpredictiveanalyticsfortheorybuilding[Shmueli &Koppius,2011]

InIndustryPurpose:evaluateorimproveproducts,service,operations,etc.• NetflixPrize:movierecommendersystem

• Yahoo!,LinkedIn:personalizednewscontenttoincreaseuserengagement/clicks[Agarwal&Chen2016]

• Target:pregnancyprediction• Amazon:pricing,etc.• Government:campaigntargeting

BBD-basedResearchQuestions

GettingBBDforResearch

1.OpenData,PubliclyAvailableDataData.govTwitterKaggle (UCIMR)APIandwebscraping

2.PartneringwithaCompany• Bothpartiesinterestedinresearchquestion• Datapurchase• Personalconnections• Partnershipbetweenschoolandorganization(CMULivingAnalyticsResearchLab)

3.CrowdsourcingAMTReplacingstudentsubjects• Experimentsubjects• Surveyrespondents• Cleaningandtaggingdata

“easyaccesstoalarge,stable,anddiversesubjectpool,thelowcostofdoingexperiments,andfasteriterationbetweendevelopingtheoryandexecutingexperiments”[MasonandSuri,2012]

UsingBBDforResearch:HumanSubjects

Institutional ReviewBoard(IRB)“ethicscommittee”University-levelcommitteedesignatedtoapprove,monitor,andreviewbiomedicalandbehavioralresearchinvolvinghumans.• performsbenefit-riskanalysisforproposedstudy• guidelines:Beneficence, Justice,andRespect forpersons

• HHSproposenewIRBexemptioncriteriaforpubliclyavailabledata(orevenbuyingit)• CouncilforBigData,Ethics&Society’sletter:“thesecriteriaforexclusionfocusonthestatusofthedataset… notthecontentofthedatasetnorwhatwillbedonewiththedataset,whicharemoreaccuratecriteriafordeterminingtheriskprofileoftheproposedresearch

Ethics:BeyondIRBFacebookexperiment[Krameretal.2014]:• NoIRB

“[Thework]wasconsistentwithFacebook’sDataUsePolicy,towhichallusersagreepriortocreatinganaccountonFacebook,constitutinginformedconsentforthisresearch.”

• PNASeditorialExpressionofConcern• Variedresponsefrompublic,academia,press,ethicists,corporates[Adar2015]

BigBehavioralExperiments

BigBehavioralExperiments:IssuesComparetoindustrialenvironment

1.Fast-ChangingEnvironmentMultipleA/Btestsruneveryday(overlaps)Userskeepevolving

2.MultiplicityandScalingComputationaladvertisingandcontentrecommendation3M’s[Agarwal&Chen2016]:• Multi-response(clicks,shares,likes,…)• Multi-context(mobile,email,...)• Multipleobjective(engagment,revenue,...)

3.Spill-OverEffects• Treatmentcanaffectcontrolgroup(socialnetworks)

• Challengeofrandomizationonasocialnetwork(Fienberg,2015):eveniftreatmentandcontrolmemberssufficientlyfarawaytoavoidspill-overeffects,analysisstillmustaccountfordependenceamongunits.

BigBehavioralExperiments:IssuesComparetoindustrialenvironment

4.KnowledgeofAllocationandGiftEffect• Likeclinicaltrials:allocationknowledgecanaffectoutcome• Onlineusersdiscovertheirallocationviaonlineforums• Blindingandplacebo?• “Gift”orpreferentialtreatmentcanaffectoutcome• Bapna etal.(2016)comparedeffectatendofmanipulationtimeandrightafter,todeterminegifteffect

5.EthicalandMoralIssuesEaseofrunningalargescaleexperimentquicklyandatlowcost• dangerofharmingmanypeoplequickly• smallscalepilotstudy?AMT:Fairtreatment&paymenttoworkers

ObservationalBBD:Issues

EthicalandMoralIssues• Privacy(Netflix)• Dataprotectionandreproducibleresearch

• Conflictofinterestcompany-vs-users(Studyconclusionsleadtooperationalactionsthattrade-offthecompany’sinterestwithuserwell-being)

• AMT– paymenttoworkers

MethodologicalIssues1.Self-selectionBiasUserschoosetreatment• ScalingofPSMtobigdata?

2.Simpson’sParadoxCausaldirectionreverseswhendataaredisaggregated• Doesadatasethaveaparadox?

3.ContaminationbyExperiments

4.DataSize&DimensionNeedverylarge+rich datatoanswerpredictiveQs[Junque deFortuny etal.2014]

ATree-BasedApproachforAddressingSelf-selectioninImpactStudies

withBigData

Inbal Yahav Galit Shmueli DeepaManiBar Ilan University NationalTsingHuaU IndianSchoolofBusiness

Israel Taiwan India

SelfSelection:TheChallenge

• Large impactstudiesofanintervention• Individuals/firmschoosewhichgrouptojoin

Howtoidentifyandadjust forself-selection?

CurrentMethods:ChallengeswithBigData

1.Matchingleadstoseveredataloss

2.Sufferfrom“datadredging”

3.Donotidentifyvariablesthatdrivetheselection

4.Assumeconstantinterventioneffect

5.Sequential natureiscomputationallycostly

6.Requiresusertospecifyformofselectionmodel

OurTree-BasedApproach:Useadataminingalgorithminanovelway

Flexiblenon-parametricselectionmodel

Automated detectionofunbalancedvariables

Easytointerpret,transparent,visual

Applicabletobinary,polytomous,continuousintervention

UsefulinBigDatacontext

Identifyheterogeneouseffects

Example:Impactoftrainingonfinancialgains

Experiment:USAgovt programrandomly assignedeligiblecandidates totraining program• Goal:increasefutureearnings• Results(LaLonde, 1986):

üGroupsstatisticallyequalintermsofdemographic&pre-trainearnings

ü AverageTrainingEffect=$1794(p<0.004)

Treereveals…High-SchoolMatters!

LaLonde’snaïveapproach (experiment)

TreeapproachHSdropout(n=348)

HSdegree(n=97)

Nottrained(n=260) $4554 $4,495 $4,855Trained(n=185) $6349 $5,649 $8,047

Trainingeffect$1794

(p=0.004)$1,154

(p=0.063)$3,192

(p=0.015)Overall:$1598

(p=0.017)

no yes

Highschooldegree

TheForestortheTrees?TacklingSimpson’sParadox

withClassification&RegressionTrees

GalitShmueliNationalTsingHuaUniversity,TaiwanInbal Yahav-ShenbergerBar-Ilan University,Israel

Simpson’sParadox

Thedirection ofacauseonaneffectappears reversedwhenexaminingaggregatevs.disaggregateofasample(orpopulation)

Simpson'sParadoxisthereversal ofanassociation betweentwovariablesafterathirdvariable(aconfoundingfactor)istakenintoaccount. - Schield (1999)

ThephenomenonwherebyaneventB increasestheprobabilityofA inagivenpopulationp,atthesametime,decreasestheprobabilityofA ineverysubpopulationofp.- Pearl(2009)

Goal:DoesadatasetexhibitSP?

“Thereisnostatisticalcriterionthatwouldwarntheinvestigatoragainstdrawingthewrongconclusionorwouldindicatewhichtablerepresentsthecorrectanswer”

- Pearl,2009

“IfCornfield’sminimumeffectsizeisnotreached,[you]canassumenocausality”

- Schield,1999

Cornfieldetal’s Criterion

C=confounder

P(E|C)– P(E|C’ ) P(E|A )– P(E|A’ )

E=effectA=cause

Fivepotentialtreestructuressinglecausalvariable(X)andsingleconfoundingvariable(Z)

WhichmightexhibitSimpson’sParadox?

Simpson’sParadoxonaTree

#1Ifcause->effect,thencause shouldappearintree

#2IfZisconfounding,thenZshouldappearintree

Cornfield’scriterion+samplingerror:ConditionalInferenceTrees

SeatbeltsandInjuries(Agresti 2012)

Doesuseofseat-belts(X)reducechanceofinjury(Y)?Z =Passenger gender andaccidentlocation

n=68,694 passengersinvolvedinaccidentsinMaine

Potential Paradox(bylocation)

Howaboutlogisticregression?

%Injuries

Simpson’sParadoxinBigDataLargen ,High-dimensionalZ

MultiplePotentialConfounders(Z)

TheChallenge

Statistical significance ofSimpson’sparadox

≠Significance threshold oftreesplitsinCItreeCITree FullTree

Solution:X-TerminalTree

ParadoxDetectioninBigData:X-TerminalTrees

GrowtreeonlyuntilX-splits

SurveycommissionedbyGovt ofIndiain2006>9500individualswhousedpassportservices• Representativesampleof13PassportOffices• Equalnumberofofflineandonlineusers,

matchedbygeographyanddemographicsVariousoutcomesofinterest,suchasPolicebribing

ImpactassessmentofnewonlinepassportInitiativeinIndia

Y=policebribe(0/1)X=online/offlineZ={demographics;surveyQs}

Bribesbyonline/offlinefilteredbyupperZfactors

Splitp=.32 Paradoxp=0.003Paradoxp=0.16

Noparadox

KidneyAllocationinUSA(104,000patients,19confounders)

Isthekidneyallocationsystemracist?

Type4tree,butnosignificantSimpson’sparadoxdetected!

Y=waitingtime(days)X=patientraceZ={patientdemog,health,bio}

LargeScaleSurveys

DataQuality• duplicateresponses• insincereresponsesrequiredifferentapproachesatlargescale

Onlinesurveys:cheap,easy,fastLargepoolofavailable“workers”Supplementexperimental/observationalstudies

Paradatadataonhowthesurveywasaccessed/answered• timestampsofopeninginvitationemail,whensurveywasaccessed

• Durationforansweringeachquestion

• [SurveyofAdultSkillsbytheOECD]

LargeScaleSurveys

MethodologicalIssue:GeneralizationSamplingandnon-samplingerrors

“Thecentralissueiswhetherconditionaleffectsinthesample(thestudypopulation)maybetransportedtodesiredtargetpopulations.Successdependsoncompatibilityofcausalstructuresinstudyandtargetpopulations,andwillrequiresubjectmatterconsiderationsineachconcretecase.”

[Keiding andLouis,2016]

• Statisticalgeneralization&scientificgeneralization[Kenett&Shmueli,2014]

MethodicalAnalysisCycleofBBDInspiredbyLifecycle view[Kenett,2014],andstatthinkingbuildingblocks[Hoerl etal.2014]

1. understandcompanycontextandBBD2. setuptheresearchquestion3. determineexperimentaldesign4. obtainIRB approval(ifneeded)5. possibly:pilotexperiment6. communicatedesignwithcompany;assurefeasibility7. companydeploysexperimentandcollectsthedata8. companysharesthedatawiththeresearchers9. researchersanalyzethedataandarriveatconclusions10. researchers sharetheinsightsandconclusionswithcompanyandresearchcommunity11. companyoperationalizestheinsightstoimprovetheirbusiness12. companydeploysimpactstudy

Summary

TechnicalChallengesDataaccessAnalysisscalabilityQuick-changingenvironment

BBD=lotsofbehavioraldataWhohasit?Howisitanalyzed?Forwhatpurpose?

MethodologicalChallengesSelectionbiasGeneralization“Control”groupcontaminatedbyotherexperimentsSpill-overeffectsLackofmethodicallifecycle

Legal,Ethical,MoralChallengesPrivacyviolation(Netflix;networks)RiskstohumansubjectsCompanyvs.ResearcherObjectivesGainsofcompanyatexpenseofindividuals,communities,societies,&science

WhyshouldindustrialstatisticianscareaboutBBD?Technologyisadvancingintwodirections

Fullyautomated(algorithmic)solutions

Micro-levelrecordingofhumanandsocialbehavior

ContemplationThreatstoprivacy,society,governance,humanthought,andhumaninteraction

Generalizationforcompany≠scientificgeneralization

Personalizationefforts->de-personalization

“Lawofunintendedconsequences”• Labeling“studentatrisk”,

“potentialcriminal”

Speedofresearch,excitementofnewabilities,notimeforcontemplation

TheCircle,runoutofasprawlingCaliforniacampus,linksusers’personalemails,socialmedia,banking,andpurchasingwiththeiruniversaloperatingsystem,resultinginoneonlineidentityandanewageofcivilityandtransparency.

TheWayForward

ConvergenceofSocialSciencesandEngineering

Things eventuallycollectBBD(intentionallyornot)

AnalyticsHumanity

Responsibility

Galit ShmueliInstituteofService Science

Center forService Innovation&AnalyticsCollegeofTechnologyManagementNationalTsingHuaUniversity,Taiwangalit.shmueli@iss.nthu.edu.tw

top related