predictive analytics in information systems research (tswim 2015 keynote)

48
To Explain or To Predict? Predictive Analytics in IS Research 3 rd Taiwan Summer Workshop on Information Management July 2015 Galit Shmuéli

Upload: galit-shmueli

Post on 13-Aug-2015

204 views

Category:

Data & Analytics


0 download

TRANSCRIPT

To Explain or To Predict?Predictive Analytics in IS Research

3rd Taiwan Summer Workshop on Information Management

July 2015

Galit Shmuéli

Galit Shmueli ( 徐茉莉 )www.galitshmueli.com

❷ 2000-2002 Carnegie Mellon Univ.Visiting Assistant Prof.Dept. of Statistics

❸ 2002-2012 Univ. of Maryland College ParkAssistant then Associate Prof. of

Statistics & Management Science

R H Smith School of Business

2008-2014 Rigsum Institute (Bhutan)

Co-Director, Rigsum Research Lab

❹ 2011-2014 Indian School of Business SRITNE Chaired Prof. of Data

Analytics, Associate Prof. of Statistics & Info Systems

❶ 1994-2000 Israel Institute of

TechnologyMSc + PhD, Statistics

2014-… NTHUInstitute of Service ScienceDirector, Center for Service

Innovation & Analytics

Research in Data Analytics

www.galitshmueli.com

• Statistical strategy• ‘Entrepreneurial’ statistical &

data mining modeling (new conditions & environments)

• Business analytics

In progress…

www.iss.nthu.edu.tw

Road Map

DefinitionsExplanatory-dominated MISExplanatory modeling ≠ predictive modeling

Why?Different modeling pathsExplanatory power vs. predictive power

How do I use this?

Definitions

Explanatory modeling:Theory-based, statistical testing of causal hypotheses

Explanatory power:Strength of relationship in statistical model

Definitions

Predictive modeling:Empirical method for predicting new observations

Predictive power:Ability to accurately predict new observations

Explain PredictDescribe

Matching Game

Social Sciences (MIS included)

Machine learning

Statistics

Statistical modeling in MIS research

Purpose: test causal theory (“explain”)Association-based statistical models

Prediction nearly absent

Start with a causal theory

Generate causal hypotheses on constructs

Operationalize constructs → Measurable variables

Fit statistical model

Statistical inference → Causal conclusions

Explanatory modeling à-la MIS

In MIS,

data analysis is mainly used for testing causal theory.

“If it explains, it predicts”

“Empirical prediction aloneis un-scientific”

Some statisticians share this view:

The two goals in analyzing data... I prefer to describe as “management” and “science”. Management seeks profit... Science seeks truth.

- Parzen, Statistical Science 2001

Prediction in top research journals in Information Systems

Predictive goal?Predictive modeling?Predictive assessment?

1990-2006

52 “predictive” articles among 1,072 in Information Systems top journals

generate new theorydevelop measurescompare theoriesimprove theoryassess relevanceevaluate predictability

Why Predict? for Scientific Research

Shmueli & Koppius, “Predictive Analytics in IS Research” MIS Quarterly, 2011

“A good explanatory model will also predict well”

“You must understand the underlying causes in order to predict”

Philosophy of Science

“Explanation and prediction have the same logical structure”

Hempel & Oppenheim, 1948

“It becomes pertinent to investigate the possibilities of predictive procedures autonomous of those used for explanation”

Helmer & Rescher, 1959

“Theories of social and human behavior address themselves to two distinct goals of science: (1) prediction and (2) understanding”

Dubin, Theory Building, 1969

Why statistical

explanatory modeling differs from

predictive modeling

Explanatory Model: Test/quantify causal effect for “average” record in population

Predictive Model: Predict new individual observations

Different Scientific Goals

Different generalization

Theory vs. its manifestation

?

Notation

Theoretical constructs: X, Y

Causal theoretical model: Y=F(X)

Measurable variables: X, Y

Statistical model: E(y)=f(X)

Four aspects

1. Theory – Data

2. Causation – Association

3. Retrospective – Prospective

4. Bias - Variance

Y=F(X)E(Y)=f(X)

“The goal of finding models that are predictively accurate differs from the goal of finding models that are true.”

Best explanatory model

Best predictive model

Point #1

Four aspects

1. Theory - Data

2. Causation – Association

3. Retrospective – Prospective

4. Bias - Variance

Y=F(X)Y=f(X)

Predict ≠ Explain

+ ?

“we tried to benefit from an extensive set of attributes describing each of the movies in the dataset. Those attributes certainly carry a significant signal and can explain some of the user behavior. However… they could not help at all for improving the [predictive] accuracy.”

Bell et al., 2008

Predict ≠ Explain

Explain ≠ PredictThe FDA considers two products bioequivalent if the 90% CI of the relative mean of the generic to brand formulation is within 80%-125%

“We are planning to… develop predictive models for bioavailability and bioequivalence”

Lester M. Crawford, 2005Acting Commissioner of Food & Drugs

“For a long time, we thought that Tamoxifen was roughly 80% effective for breast cancer patients.

But now we know much more: we know that it’s 100% effective in 70%-80% of the patients, and ineffective in the rest.”

Goal Definition

Design & Collection

Data Preparation

EDA

Variables? Methods? Evaluation,

Validation & Model Selection

Model Use & Reporting

Study design

Hierarchical data

Observational or experiment?

Primary or secondary data?

Instrument (reliability+validity vs. measur accuracy)

How much data?

How to sample?

& data collection

Data Preprocessing

reduced-feature models

missing

partitioning

PCASVD

Interactive visualization

Data exploration & reduction

Which Variables?

Multicollinearity?causation associations

endogeneity ex-post

availability

A, B, A*B?

ensemblesShrinkage models

variance bias

Methods / ModelsBlackbox / interpretableMapping to theory

Evaluation, Validation& Model Selection

Training dataEmpirical model Holdout data

Predictive power

Over-fitting analysis

Theoretical model

Empirical model

Data

ValidationModel fit ≠

Explanatory power

Inference

Model Use

test causal theory

generate new theorydevelop measurescompare theoriesimprove theoryassess relevanceEvaluate predictability

Predictive performance

Over-fitting analysis

Null hypothesis

Naïve/baseline

Point #2

Explanatory Power

Predictive Power ≠

Cannot infer one from the other

out-of-sample

Performance Metrics

type I,II errors

goodness-of-fit

p-values

over-fitting

costs

prediction accuracy

interpretation

Training vs. holdout

R2

Explanatory Power

Pred

ictiv

e Po

wer

The predictive power of an explanatory model has important scientific value

Relevance, reality check, predictability

Current State in Social Sciences (and MIS)

“While the value of scientific prediction… is beyond question… the inexact sciences [do not] have…the use of predictive expertise well in hand.”

Helmer & Rescher, 1959

Distinction blurred

Unfamiliarity with predictive modeling/assessment

Prediction underappreciated

How does this impact

Scientific Research?

State-of-the-art in Industry

Distinction blurred

Prediction over-appreciated

“Big Data” synonymous with prediction

How does this impact an organization’s actions?

…and our lives?

What can be done?Acknowledge difference

Learn/teach predictionLeverage prediction in research

BUT focus on its scientific uses:

generate new theorydevelop measurescompare theoriesimprove theoryassess relevanceevaluate predictability

Why Predict? for Scientific Research

Shmueli & Koppius, “Predictive Analytics in IS Research” MIS Quarterly, 2011

Shmueli (2010) “To Explain or To Predict?”, Statistical ScienceShmueli & Koppius (2011) “Predictive Analytics in IS Research”, MISQ