big data - to explain or to predict? talk at u toronto's rotman school of management

Post on 14-Feb-2017

636 Views

Category:

Education

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Big Data – To Explain or To Predict?

Big Data Experts Speaker Series Rotman School of Management, U Toronto, March 2016

Galit Shmueli

Galit Shmueli ( 徐茉莉 )www.galitshmueli.com

❷ 2000-2002 Carnegie Mellon Univ.Visiting Assistant Prof.Dept. of Statistics

❸ 2002-2012 Univ. of Maryland College ParkAssistant then Associate Prof. of

Statistics & Management Science

R H Smith School of Business

2008-2014 Rigsum Institute (Bhutan)

Co-Director, Rigsum Research Lab

❹ 2011-2014 Indian School of Business SRITNE Chaired Prof. of Data

Analytics, Associate Prof. of Statistics & Info Systems

❶ 1994-2000 Israel Institute of

TechnologyMSc + PhD, Statistics

2014-… NTHUInstitute of Service ScienceDirector, Center for Service

Innovation & Analytics

Research in Data Analytics‘Entrepreneurial’

statistical & data mining modeling (for today’s problems)

Interdisciplinary modeling

Statistical StrategyTo Explain or To Predict?Information QualityRegression with Big Data

Road Map

DefinitionsExplanatory-dominated social sciencesExplanatory modeling ≠ predictive modeling

Why?Different modeling pathsExplanatory power vs. predictive power

Implications

Definitions

Explanatory modeling:Theory-based, statistical testing of causal hypotheses

Explanatory power:Strength of relationship in statistical model

Definitions

Predictive modeling:Empirical method for predicting new observations

Predictive power:Ability to accurately predict new observations

Explain PredictDescribe

Matching Game

Social Sciences

Machine learning

Statistics

Statistical modeling in social sciences &

management research

Purpose: test causal theory (“explain”)Association-based statistical models

Prediction nearly absent

Start with a causal theory

Generate causal hypotheses on constructs

Operationalize constructs → Measurable variables

Fit statistical model

Statistical inference → Causal conclusions

Classic journal paper

In the social sciences,

data analysis is mainly used for testing causal theory.

“If it explains, it predicts”

“Empirical prediction aloneis un-scientific”

Some statisticians share this view:

The two goals in analyzing data... I prefer to describe as “management” and “science”. Management seeks profit... Science seeks truth.

- Parzen, Statistical Science 2001

Prediction in top research journals in Information Systems

Predictive goal?Predictive modeling?Predictive assessment?

1990-2006

52 “predictive” articles among 1,072 in Information Systems top journals

“A good explanatory model will also predict well”

“You must understand the underlying causes in order to predict”

Meanwhile… in industry

Philosophy of Science

“Explanation and prediction have the same logical structure”

Hempel & Oppenheim, 1948

“It becomes pertinent to investigate the possibilities of predictive procedures autonomous of those used for explanation”

Helmer & Rescher, 1959

“Theories of social and human behavior address themselves to two distinct goals of science: (1) prediction and (2) understanding”

Dubin, Theory Building, 1969

Why statistical

explanatory modeling differs from

predictive modeling

Explanatory Model: Test/quantify causal effect for “average” record in population

Predictive Model: Predict new individual observations

Different Scientific Goals

Different generalization

Theory vs. its manifestation

?

Four aspects

1. Theory – Data

2. Causation – Association

3. Retrospective – Prospective

4. Bias - Variance

“The goal of finding models that are predictively accurate differs from the goal of finding models that are true.”

Best explanatory model

Best predictive model

Point #1

Predict ≠ Explain

+ ?

“we tried to benefit from an extensive set of attributes describing each of the movies in the dataset. Those attributes certainly carry a significant signal and can explain some of the user behavior. However… they could not help at all for improving the [predictive] accuracy.”

Bell et al., 2008

Explain ≠ PredictThe FDA considers two products bioequivalent if the 90% CI of the relative mean of the generic to brand formulation is within 80%-125%

“We are planning to… develop predictive models for bioavailability and bioequivalence”

Lester M. Crawford, 2005Acting Commissioner of Food & Drugs

“For a long time, we thought that Tamoxifen was roughly 80% effective for breast cancer patients.

But now we know much more: we know that it’s 100% effective in 70%-80% of the patients, and ineffective in the rest.”

Goal Definition

Design & Collection

Data Preparation

EDA

Variables? Methods? Evaluation,

Validation & Model Selection

Model Use & Reporting

Study design

Hierarchical data

Observational or experiment?

Primary or secondary data?

Instrument (reliability+validity vs. meas. accuracy)

How much data?

How to sample?

& data collection

Data Preprocessing

reduced-feature models

missing

partitioning

Data exploration, viz, reduction

PCA

Factor Analysis(interpretable)

Dimension Reduction(fast, small)

Which Variables?

Multicollinearity?causation associations

endogeneity ex-post

availability

A, B, A*B?

ensemblesShrinkage models

variance bias

Methods / ModelsBlackbox / interpretableMapping to theory

Evaluation, Validation& Model Selection

Training dataEmpirical model Holdout data

Predictive power

Over-fitting analysis

Theoretical model

Empirical model

Data

ValidationModel fit ≠

Explanatory power

Inference

Model Use: Industry

Identify causal factors

generate predictions for new data

Predictive performance

Over-fitting analysis

Null hypothesis

Naïve/baseline

Inference

Model Use (Science)

test causal theory

generate new theorydevelop measurescompare theoriesimprove theoryassess relevanceEvaluate predictability

Predictive performance

Over-fitting analysis

Null hypothesis

Naïve/baseline

Point #2

Explanatory Power

Predictive Power ≠

Cannot infer one from the other

out-of-sample

Performance Metrics

type I,II errors

goodness-of-fit

p-values

over-fitting

costs

prediction accuracy

interpretation

Training vs. holdout

R2

Explanatory Power

Pred

ictiv

e Po

wer

The predictive power of an explanatory model has important scientific value

Relevance, reality check, predictability

Current state in academia (social sciences and management)

“While the value of scientific prediction… is beyond question… the inexact sciences [do not] have…the use of predictive expertise well in hand.”

Helmer & Rescher, 1959

Distinction blurred

Unfamiliarity with predictive modeling/assessment

Prediction underappreciated

State-of-the-art in industry

Distinction blurred

Prediction over-appreciated

“Big Data” synonymous with prediction

How does this impact

Scientific research?

How does this impact organizations’ actions?

…and our lives?

Will the customer pay?

What causes non-payment?

ExplainPredict

PredictPotential explanations

Shmueli (2010) “To Explain or To Predict?”, Statistical ScienceShmueli & Koppius (2011) “Predictive Analytics in IS Research”, MISQ

top related