1 st data science meetup in seoul

27
1 ST DATA SCIENCE MEETUP IN SEOUL JINYOUNG KIM & HEEWON JEON 1

Upload: barto

Post on 25-Feb-2016

42 views

Category:

Documents


1 download

DESCRIPTION

1 st DATA SCIENCE MEETUP in SEOUL. JINYOUNG KIM & HEEWON JEON. Data Science?. Organizations use their data for decision support and to build data-intensive products and services. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 1 st  DATA SCIENCE  MEETUP in SEOUL

1

1ST DATA SCIENCE MEETUP IN SEOUL

JINYOUNG KIM & HEEWON JEON

Page 2: 1 st  DATA SCIENCE  MEETUP in SEOUL

2

Data Science?• Organizations use their data for decision support and to build data-intensive products and services. • The collection of skills required by organizations to support these functions has been grouped under the term "Data Science". - J. Hammerbacher

Page 3: 1 st  DATA SCIENCE  MEETUP in SEOUL

3

Taxonomy of Data Science• What

Data Format (documents, records, sensory and linked data)Size (small vs. big data)Dynamics (static, dynamic, streaming)

User End User (OLTP)Business User (OLAP)

Domain Web Service / Health Informatics / Journalism / …Business Intelligence (for *any* industry)

Needs Search / Recommendation Fraud / Spam DetectionTrend AnalysisExploratory data analysisDecision Making

Page 4: 1 st  DATA SCIENCE  MEETUP in SEOUL

4

Taxonomy of Data Science

System Infrastructure

RDBMS / Business Intelligence PlatformsSpecialized Platforms for Big Data

Preparation

Data Schema DesignExploratory data analysisPreprocessing (filtering / sampling)Training Data Collection

Analytics Descriptive StatisticsPredictive ModelingSpecialized Analytics (e.g., ranking model for IR)

Presentation

End-User / Annotator InterfaceInformation Visualization

• How

Page 5: 1 st  DATA SCIENCE  MEETUP in SEOUL

5

Data Scientist?• A data scientist is someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning – H. Mason

Page 6: 1 st  DATA SCIENCE  MEETUP in SEOUL

6

Data Science Meet-up• Let’s learn from each other!

• Foster collaboration among participants

• Beginning of a long and fruitful journey!

Page 7: 1 st  DATA SCIENCE  MEETUP in SEOUL

7

In What Follows…• Presentation

• Discussion• Who should care about Big Data, everyone?• Developing Career Path as a Data Scientist

Page 8: 1 st  DATA SCIENCE  MEETUP in SEOUL

8

FROM DATA SCIENCETO INFORMATION RETRIEVAL

Page 9: 1 st  DATA SCIENCE  MEETUP in SEOUL

9

Information Retrieval?• Definition• The study and the practice of how an automated system

can enable its users to access, interact with, and make sense of information.

• Characteristics• More than ten-blue-links of search results• Algorithmic solutions for information problems

Information Retrieval / RecSys

Large-scale System Infra.

Large-scale (Text)Analytic

s

UX / HCI / Info. Vis.

Page 10: 1 st  DATA SCIENCE  MEETUP in SEOUL

10

IR in the Taxonomy of Data Science

Data Format (documents, records, sensory and linked data) /Size / Dynamics (static, dynamic, streaming)

User / Domain

End User vs. Business UserWeb Service / Business Intelligence / Health Informatics

Needs Search / RecommendationTrend Analysis / Decision Making

System Storage / Transfer / ComputationPlatform for Big Data Handling

Analytics Descriptive StatisticsPredictive Modeling / Specialized Analytics

Presentation

User InterfaceInformation Visualization

• What

• How

Page 11: 1 st  DATA SCIENCE  MEETUP in SEOUL

11

Major Problems in IR & RecSys• Matching• (Keyword) Search : query – document• Personalized Search : (user+query) – document• Item Recommendation : user – item• Contextual Advertising : (user+context) – advertisement

• Quality• PageRank / Spam filtering / Freshness

• Relevance• Combination of matching and quality features• Evaluation is critical for optimal performance

Page 12: 1 st  DATA SCIENCE  MEETUP in SEOUL

12

The Great Divide: IR vs. RecSysIR

• Query / Document• Provide relevant info.• Reactive (given query)• SIGIR / CIKM / WSDM

RecSys• User / Item• Support decision making• Proactive (push item)• RecSys / KDD / UMAP• Both requires similarity / matching score

• Personalized search involves user modeling

• Most RecSys also involves keyword search

• Both are parts of user’s info seeking process

Page 13: 1 st  DATA SCIENCE  MEETUP in SEOUL

13

IMPROVED QUERY MODELING FOR STRUCTURED DOCUMENTS

A Sneak-Peak of Information Retrieval Research

Page 14: 1 st  DATA SCIENCE  MEETUP in SEOUL

14

1

1

221

2

• Field Relevance• Different field is important for different query-term

‘james’ is relevant when it occurs in

<to>

‘registration’ is relevant when it occurs

in <subject>

Matching for Structured Document Retrieval[ECIR09,12,CIKM09]

Why don’t we provide field operator or advanced UI?

Page 15: 1 st  DATA SCIENCE  MEETUP in SEOUL

15

Estimating the Field Relevance• If User Provides Feedback• Relevant document provides sufficient information

• If No Feedback is Available• Combine field-level term statistics from multiple sources

contenttitle

from/to

Relevant Docscontent

titlefrom/to

Collection content

titlefrom/to

Top-k Docs

+ ≅

Page 16: 1 st  DATA SCIENCE  MEETUP in SEOUL

16

Retrieval Using the Field Relevance• Comparison with Previous Work

• Ranking in the Field Relevance Model

q1 q2 ... qm

f1

f2

fn

...

f1

f2

fn

...

w1

w2

wn

w1

w2

wn

q1 q2 ... qm

f1

f2

fn

...

f1

f2

fn

...

P(F1|q1)

P(F2|q1)

P(Fn|q1)

P(F1|qm)

P(F2|qm)

P(Fn|qm)

Per-term Field Weight

Per-term Field Score

sum

multiply

Page 17: 1 st  DATA SCIENCE  MEETUP in SEOUL

17

• Retrieval Effectiveness (Metric: Mean Reciprocal Rank)

DQL BM25F MFLM FRM-C FRM-T FRM-RTREC 54.2% 59.7% 60.1% 62.4% 66.8% 79.4%IMDB 40.8% 52.4% 61.2% 63.7% 65.7% 70.4%Monster 42.9% 27.9% 46.0% 54.2% 55.8% 71.6%

Evaluating the Field Relevance Model

DQL BM25F MFLM FRM-C FRM-T FRM-R40.0%

45.0%

50.0%

55.0%

60.0%

65.0%

70.0%

75.0%

80.0%

TRECIMDBMonster

Fixed Field WeightsPer-term Field Weights

Page 18: 1 st  DATA SCIENCE  MEETUP in SEOUL

18

Lessons from Data Science Perspective• Understanding user behavior provides key insights• The notion of field relevance

• Choice of estimation technique relies on many things• Availability of data and labels (e.g., can we use CRF?)• Efficiency concerns (possibility of pre-computation)

• Evaluation is critical for continuous improvement• IR people are very serious about dataset and metrics

Page 19: 1 st  DATA SCIENCE  MEETUP in SEOUL

19

DATA-DRIVEN PURSUIT OF HAPPINESS

Page 20: 1 st  DATA SCIENCE  MEETUP in SEOUL

20

LiFiDeA (= Life+Idea) Project• Goal• Improved Personal Info Mgmt. => Self Improvement• Collect behavioral data (schedule and tasks)• Correlate them with subjective judgments of happiness

• Workflow• Write task-centric journals on Evernote• Weekly data migration into spreadsheet• Statistical analysis using Excel chart and R

• Findings• Tracking itself helps, but not for a long time• Keeping right amount of tension is critical My Source of

Inspiration

Page 21: 1 st  DATA SCIENCE  MEETUP in SEOUL

21

My Self-tracking Efforts• Life-optimization Project (2002~2006)

• Software dev. project for myself, for 4 years• Covers all aspects of personal info mgmt.• Core component of my Ph.D application

Page 22: 1 st  DATA SCIENCE  MEETUP in SEOUL

22

My Self-tracking Efforts• LiFiDeA Project (2011-2012)

Happiness by Place

Data Moved onto Excel Sheet

Happiness by Wake-up Time

Raw Data on Evernote

Page 23: 1 st  DATA SCIENCE  MEETUP in SEOUL

23

Lessons Learned• Combine existing solutions whenever possible• “Done is better than perfect” applies here

• *You* should own your data, not the app you use• Apps can come and go, but the data should stay

• Minimize data collection efforts for sustainability• Integrate self-tracking into your daily routine• “Effort << Benefit” should be kept all the time

• Communicating regularly helps you make progress• Writing has been the best way to learn about the subject

Page 24: 1 st  DATA SCIENCE  MEETUP in SEOUL

24

OPTIONAL SLIDES

Page 25: 1 st  DATA SCIENCE  MEETUP in SEOUL

25

Criteria for Choosing IR vs. RecSsys

IRRecSys

• User’s willingness to express information needs• Lack of evidence about the user himself

• Confidence in predicting user’s preference• Availability of matching items to recommend

Page 26: 1 st  DATA SCIENCE  MEETUP in SEOUL

from Query to SessionRich User ModelingHCIR Way: 26

Action Response

Action Response

Action Response

USER SYSTEM

InteractionHistory

Filtering / BrowsingRelevance Feedback

Filtering ConditionsRelated Items

User Model

Rich User InteractionIR Way:The

Providing personalized results vs. rich interactions are complementary, yet both are needed in most

scenarios.No real distinction between IR vs. HCI, and IR vs.

RecSys

ProfileContextBehavior

Page 27: 1 st  DATA SCIENCE  MEETUP in SEOUL

27

The Great Divide: IR vs. CHIIR

• Query / Document• Relevant Results• Ranking / Suggestions• Feature Engineering• Batch Evaluation (TREC)• SIGIR / CIKM / WSDM

CHI• User / System• User Value / Satisfaction• Interface / Visualization• Human-centered Design• User Study• CHI / UIST / CSCW

Can we learn from each other?