1 st data science meetup in seoul
DESCRIPTION
1 st DATA SCIENCE MEETUP in SEOUL. JINYOUNG KIM & HEEWON JEON. Data Science?. Organizations use their data for decision support and to build data-intensive products and services. - PowerPoint PPT PresentationTRANSCRIPT
1
1ST DATA SCIENCE MEETUP IN SEOUL
JINYOUNG KIM & HEEWON JEON
2
Data Science?• Organizations use their data for decision support and to build data-intensive products and services. • The collection of skills required by organizations to support these functions has been grouped under the term "Data Science". - J. Hammerbacher
3
Taxonomy of Data Science• What
Data Format (documents, records, sensory and linked data)Size (small vs. big data)Dynamics (static, dynamic, streaming)
User End User (OLTP)Business User (OLAP)
Domain Web Service / Health Informatics / Journalism / …Business Intelligence (for *any* industry)
Needs Search / Recommendation Fraud / Spam DetectionTrend AnalysisExploratory data analysisDecision Making
4
Taxonomy of Data Science
System Infrastructure
RDBMS / Business Intelligence PlatformsSpecialized Platforms for Big Data
Preparation
Data Schema DesignExploratory data analysisPreprocessing (filtering / sampling)Training Data Collection
Analytics Descriptive StatisticsPredictive ModelingSpecialized Analytics (e.g., ranking model for IR)
Presentation
End-User / Annotator InterfaceInformation Visualization
• How
5
Data Scientist?• A data scientist is someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning – H. Mason
6
Data Science Meet-up• Let’s learn from each other!
• Foster collaboration among participants
• Beginning of a long and fruitful journey!
7
In What Follows…• Presentation
• Discussion• Who should care about Big Data, everyone?• Developing Career Path as a Data Scientist
8
FROM DATA SCIENCETO INFORMATION RETRIEVAL
9
Information Retrieval?• Definition• The study and the practice of how an automated system
can enable its users to access, interact with, and make sense of information.
• Characteristics• More than ten-blue-links of search results• Algorithmic solutions for information problems
Information Retrieval / RecSys
Large-scale System Infra.
Large-scale (Text)Analytic
s
UX / HCI / Info. Vis.
10
IR in the Taxonomy of Data Science
Data Format (documents, records, sensory and linked data) /Size / Dynamics (static, dynamic, streaming)
User / Domain
End User vs. Business UserWeb Service / Business Intelligence / Health Informatics
Needs Search / RecommendationTrend Analysis / Decision Making
System Storage / Transfer / ComputationPlatform for Big Data Handling
Analytics Descriptive StatisticsPredictive Modeling / Specialized Analytics
Presentation
User InterfaceInformation Visualization
• What
• How
11
Major Problems in IR & RecSys• Matching• (Keyword) Search : query – document• Personalized Search : (user+query) – document• Item Recommendation : user – item• Contextual Advertising : (user+context) – advertisement
• Quality• PageRank / Spam filtering / Freshness
• Relevance• Combination of matching and quality features• Evaluation is critical for optimal performance
12
The Great Divide: IR vs. RecSysIR
• Query / Document• Provide relevant info.• Reactive (given query)• SIGIR / CIKM / WSDM
RecSys• User / Item• Support decision making• Proactive (push item)• RecSys / KDD / UMAP• Both requires similarity / matching score
• Personalized search involves user modeling
• Most RecSys also involves keyword search
• Both are parts of user’s info seeking process
13
IMPROVED QUERY MODELING FOR STRUCTURED DOCUMENTS
A Sneak-Peak of Information Retrieval Research
14
1
1
221
2
• Field Relevance• Different field is important for different query-term
‘james’ is relevant when it occurs in
<to>
‘registration’ is relevant when it occurs
in <subject>
Matching for Structured Document Retrieval[ECIR09,12,CIKM09]
Why don’t we provide field operator or advanced UI?
15
Estimating the Field Relevance• If User Provides Feedback• Relevant document provides sufficient information
• If No Feedback is Available• Combine field-level term statistics from multiple sources
contenttitle
from/to
Relevant Docscontent
titlefrom/to
Collection content
titlefrom/to
Top-k Docs
+ ≅
16
Retrieval Using the Field Relevance• Comparison with Previous Work
• Ranking in the Field Relevance Model
q1 q2 ... qm
f1
f2
fn
...
f1
f2
fn
...
w1
w2
wn
w1
w2
wn
q1 q2 ... qm
f1
f2
fn
...
f1
f2
fn
...
P(F1|q1)
P(F2|q1)
P(Fn|q1)
P(F1|qm)
P(F2|qm)
P(Fn|qm)
Per-term Field Weight
Per-term Field Score
sum
multiply
17
• Retrieval Effectiveness (Metric: Mean Reciprocal Rank)
DQL BM25F MFLM FRM-C FRM-T FRM-RTREC 54.2% 59.7% 60.1% 62.4% 66.8% 79.4%IMDB 40.8% 52.4% 61.2% 63.7% 65.7% 70.4%Monster 42.9% 27.9% 46.0% 54.2% 55.8% 71.6%
Evaluating the Field Relevance Model
DQL BM25F MFLM FRM-C FRM-T FRM-R40.0%
45.0%
50.0%
55.0%
60.0%
65.0%
70.0%
75.0%
80.0%
TRECIMDBMonster
Fixed Field WeightsPer-term Field Weights
18
Lessons from Data Science Perspective• Understanding user behavior provides key insights• The notion of field relevance
• Choice of estimation technique relies on many things• Availability of data and labels (e.g., can we use CRF?)• Efficiency concerns (possibility of pre-computation)
• Evaluation is critical for continuous improvement• IR people are very serious about dataset and metrics
19
DATA-DRIVEN PURSUIT OF HAPPINESS
20
LiFiDeA (= Life+Idea) Project• Goal• Improved Personal Info Mgmt. => Self Improvement• Collect behavioral data (schedule and tasks)• Correlate them with subjective judgments of happiness
• Workflow• Write task-centric journals on Evernote• Weekly data migration into spreadsheet• Statistical analysis using Excel chart and R
• Findings• Tracking itself helps, but not for a long time• Keeping right amount of tension is critical My Source of
Inspiration
21
My Self-tracking Efforts• Life-optimization Project (2002~2006)
• Software dev. project for myself, for 4 years• Covers all aspects of personal info mgmt.• Core component of my Ph.D application
22
My Self-tracking Efforts• LiFiDeA Project (2011-2012)
Happiness by Place
Data Moved onto Excel Sheet
Happiness by Wake-up Time
Raw Data on Evernote
23
Lessons Learned• Combine existing solutions whenever possible• “Done is better than perfect” applies here
• *You* should own your data, not the app you use• Apps can come and go, but the data should stay
• Minimize data collection efforts for sustainability• Integrate self-tracking into your daily routine• “Effort << Benefit” should be kept all the time
• Communicating regularly helps you make progress• Writing has been the best way to learn about the subject
24
OPTIONAL SLIDES
25
Criteria for Choosing IR vs. RecSsys
IRRecSys
• User’s willingness to express information needs• Lack of evidence about the user himself
• Confidence in predicting user’s preference• Availability of matching items to recommend
from Query to SessionRich User ModelingHCIR Way: 26
Action Response
Action Response
Action Response
USER SYSTEM
InteractionHistory
Filtering / BrowsingRelevance Feedback
…
Filtering ConditionsRelated Items
…
User Model
Rich User InteractionIR Way:The
Providing personalized results vs. rich interactions are complementary, yet both are needed in most
scenarios.No real distinction between IR vs. HCI, and IR vs.
RecSys
ProfileContextBehavior
27
The Great Divide: IR vs. CHIIR
• Query / Document• Relevant Results• Ranking / Suggestions• Feature Engineering• Batch Evaluation (TREC)• SIGIR / CIKM / WSDM
CHI• User / System• User Value / Satisfaction• Interface / Visualization• Human-centered Design• User Study• CHI / UIST / CSCW
Can we learn from each other?