stories behind kaggle competitions

20
stories behind kaggle competitions wendy kan, data scientist [email protected] @wendykan 5/19/2015 @

Upload: sergey-makarevich

Post on 12-Aug-2015

49 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: stories behind kaggle competitions

stories behind kaggle competitions

wendy kan, data [email protected]

@wendykan

5/19/2015 @

Page 2: stories behind kaggle competitions

kaggle runs public machine learning competitions

Page 3: stories behind kaggle competitions

we worked with clients/hosts on various types of problems and data of different sizes

Page 4: stories behind kaggle competitions

my job as a data scientist at kaggle

Page 5: stories behind kaggle competitions

“data science is not just kaggle competitions”

whyyyy???

Page 6: stories behind kaggle competitions

machine learning processes

● Business Problem● Collect Data● Transform Data● Dataset Splitting● Evaluation Metric● Feature Extraction

● Feature Selection● Model Training● Model Ensembling● Methodology Selection● Production System● Ongoing Optimization

Page 7: stories behind kaggle competitions

not every problem can be turned into a kaggle competition

Page 8: stories behind kaggle competitions
Page 9: stories behind kaggle competitions

size matters! where bigger is better (most of the time)

Page 10: stories behind kaggle competitions

data cleaning/formatting:

● easy to make a quick submission● boosts participation● (too) clean data kills creativity

Page 11: stories behind kaggle competitions

data privacy/anonymization

Page 12: stories behind kaggle competitions

metric: how do you measure success?

● Classification - AUC/ Logarithmic Loss/Accuracy

● Regression - RMSE/MAE

● Ranking - MAP/NDCG

● Other / Custom

https://www.kaggle.com/wiki/Metrics

Page 13: stories behind kaggle competitions

the design of a competition shapes how people are going to solve a problem

Page 14: stories behind kaggle competitions

Splitting dataset

● training/test

● public/private

Page 15: stories behind kaggle competitions

Time series data

Page 16: stories behind kaggle competitions

data leakage

“Deemed ‘one of the top ten data mining mistakes’, leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from”

“the concept of identifying and harnessing leakage has been openly addressed as one of three key aspects for winning data mining competitions”

“Leakage in Data Mining: formulation, detection, and avoidance” S Kaufman et al

Page 17: stories behind kaggle competitions

do you have thousands of people reviewing your performance at work 24/7?

I do.

Page 18: stories behind kaggle competitions

1. people make mistakes. honesty is the best policy.

Page 19: stories behind kaggle competitions

2. crowdsourcing is powerful. anything that can go wrong will go wrong.

Page 20: stories behind kaggle competitions