kaggle boschコンペ振り返り
TRANSCRIPT
Bosch Production Line Performance 2017/1/20
hskksk
1
•
•
• Result•
•
2
bosch production line performance
※
3
4
In this competition, Bosch is challenging Kagglers to predict internal failures using thousands of measurements and tests made for each component along the assembly line. This would enable Bosch to bring quality products at lower costs to the end user.
5
• : 2016/8/17
• : 2016/11/12
• 2016/9
6
Submissions are evaluated on the Matthews correlation coefficient (MCC) between the predicted and the observed response. The MCC is given by:
where TP is the number of true positives, TN the number of true negatives, FP the number of false positives, and FN the number of false negatives.
7
Lx_Sy_Dz Lx_Sy_F{z-1} 8
9
0: 1,176,868 (99.4%)1: 6,879 ( 0.6%)
extremely imbalanced data
10
Result
11
• g_votte
• tkm
• hskksk( )
12
LB(hskksk only)
13
LB( )
14
Public Leaderboard
15
Private Leaderboard
16
Top Ten !
17
18
• LB(CV )
• ( )
19
20
• GCP with R/Python
• Rmarkdown
• xgboost
• github
GCP1
21
CV
22
LB
• 30submit LB
•LB
•
23
1. Cross-Validation fold
2.
3. MCC
24
1• Cross-Validation fold
• Predicting Redhat Business Value
25
Redhat• CV
• CV CV ( )
•
• fold
26
•→
• ID→ ID
•
27
2•
•
28
qqplot
•
• Station32, 33OK
•
29
30
3
MCC• Gaussian Process LB
•
• ,mcc
• LB
31
Feature engineering
32
• 25
• 3154
33
1. ID
• Forum magic feature
2.
•
3.
•34
•
• ID
35
•
• ID
36
Station 38
•
• Station 38!!
• IDStation 38 NA
37
ID
38
• bitmap( 17017 )
• bitmap
•
•
•
•
39
40
41
42
43
44
• Stacking
• xgboost
• xgboost
• objective
45
Stacking• 2 stacking
• 8 xgboost stacking
• narrow-deep stacking
• deep learning
• Layer
46
xgboost• base_margin
• dart(Dropouts meet Multiple Additive Regression Trees)
47
base_marginbase_marginxgboost learn = xgb.DMatrix(...) base_margin = logit( p(y|x))
setinfo(learn, 'base_margin', base_margin) m <- xgb.train( data = learn, ... )
48
base_margin•
•
dart
49
Dart
• Dropouts meet Multiple Additive Regression Trees1
• dropout
•
• 0.5
1 Rashmi, K. V, & Gilad-Bachrach, R. (n.d.). DART: Dropouts meet Multiple Additive Regression Trees, 38.
50
xgboost• GBDT-feature + Factorization Machines
• GBDT-feature: GBDT tree
• One-hot Encoding → Factorization Machines
• OpenMP libffm
• only libFM
51
objective
binary:logistic
52
smoothed-MCC
mcc smoothingxgboost gradient,hessian(diagonal only)
53
54
•
•
•
•
55
56
• hskksk Line2 tkm Line0
57
•
•
• 3 fold 1
• MCC LB Feedback
• tkm g_votte
• LB Feedback58
Public Private• tkm submit Public
Score Private
•
•
• Publictkm
•
59
• submit
•
•
•
• mcc
•
•
60
kaggle• CV LB CV
• fold
• fold CV
•
61
kaggle•
• Accuracy confusion matrix
• mcc
•
• think more, try less 2
2 kaggle (Owen Zhang)
62
Enjoy Kaggle!
63