1.5.ensemble learning with apache spark m llib 1.5
TRANSCRIPT
Ensemble Learning with Apache Spark MLlib 1.5
Reference[1] http://www.csdn.net/article/2015-03-02/2824069[2] http://www.csdn.net/article/2015-09-07/2825629[3] http://www.scholarpedia.org/article/Ensemble_learning
What is Ensemble Learning (集成学习) ?
● 结合不同的学习模块(单个模型)来加强模型的稳定性和预
测能力
● 导致模型不同的4个主要因素。这些因素的组合也可能会造
成模型不同:
● 集成学习是典型的实践驱动的研究方向,它一开始先在实践
中证明有效,而后才有学者从理论上进行各种分析
● 不同种类
● 不同假设
● 不同建模技术
● 初始化参数不同
A pinch of math
● There are 3 (independent) binary classifiers (A,B,C) with a 70% accuracy
● For a majority vote with 3 members we can expect 4 outcomes:
● All three are correct 0.7 * 0.7 * 0.7 = 0.3429
● Two are correct0.7 * 0.7 * 0.3 + 0.7 * 0.3 * 0.7+ 0.3 * 0.7 * 0.7 = 0.4409
● Two are wrong0.3 * 0.3 * 0.7 + 0.3 * 0.7 * 0.3 + 0.7 * 0.3 * 0.3 = 0.189
● All three are wrong0.3 * 0.3 * 0.3 = 0.027
0.3429 + 0.4409 = 0.7838 > 0.7
Model Error
● 任何模型中出现的误差都可以在
数学上分解成三个分量:
○ Bias error 是用来度量预测值与实
际值的差异
○ Variance 则是度量基于同一观测
值,预测值之间的差异
Trade-off management of bias-variance errors
● 通当模型复杂性增加时,最
终会过拟合,因此模型开始
出现Variance● 优良的模型应该在这两种
误差之间保持平衡
● 集成学习就是执行折衷权
衡的一种方法○ 怎么训练每个算法?
○ 怎么融合每个算法?
EL techniques (1): Bagging
● 试图在小样本集上实现相
似的学习模块,然后对预
测值求平均值
● 可以帮助减少Variance
EL techniques (2): Boosting
● 是一项迭代技术
● 它在上一次分类的基础上
调整观测值的权重。如果
观测值被错误分类,它就
会增加这个观测值的权重
● 会减少Bias error,但是有
些时候会在训练数据上过
拟合
EL techniques (3): Stacking
● 用一个学习模块与来自
不同学习模块的输出结
合起来
● 可以减少Bias error和Variance
● 选择合适的集成模块与
其说是纯粹的科研问题,
不如说是一种艺术
https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/14335/1st-place-winner-solution-gilberto-titericz-stanislav-semenov
https://www.linkedin.com/pulse/ideas-sharing-kaggle-crowdflower-search-results-relevance-mark-peng
Stacking with Apache MLLib (1)
● Dataset:UCI Covtype (Ch04, Adv. Analytic w/ Spark)● Baseline: RandomForest (Best from 8 hyper-parameters
with 3-folds C.V.)○ precision = 0.956144○ recall = 0.956144
Training set X RF(θ1)
fits
Training set Y
predicts
h1(Y,θ1)#trees = 32θ1: #bins=300, #depth=30, entropy
● Using Meta-features
Stacking with Apache MLLib (2)
Training set X RF(θ1) RF(θ2) RF(θ3)
RF(θ1)
h1(Y,θ1)
#trees = 32θ1: #bins=300, #depth=30, entropyθ2: #bins=40, #depth=30, entropyθ3: #bins=300, #depth=30, gini
fits
predicts 3-folds C.V of Training set X
h1(X,θ1) h2(X,θ2) h3(X,θ3) Label
fits
Training set Y
predictsRF(θ1)
RF(θ2)
RF(θ3)
h1(Y,θ1)
h2(Y,θ2)
h3(Y,θ3)
predicts
sort by precision
Baseline Current
precision 0.956144 0.951056
recall 0.956144 0.951056
Stacking with Apache MLLib (3)
● Using Original features & Meta-features
Training set X RF(θ1) RF(θ2) RF(θ3)
RF(θ1)
h1(Y,θ1)
#trees = 32θ1: #bins=300, #depth=30, entropyθ2: #bins=40, #depth=30, entropyθ3: #bins=300, #depth=30, gini
fits
predicts 3-folds C.V of Training set X
h1(X,θ1) h2(X,θ2) h3(X,θ3) Label
fits
Training set Y
predictsRF(θ1)
RF(θ2)
RF(θ3)
h1(Y,θ1)
h2(Y,θ2)
h3(Y,θ3)
predicts
sort by precision
Baseline Current
precision 0.956144 0.951094
recall 0.956144 0.951094
f1 fn………..
f1...fn
Stacking with Apache MLLib (4)
● Retrain tier-1 models and stacking with all features
Training set X RF(θ1) RF(θ2) RF(θ3)
RF(θ1)
h1(Y,θ1)
#trees = 32θ1: #bins=300, #depth=30, entropyθ2: #bins=40, #depth=30, entropyθ3: #bins=300, #depth=30, gini
fits
predicts Training set X
h1(X,θ1) h2(X,θ2) h3(X,θ3) Label
fits
Training set Y
predictsh1(Y,θ1)
h2(Y,θ2)
h3(Y,θ3)
predicts
sort by precision
Baseline Current
precision 0.956144 0.956836
recall 0.956144 0.956836
f1 fn………..
f1...fnRF(θ1)
RF(θ2)
RF(θ3)