1.5.ensemble learning with apache spark m llib 1.5

Ensemble Learning with Apache Spark MLlib 1.5

[email protected]

Reference[1] http://www.csdn.net/article/2015-03-02/2824069[2] http://www.csdn.net/article/2015-09-07/2825629[3] http://www.scholarpedia.org/article/Ensemble_learning

What is Ensemble Learning (集成学习) ?

● 结合不同的学习模块（单个模型）来加强模型的稳定性和预

测能力

● 导致模型不同的4个主要因素。这些因素的组合也可能会造

成模型不同：

● 集成学习是典型的实践驱动的研究方向，它一开始先在实践

中证明有效，而后才有学者从理论上进行各种分析

● 不同种类

● 不同假设

● 不同建模技术

● 初始化参数不同

A pinch of math

● There are 3 (independent) binary classifiers (A,B,C) with a 70% accuracy

● For a majority vote with 3 members we can expect 4 outcomes:

● All three are correct 0.7 * 0.7 * 0.7 = 0.3429

● Two are correct0.7 * 0.7 * 0.3 + 0.7 * 0.3 * 0.7+ 0.3 * 0.7 * 0.7 = 0.4409

● Two are wrong0.3 * 0.3 * 0.7 + 0.3 * 0.7 * 0.3 + 0.7 * 0.3 * 0.3 = 0.189

● All three are wrong0.3 * 0.3 * 0.3 = 0.027

0.3429 + 0.4409 = 0.7838 > 0.7

Model Error

● 任何模型中出现的误差都可以在

数学上分解成三个分量：

○ Bias error 是用来度量预测值与实

际值的差异

○ Variance 则是度量基于同一观测

值，预测值之间的差异

Trade-off management of bias-variance errors

● 通当模型复杂性增加时，最

终会过拟合，因此模型开始

出现Variance● 优良的模型应该在这两种

误差之间保持平衡

● 集成学习就是执行折衷权

衡的一种方法○ 怎么训练每个算法？

○ 怎么融合每个算法？

EL techniques (1): Bagging

● 试图在小样本集上实现相

似的学习模块，然后对预

测值求平均值

● 可以帮助减少Variance

EL techniques (2): Boosting

● 是一项迭代技术

● 它在上一次分类的基础上

调整观测值的权重。如果

观测值被错误分类，它就

会增加这个观测值的权重

● 会减少Bias error，但是有

些时候会在训练数据上过

拟合

EL techniques (3): Stacking

● 用一个学习模块与来自

不同学习模块的输出结

合起来

● 可以减少Bias error和Variance

● 选择合适的集成模块与

其说是纯粹的科研问题，

不如说是一种艺术

https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/14335/1st-place-winner-solution-gilberto-titericz-stanislav-semenov

https://www.linkedin.com/pulse/ideas-sharing-kaggle-crowdflower-search-results-relevance-mark-peng

Stacking with Apache MLLib (1)

● Dataset：UCI Covtype (Ch04, Adv. Analytic w/ Spark)● Baseline: RandomForest (Best from 8 hyper-parameters

with 3-folds C.V.)○ precision = 0.956144○ recall = 0.956144

Training set X RF(θ1)

fits

Training set Y

predicts

h1(Y,θ1)#trees = 32θ1: #bins=300, #depth=30, entropy

● Using Meta-features


Training set X RF(θ1) RF(θ2) RF(θ3)

RF(θ1)

h1(Y,θ1)

#trees = 32θ1: #bins=300, #depth=30, entropyθ2: #bins=40, #depth=30, entropyθ3: #bins=300, #depth=30, gini

fits

predicts 3-folds C.V of Training set X

h1(X,θ1) h2(X,θ2) h3(X,θ3) Label

fits

Training set Y

predictsRF(θ1)

RF(θ2)

RF(θ3)

h1(Y,θ1)

h2(Y,θ2)

h3(Y,θ3)

predicts

sort by precision

Baseline Current

precision 0.956144 0.951056

recall 0.956144 0.951056


● Using Original features & Meta-features


RF(θ1)

h1(Y,θ1)


fits

predicts 3-folds C.V of Training set X


fits

Training set Y

predictsRF(θ1)

RF(θ2)

RF(θ3)

h1(Y,θ1)

h2(Y,θ2)

h3(Y,θ3)

predicts

sort by precision

Baseline Current

precision 0.956144 0.951094

recall 0.956144 0.951094

f1 fn………..

f1...fn


● Retrain tier-1 models and stacking with all features


RF(θ1)

h1(Y,θ1)


fits

predicts Training set X


fits

Training set Y

predictsh1(Y,θ1)

h2(Y,θ2)

h3(Y,θ3)

predicts

sort by precision

Baseline Current

precision 0.956144 0.956836

recall 0.956144 0.956836

f1 fn………..

f1...fnRF(θ1)

RF(θ2)

RF(θ3)

1.5.ensemble learning with apache spark m llib 1.5

Technology