hive/pigを使ったkdd'12 track2の広告クリック率予測

Hive/Pigを使ったKDD'12 track2の広告クリック率予測

油井誠 [email protected]

産業技術総合研究所情報技術研究部門

Twitter ID: @myui

スライド http://www.slideshare.net/myui/dsirnlp-myuilt http://goo.gl/Ulf3A 1

KDDcup 2012 track2

• 検索ログを基に、検索エンジンの広告のクリック率(Click-Through Rate)を推定するタスク

– 中国の3大検索エンジンの一つsoso.comの実データ

• 検索語などはHash値などを利用してすべて数値化されている

– Trainingデータ(約10GB+2.2GB, 15億レコード）

– Testデータ（約1.3GB, 2億レコード）

• 学習データの1.33割が評価用データセット

– CTRがsubmission format

• クラス分類というより回帰（もちろんクラス分類でも解ける）

2

学習データのテーブル構成

UserID AdID QueryID Depth Position Impression Click

DisplayURL AdvertiserID KeywordID TitleID DescriptionID

AdID properties Training table

UserID Gender Age User table

QueryID Tokens

Query table

KeywordID Tokens TitleID Tokens DescriptionID Tokens

Keyword table Title table Description table

評価用のテーブルにはimpression、click以外の素性(feature) 基本的に、全部、質的変数 → 二値変数の素性に分解

Click = Positive Impression – Click = Negative CTR = Click / Impression

Label A B

1 1 9

-1 2 7

1 3 8

Label A:1 A:2 A:3 B:7 B:8 B:9

1 1 0 0 0 1 0

-1 0 1 0 0 0 1

1 0 0 1 1 0 0 3

ロジスティック回帰での発生予測

• 発生確率を予測する手法

• 各変数の影響力の強さを計算(Train)

– 入力: Label, Array<feature>

– 出力: 素性ごとの重みのMap<feature, float>

– # of features = 54,686,452

• ただし、token tableは利用していない (Token ID = <token,..,token>)

• 影響力を基に生起確率を計算(Predict)

– P(X) = Pr(Y=1|x1,x2,..,xn)

– f: X → Yとなる関数fを導出したい s.t. empirical lossを最小化 • 勾配降下法を使う

𝑎𝑟𝑔𝑚𝑖𝑛1

𝑛 𝑙𝑜𝑠𝑠(𝑓(𝑥𝑖

𝑛

𝑖=0

; 𝑤), 𝑦𝑖)

各素性の重み 4

Gradient Descent(勾配降下法)

𝑤𝑡+1 = 𝑤𝑡 − 𝛾𝑡1

𝑛 𝛻𝑙𝑜𝑠𝑠(𝑓(𝑥𝑖

𝑛

𝑖=0

; 𝑤𝑡), 𝑦)

新しい重み古い重み

経験損失の勾配を基に重みを更新

Jimmy LinのLarge-Scale Machine Learning at Twitterより https://speakerdeck.com/u/lintool/p/large-scale-machine-learning-at-twitter

学習率

5

https://speakerdeck.com/u/lintool/p/large-scale-machine-learning-at-twitter















𝑛

𝑖=0

; 𝑤𝑡), 𝑦)

勾配の並列計算

mappers

single reducer

勾配をmapperで並列に計算重みの更新をreducerで行う

• 実際には重みの更新の時に更新されたfeature(xi)が必要 • wはMap<feature, weight>でMap.size()=54,686,452

• Iteration数が多く必要で、入出力がDFSを介すMapReduceに向かない

• Reducerでの計算がボトルネックになる 6

確率的勾配降下法

• Gradient Descent

• Stochastic Gradient Descent (SGD)

– Iterative Parameter Mixで処理すれば、実際意外とうまく動くし、そんなにイテレーション数が必要でない • データ分割して、各mapperで並列にを計算

• モデルパラメタはイテレーション/epochごとに配る



𝑛

𝑖=0

; 𝑤𝑡), 𝑦)

モデルの更新に全てのトレーニングインスタンスが必要(バッチ学習）

𝑤𝑡+1 = 𝑤𝑡 − 𝛾𝑡𝛻𝑙𝑜𝑠𝑠(𝑓(𝑥;𝑤𝑡), 𝑦)

それぞれのトレーニングインスタンスで重みを更新(オンライン学習）


7

よくある機械学習のデータフロー

Label, array<feature>

Map <feature, weight>

Trainingデータ Modelデータ

array<feature>

Testデータ

train

predict

Label/Prob

8

よくある並列trainのデータフロー



Trainingデータ

map

map

map

map

Map <feature,weight>

reduce

map

Modelデータ

重みの平均をとる


SGDで重みを計算

機械学習はaggregationの問題

直感的にはHive/PigのUDAF(user defined aggregation function)で実装すればよいほんとはM/Rよりもparallel aggregationに特化したDremelに向いてる

イテレーションする場合は古いmodelを渡す

9

よくある並列trainのデータフロー



Trainingデータ

map

map

map

map

Map <feature,weight>

reduce

map

Modelデータ

重みの平均をとる


SGDで重みを計算

最初は素直にmapを返すUDAFで作った create table model as select trainLogisticUDAF(features,label [, params]) as weight from training

イテレーションする場合は古いmodelを渡す

mapはsplitサイズの調整でメモリ内に収まるけど、より規模がでかくなると reduceでメモリ不足になるのでデータ量に対してスケールしない

10

Think relational



Trainingデータ Modelデータ

array<feature>

Testデータ

train

predict

Label/Prob Scaler値として返すのはダメリレーションでfeature, weightを返そうでも、UDAFは使えない →そこでUDTF (User Defined Table Function)

11

UDTF (parameter-mix)

select feature, CAST(avg(weight) as FLOAT) as weight from ( select TrainLogisticSgdUDTF(features,label,..) as (feature,weight) from train ) t group by feature;

どうやってiterative parameter mixさせよう？？？

古いmodelを渡さないといけない毎行渡すのはあれだし…

12

HadoopのInputSplitSizeの設定に応じたmapperが立ち上がる（map-only)

UDTF(iterative parameter mix) create table model1sgditor2 as

select

feature,

CAST(avg(weight) as FLOAT) as weight

from (

select

TrainLogisticIterUDTF(t.features, w.wlist, t.label, ..) as (feature, weight)

from

training t join feature_weight w on (t.rowid = w.rowid)

) t

group by feature; ここで必要なのは、各行の素性ごとに古いModel

Map<feature, weight>, label相当を渡せばよいので、 Array<feature>に対応するArray<weight>をテーブルを作って inner joinで渡す

13

Pig版のフローの一例 training_raw = load '$TARGET' as (clicks: int, impression: int, displayid: int, adid: int, advertiserid: int, depth: int, position: int, queryid: int, keywordid: int, titleid: int, descriptionid: int, userid: int, gender: int, age: int); training_bin = foreach training_raw generate flatten(predictor.ctr.BinSplit(clicks, impression)), displayid, adid, advertiserid, depth, position, queryid, keywordid, titleid, descriptionid, userid, gender, age; training_smp = sample training_bin 0.1; training_rnd = foreach training_smp generate (int)(RANDOM() * 100) as dataid, TOTUPLE(*) as training; training_dat = group training_rnd by dataid; model = foreach training_dat generate predictor.ctr.TrainLinear(training_rnd.training.training_smp); store model into '$MODEL'; model = load '$MODEL' as (mdl: map[]); model_lmt = limit model 10; testing_raw = load '$TARGET' as (dataid: int, displayid: int, adid: int, advertiserid: int, depth: int, position: int, queryid: int, keywordid: int, titleid: int, descriptionid: int, userid: int, gender: int, age: int); testing_with_model = cross model_lmt, testing_raw; result = foreach testing_with_model generate dataid, predictor.ctr.Pred(mdl, displayid, adid, advertiserid, depth, position, queryid, keywordid, titleid, descriptionid, userid, gender, age) as ctr; result_grp = group result by dataid; result_ens = foreach result_grp generate group as dataid, predictor.ctr.Ensemble(result.ctr); result_ens_ord = order result_ens by dataid; result_fin = foreach result_ens_ord generate $1; store result_fin into '$RESULT';

14

弱学習

アンサンブル学習

まとめ

• データ量に対してちゃんとスケールするものができた – インターン生にpig版を作ってもらった

• こちらはUTDFではやっていなくて、モデルファイルを分割して作って、アンサンブル学習させる戦略

– オンラインのモデル更新とかをやるには、updateのないhiveだとinsertにしないといけないので一工夫いる

– Passive-aggressive版も作る予定

• 現状、AUC=0.75程度（優勝者の台湾国立大は0.8） – a9aデータセットだとlibsvm, svm-light, liblinear, tinysvmなどと同程度の精度(0.85ぐらい)

• 余裕があったらHiveにパッチとして送る – でも、ドキュメントとかテストとかｘｘｘｘｘ

実データを持つ共同研究先募集 (一件、広告配信企業とやってる） 15

hive/pigを使ったkdd'12 track2の広告クリック率予測

Documents