supervise learning mining the web - chapter 5

SUPERVISE LEARNINGMining the Web - Chapter 5Dao Vinh Ninh

2005/5/30

Mining the Web Chakrabarti & Ramakrishnan 2

発表の内容 Bayesian Learners

Naïve Bayes Learners Bayesian Networks

Maximum Entropy Learner Discriminative Classification

Linear Least-Square Regression Support Vector Machine


The Supervised Learning Scenario

練習のドキュメントはトピック事に分類された標準ドキュメントである。練習ドキュメントにより各トピックの特性を調べ、トピックの特性に基いてドキュメントを分類する。


Bayesian Learners 概要

確率モデルドキュメントの生成は確率現象テキスト分類に最も実践的な方法

前提の仮説１つのドキュメントは１つのトピックに属するトピック「ｃ」の選択される確率は Pr(c) 「優先確率」トピック「ｃ」でトークン「ｔ」の出現確率は Pr(t|c) トピック「ｃ」でドキュメント「ｄ」の生成確率は Pr

(d|c)


Bayesian LearnersParameter Estimation パラメタ集合「 Θ 」の推定

既知値練習ドキュメント集合「 D 」を調べて、パラメタの値を

推定する

上記の式は実際に計算不可能 Maximum Likelihood Estimate - MLE

総（積分）の値を計算しやすい値 arg maxPr(D|Θ) に変更分類結果はよくない

)|Pr(),|Pr()|Pr(

),|Pr()|Pr()|Pr(

)|Pr(),|Pr()|Pr(

Dd

cdcdc

Ddcdc


Bayesian LearnersNaïve Bayes Learners 概念

簡単、速い、更新しやすいモデルの仮説

トークンの出現は独立応用したモデル

Binary Model Multinomial Model


Naïve Bayes Learners Binary Model

トピック「ｃ」にあるドキュメントの中でトークン「ｔ」が出現確立は

Multinomial Model トピック「ｃ」にあるドキュメントの中でトークン「ｔ」

が一回ごとに出現確率は　　　ドキュメントの長さを「 L 」とする。

tc,

Ddfor account to

,,

,

,,, )1(

1)1()|Pr(

Wttc

dt tc

tc

dtWttc

dttccd

dt

tdnt

dddd tdn

lclLcldclLcd ),(

)},({)|Pr(),|Pr()|Pr()|Pr(

tc,


Naïve Bayes Learners 問題

多くの小さいパラメタの積とることで、生成確率の値が極めて小さくなる⇒解決：対数を取る

パラメタ練習ドキュメントの中に表さないトークンは ML

E 法により出現確率が「 0 」になるそのトークンを含むドキュメントの出現確率は

「 0 」にされる。


Naïve Bayes Learners

Parameter Smoothing Binary Model

Multinomial Model

2

~

n

k Laplace’s law of succession: λ=1Lidstone’s law of succession: λ=heuristic

dDd

Ddtc

c

c

dnW

tdn

,

, ),(||

),(1

W ：トークンの数


Naïve Bayes Learners の評価

Multinomial Model は Binary Model よりも正確

K-NN Model は Naïve Bayes Model よりも正確

ただし、 Naïve Bayes Model は k-NN Modelよりも簡潔で速い

Naïve Bayes Model に各トピックの間に等しい生成確率のファイルゾーンが存在する。

各トークンの間の関係を無視した。


Bayesian LearnersSmall-Degree Bayesian Networks 各トークンの間の関係をモデルに追加する

各トークンの出現確率はトピックと他のトークンの出現に影響される。

Bayesian Network 各トピックとトークンはグラフのノード依存関係はエージで表す各トークンは必ず直接に１つのトピックノード

を通る


Bayesian networks. For the naive Bayes assumption, the only edges are from the classvariable to individual terms. Towards better approximations to the joint distribution over terms:the probability of a term occurring may now depend on observation about other terms as well as the class variable.


Small-Degree Bayesian Networks

関連のあるトークンを親トークン「 Pa(x) 」といい、親トークンの数を「ｋ」個以下と制限される。親トークンが定まるとそのトークンの出現確率も定まる。

ドキュメントの生成確率は各条件付確率で計算される。

現在 Binary Model だけを用いる。計算時間 2 乗オーダー結果はよくなったがテキストドキュメントに対して

はまだ

x

Xpaxcd ))(|Pr()|Pr(


Maximum Entropy Learners

解決すること Bayesian Learners で訓練集合のベクトル空間の次元はテ

ストドキュメントの次元よりも小さい新たしいフィーチャを追加することが出来ない

仮説ドキュメントは 1 つのトピックに属する。訓練のデータ集合　　　　　　　　　　　　が与える。ドキュメント「 d 」とトピック「 c 」の間の関係は indica

tor function 　　　で定義する。例：

},...,1),,{( nicd ii ),( cdfj

otherwise 0

d tand c’c if 1),(,' cdf tc

otherwise ),(

),( c’c if 0

),(,'

dntdn

cdf tcまたは



Indicator function 　の期待値

Pr(d,c) と Pr(d) の値は経験値で推定する。

ランダム

cd d c

jjj cdfdcdcdfcdfE,

),()|Pr()Pr(),(),Pr()(

),( cdfj

i c

ijiiiij

iii cdfdcdcdfcd ),()|Pr()r(P

~),(),r(P

~

i c

ijii

iij cdfdcn

cdfn

),()|Pr(1

),(1



評価 Bayesian Learners よりも良い分類結果だが、安

定していない複雑 Naïve Bayes Learners と同じフィーチャを用い

るにも関わらず、各フィーチャの独立性に依存しない。


Discriminative Classification

解決すること Naïve Bayes Learners と Maximum Entropy Learners 法は

各トピックの間に共通空間が存在する。 Naïve Bayes

Maximum Entropy 目的

フィーチャをトピック空間に射影する共通空間を無くすること

Dt

tctdncdc ,log),()Pr(log~)|Pr(log

Dt

tctcd cdfZdc ,, log),(log)|Pr(log


Discriminative ClassificationLinear Least-Square Regression 各トピックは数字にエンコードする。ドキュメントを分類する関数を先に定義する。

ドキュメント「ｄ」に対して、関数　 α・ｄ＋ｂ　でトピック空間に射影する。

訓練ドキュメントに対して、エラーが最小になる様に各パラメタを調整する。 Minimum

Widrow-Hoff 更新規則

iii cbd 2).(

iiiiii dcd ).(2 )1()1()(


Linear Least-Square Regression

解釈分別方程式は一つの面と考えられて、各ドキュメ

ントをその面で分けることになる。その面ことをハイパー面という。

または、各ドキュメントはその面に垂直なベクトルに射影されて、スカラー値で分類される。

評価 K-NN 法と同等な結果が得て、 Naive Bayes 法よ

りも良い結果が得る。


Linear Least-Square Regression Hyperplane


Discriminative ClassificationSupport Vector Machines (SVM) 直感

ハイパー面が訓練ドキュメントが蜜なとことから離れたほうが良い

ハイパー面は訓練ドキュメントを通さないほうが良い仮説

訓練ドキュメントとテストドキュメントは同じ集合から抽出する。

トピック空間は 2 個のトピックとして、 {-1,1} とエンコードする。

ハイパー面は 2 つのトピックのドキュメント空間の最も近いところから方向ベクトルを定義する。

ハイパー面は両空間の一番近いところの真ん中を交わる。

}1,1{ ic


Support Vector Machines (SVM)



ドキュメントを分類する関数を改めて定義する

まず、 α とｂが次のような制約を満たす

ハイパー面の反対方向のドキュメントの距離は仮説より

従って ||α|| の値が最小になるようにハイパー面を選択

bdSVM .

1,.....n i 1b) .d(c ii

||||

2).(

||||

2)(

21

21

dd

dd


Support Vector Machines (SVM) 実際、訓練ドキュメントはいつも完全に分かれて

いるわけではない補足変数を導入する　 (fudge variables)

相当の関数 (Lagrange Optimize)n 1,........i 0 and

n.1,....,i -1b).d(c subject to

.2

1 Minimize

i

iii

i

iC

n 1,........i C1 and

0c subject to

).(2

1 Maximize

i

ii

,ii

i

jijijiji ddcc



最適関数を計算する計算量は 2 乗のオーダー数個の λ を一度に精製する (working set) 訓練する時間はに比例し、 n~1.7-2.1 最近線形時間で計算できる結果

SVM は他の方法と比べると、最も制度の高い分類法研究課題

Non-linearSVM

an



SVM training time variation as the training set size is increased, with and without sufficient memory to hold the training set. In the latter case, the memory is set to about a quarter of that needed by the training set.



Comparison of LSVM with previous classifiers on the Reuters data set (data taken from Dumais). (The naive Bayes classier used binary features, so its accuracy can be improved)



Comparison of accuracy across three classifiers: Naive Bayes, Maximum Entropy and Linear SVM, using three data sets: 20 newsgroups, the Recreation sub-tree of the Open Directory, and University Web pages from WebKB.



Comparison between several classifiers using the Reuters collection.

supervise learning mining the web - chapter 5

Documents