k236:&basis&ofdata&analyticsbao/k236/k236-l7.pdf · 1. introduction to data science(1)...

K236: Basis of Data AnalyticsLecture 7: Classification and prediction

Decision tree induction

Lecturer: Tu Bao Ho and Hieu Chi DamTA: Moharasan Gandhimathi

and Nuttapong Sanglerdsinlapachai

Schedule of K236

1. Introduction to data science (1) データ科学入門 6/9

2. Introduction to data science (2) データ科学入門 6/13

3. Data and databases データとデータベース 6/16

4. Review of univariate statistics 単変量統計 6/20

5. Review of linear algebra 線形代数 6/23

6. Data mining software データマイニングソフトウェア 6/27

7. Data preprocessing データ前処理 6/30

8. Classification and prediction (1) 分類と予測 (1) 7/4

9. Knowledge evaluation 知識評価 7/7

12. Mining association rules (1) 相関ルールの解析 7/18

13. Mining association rules (2) 相関ルールの解析 7/21

14. Cluster analysis クラスター解析 7/25

15. Review and Examination レビューと試験 (the data is not fixed) 7/27

Data schemas vs. mining methodsデータ･スキーマ vs. 学習手法

Types of data

§ Flat data tables 表形式データ

§ Relational databases 関係DB§ Temporal & spatial data

時空間データ

§ Transactional databases 取引データ

§ Multimedia data マルチメディアデータ

§ Genome databases ゲノムデータ

§ Materials science data 材料データ

§ Textual data テキストデータ

§ Web data ウェブデータ

§ etc.

Mining tasks and methods マイニングの課題と手法

§ Classification/Prediction 分類/予測

q Decision trees 決定木

q Bayesian classification ベイジアン分類

q Neural networks 神経回路網

q Rule induction ルール帰納法

q Support vector machines SVMq Hidden Markov Model 隠れマルコフ

q etc.§ Description 記述

q Association analysis 相関分析

q Clustering クラスタリング

q Summarization 要約

q etc.

1. Issues Regarding Classification and Prediction

2. Attribute selection in decision tree induction

3. Tree pruning and other issues

Outline

Classification and prediction

Supervised dataUnsupervised data

color #nuclei #tails class

H1 light 1 1 heaH2 dark 1 1 healthyH3 light 1 2 healthyH4 light 2 1 healthyC1 dark 1 2 cancerousC2 dark 2 1 cancerousC3 light 2 2 cancerousC4 dark 2 2 cancerous

color #nuclei #tails label

H1 light 1 1 healH2 dark 1 1 healthyH3 light 1 2 healthyH4 light 2 1 healthyC1 dark 1 2 cancerousC2 dark 2 1 cancerousC3 light 2 2 cancerousC4 dark 2 2 cancerous

Given: 𝒙", 𝑦" , 𝒙%, 𝑦% , … , (𝒙(, 𝑦()-‐ 𝑥+ is description of an object, phenomenon, etc.-‐ 𝑦+ (label attribute) is some property of 𝑥+, if not available learning is unsupervised

Find: a function 𝑓 𝑥 that characterizes {𝑥+} or that 𝑓 𝑥+ = 𝑦+

The problem is usually called classification if “label” is categorical, and prediction if “label” is continuous (in this case, if the descriptive attribute is numerical the problem is regression)

Classification—a two-‐step process

• Model construction: describing a set of predetermined classesq Each tuple/object is assumed to belong to a predefined class, as determined by the

class label attribute

q The set of tuples used for model construction: training set

q The model is represented as classification rules, decision trees, or mathematical formulae (classifiers)

• Model usage: for classifying future or unknown objectsEstimate accuracy of the model:

q The known label of test object is compared with the classified result from the model

q Accuracy rate is the percentage of test set objects that are correctly classified by the model

q Test set is independent of training set, otherwise over-‐fitting will occur

ClassificationAlgorithms

If color = darkand # tails = 2

Then cancerous cell

training data

Classifier(model)

Unknown object

Classification—a two-‐step process

Cancerous?

Model construction Model usage

Cancerous

• Predictive accuracy（予測精度）: the ability of the classifier to correctly predict unseen data

• Speed: refers to computation cost

• Robustness（頑健性）: the ability of the classifier to make correctly predictions given noisy data or data with missing values

• Scalability（拡張性）: the ability to construct the classifier efficiently given large amounts of data

• Interpretability（解釈容易性）: the level of understanding and insight that is provided by the classifier

Criteria for classification methods

Machine learning: View by nature of methods

The five tribes of machine learning, Pedro Domingos

Tribes Origins Master Algorithms

Symbolists Logic, philosophy Inverse deduction

Evolutionaries Evolutionary biology Genetic programming

Connectionists Neuroscience Backpropagation

Bayesians Statistics Probabilistic inference

Analogizers Psychology Kernel machines

Symbolists

Tom Mitchell Steve Muggleton Ross Quinlan

#nuclei?

light dark

color?

light dark

#tails?H

color?

#tails?

Classification with decision trees

K236, L7

Analoziger

Peter Hart Vladimir Vapnik Douglas Hofstadter

…xn-1 xn

f(x)f(x1)

f(xn-1)f(xn)

inverse map f-1

k(xi,xj) = f(xi).f(xj)

Kernel matrix Knxn

Input space X Feature space F

kernel function k: XxX à R kernel-based algorithm on K(computation done on kernel matrix)

Kernel methodsThe basic ideas

Connectionists

Yann LeCun Geoff Hinton Yoshua Bengio

Healthy

Cancerous

color = dark

# nuclei = 1

# tails = 2

Classification with neural networks

K236, L9

Deep learning

53K619

Bayesians in machine learning

David Heckerman Judea Pearl Michael Jordan

K236, L8

Probabilistic graphical modelsInstances of graphical models

Probabilistic modelsGraphical models

Directed Undirected

Bayes nets MRFs

Hidden Markov Model (HMM)

Naïve Bayes classifier

Mixture models

Kalmanfiltermodel

Conditionalrandom fields

MaxEnt

Murphy, ML for life sciences

Outline

A decision tree is a flow-chart-like tree structure:フローチャートのような木構造

§ each internal node denotes a test on an attribute属性の値を判定するのが中間にある節

§ each branch represents an outcome of the test値を判定して各枝へ分岐

§ leaf nodes represent classes or class distributions 末端(葉)はクラス/分布

§ The top-most node in a tree is the root node 木構造の頂点は根

Mining with decision trees決定木でのマイニング

#nuclei

color?

#tails

light dark

{H1, H3}

{H4, C2, C3, C4}{H1, H2, H3, C1}

{H2, C1}

CH {H2} {C1}

#tails

1 2{H4, C2} {C3, C4}

{H1, H2, H3, H4,C1, C2, C3, C4}

color?

light dark

CH {H4} {C2}

Decision tree induction (DTI)

§ Decision tree generation consists of two phasesq Tree construction（決定木構築）

§ Partition examples recursively based on selected attributes

§ At start, all the training objects are at the root

q Tree pruning （構築した木の枝刈）

§ Identify and remove branches that reflect noise or outliers

§ Use of decision trees: Classify unknown objects（新事例の分類）

q Test the attribute values of the object against the decision tree

1. At each node, choose the “best”attribute by a given measure for attribute selection 各節では事前に指定した選択基準をに対し、良の属性を選ぶ

2. Extend tree by adding new branch for each value of the attributeその属性の値ごとに枝を追加して木を拡張

3. Sorting training examples to leaf nodes末端に訓練データを並べ替える

4. If examples in a node belong to one class Then Stop Else Repeat steps 1-4 for leaf nodes ある節のデータが同一クラスだけなら停止、混じっていれば1から繰返す

5. Prune the tree to avoid over-fitting枝刈をして過学習を防ぐ

Two steps: recursively generate the tree（順次、

属性を選んでデータを分割）(1-4), and prune the tree （構築した木の枝刈）(5)

Tree construction general algorithm木構造を構築する一般的なアルゴリズム

#nuclei

color?

#tails

light dark

{H1, H3}

{H4, C2, C3, C4}{H1, H2, H3, C1}

{H2, C1}

CH {H2} {C1}

#tails

1 2{H4, C2} {C3, C4}

{H1, H2, H3, H4,C1, C2, C3, C4}

color?

light dark

CH {H4} {C2}

• A typical dataset in machine learning

• 14 objects belonging to two class {Y, N} are observed on 4 properties.

• Dom(Outlook) = {sunny, overcast, rain}

• Dom(Temperature) = {hot, mild, cool}

• Dom(humidity) = {high, normal}

• Dom(Wind) = {weak, strong}

Training data for concept “play-‐tennis”

temperature

sunny rain o’cast{D9} {D5, D6} {D7}

outlook outlookwind

cool hot mild{D5, D6, D7, D9} {D1, D2, D3, D13} {D4, D8, D10, D11,D12, D14}

true false{D2} {D1, D3, D13}

true false{D5} {D6}

high normal{D1, D3} {D3}

humidity

sunny rain o’cast{D1} {D3}

outlook

sunny o’cast rain{D8, D11} {D12} {D4, D10,D14}

true false{D11} {D8}

windyes yes

no yes

yesno null

no yes

high normal{D4, D14} {D10}

humidity

yestrue false

{D14} {D4}

no yes

A decision tree for playing tennisテニスに関する決定木の一例

sunny o’cast rain{D1, D2, D8 {D3, D7, D12, D13} {D4, D5, D6, D10, D14} D9, D11}

outlook

high normal{D1, D2, D8} {D9, D10}

humidity

no yes

true false{D6, D14} {D4, D5, D10}

no yes

This tree is much simpler as “outlook” is selected at the root.How to select good attribute to split a decision node?

初の属性として”outlook”を選択することで決定木がかなり簡潔になる.分割条件として適切な属性をどのように選ぶのか?

A simple decision tree for playing tennisテニスに関する簡潔な決定木

• The “playing-tennis” set S contains 9 positive objects (+) and 5 negative objects (-), denote by [9+, 5-] テニスデータ（テニスする(+)9件、しない(-)５件）のクラス分布[9+, 5-]

• If attributes “humidity” and “wind” split S into sub-nodes with proportions of positive and negative objects as below, which attribute is better?

データを“humidity”で分割する場合と “wind”で分割する場合とでは、クラスの分布はどちらがよいか？

[9+, 5-]

[6+, 1-] [3+, 4-]

A1 = humidity

normal high

[9+, 5-]

[6+, 2-] [3+, 3-]

A2 = wind

weak strong

Which attribute is the best?最良の属性は?

• Entropy characterizes the impurity (purity) of an arbitrary collection of objects (データ集合の純度の指標).q S is the collection of positive and negative objects(全体)

q is the proportion of positive objects in S (該当データの比率)

q is the proportion of negative objects in S (非該当データの比率)

q In the play-tennis example, these numbers are 14, 9/14 and 5/14, respectively (テニスデータでは、それぞれ14，9/14, 5/14)

• Entropy is defined as follows エントロピーの定義式

Entropy エントロピー

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = −𝑝 𝑙𝑜𝑔.𝑝 − 𝑝 𝑙𝑜𝑔.𝑝

𝑝𝑝

Entropy

The entropy function relative to a Boolean classification, as the proportion of positive objects varies between 0 and 1.

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 =/−𝑝0𝑙𝑜𝑔.𝑝0

If the collection has c distinct groups of objects then the entropy is defined by

From 14 examples of Play-‐Tennis, 9 positive and 5 negative objects (denote by [9+, 5-‐] ) 14件中、正例9件、負例5件なら

Entropy([9+, 5-‐] ) = − (9/14)log2(9/14) − (5/14)log2(5/14) = 0.940

Notice: 1. Entropy is 0 if all members of S belong to the same class（全データ

が同じクラスの場合のエントロピーは0） . For example, if all members are positive ( = 1), then is 0, and Entropy(S) = − 1. log2(1) − 0. log2 (0) = − 1.0 − 0 . log2 (0) = 0.

2. Entropy is 1 if the collection contains an equal number of positive and negative examples. If these numbers are unequal, the entropy is between 0 and 1. （両クラスのデータ件数が等しい場合のエントロピーは1、等しくなければ0から1の間の値）

Example

𝑝 𝑝

We define a measure, called information gain (情報利得), of the effectiveness of an attribute in classifying data. It is the expected reduction in entropy caused by partitioning the objects according to this attributeその属性によるデータ分割における不純度低減効果をはかる尺度のひとつが情報利得

where Value(A) is the set of all possible values for attribute A, and Sv is the subset of S for which A has value v.

Value(A)：属性Aの値 Sv: 全データSのうちValue(A)=vのもの

Information gain measures the expected reduction in entropy

𝐺𝑎𝑖𝑛 𝑆, 𝐴 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − /𝑆9𝑆 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆9)

9∈>?@AB(C)

Values(Wind) = {Weak, Strong}, S = [9+, 5-]

Sweak , the subnode with value “weak”, is [6+, 2-] Sstrong , the subnode with value “strong”, is [3+, 3-]

Information gain measures the expected reduction in entropy

𝐺𝑎𝑖𝑛 𝑆,𝑊𝑖𝑛𝑑 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) −∑ GHG𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆9)�

9∈{JB?K,LMNOPQ}

= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 −814𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆JB?K −

614𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆LMNOPQ)

= 0.940 −814 0.811 −

614 𝑥 1.0 = 0.048

S:[9+, 5-]E = 0.940

Humidity

High Normal

[3+, 4-] [6+, 1-]E = 0.985 E = 0.592

Gain(S, Humidity)= .940 - (7/14).985 - (7/14).592= .151

S:[9+, 5-]E = 0.940

Weak Strong

[6+, 2-] [3+, 3-]E = 0.811 E = 1.00

Gain(S, Wind)= .940 - (8/14).811 - (6/14)1.00= .048

Which attribute is the best classifier?

Information gain of all attributes

Gain (S, Outlook) = 0.246

Gain (S, Humidity) = 0.151

Gain (S, Wind) = 0.048

Gain (S, Temperature) = 0.029

{D1, D2, ..., D14} [9+, 5-]

Outlook

Sunny Overcast Rain

{D1, D2, D8, D9, D11}[2+, 3-]

{D3, D7, D12, D13}[4+, 0-]

{D4, D5, D6, D10, D14}[3+, 2-]

? Yes ?

Which attribute should be tested here?

Ssunny = {D1, D2, D3, D9, D11}Gain(Ssunny, Humidity) = .970 - (3/5)0.0 - (2/5)0.0 = 0.970Gain(Ssunny, Temperature) = .970 - (2/5)0.0 - (2/5)1.0 - (1/5)0.0 = 0.570Gain(Ssunny, Wind) = .970 - (2/5)1.0 - (3/5)0.918 = 0.019

Next step in growing the decision tree

Attributes with many values

• If attribute has many values (e.g., days of the month), ID3 will select it

• C4.5 uses GainRatio instead

𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜 𝑆, 𝐴 =𝐺𝑎𝑖𝑛(𝑆, 𝐴)

𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑎𝑡𝑖𝑜𝑛(𝑆, 𝐴)

𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑆, 𝐴 = −/𝑆0𝑆

𝑙𝑜𝑔.𝑆0𝑆

𝑤ℎ𝑒𝑟𝑒 𝑆0 𝑖𝑠 𝑠𝑢𝑏𝑠𝑒𝑡 𝑜𝑓 𝑆 𝑤𝑖𝑡ℎ 𝐴 ℎ𝑎𝑠 ℎ𝑎𝑠 𝑣𝑎𝑙𝑢𝑒 𝑣0

Measures for attribute selection

∑ f.g ∑ fhgijkfhgl∑ fh.ijkfh.�h

∑ f.gijkf.g�g

Quinlan, C4.5, 1993

𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 /𝑝.m/𝑝0/m. −/𝑝0.. �

𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜

Breiman, CART, 1984

𝜒. //𝑒0m − 𝑛0m

𝑒0m,

𝑒0m =𝑛.m𝑛0.𝑛..

Statistics

𝑅 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 /𝑝.m𝑚𝑎𝑥0 𝑝0/m.�

Ho & Nguyen, 1997

Outline

1. Every attribute has already been included along this path through the tree木構造の経路内に出現しない属性がなくなったとき

2. The training objects associated with each leaf node all have the same target attribute value (i.e., their entropy is zero 末端に該当するデータが同一クラスで構成される場合 = エ

ントロピー0

Notice: Algorithm ID3 uses Information Gain and C4.5, its successor, uses Gain Ratio (a variant of Information Gain)分割の適切さを測る尺度として、ID３では情報利得、その後継C4.5では情報利得比を用いる

Stopping condition

Generalization problem in classification

Underfitting OverfittingGood fitting

• One of the most common tasks is to fit a “model” to a set of training data, so as to be able to make reliable predictions on general untrained data.

• Overfitting: A statistical model describes random error or noise instead of the underlying relationship.

• Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations.

• A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data.

Over-‐fitting in decision trees• The generated tree may overfit the training data

q Too many branches, some may reflect anomalies due to noise or outliers

q Result is in poor accuracy for unseen objects

• Two approaches to avoid overfittingq Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold.• Difficult to choose an appropriate threshold

q Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees• Use a set of data different from the training data to decide which is the “best pruned tree”.

sunny o’cast rain

outlook

high normal

humidity

no yes

yestrue false

no yes

IF (Outlook = Sunny) and (Humidity = High) THEN PlayTennis = No

IF (Outlook = Sunny) and (Humidity = Normal) THEN PlayTennis = Yes

Converting a tree to rules

Sunday11-12 PM

Tree map

Cone tree

Fisheye view

Hyperbolic tree

Visualization of decision trees

Our D2MS

D2MS’s T2.5D

Ensemble learningEnsemble methods use multiple models to obtain better predictive performance than could be obtained from any of the constituent models.

q Boosting: Make examples currently misclassified more importantq Bagging: Use different subsets of the training data for each model

Training Data

Data1 Data mData2 × × × × × × × ×

Learner1 Learner2 Learner m× × × × ×

Model1 Model2 Model m× × × ×× ×

Model Combiner Final Model

Model 1

Model 2

Model 3

Model 4

Model 5Model 6

Some unknown distribution

Random forest

• Random forests is a forest of random decision trees (ensemble)

• Tree bagging: Given a training set 𝒙3, 𝑦3 , 𝒙., 𝑦. , … , 𝒙P, 𝑦P .

q Sample with replacement 𝑛 training examples à Learn a tree

q Repeat 𝐾 times to learn 𝐾 decision trees

q Making prediction for an unknown case by the majority vote from the results of 𝐾trees

• Random forest: As tree bagging but choose a random subset of attributes to build the tree. Leo Breiman, 1928 -‐ 2005

Issues in decision tree learning

• Attribute selection• Pruning trees• From trees to rules (high cost of pruning)• Visualization• Data access: recent development on very large training

sets, fast, efficient and scalable (well-‐known systems: C4.5 and CART)

• Random Forest• Further reading:

http://www.jaist.ac.jp/~bao/DA-‐K236/TopTenDMAlgorithms.pdf

Homework

A company preparares its marketing strategy and sent out some promotion to various houses and recorded 4 facts (attributes) about each house and also whether the people responded or not (outcome of promotion). The data is as in the table.

Manually build a decision tree with the method studied in this lecture.

k236:&basis&ofdata&analyticsbao/k236/k236-l7.pdf · 1. introduction to data science(1)...

Documents

nagoyastat #5...

excel で学ぶ多変量データ処理入門 ·...

異常行動検出入門 –...

データ分析入門（ 12 ）

統計学講義 - u-toyama.ac.jp ·...

rws150b evaluation data 型式データ

データ解析のための統計モデリング入門...

big data入門に見せかけたfluentd入門

専門演習b：pythonによるデータ分析入門i.cla.kobe-u.ac.jp/murao/class/2015-seminarb3/01_preparation.pdf ·...

データ分析入門（ 3 ）

オープンソース gis 入門コース postgis ＋ qgis...

excel2016入門 - unisys...excel2016入門コース概要...

data データ事業 - weekly toyo keizai

data+とdata+ lite データ管理ツール...

linked data (再)入門

pythonistaのためのデータ分析入門 - c4k meetup #3

データ分析入門（ 5 ）

データ社会に求められる...

hddsurgery - guide for using hdds sea 3.5' ramp set · page...

cache-oblivious データ構造入門 @dsirnlp#5