k236:&basis&ofdata&analyticsbao/k236/k236-l7.pdf · 1. introduction to data science(1)...
Post on 20-May-2020
2 Views
Preview:
TRANSCRIPT
K236: Basis of Data AnalyticsLecture 7: Classification and prediction
Decision tree induction
Lecturer: Tu Bao Ho and Hieu Chi DamTA: Moharasan Gandhimathi
and Nuttapong Sanglerdsinlapachai
2
Schedule of K236
1. Introduction to data science (1) データ科学入門 6/9
2. Introduction to data science (2) データ科学入門 6/13
3. Data and databases データとデータベース 6/16
4. Review of univariate statistics 単変量統計 6/20
5. Review of linear algebra 線形代数 6/23
6. Data mining software データマイニングソフトウェア 6/27
7. Data preprocessing データ前処理 6/30
8. Classification and prediction (1) 分類と予測 (1) 7/4
9. Knowledge evaluation 知識評価 7/7
10. Classification and prediction (2) 分類と予測 (2) 7/11
11. Classification and prediction (3) 分類と予測 (3) 7/14
12. Mining association rules (1) 相関ルールの解析 7/18
13. Mining association rules (2) 相関ルールの解析 7/21
14. Cluster analysis クラスター解析 7/25
15. Review and Examination レビューと試験 (the data is not fixed) 7/27
3
Data schemas vs. mining methodsデータ・スキーマ vs. 学習手法
Types of data
§ Flat data tables 表形式データ
§ Relational databases 関係DB§ Temporal & spatial data
時空間データ
§ Transactional databases 取引データ
§ Multimedia data マルチメディアデータ
§ Genome databases ゲノムデータ
§ Materials science data 材料データ
§ Textual data テキストデータ
§ Web data ウェブデータ
§ etc.
Mining tasks and methods マイニングの課題と手法
§ Classification/Prediction 分類/予測
q Decision trees 決定木
q Bayesian classification ベイジアン分類
q Neural networks 神経回路網
q Rule induction ルール帰納法
q Support vector machines SVMq Hidden Markov Model 隠れマルコフ
q etc.§ Description 記述
q Association analysis 相関分析
q Clustering クラスタリング
q Summarization 要約
q etc.
4
1. Issues Regarding Classification and Prediction
2. Attribute selection in decision tree induction
3. Tree pruning and other issues
Outline
Classification and prediction
H1
C3
H3 H4
H2
C2C1
C4
Supervised dataUnsupervised data
color #nuclei #tails class
H1 light 1 1 heaH2 dark 1 1 healthyH3 light 1 2 healthyH4 light 2 1 healthyC1 dark 1 2 cancerousC2 dark 2 1 cancerousC3 light 2 2 cancerousC4 dark 2 2 cancerous
color #nuclei #tails label
H1 light 1 1 healH2 dark 1 1 healthyH3 light 1 2 healthyH4 light 2 1 healthyC1 dark 1 2 cancerousC2 dark 2 1 cancerousC3 light 2 2 cancerousC4 dark 2 2 cancerous
Given: 𝒙", 𝑦" , 𝒙%, 𝑦% , … , (𝒙(, 𝑦()-‐ 𝑥+ is description of an object, phenomenon, etc.-‐ 𝑦+ (label attribute) is some property of 𝑥+, if not available learning is unsupervised
Find: a function 𝑓 𝑥 that characterizes {𝑥+} or that 𝑓 𝑥+ = 𝑦+
The problem is usually called classification if “label” is categorical, and prediction if “label” is continuous (in this case, if the descriptive attribute is numerical the problem is regression)
6
Classification—a two-‐step process
• Model construction: describing a set of predetermined classesq Each tuple/object is assumed to belong to a predefined class, as determined by the
class label attribute
q The set of tuples used for model construction: training set
q The model is represented as classification rules, decision trees, or mathematical formulae (classifiers)
• Model usage: for classifying future or unknown objectsEstimate accuracy of the model:
q The known label of test object is compared with the classified result from the model
q Accuracy rate is the percentage of test set objects that are correctly classified by the model
q Test set is independent of training set, otherwise over-‐fitting will occur
7
ClassificationAlgorithms
If color = darkand # tails = 2
Then cancerous cell
H1
H3 H4
H2
C2C1
training data
Classifier(model)
Unknown object
Classification—a two-‐step process
Cancerous?
Model construction Model usage
Cancerous
8
• Predictive accuracy(予測精度): the ability of the classifier to correctly predict unseen data
• Speed: refers to computation cost
• Robustness(頑健性): the ability of the classifier to make correctly predictions given noisy data or data with missing values
• Scalability(拡張性): the ability to construct the classifier efficiently given large amounts of data
• Interpretability(解釈容易性): the level of understanding and insight that is provided by the classifier
Criteria for classification methods
45
Machine learning: View by nature of methods
The five tribes of machine learning, Pedro Domingos
Tribes Origins Master Algorithms
Symbolists Logic, philosophy Inverse deduction
Evolutionaries Evolutionary biology Genetic programming
Connectionists Neuroscience Backpropagation
Bayesians Statistics Probabilistic inference
Analogizers Psychology Kernel machines
Symbolists
46
Tom Mitchell Steve Muggleton Ross Quinlan
47
#nuclei?
1 2
light dark
color?
light dark
1 2
#tails?H
H C
color?
#tails?
1 2
H C
C
H1
C3
H3 H4
H2
C2C1
C4
Classification with decision trees
K236, L7
Analoziger
49
Peter Hart Vladimir Vapnik Douglas Hofstadter
x1 x2
…xn-1 xn
f(x)f(x1)
f(x2)
f(xn-1)f(xn)
...
inverse map f-1
k(xi,xj) = f(xi).f(xj)
Kernel matrix Knxn
Input space X Feature space F
kernel function k: XxX à R kernel-based algorithm on K(computation done on kernel matrix)
Kernel methodsThe basic ideas
50
K619
Connectionists
51
Yann LeCun Geoff Hinton Yoshua Bengio
52
H1
C3
H3 H4
H2
C2C1
C4
Healthy
Cancerous
color = dark
# nuclei = 1
# tails = 2
Classification with neural networks
K236, L9
Deep learning
53K619
Bayesians in machine learning
54
David Heckerman Judea Pearl Michael Jordan
K236, L8
Probabilistic graphical modelsInstances of graphical models
55
Probabilistic modelsGraphical models
Directed Undirected
Bayes nets MRFs
DBNs
Hidden Markov Model (HMM)
Naïve Bayes classifier
Mixture models
Kalmanfiltermodel
Conditionalrandom fields
MaxEnt
LDA
Murphy, ML for life sciences
K619
19
1. Issues Regarding Classification and Prediction
2. Attribute selection in decision tree induction
3. Tree pruning and other issues
Outline
20
A decision tree is a flow-chart-like tree structure:フローチャートのような木構造
§ each internal node denotes a test on an attribute属性の値を判定するのが中間にある節
§ each branch represents an outcome of the test値を判定して各枝へ分岐
§ leaf nodes represent classes or class distributions 末端(葉)はクラス/分布
§ The top-most node in a tree is the root node 木構造の頂点は根
Mining with decision trees決定木でのマイニング
#nuclei
color?
1 2
#tails
light dark
1 2
H
{H1, H3}
{H4, C2, C3, C4}{H1, H2, H3, C1}
{H2, C1}
CH {H2} {C1}
#tails
1 2{H4, C2} {C3, C4}
C
{H1, H2, H3, H4,C1, C2, C3, C4}
color?
light dark
CH {H4} {C2}
21
Decision tree induction (DTI)
§ Decision tree generation consists of two phasesq Tree construction(決定木構築)
§ Partition examples recursively based on selected attributes
§ At start, all the training objects are at the root
q Tree pruning (構築した木の枝刈)
§ Identify and remove branches that reflect noise or outliers
§ Use of decision trees: Classify unknown objects(新事例の分類)
q Test the attribute values of the object against the decision tree
22
1. At each node, choose the “best”attribute by a given measure for attribute selection 各節では事前に指定した選択基準をに対し、 良の属性を選ぶ
2. Extend tree by adding new branch for each value of the attributeその属性の値ごとに枝を追加して木を拡張
3. Sorting training examples to leaf nodes末端に訓練データを並べ替える
4. If examples in a node belong to one class Then Stop Else Repeat steps 1-4 for leaf nodes ある節のデータが同一クラスだけなら停止、混じっていれば1から繰返す
5. Prune the tree to avoid over-fitting枝刈をして過学習を防ぐ
Two steps: recursively generate the tree(順次、
属性を選んでデータを分割)(1-4), and prune the tree (構築した木の枝刈)(5)
Tree construction general algorithm木構造を構築する一般的なアルゴリズム
#nuclei
color?
1 2
#tails
light dark
1 2
H
{H1, H3}
{H4, C2, C3, C4}{H1, H2, H3, C1}
{H2, C1}
CH {H2} {C1}
#tails
1 2{H4, C2} {C3, C4}
C
{H1, H2, H3, H4,C1, C2, C3, C4}
color?
light dark
CH {H4} {C2}
23
• A typical dataset in machine learning
• 14 objects belonging to two class {Y, N} are observed on 4 properties.
• Dom(Outlook) = {sunny, overcast, rain}
• Dom(Temperature) = {hot, mild, cool}
• Dom(humidity) = {high, normal}
• Dom(Wind) = {weak, strong}
Training data for concept “play-‐tennis”
24
temperature
sunny rain o’cast{D9} {D5, D6} {D7}
outlook outlookwind
cool hot mild{D5, D6, D7, D9} {D1, D2, D3, D13} {D4, D8, D10, D11,D12, D14}
true false{D2} {D1, D3, D13}
true false{D5} {D6}
wind
high normal{D1, D3} {D3}
humidity
sunny rain o’cast{D1} {D3}
outlook
sunny o’cast rain{D8, D11} {D12} {D4, D10,D14}
true false{D11} {D8}
windyes yes
no yes
yesno null
yes
no yes
high normal{D4, D14} {D10}
humidity
yestrue false
{D14} {D4}
wind
no yes
noyes
A decision tree for playing tennisテニスに関する決定木の一例
25
sunny o’cast rain{D1, D2, D8 {D3, D7, D12, D13} {D4, D5, D6, D10, D14} D9, D11}
outlook
high normal{D1, D2, D8} {D9, D10}
humidity
no yes
yes
true false{D6, D14} {D4, D5, D10}
wind
no yes
This tree is much simpler as “outlook” is selected at the root.How to select good attribute to split a decision node?
初の属性として”outlook”を選択することで決定木がかなり簡潔になる.分割条件として適切な属性をどのように選ぶのか?
A simple decision tree for playing tennisテニスに関する簡潔な決定木
26
• The “playing-tennis” set S contains 9 positive objects (+) and 5 negative objects (-), denote by [9+, 5-] テニスデータ(テニスする(+)9件、しない(-)5件) のクラス分布[9+, 5-]
• If attributes “humidity” and “wind” split S into sub-nodes with proportions of positive and negative objects as below, which attribute is better?
データを“humidity”で分割する場合と “wind”で分割する場合とでは、クラスの分布はどちらがよいか?
[9+, 5-]
[6+, 1-] [3+, 4-]
A1 = humidity
normal high
[9+, 5-]
[6+, 2-] [3+, 3-]
A2 = wind
weak strong
Which attribute is the best?最良の属性は?
27
• Entropy characterizes the impurity (purity) of an arbitrary collection of objects (データ集合の純度の指標).q S is the collection of positive and negative objects(全体)
q is the proportion of positive objects in S (該当データの比率)
q is the proportion of negative objects in S (非該当データの比率)
q In the play-tennis example, these numbers are 14, 9/14 and 5/14, respectively (テニスデータでは、それぞれ14,9/14, 5/14)
• Entropy is defined as follows エントロピーの定義式
Entropy エントロピー
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = −𝑝 𝑙𝑜𝑔.𝑝 − 𝑝 𝑙𝑜𝑔.𝑝
𝑝𝑝
28
Entropy
The entropy function relative to a Boolean classification, as the proportion of positive objects varies between 0 and 1.
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 =/−𝑝0𝑙𝑜𝑔.𝑝0
1
023
𝑝
If the collection has c distinct groups of objects then the entropy is defined by
entr
opy
𝑝
29
From 14 examples of Play-‐Tennis, 9 positive and 5 negative objects (denote by [9+, 5-‐] ) 14件中、正例9件、負例5件なら
Entropy([9+, 5-‐] ) = − (9/14)log2(9/14) − (5/14)log2(5/14) = 0.940
Notice: 1. Entropy is 0 if all members of S belong to the same class(全データ
が同じクラスの場合のエントロピーは0) . For example, if all members are positive ( = 1), then is 0, and Entropy(S) = − 1. log2(1) − 0. log2 (0) = − 1.0 − 0 . log2 (0) = 0.
2. Entropy is 1 if the collection contains an equal number of positive and negative examples. If these numbers are unequal, the entropy is between 0 and 1. (両クラスのデータ件数が等しい場合のエントロピーは1、等しくなければ0から1の間の値)
Example
𝑝 𝑝
30
We define a measure, called information gain (情報利得), of the effectiveness of an attribute in classifying data. It is the expected reduction in entropy caused by partitioning the objects according to this attributeその属性によるデータ分割における不純度低減効果をはかる尺度のひとつが情報利得
where Value(A) is the set of all possible values for attribute A, and Sv is the subset of S for which A has value v.
Value(A):属性Aの値 Sv: 全データSのうちValue(A)=vのもの
Information gain measures the expected reduction in entropy
𝐺𝑎𝑖𝑛 𝑆, 𝐴 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − /𝑆9𝑆 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆9)
�
9∈>?@AB(C)
31
Values(Wind) = {Weak, Strong}, S = [9+, 5-]
Sweak , the subnode with value “weak”, is [6+, 2-] Sstrong , the subnode with value “strong”, is [3+, 3-]
Information gain measures the expected reduction in entropy
𝐺𝑎𝑖𝑛 𝑆,𝑊𝑖𝑛𝑑 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) −∑ GHG𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆9)�
9∈{JB?K,LMNOPQ}
= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 −814𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆JB?K −
614𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆LMNOPQ)
= 0.940 −814 0.811 −
614 𝑥 1.0 = 0.048
32
S:[9+, 5-]E = 0.940
Humidity
High Normal
[3+, 4-] [6+, 1-]E = 0.985 E = 0.592
Gain(S, Humidity)= .940 - (7/14).985 - (7/14).592= .151
S:[9+, 5-]E = 0.940
Wind
Weak Strong
[6+, 2-] [3+, 3-]E = 0.811 E = 1.00
Gain(S, Wind)= .940 - (8/14).811 - (6/14)1.00= .048
Which attribute is the best classifier?
33
Information gain of all attributes
Gain (S, Outlook) = 0.246
Gain (S, Humidity) = 0.151
Gain (S, Wind) = 0.048
Gain (S, Temperature) = 0.029
34
{D1, D2, ..., D14} [9+, 5-]
Outlook
Sunny Overcast Rain
{D1, D2, D8, D9, D11}[2+, 3-]
{D3, D7, D12, D13}[4+, 0-]
{D4, D5, D6, D10, D14}[3+, 2-]
? Yes ?
Which attribute should be tested here?
Ssunny = {D1, D2, D3, D9, D11}Gain(Ssunny, Humidity) = .970 - (3/5)0.0 - (2/5)0.0 = 0.970Gain(Ssunny, Temperature) = .970 - (2/5)0.0 - (2/5)1.0 - (1/5)0.0 = 0.570Gain(Ssunny, Wind) = .970 - (2/5)1.0 - (3/5)0.918 = 0.019
Next step in growing the decision tree
35
Attributes with many values
• If attribute has many values (e.g., days of the month), ID3 will select it
• C4.5 uses GainRatio instead
𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜 𝑆, 𝐴 =𝐺𝑎𝑖𝑛(𝑆, 𝐴)
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑎𝑡𝑖𝑜𝑛(𝑆, 𝐴)
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑆, 𝐴 = −/𝑆0𝑆
1
023
𝑙𝑜𝑔.𝑆0𝑆
𝑤ℎ𝑒𝑟𝑒 𝑆0 𝑖𝑠 𝑠𝑢𝑏𝑠𝑒𝑡 𝑜𝑓 𝑆 𝑤𝑖𝑡ℎ 𝐴 ℎ𝑎𝑠 ℎ𝑎𝑠 𝑣𝑎𝑙𝑢𝑒 𝑣0
Measures for attribute selection
∑ f.g ∑ fhgijkfhgl∑ fh.ijkfh.�h
�h
�g
∑ f.gijkf.g�g
Quinlan, C4.5, 1993
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 /𝑝.m/𝑝0/m. −/𝑝0.. �
0
�
0
�
m
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜
Breiman, CART, 1984
𝜒. //𝑒0m − 𝑛0m
.
𝑒0m,
�
m
�
0
𝑒0m =𝑛.m𝑛0.𝑛..
Statistics
𝑅 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 /𝑝.m𝑚𝑎𝑥0 𝑝0/m.�
m
Ho & Nguyen, 1997
37
1. Issues Regarding Classification and Prediction
2. Attribute selection in decision tree induction
3. Tree pruning and other issues
Outline
38
1. Every attribute has already been included along this path through the tree木構造の経路内に出現しない属性がなくなったとき
2. The training objects associated with each leaf node all have the same target attribute value (i.e., their entropy is zero 末端に該当するデータが同一クラスで構成される場合 = エ
ントロピー0
Notice: Algorithm ID3 uses Information Gain and C4.5, its successor, uses Gain Ratio (a variant of Information Gain)分割の適切さを測る尺度として、ID3では情報利得、その後継C4.5では情報利得比を用いる
Stopping condition
Generalization problem in classification
Underfitting OverfittingGood fitting
• One of the most common tasks is to fit a “model” to a set of training data, so as to be able to make reliable predictions on general untrained data.
• Overfitting: A statistical model describes random error or noise instead of the underlying relationship.
• Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations.
• A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data.
40
Over-‐fitting in decision trees• The generated tree may overfit the training data
q Too many branches, some may reflect anomalies due to noise or outliers
q Result is in poor accuracy for unseen objects
• Two approaches to avoid overfittingq Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold.• Difficult to choose an appropriate threshold
q Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees• Use a set of data different from the training data to decide which is the “best pruned tree”.
41
sunny o’cast rain
outlook
high normal
humidity
no yes
yestrue false
wind
no yes
IF (Outlook = Sunny) and (Humidity = High) THEN PlayTennis = No
IF (Outlook = Sunny) and (Humidity = Normal) THEN PlayTennis = Yes
Converting a tree to rules
42
Sunday11-12 PM
Tree map
Cone tree
Fisheye view
Hyperbolic tree
Visualization of decision trees
Our D2MS
D2MS’s T2.5D
Ensemble learningEnsemble methods use multiple models to obtain better predictive performance than could be obtained from any of the constituent models.
q Boosting: Make examples currently misclassified more importantq Bagging: Use different subsets of the training data for each model
43
Training Data
Data1 Data mData2 × × × × × × × ×
Learner1 Learner2 Learner m× × × × ×
Model1 Model2 Model m× × × ×× ×
Model Combiner Final Model
Model 1
Model 2
Model 3
Model 4
Model 5Model 6
Some unknown distribution
Random forest
• Random forests is a forest of random decision trees (ensemble)
• Tree bagging: Given a training set 𝒙3, 𝑦3 , 𝒙., 𝑦. , … , 𝒙P, 𝑦P .
q Sample with replacement 𝑛 training examples à Learn a tree
q Repeat 𝐾 times to learn 𝐾 decision trees
q Making prediction for an unknown case by the majority vote from the results of 𝐾trees
• Random forest: As tree bagging but choose a random subset of attributes to build the tree. Leo Breiman, 1928 -‐ 2005
45
Issues in decision tree learning
• Attribute selection• Pruning trees• From trees to rules (high cost of pruning)• Visualization• Data access: recent development on very large training
sets, fast, efficient and scalable (well-‐known systems: C4.5 and CART)
• Random Forest• Further reading:
http://www.jaist.ac.jp/~bao/DA-‐K236/TopTenDMAlgorithms.pdf
Homework
Homework
A company preparares its marketing strategy and sent out some promotion to various houses and recorded 4 facts (attributes) about each house and also whether the people responded or not (outcome of promotion). The data is as in the table.
Manually build a decision tree with the method studied in this lecture.
top related