decision treedatamining.dongguk.ac.kr/lectures/2010-2/dm/dm_tree.pdf · 2011-01-06 · decision...

Decision Tree

(의사결정나무모형)

김진석

Department of Statistics and Information Science

Dongguk University

E-mail:[email protected]

2008년 9월

0-0

차례

제 1 절 Boston Housing Data 0-3

제 2 절 Brief introduction to Tree model 0-72.1 Types of decision trees . . . . . . . . . . . . . . . . . . . 0-92.2 Construction process of tree model . . . . . . . . . . . . 0-9

제 3 절 Growing 0-103.1 Impurity function(불순도함수) . . . . . . . . . . . . . . . 0-113.2 Split method (분리기준) . . . . . . . . . . . . . . . . . . . 0-13

제 4 절 Pruning (가지치기) 0-14

제 5 절 Selection of the best tree 0-17

0-1

제 6 절 예측 및 Model 평가 0-21

제 7 절 Exercise: Spam E-mail Data 0-23

0-2

제 1절 Boston Housing Data

Housing data for 506 census tracts of Boston from the 1970 census.The dataframe BostonHousing contains the original data by Harrisonand Rubinfeld (1979), The original data are 506 observations on 14variables, medv being the target variable:

• crim: 1인당 범죄율( per capita crime rate by town)

• zn: proportion of residential land zoned for lots over 25,000sq.ft

• indus: proportion of non-retail business acres per town

• chas: Charles River dummy variable (= 1 if tract bounds river;0 otherwise)

0-3

• nox: nitric oxides concentration (parts per 10 million)

• rm: average number of rooms per dwelling

• age: proportion of owner-occupied units built prior to 1940

• dis: weighted distances to five Boston employment centres

• rad: index of accessibility to radial highways

• tax: full-value property-tax rate per USD 10,000

• ptratio: pupil-teacher ratio by town

• b: 1000(B−0.63)2 where B is the proportion of blacks by town

• lstat: percentage of lower status of the population

0-4

• medv: median value of owner-occupied homes in USD 1000’s

The original data have been taken from the UCI Repository athttp://www.ics.uci.edu/~mlearn/MLRepository.html.

> library(mlbench)

> data(BostonHousing)

> t1<-tree(medv~., BostonHousing)

> summary(t1)

Regression tree:

tree(formula = medv ~ ., data = BostonHousing)

Variables actually used in tree construction:

[1] "rm" "lstat" "dis" "crim" "ptratio"

Number of terminal nodes: 9

Residual mean deviance: 13.55 = 6734 / 497

Distribution of residuals:

0-5

http://www.ics.uci.edu/~mlearn/MLRepository.html

Min. 1st Qu. Median Mean 3rd Qu. Max.

-1.768e+01 -2.230e+00 7.026e-02 1.639e-16 2.221e+00 1.650e+01

Snip the tree model

t1.1 <- snip.tree(t1, nodes = c(4, 5, 6, 7))

t1.1

node), split, n, deviance, yval

* denotes terminal node

1) root 506 42720 22.53

2) rm < 6.941 430 17320 19.93

4) lstat < 14.4 255 6632 23.35 *

5) lstat > 14.4 175 3373 14.96 *

3) rm > 6.941 76 6059 37.24

6) rm < 7.437 46 1900 32.11 *

7) rm > 7.437 30 1099 45.10 *

0-6

Ploting tree model

pdf("tree.pdf")

plot(t1, type="uniform")

text(t1)

mtext("Tree model on Boston Housing data",

side = 3, line = 1, col=2, cex=2)

dev.off()

pdf("partitiontree.pdf")

partition.tree(t1.1, main="partition.tree(t1.1)")

dev.off()

제 2절 Brief introduction to Tree model

• CART(Breiman et al., 1984), C4.5(Quinlan, 1987)

0-7

|rm < 6.941

lstat < 14.4

dis < 1.38485

rm < 6.543

crim < 6.99237

rm < 7.437

lstat < 11.455 ptratio < 17.9

45.58

21.63 27.43

17.14 11.98 33.50 20.74 46.82 36.48

Tree model on Boston Housing data

4 5 6 7 8

510

1520

2530

35

partition.tree(t1.1)

rmls

tat

23.3

15.0

32.1 45.1

그림 1: Regression tree using Boston Housing data

0-8

• Recusive partition : X = ∪Rj , Ri ∩Rj = ∅

• Constant fitting : f(x) =∑Mj=1 cjI(x ∈ Rj)

2.1 Types of decision trees

• y is continuous variable, i.e regression tree.

• y is categorical variable, i.e classification tree.

– The two class problem

– The multiclass problem

2.2 Construction process of tree model

트리모형을 구축하기 위한 절차로는 보통 다음의 3가지 절차를 따른다.

0-9

1. Growing

2. Pruning (보통, Pruning과 Selection을 합쳐서 pruning이라고 부르기도 한다.)

3. Selection

제 3절 Growing

1. Which measure can we choose to split?

2. Stoping rule(MDL, BIC, MML).

3. Computational efficiency :

(a) Large data인 경우 각노드에서 subsampling ?

0-10

(b) covariate이 multiclass categorical 변수의 split?

(c) response가 multiclass categorical 변수의 split?

3.1 Impurity function(불순도함수)

노드 t에서의 불순도함수는 그 노드에서의 y가 class에 속할 확률들 즉,p(1|t), · · · , p(J |t)의 함수로 it(p(1|t), · · · , p(J |t))와 같이 표현되며, 이불순도함수는 p(j|t)값이 모두 같은 경우에 가장 큰 값을 갖고, 한 값이1이고 나머지가 0인 경우에 0값을 갖는 symmetric 함수이다.

Several impurity functions

• Least squares it = 1nt

∑j∈t(yj − yt)2

• Least Absolute Deviation : it = 1nt

∑j∈t |yj −medk∈tyk|

0-11

3.2 Split method (분리기준)

먼저 t 를 current node, tL, tR을 t의 children node라고 할 때, s를node t의 자료들을 분할하는 기준이라고 하자. 우리는 s를 split이라고부르며, s는 split variable과 split value(or set)로 이루어져 있다.

1. Using impurity

D(s, t) = ∆i(s, t) = i(t)− pLi(tL)− pRi(tR).

2. Using Twoing criterion

D(s, t) =pLpR

4

[∑j

∣∣p(j|tL)− p(j|tR)∣∣]2

We can choose the best split s∗ = arg maxs∈S D(s, t).

0-13

제 4절 Pruning (가지치기)

The cost-complexity pruning:For any subtree T � Tmax, define its complexity as |T |, the numberof terminal nodes in T . Let α ≥ 0 be a real number called the com-plexity parameter and define the cost-complexity measure Rα(T ) as

Rα(T ) = R(T ) + α|T |.

Pruning Criterion:

• Misclassification Error (or generalization error)

• Entropy or deviance

> prune.tree(t1)

0-14

$size

[1] 9 8 7 6 5 4 3 2 1

$dev

[1] 6733.787 7179.269 7904.869 9041.678 10483.604

[6] 13003.931 16064.888 23376.740 42716.295

$k

[1] -Inf 445.4817 725.6002 1136.8088 1441.9267

[6] 2520.3263 3060.9575 7311.8524 19339.5550

$method

[1] "deviance"

> cvt<-cv.tree(fgl.tr,K=10, FUN=prune.misclass)

> cvt

$size

[1] 20 16 12 11 9 6 5 4 3 1

0-15

$dev

[1] 77 77 75 72 73 82 88 89 93 144

$k

[1] -Inf 0.000000 1.000000 2.000000 2.500000 4.666667 7.000000

[8] 8.000000 11.000000 27.000000

$method

[1] "misclass"

attr(,"class")

[1] "prune" "tree.sequence"

pdf("ccp.pdf", width=8)

par(mfrow=c(1,2))

data(fgl, package="MASS")

fgl.tr <- tree(type ~ ., fgl)

plot(prune.tree(fgl.tr))

mtext("plot of pruning sequence", 3,2.5, cex=1.5, col=4)

0-16

cvt<-cv.tree(fgl.tr,K=10, FUN=prune.misclass)

plot(cvt);

mtext("plot of pruning sequence", 3,2.5, cex=1.5, col=4)

dev.off()

제 5절 Selection of the best tree

Selection Criterion:

• Misclassification Error (or generalization error)

• Entropy or deviance

How to measure them:

• Using test sample

0-17

size

devi

ance

200

300

400

500

600

5 10 15 20

170.0 23.0 16.0 10.0 8.1

plot of pruning sequence

size

mis

clas

s

7080

9010

011

012

013

014

0

5 10 15 20

27.0 7.0 2.5 0.0 −Inf

plot of pruning sequence

그림 2: Cost-complexity pruning

0-18

• Using K-fold cross validation: Let L = L1 ∪ . . . ∪ LK .

– Model building from L − Lj

– Test using Lj

pdf("comp1.pdf", width=8)

par(mfrow=c(1,2))

data(fgl, package="MASS")

cv.fgl.tree <- prune.tree(fgl.tr,best=11)

plot(fgl.tr, type="uniform");

text(fgl.tr);

plot(cv.fgl.tree, type="uniform");

text(cv.fgl.tree)

dev.off()

0-19

|Mg < 2.695

Na < 13.785

Al < 1.38

Fe < 0.085

Ba < 0.2

RI < 1.265

Al < 1.42

RI < −0.93

RI < −1.885K < 0.29

Ca < 9.67Mg < 3.75

Fe < 0.145

RI < 1.045Al < 1.17

Mg < 3.455

Si < 72.84Na < 12.835

K < 0.55

WinNF

ConWinNFTablWinNF

Head

WinFVeh

WinFWinF

WinFWinFWinNFWinF

WinNF

VehWinNFWinF

WinNFWinNF

|Mg < 2.695

Na < 13.785

Al < 1.38Ba < 0.2

Al < 1.42

RI < −0.93

K < 0.29

Mg < 3.75

Fe < 0.145

Mg < 3.455

WinNFConTablHeadVeh

WinF

WinFWinNF

WinNF

WinNFWinNF

그림 3: Pruning하기 전과 pruning 한 후 나무모형0-20

제 6절 예측및Model평가

By using the final selected model, we can predict unseen outputscorresponding to input variates.

pred<-predict(cv.fgl.tree, fgl)

> pred[1:5,]

WinF WinNF Veh Con Tabl Head

1 0.7142857 0.07142857 0.2142857 0 0 0.00000000

2 0.9473684 0.02631579 0.0000000 0 0 0.02631579

3 0.1714286 0.82857143 0.0000000 0 0 0.00000000

4 0.9473684 0.02631579 0.0000000 0 0 0.02631579

5 0.9473684 0.02631579 0.0000000 0 0 0.02631579

나무모형에 의한 예측치, 실제값비교

> pr.class<-predict(cv.fgl.tree, fgl, "class")

0-21

> table(true=fgl$type, pred=pr.class)

pred

true WinF WinNF Veh Con Tabl Head

WinF 56 11 3 0 0 0

WinNF 3 63 4 4 2 0

Veh 6 4 7 0 0 0

Con 0 0 0 12 0 1

Tabl 0 0 0 0 9 0

Head 1 3 0 0 1 24

예측오차 계산

miclass.err <- mean(fgl$type != pr.class)

miclass.err

[1] 0.2009346

0-22

제 7절 Exercise: Spam E-mail Data

The data consist of 4601 email items, of which 1813 items were iden-tified as spam. library(DAAG)

• crl.tot: total length of words in capitals

• dollar: number of occurrences of the $ symbol

• bang: number of occurrences of the ! symbol

• money: number of occurrences of the word ‘money’

• n000: number of occurrences of the string ‘000’

• make: number of occurrences of the word ‘make’

0-23

• yesno: outcome variable, a factor with levels n not spam, yspam

spam.t <- tree(yesno ~ crl.tot + dollar + bang + money + n000 + make,

data=spam7)

spam.p<-prune.misclass(spam.t)

which.min(spam.p$deviance)

0-24

decision treedatamining.dongguk.ac.kr/lectures/2010-2/dm/dm_tree.pdf · 2011-01-06 · decision...

Documents