a fast ensemble pruning algorithm based on pattern mining process

Post on 13-Jan-2016

35 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

A fast ensemble pruning algorithm based on pattern mining process. 17 July 2009 Springer Science+Business Media, LLC 2009. 69821514 洪佳瑜 69821516 蔣克欽. Outline. M otive Introduction Method Experiment Conclusion. M otive. - PowerPoint PPT Presentation

TRANSCRIPT

A fast ensemble pruning A fast ensemble pruning algorithm based on algorithm based on pattern mining processpattern mining process

17 July 2009Springer Science+Business Media, LLC 2009

69821514 洪佳瑜69821516  蔣克欽

OutlineOutline

MotiveIntroductionMethodExperimentConclusion

Motivemost ensemble pruning methods in the literature need much pruning time, and are mainly used to the domains where time can be sacrificed in order to improve accuracy. This makes them unsuitable for the applications requiring fast learning process, such as on-line network intrusion detection.

IntroductionIntroductionpattern mining based ensemble

pruning (PMEP)The algorithm converts an ensemble

pruning problem into a special pattern mining problem,which enables a FP-Tree to store the prediction results of all base classifiers, then uses a new pattern mining method to select base classifiers.

The final output of our PMEP approach is the pruned ensemblewith the highest correct value.

Properties of PMEP (1/2) Properties of PMEP (1/2) Firstly, it uses a transaction

database to represent the prediction results of all base classifiers. This representation enables a FP-Tree to compact the results, and the ensemble pruning process becomes to pattern mining problems.

Secondly, PMEP uses majority voting principle to decide the candidate classifiers before pattern mining process. For a given k, PMEP only considers the paths with length of [k/2 + 1] in the FP-Tree.

Properties of PMEP (2/2)Properties of PMEP (2/2)Thirdly, the pattern mining

method greedily selects a set of classifiers, instead of one in each iteration, which saves pruning time further.

Method Method (1/7)(1/7)CID Itemset Num

X1 h1, h2, h3, h4, h5, h6, h7, h8 8

X2 h2, h3, h4, h5, h7 5

X3 h2, h5, h6 3

X4 0

X5 h1, h2, h6, h8 4

X6 h1, h2, h3, h4, h5, h6, h7, h8 8

X7 h5, h6, h7 3

X8 h2, h5, h7 3

X9 h3, h4, h5, h7 4

X10 h1, h2, h5, h6 4

X11 h2, h5, h6 3

X12 h1, h2, h4, h6, h7 5

Method Method (2/7)(2/7)

CID Itemset Sorted Itemset

X2 h2, h3, h4, h5, h7 h2, h5, h7, h4, h3

X3 h2, h5, h6 h2, h5, h6

X5 h1, h2, h6, h8 h2, h6, h1, h8

X7 h5, h6, h7 h5, h6, h7

X8 h2, h5, h7 h2, h5, h7

X9 h3, h4, h5, h7 h5, h7, h4, h3

X10 h1, h2, h5, h6 h2, h5, h6, h1

X11 h2, h5, h6 h2, h5, h6

X12 h1, h2, h4, h6, h7 h2, h6, h7, h1, h4

For any i (1 ≤ i ≤ n), if we have:Li = L 或 Li =0we delete their corresponding rows from table T to reduce computational cost.

MethodMethod FP-Tree FP-Tree (3/7)(3/7)

Method (4/7) Method (4/7)

suppose k = 5, we have:

Method (5/7)Method (5/7)

Then deleteS.set={ h2, h5, h6 }, S.correct=3.

the largest count value, and its classifier set is {h2, h5, h6}.We add these three classifiers into S.set, and set S.correct=3.

Method (6/7)Method (6/7)

S.set={ h2, h5, h6, h7}, S.correct=7.

the first row has the maximum count value, so the base classifier h7 is selected.

Method (7/7)Method (7/7)the classifier

sets {h1} and {h4} have the same count value. Considering that the path of h1 is constructed earlier in Path-Table than that of h4, we add h1 into S.set.

S.set={h2, h5, h6, h7, h1}, and S. correct=8

advantagesadvantagesthe classifiers with negative

effect for ensemble have low probability to be selected because of low count values

the selected classifiers come from multiple paths, which makes them have low error correlation.

ExperimentExperiment

We compared the performance of our approach, PMEP, against Bagging (Breiman

1996), GASEN (Zhou et al. 2002), and Forward Selection (FS) (Caruana et al. 2004) in our empirical study.

Test platform: AMD 4000+, 2G RAM C programming language Linux operating system

All the tests are performed on 15 All the tests are performed on 15 data sets from UCI machine data sets from UCI machine leaning repositoryleaning repository

Results of prediction accuracy

Sizes of pruned ensembles for Sizes of pruned ensembles for each data set, the last one is each data set, the last one is the average result of all 15 data the average result of all 15 data setssets

Avg : 20 7.43 3.77 5.70

Results of pruningResults of pruning time (s)time (s)

ConclusionConclusionThe experimental results have

shown that the proposed PMEP achieves the highest prediction accuracy, and costs much less pruning time than GASEN and forward selection.

The design of our PMEP algorithm is aimed at majority voting method, how to extend the algorithm to other combination strategies is the other of our works.

THANKTHANK

algorithm

top related