11 department of computer science, national tsing hua university, no. 101 kuang fu road, hsinchu...

23
1 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications, National Tsing Hua University, Hsinchu, Taiwan a Department of Computer Science, University of Vermont, Burlington, VT 05405, USA chool of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China Integrating induction and deduction for noisy data mining 報報報 報報報

Upload: stewart-johnston

Post on 02-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,

11

Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications, National Tsing Hua University, Hsinchu, Taiwan

a Department of Computer Science, University of Vermont, Burlington, VT 05405, USAb School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China

Integrating induction and deduction for noisy data mining

報告人:陳重光

Page 2: 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,

22

Outline(1/2)

1. Introduction

2. Noise modeling with associative corruption rules– A systematic noise handling framework

– Problem statement• Input data

• Notations and definitions

– Method• Algorithm ACF

• Algorithm ACB

Page 3: 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,

33

Outline(2/2)

3. Experiments– Experiment settings

– Experimental results

4. Conclusions

Page 4: 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,

44

1. Introduction(1/3)

1. The main purpose of data mining is from the large amount of data can be manipulated to find the knowledge.

2. In many data mining topics , of which the three most fundamental ones are

– Classification

– Cluster analysis

– Association analysis

3. Other classification methods– Linear regression models

– Classification rules

– Neural networks

Page 5: 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,

55

1. Introduction(2/3)

4. Data mining techniques have been applied to many fundamental research domains

– Biology

– Medicine

– Ecology

5. There are two essential driving forces that push data mining research to move forward energetically

– Large amounts of data

– Powerful hardware support

Page 6: 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,

66

1. Introduction(3/3)

6. The main purpose of this paper is perform a study on integrating induction and deduction for noisy data mining.

Page 7: 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,

7

Noise modeling with associative corruption rules

a) In large scale data mining applications , erroneous entries in the data are almost unavoidable.

b) The existence of such noise degrades the dataset’s truthfulness , which directly affects the data quality.

c) The robustness of data mining results crucially relies on the quality of the underlying data.

Page 8: 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,

8

A systematic noise handling framework

• Noise data from different sources , can be traced by analyzing the erroneous data items , unless they are totally random.

• Gaussian noise follows the normal distribution with some certain mean value and variance , it can be regarded as a kind of systematic noise.

Page 9: 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,

9

1. To understand the nature of the noise.

2. To eliminate the noise from the source data so as benefit the succeeding data mining process.

性質

Data

Page 10: 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,

10

Problem statement

a) To derive the associative rules that corrupt the original clean dataset Dc1.

b) To eliminate the noise from the noisy data Dobs and construct a robust learner for supervised learning.

Page 11: 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,

11

Input data

a) A subset of instances that are suspects of noise are identified based on a certain criterion.

b) proposing error correction rules and performing error correction on this subset of data.

Page 12: 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,

12

• Before providing the definitions of several concepts for this study , we give some notations as follows :– A: a set of feature attributes of Dobs ; A = {A1; A2; . . . ; AN} ;– C: the class attribute of Dobs ;– V: the value space of the corresponding attributes :

V = {V1 ; V2 ; . . . ; VN ; VC} , where Vi corresponds to Ai ;Ai 屬於 (A {C})∪ ;

– H: a 2-tuple structure (Ai; vi) , where• Ai 屬於 (A {C})∪ ; Ai is called the Head of H ;• vi 屬於 Vi ; vi is the value of Head , Vi corresponds to Ai ;

– T: a 2-tuple structure <p,v> , where• p = <Ai; vi> is a structure H ; Ai is called the Head of T ;• v 屬於 Vi ; v is the modified value of Head , Vi corresponds to Ai ;• vi ≠ v.

Notations and definitions

Page 13: 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,

13

Method

Problem description

Noisy Clean data Dobs1 Dc1

Dobs2 Dcor2

1) Noise Formation

2) Noise Correction

Dobs Dcicor2

Page 14: 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,

14

• In this study , propose a deductive learning procedure to derive these corruption rules.

• The idea follows a two-step fashion.– Firstly , we propose an algorithm called ACF(Associative Corruption

Forward) to learn the noise formation mechanism from Dc1 to Dobs1.

– Secondly , we propose an algorithm called ACB (Associative Corruption Backward) that corrects Dobs2.

Page 15: 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,

15

Algorithm ACF

• Algorithm ACF is used to infer the set of AC rules R1 that corrupts Dc1.

• Employ the method of classification rule induction.

Page 16: 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,

16

Algorithm ACB

• Algorithm ACB (Associative Corruption Backward) is used for noise correction.

• It is not a strict one to one mapping.

• ACB builds a Naive Bayes learner based on Dc1 for each noise corrupted attribute.

Page 17: 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,

17

Experiments

• The objectives of our experiments focus on two aspects.– Firstly , we want to examine whether the algorithm ACF could

accurately derive the AC rules.

– Secondly , we seek to verify whether our noise correction procedure ACB could produce a higher quality dataset Dcor2 in terms of supervised learning.

Page 18: 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,

18

Experiment settings

• Evaluate the system performances on datasets collected from the UCI database repository And References [22] compared .

• In order to evaluate the performance of the proposed method we first separate Dclean into two parts :– A dataset Dbase.

– Corresponding testing set Dtest.

Page 19: 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,

19

Experimental results

• In the set of AC rules R that corrupts the original clean dataset , more than one AC rule are allowed. However ; restrictions are applied to R as follows:– Every rule in R is an AC rule ;– For any two rules in R , the right-hand side of them differs from each

other ;– If P => Q 屬於 R , where Q = <p,v> , then predicate p does not

exist in both the left and the right-hand sides of any other rules in R.

Page 20: 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,

20

Basic information on the datasets for the experiment.

Page 21: 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,

21

Shows the comparative results of five models m0 through m4. m0 is a benchmark learner built on noise-free data dclean.

Page 22: 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,

22

Conclusions(1/2)

• Bring up a systematic noise handling framework into discussion , where the deductive reasoning on noise information and inductive learning from the input data are integrated neatly.

• proposed a method to handle the noise caused by Associative Corruption (AC) rules for supervised learning.

Page 23: 11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,

23

Conclusions(2/2)

• In order to propose a method to correct Dobs2 , we design a two-step method that includes algorithms ACF and ACB.

• In this experiments , we show that our method could infer the noise formation mechanism accurately and perform a noise correction process appropriately, so as to enhance the quality of the original dataset.