catching the trend- a framework for clustering concept-drifting categorical data

12
Intelligent Database Systems Lab 國國國國國國國國 National Yunlin University of Science and Technology Catching the Trend- A Framework for Clustering Concept-Drifting Categorical Data Hung-Leng Chen, Ming-Syan Chen, and Su-Chen Lin TKDE, Vol.21, No. 5, 2009, pp. 652-665. Presenter : Wei-Shen Tai 2009/7/1

Upload: orsin

Post on 12-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Catching the Trend- A Framework for Clustering Concept-Drifting Categorical Data. Hung- Leng Chen, Ming- Syan Chen, and Su-Chen Lin TKDE, Vol.21, No. 5, 2009, pp. 652-665. Presenter : Wei- Shen Tai 200 9 / 7/1. Outline. Introduction Preliminaries Node Importance Representative - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Catching the Trend- A Framework for Clustering Concept-Drifting Categorical Data

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

Catching the Trend- A Framework for Clustering Concept-Drifting Categorical Data

Hung-Leng Chen, Ming-Syan Chen, and Su-Chen Lin

TKDE, Vol.21, No. 5, 2009, pp. 652-665.

Presenter : Wei-Shen Tai

2009/7/1

Page 2: Catching the Trend- A Framework for Clustering Concept-Drifting Categorical Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

2

Outline

Introduction Preliminaries

Node Importance Representative Drifting concept detection Clustering relationship analysis Experimental results Conclusion Comments

Page 3: Catching the Trend- A Framework for Clustering Concept-Drifting Categorical Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

3

Motivation

Find concept drifting with time in categorical domain. For example, the buying preferences of customers may change with

time.

Page 4: Catching the Trend- A Framework for Clustering Concept-Drifting Categorical Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

4

Objective

A framework for performing clustering on the categorical time-evolving data Detects concept drifting and analyzes relationship between drifting-

concepts.

Page 5: Catching the Trend- A Framework for Clustering Concept-Drifting Categorical Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

5

Node importance representative

NIR Represents a cluster as the distribution of the attribute

values, which are called “nodes” (e.g. [age = 50-59]). Importance of node Iir in cluster ci.

Similar to TFIDF and Entropy

Page 6: Catching the Trend- A Framework for Clustering Concept-Drifting Categorical Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

6

Drifting concept detection

DCD Detect the difference of cluster distributions between the

current subset St and the last clustering result C[te, t-1].

Data labeling

Page 7: Catching the Trend- A Framework for Clustering Concept-Drifting Categorical Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

7

Data labeling and outlier detection Resemblance of input and cluster can be directly obtained

by summing up the nodes’ importance in the NIR

P 7 C1, 0.029

= 0.5

P 7 C2, (0.5+0.029+1)=1.529

Page 8: Catching the Trend- A Framework for Clustering Concept-Drifting Categorical Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

8

Cluster distributions comparison

Clustering results are said to be different according to the following two criteria.1. If quite a large number of outliers are found.

2. If quite a large number of clusters are varied in the ratio of data points.

(0.4)

(0.5)

(0.3)

C1, |2/5 – 4/5| = 0.4

C2, |3/5 – 0/5| = 0.6Diff of results , 2/2 = 1

outlier, 1/5 = 0.2

Page 9: Catching the Trend- A Framework for Clustering Concept-Drifting Categorical Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

9

Clustering relationship analysis

CRA Explains the drifting concepts based on the evolving

clustering results. Node importance vector

Cluster distance using cosine measure

A,B,X,Y

Page 10: Catching the Trend- A Framework for Clustering Concept-Drifting Categorical Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

10

Experimental results

Scalability

Accuracy

Page 11: Catching the Trend- A Framework for Clustering Concept-Drifting Categorical Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

11

Discussion and conclusions

A framework to perform clustering on categorical time-evolving data. Detects the drifting concepts at different sliding

windows, Generates the clustering results based on the

current concept, Analyzes and shows the relationship between

clustering results by visualization.

Page 12: Catching the Trend- A Framework for Clustering Concept-Drifting Categorical Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

12

Comments Advantage

This proposed framework provides a solution for time-evolving data clustering in categorical domain.

It also provides an alternative for the similarity measurement between cluster and input in categorical data set based on NIR.

Drawback Merely categorical data can be processed in this framework with NIR, even

numerical data must be transformed to categorical labels as well. In other words, it seems unsuitable for clustering in mixed data domain.

The vector dimension of each class did not be reduced, it will spend too many spaces to preserve overall vector information.

Node important vector is similar to binary coding, it makes the result of cosine measurement be very tiny.

Application Concept-drifting detection for time-evolving data set in categorical domain.