on data labeling for clustering categorical data

12
Intelligent Database Systems Lab 國國國國國國國國 National Yunlin University of Science and Technology On Data Labeling for Clustering Categorical Data Hung-Leng Chen, Kun-Ta Chuang, Member, and Ming-Syan Chen TKDE, Vol. 19, No. 11, 2008, pp. 1458-1471. Presenter : Wei-Shen Tai 2008/11/4

Upload: chiku

Post on 22-Feb-2016

50 views

Category:

Documents


0 download

DESCRIPTION

On Data Labeling for Clustering Categorical Data. Hung- Leng Chen, Kun-Ta Chuang, Member, and Ming- Syan Chen TKDE, Vol. 19, No. 11, 2008, pp. 1458-1471. Presenter : Wei-Shen Tai 200 8 / 11/4. Outline . Introduction Related work Model of MARDL ( MAximal Resemblance Data Labeling) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: On Data Labeling for Clustering Categorical Data

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

On Data Labeling for Clustering Categorical Data

Hung-Leng Chen, Kun-Ta Chuang, Member, and Ming-Syan Chen

TKDE, Vol. 19, No. 11, 2008, pp. 1458-1471.

Presenter : Wei-Shen Tai

2008/11/4

Page 2: On Data Labeling for Clustering Categorical Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

2

Outline Introduction Related work Model of MARDL (MAximal Resemblance Data

Labeling) Experimental results Conclusions Comments

Page 3: On Data Labeling for Clustering Categorical Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

3

Motivation Sampling

Scales down the size of the database and speed up clustering algorithms.

Problem comes from how to allocate the unclustered data into appropriate clusters.

LargeDatabase

Sampled data

SamplingClustering

Unclustered data

Labeling ?

Page 4: On Data Labeling for Clustering Categorical Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

4

Objective Data Labeling

Gives each unclustered data point the most appropriate cluster label.

MARDL is independent of clustering algorithms, and any categorical clustering algorithm can be utilized in this framework.

Page 5: On Data Labeling for Clustering Categorical Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

5

Categorical cluster representative Node

Attribute name + attribute value. E.g. [A1=a], [A2=m] is an node. N-nodeset

A set of n nodes, in which every node is a member of the distinct attribute Aa. E.g. {[A1=a], [A2=m]} is a 2-nodeset.

Independent nodesets Two nodesets do not contain nodes from the same attributes are said to

be independent with each other in a represented cluster. E.g. {[A1=a], [A2=m]} and {[A3=c]} p({[A1=a], [A2=m],[A3=c]}) =

p({[A1=a], [A2=m]})*p({[A3=c]})

Page 6: On Data Labeling for Clustering Categorical Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

6

Node and n-nodeset importance Information theorem

Entropy

Page 7: On Data Labeling for Clustering Categorical Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

7

N-nodeset importance representative(NNIR)

NNIR tree constructing and pruning An Apriori-like algorithm.

Initialization Computing candidate nodeset

importance and pruning Generating candidate nodeset

Pruning Threshold

Importance of t nodeset is less than a predefined θ.

Relative maximum Importance of (t+1) nodeset is

larger than importance of t nodeset. Hybrid

Page 8: On Data Labeling for Clustering Categorical Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

8

Maximal resemblance data labeling

Goal of MARDL Decide the most appropriate cluster label ci for the

unlabeled data point.

A unclustered data point {[A1=a], [A2=m],[A3=c ]} to the combination{[A1=a], [A2=m]} and {[A3=c ]} in Cluster c1.

Page 9: On Data Labeling for Clustering Categorical Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

9

Approximate algorithm for MARDL Only one combination is considered and utilized

Tree nodes are queued and sorted by importance value. The nodeset with maximal importance is selected. Those nodesets which are not independent with the

selected nodeset are removed from the queue. A unclustered data point

{[A1=a], [A2=m],[A3=c ]}and a tree nodeset queue.

Page 10: On Data Labeling for Clustering Categorical Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

10

Experimental results

Page 11: On Data Labeling for Clustering Categorical Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

11

Conclusions MARDL

Allocates unlabeled data point into appropriate clusters when the sampling technique is utilized to cluster a very large categorical database.

NIR A categorical cluster representative technique.

NNIR A more powerful representative than NIR while the

combinations of attribute values are considered.

Page 12: On Data Labeling for Clustering Categorical Data

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

12

Comments Advantage

A good method to assign unclustered data to appropriate trained clusters in categorical data sampling clustering methods.

The concept, derived from existed method (Apriori and information theorem) , is easy to understand and accept.

MARDL is independ of clustering methods and any categorical clustering algorithm can be utilized in this framework.

Drawback It spends much time to construct the tree of each cluster and the tree is quite complex to

represent cluster. Because the importance of t+1 nodeset may be larger than the importance of t nodeset, it

will take much time to process the hybrid pruning in computing all of candidate t+1 nodeset.

Application Unclustered data classification while the sampling technique is utilized to cluster a very

large categorical database.