a hybrid method to categorical clustering

43
A Hybrid Method to Categorical Clustering 學學 學學學 Advisor : Prof. J. Hsi ang Date : 7/4/2006

Upload: cora-collier

Post on 30-Dec-2015

22 views

Category:

Documents


1 download

DESCRIPTION

A Hybrid Method to Categorical Clustering. 學生:吳宗和 Advisor : Prof. J. Hsiang Date : 7/4/2006. Outlines. Introduction Motivation Related Work Research Goal / Notation Our Method Experiments Conclusion Future Work. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Hybrid Method to Categorical Clustering

A Hybrid Method to Categorical Clustering

學生:吳宗和Advisor : Prof. J. Hsiang

Date : 7/4/2006

Page 2: A Hybrid Method to Categorical Clustering

Outlines

Introduction Motivation Related Work Research Goal / Notation Our Method Experiments Conclusion Future Work

Page 3: A Hybrid Method to Categorical Clustering

Introduction

Data clustering is to partition a set of data elements into groups such that elements in the same groups are similar, while elements in different groups are dissimilar. [1] The similarity must be well-defined by developer.

Data elements can be Recorded as Numerical values. (Degree: 0o~360o ) Recorded as Categorical values. (Sex : { Male, Female})

Find the latent structure in the dataset. Clustering Methods purposed for numerical data

can’t fit for categorical data, because lack of a proper measure of similarity[21].

Page 4: A Hybrid Method to Categorical Clustering

Example:

An instance of the movie database. Attributes

InstancesDirector Actor Genre

Partition

1

Partition

2X-man III Brett Ratner Hugh Jackma

n Action

AdventureC1 G1

Superman returns

Bryan Singer Brandon Routh ActionAdventure

C2 G1

X-man II Bryan Singer Hugh Jackman

ActionAdventure

C1 G1

Spiderman III Sam Raimi Tobey Maguire Science Fiction Fantasy

C3 G2

Van Helsing Stephen Sommers

Hugh Jackman

ActionAdventure

C1 G1

{Brett Ratner, Bryan Singer,

Sam Raimi,

Stephen Sommers}

{Hugh Jackman,

Brandon Routh,

Tobey Maguire}

{ActionAdventure,

Science Fiction Fantasy}

Page 5: A Hybrid Method to Categorical Clustering

Outlines

Introduction Motivation Related Work Research Goal / Notation Our Method Experiments Conclusion Future Work

Page 6: A Hybrid Method to Categorical Clustering

Motivation

Needs, For database, much of the data in databases are

categorical (described by people). For web navigation, web documents can be categorical

data after feature selection. For knowledge discovery, find latent structure in data.

The difficulties to solve clustering categorical data, data elements with categorical values they can take are

not ordered. no intuitive measure of distance.[21] methods specified for clustering categorical data are not

easy to use.

Page 7: A Hybrid Method to Categorical Clustering

Motivation (cont’)

Problems in Related work Hard to understand or use. Parameters are data sensitive. Most of them CANNOT decide the number of groups when cluste

ring.[3,13,17,18] Our gauss/assumption.

Need All new method? Reuse the methods proposed for numerical data.

( measure of distance can’t fit the categorical data ) Or only NEED a good measure of similarity / distance between t

wo groups. REUSE the framework for clustering numerical data. ( intuitive ap

proach, fewer parameters, and less sensitive for data ) Get better clustering result.

Page 8: A Hybrid Method to Categorical Clustering

Related Work

Partition Based K-modes [Z. Huang. (1998)] Monte-Carlo algorithm [Tao Li, et. al 2003] COOLCAT [Barbara, D., Couto, J., & Li, Y. (200

2)] Agglomerative Based

ROCK [Guha, S., Rastogi, R.,& Shim, K. (2000)] CACTUS [ Ganti, V., Genhrke, J., & Ramarkrish

nan, R. (1999)] Best-K ( ACE )[Keke Chen, Ling Liu (2005)]

Page 9: A Hybrid Method to Categorical Clustering

Summarization of relate work

 Local

MinimalDecide

ParametersDecide the number of

groups

K-Modes Yes No Human Monte-Carlo

algorithmNo No Human 

COOLCAT Yes No Human

ROCK Yes Hard  Human CACTUS Yes Hard  Human Best-K ( ACE )

Yes NoMachine(Quality is not good) 

Page 10: A Hybrid Method to Categorical Clustering

Outlines

Introduction Motivation Related Work Research Goal / Notation Our Method Experiments Conclusion Future Work

Page 11: A Hybrid Method to Categorical Clustering

Research Goal

Propose a method A measure of similarity between two groups. Reuse the gravitation concept.( for intuition,

fewer parameters ) Decide the proper number of groups by

machine. Strength the clustering result.( for better result )

Page 12: A Hybrid Method to Categorical Clustering

Notation

D: data set of N elements p1, p2, p3, …,pn ,|D| = n.

Element pi is a multidimensional vector of d categorical attributes, i.i.e. pi = <ai

1,ai2,…,ai

d> , where aij Aj, Aj is th

e set of all possible values aij can take, and A

j is a finite set. For related work, K : an integer given by user,

then divide elements into K groups G1,G2,…,GK such that G∪ i=D and i,j , ij, Gi∩Gj =ψ.

Page 13: A Hybrid Method to Categorical Clustering

Outlines

Introduction Motivation Related Work Research Goal

Our Method A measure of similarity Step 1: Gravitation Model Step 2: Strength the result.

Experiments Conclusion Future Work

Page 14: A Hybrid Method to Categorical Clustering

Our method

One similarity function, two steps. Define a measure of similarity between two

groups. Use the similarity to be the measure of distance.

Two steps. Step 1.

Use the concept of gravitation model. Find the most suitable group k.

Step 2. Conquer the local optimal. Optimize the result.

Page 15: A Hybrid Method to Categorical Clustering

The Intuition of similarity

Based on the structure of the group. Each group has its own probability distribution of each attribute

and can be represented as its group structure. Groups are similar if their have very similar probability

distribution of each attribute.

Gi Gj

Similarity(Gi,Gj)

100%

90%

80%

70%

60%

50%

40%

30%

40%

50%

60%

70%

Y軸 Attributes distribution of group Gi

A1=Va11

A2=Va21

A3=Va31

A4=Va41

A5=Va51

A6=Va61

A7=Va71

A8=Va81

A9=Va91

A10=Va10,1

A11=Va11,1

A12=Va12,1

100%

90%

80%

70%

60%

50%

40%

30%

40%

50%

60%

70%

Y軸 Attributes distribution of group Gj

A1=Va11

A2=Va21

A3=Va31

A4=Va41

A5=Va51

A6=Va61

A7=Va71

A8=Va81

A9=Va91

A10=Va10,1

A11=Va11,1

A12=Va12,1

Page 16: A Hybrid Method to Categorical Clustering

The Similarity function

A group Gi = <A1,A2,A3,…,Ad>, Ar is a random variable for all r = 1…d, p(Ar=v | Gi) is the probability, when Ar=v.

Entropy of Ar in group Gi : entropy(Ar) =

Entropy of Gi E(Gi) =

Sim(Gi,Gj)=

( | ) log( ( | ))r

r i r iv A

p A v G p A v G

1

( )d

rr

entropy A

| | ( ) | | ( ) | | ( )i j i j i i j jG G E G G G E G G E G

Page 17: A Hybrid Method to Categorical Clustering

To geometric analogy

Use Sim(Gi,Gj) to be our geometric analogy, because Sim(Gi,Gj) 0, ≧ i,j.

Sim(Gi,Gi) = 0, i.

Sim(Gi,Gi) = Sim(Gj,Gi), i,j.

Page 18: A Hybrid Method to Categorical Clustering

Step 1.

Use the concept of gravitation model.

Mass, Radius, and Time. Mass of a group = the # of elements. Radius of two groups Gi,Gj = similarity(Gi,Gj) Time = ∆T, a constant.

Merging groups Get a clustering tree.

G1 G2

1 21 2 2

1 2

( ) ( )( , )

( , )

GM G M Gfg G G

R G G

1 2( , )R G G

Time T

Page 19: A Hybrid Method to Categorical Clustering

A Merging Pass M(Gi)= # elements in Gi. R(Gi,Gj)=Sim(Gi,Gj). G is 1.0. fg(Gi,Gj) is the gravitation force. Δt is the time period. Δt = 1.0.

Rnew(Gi,Gj) = R(Gi,Gj) - 1/2(accg

i+accgj)Δt2

If Rnew(Gi,Gj) < 0, Merge Gi and Gj.

Getting closer {Radius = 0} merge. Build a clustering tree

G1 G2

1 21 2 2

1 2

( ) ( )( , )

( , )

GM G M Gfg G G

R G G

1 2( , )R G G

Time T

1

1 2

1

( , )

( )G

fg G Gacc

M G

2

1 2

2

( , )

( )G

fg G Gacc

M G

Time T + ∆t

G1 G2

1 2( , )newR G G

Page 20: A Hybrid Method to Categorical Clustering

A clustering tree.

p1 p2 p3 pn-1p5 pn

C1

p4

Page 21: A Hybrid Method to Categorical Clustering

Suitable K.

In the sequence of merging passes, find Suitable number of groups in the partition.

Suppose step1 merges to 1 group, after T iterations. S1, S2, S3,…, ST-1, ST. Si : i-th iteration

Stablei = ΔTime(Si , Si+1) x Radius(Si), Radius(Si) = Min{ Sim

( Gα,Gβ) , Gα,Gβ in Si and αβ }

In the following slides, we use K to denote the proper number of clusters in the partition.

r=0

The suitable K = the # of groups in S , such that

Stable(S ) is maximal, is given by user.

i

l

i r

i

l

Page 22: A Hybrid Method to Categorical Clustering

Step 2.

Still has Local minimal problem? Yes. Conquer it by randomized algorithm.

Randomized step. If Exchange two elements from two groups[11]

Time consumption. Goal : Enlarge distance/radius of two groups.

For Efficiency Tree-structured data structure digital search

tree. Enlarge the distance/radius of each two groups

in the result obtained from Step 1.

Page 23: A Hybrid Method to Categorical Clustering

Step 2(con’t)

Why digital search tree (D.S.T)? Locality.

Storing elements storing strings from elements. Deeper nodes are looked like their siblings and

children.

Page 24: A Hybrid Method to Categorical Clustering

Operations in D.S.T

D.S.T is tree-structured, storing binary strings, and degree of each node 2.≦

Operations cost less ( O(d) ) Search Insert Delete

Search operation. ( O(d) ) Ex:

Page 25: A Hybrid Method to Categorical Clustering

Operations in D.S.T (cont’)

Insertion (O(d)), Ex: Insert C010 into the tree.

Deletion (O(d)), Ex: delete G110

Page 26: A Hybrid Method to Categorical Clustering

Randomizing

Transform the result from step 1 into a forest with K digital search trees. Nodes in Tree i ≡ elements in Gi , i.

Randomly select 2 trees i,j, and select an internal node αfrom tree i.

Let gα = { node n | α is the ancestor of node n } ∪{α}

To calculate, , ( ( ) , ) ( , )

, .( , , ) i a j a i jYes if Sim G g G g Sim G Gi j a No OtherwiseGoodExchange G G g

Page 27: A Hybrid Method to Categorical Clustering

Randomizing (cont’)

For example, K = 2. α= ‘0101’

G10000

00102

00112

0

1

0101

0

G20000

0110

0

1001

1

1010

0

11012

1

1100

0

0001

0

3.248661

Phase 0

0101

1

G20000

0110

1001

1

1010

0

11012

1

1100

0

G10000

00102

00112

0

1

0001

0

1

8.985047

Phase 1

Good Exchange !!!

Page 28: A Hybrid Method to Categorical Clustering

Outlines

Introduction Motivation Related Work Research Goal / Notation Our Method

Experiments Datasets Compared methods and Criterion

Conclusion Future Work

Page 29: A Hybrid Method to Categorical Clustering

Experiments

Datasets Real datasets[7].

Congressional Voting Record. Mushroom.

Document datasets[8,9]. Reuters 21578. 20 Newsgroups.

Page 30: A Hybrid Method to Categorical Clustering

Compared methods & Criterion

Compared methods K-modes ( randomly partition elements into K groups ) Coolcat ( select some elements to be seeds and place se

eds into K groups, then place the remaining)

Criterion (10 times average) Category Utility =

Expected Entropy =

Purity =

1 1

( | ) log( ( | ))r

K d

r k r kk r v A

p A v c p A v c

2 2

1 1

| | ( | ) ( )r

K d

k r k rk r v A

c p A v c p A v

K

1

1( ) , ( ) max ( ) , is the # of elements

of the -th input class that were assigned to the -th cluster.

Kj jk

k k j k kk k

nP c P c n n

n n

i j

Page 31: A Hybrid Method to Categorical Clustering

Real datasets

Do classify like human. Mushroom dataset

2 classes( edible, poisonous ) 22 attributes. 8124 records.

# Groups Category Utility

Expected Entropy

Purity

K-modes 22 0.60842867 20.2 95.5%

Coolcat 30 0.25583642 35.1 96.2%

Our Method 18 0.74743694 19.4 99.4%

Page 32: A Hybrid Method to Categorical Clustering

Real datasets

Voting Record dataset 2 classes (democrat, republican), 435 records,

16 attributes (all boolean valued). Some records (10%) have missed value.

# Groups Category Utility

Expected Entropy

Purity

K-modes 22 0.22505624 6.04533891 93.4%

Coolcat 29 0.14361348 7.42124111 94.6%

Our Method 14 0.3177278 6.84627962 92.4%

Page 33: A Hybrid Method to Categorical Clustering

Document Datasets

For applying. 20 Newsgroups

20 subjects. Each subject contains 1000 articles. 20000 documents (too many). Select documents from 3 subjects to be the dataset

(3000 articles). Use BOW package[4] to do feature selection. 100

features.

# Groups Expected Entropy Purity

K-modes K = 5 3.02019429 92.2%

Coolcat K = 5 3.44101501 90.8%

Our Method K = 5 2.93761945 95.0%

Page 34: A Hybrid Method to Categorical Clustering

Document Datasets

Reuters 21578 135 subjects. 21578 articles, each subject

contains different # of articles. Select first 10 most articles subjects. Use BOW package[4] to do feature selection. 100

features.

# Groups Expected Entropy Purity

K-modes K = 18 2.30435514 51.8%

Coolcat K = 18 2.68157959 68.3%

Our Method K = 18 0.78800815 66.9%

Page 35: A Hybrid Method to Categorical Clustering

Conclusion

In this work, we purposed a measure of similarity such that Can reuse the framework used on numerical

data clustering. With FEWER parameters and easy to use. Try to avoid trapping into local minima. Still can obtain same or better clustering result

than methods proposed for categorical data.

Page 36: A Hybrid Method to Categorical Clustering

Future Work

For our method itself, Still take a long time to calculate, because we

need to fix and satisfy “triangle-inequality law” for our similarity function.

For application. To build a framework to cluster documents

without other packages’ help. Use our method to solve “Binding sets” problem

in Bioinformatics.

Page 37: A Hybrid Method to Categorical Clustering

Reference

[1] A.K. Jain M.N Murty, and P.J. Flynn, Data Clustering: A Review, ACM Computing Surveys. Vol.31, 1999.

[2] Yen-Jen Oyang, Chien-Yu Chen, Shien-Ching Huang, and Cheng-Fang Lin. Characteristics of a Hierarchical Data Clustering Algorithm Based on Gravity Theory. Technical Report of NTUCSIE 02-01.

[3] S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. Proc. of IEEE Intl. Conf. on Data Eng. (ICDE), 1999.

[4] McClallum, A. K. Bow: A tookit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/mccallum/bow.

[5] DIGITAL LIBRARIES: Metadata Resources. http://www.ifla.org/II/metadata.htm [6] W.E. Wright. A formulization of cluster analysis and gravitational clustering. Doctoral

Dissertation. Washington University, 1972. [7] Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of mac

hine learning databases. [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science.

[8] David D. Lewis. Reuters-21578 text categorization test collection. AT&T Labs – Research. 1997.

[9] Ken Lang. 20 Newsgroups. Http://people.csail.mit.edu/jrennie/20Newsgroups/ [10] T. Cover and J. Thomas. Elements of Information Theory. Wiley, 1991.

Page 38: A Hybrid Method to Categorical Clustering

Reference

[11] T. Li, S. Ma, and M. Ogihara. Entropy-based criterion in categorical clustering. Proc. of Intl. Conf. on Machine Learning (ICML), 2004.

[12] Bock, H.-H. Probabilistic aspects in cluster analysis. In O. Opitz(Ed.), Conceptual and numerical analysis of data, 12-44. Berlin: Springer-verlag. 1989.

[13] Z. Huang. A fast clustering algorithm to cluster very large categorical data sets in data mining. Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.

[14] MacQueen, J. B. (1967) Some Methods for Classification and Analysis of Multivariate Observations, In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281-297.

[15] Richard O. Duda and Peter E. Hard. Pattern Classification and Secen Analysis. A Wiley-Interscience Publication, New York, 1973.

[16] P Andritsos, P Tsaparas, RJ Miller, KC Sevcik. LIMBO: Scalable Clustering of Categorical Data. Hellenic Database Symposium, 2003.

[17] V. Ganti, J. Gehrke, and R. Ramakrishnan. CACTUS-clustering categorical data using summaries. Proc. of ACM SIGKEE Conference, 1999.

[18] D. Barbara, Y. Li, and J. Couto. Coolcat: an entropy-based algorithm for categorical clustering. Proc. of ACM Conf. on Information and Knowledge Mgt. (CIKM), 2002.

[19] R. C.Dubes. How many clusters are best? – an experiment. Pattern Recognition, vol. 20, no 6, pp.645-663, 1987.

[20] Keke Chen and Ling Liu: "The ‘Best K’ for Entropy-based Categorical Clustering ", Proc of Scientific and Statistical Database Management (SSDBM05). Santa Barbara, CA June 2005.

[21] 林正芳,歐陽彥正,”以重力為基礎的二階段階層式資料分群演算法特性之研究”,國立臺灣大學資訊工程學系,碩士論文。民 91。

[24] I.H. Witten, E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann Publishers, San Francisco, 2000.

Page 39: A Hybrid Method to Categorical Clustering

Appendix 1

K-modes Extend the k-means algorithm for clustering large data sets with cate

gorical values. Use “Mode” to represent a group.

1 2 3: { , , ,... }, ( ),1

( , ) ( , ).

i n j j

i i i i

Mode DQ x x x x x S X i K

dist d DQ CountDiffAttributes d DQ

Page 40: A Hybrid Method to Categorical Clustering

Appendix 2

ROCK Using Jaccard coefficient:

to measure the distance between points Xi,Xk in attribute j Define

Otherwise Neighbor(xi,xk) = false. Define Link(p,Ci ) = the # of Neightbor(p,q) = 1, for every q belon

gs to cluster Ci. Best clusters (Objective function):

For well defined data set , ψ is 0.5, f(ψ) = (1+ψ)/(1-ψ) Threshold ψ affects the result is good or not, and hard to decide. Relaxing the problem to many clusters.

Aj Aij=1 Aij=0

Aik=1 a bAik=0 c d

( , ) ,j i k

aJ x x for attribute j

a b c

( , ) , ( , )i k j i kj

Neighbor x x true if J x x threshold

1 2 ( )

( , )| |

| |i

ii f

p C i

Link p CMaximize C

C

Page 41: A Hybrid Method to Categorical Clustering

Appendix 3

CACTUS is an agglomerative algorithm. Use attribute values to group data. Define:

Support : Strong connection: S_conn(sij,sjt)=true,

, if Pair(sij,sjt) > threshold ψ. From pairs to sets of attributes. A cluster is defined as a region of attributes that are pai

rwise strongly connected. From the attributes view. Find k regions of attributes. Setting threshold ψ decides the result is good or not.

( , ) #{( ) ( )}ji kt j ji k ktPair s s S s S s

Page 42: A Hybrid Method to Categorical Clustering

Appendix 4 COOLCAT [2002] Use entropy to measure the uncertainty of the attribu

te j. Si = { Si1, Si2, Si3,…,Sit },

Entropy(Si) = When Si is much clear, Entropy(Si) 0. uncertainty is small.

For a cluster k, the entropy of k is

在Ci中,各屬性 Si的亂度的總和。 For a partition P = {C1,C2,…Ck} The expected Entropy of P is

1

( 1) ( ) log( ( ))t

i ij i ijj

p S s p S s

1

( ) ( ) ,d

k ii

E C Entropy S

| |( ) ( )i

kk

CE P E C

m

Page 43: A Hybrid Method to Categorical Clustering

Appendix 4 (cont’) For a given partition P, P is best clusters, P satisfies unique val

ue in each attribute for each cluster Ci. COOLCAT’s objective function is . Since

Coolcat picks k points from data set S, such that k are most unlike for each other, when beginning. And assign these points to k clusters.

For each point p in S-{ assigned points }, assign p into cluster i with the minimal increasing entropy.

Very easy to implement, but performance is strongly connected to the initial selection.

Sequence sensitive. Trapped in local minima easily, expected entropy is not the mini

mal.