presenter : cheng-han tsai authors : liang bai , jiye liang, chuangyin dang kbs, 2011

Post on 01-Jan-2016

67 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data. Presenter : Cheng-Han Tsai Authors : Liang Bai , Jiye Liang, Chuangyin Dang KBS, 2011. Outlines. Motivation Objectives Methodology Experiments - PowerPoint PPT Presentation

TRANSCRIPT

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

1

An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data

Presenter : Cheng-Han Tsai  Authors : Liang Bai, Jiye Liang, Chuangyin Dang

KBS, 2011

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

2

Outlines

Motivation Objectives Methodology Experiments Conclusions Comments

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Motivation

· The k-modes algorithm is sensitive to initial cluster centers and needs to give the number of clusters in advance.

· We can’t guarantee the number of clusters we select are the best.

3

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Objectives

4

• To propose an initialization method to find initial cluster centers and the number of clusters.

• The method can efficiently deal with large categorical data in linear time.

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

5

Data SetConstruct a

potential exemplars set S

Set the estimated number of clusters

K-modes-type algorithm

The clustering result

1 2

3

4

5

67

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

· The k-modes algorithm

6

· Hamming distance:Differences between two codes(using XOR)ex: 10001001XOR 10110001------------------------

00111000 → Hamming distance = 3

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

· New cluster centers initialization method· Finding the number of clusters

7

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

· New cluster centers initialization method.

8

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

9

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

10

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

11

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

· Finding the number of clusters─ We need to input a value k’ which is a estimated

number of clusters─ If k’ can’t be determined, we set k’ = |S|

12

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

13

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

14

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

· More than 1 knee point of the function P(k)· More than 1 peak of the function C(k)

15

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

· Performance analysis─ Soybean dada (4 diseases)─ Lung cancer data (3 classes)─ Zoo data (7 classes which has 3 big clusters and 4

small clusters)─ Mushroom data (2 classes)

· Scalability analysis

16

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

· Performance analysis

17

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

18

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

· Scalability analysis─ 67557 data points and 42 categorical attribute

19

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Conclusions

· The proposed method is effective and efficient for obtaining the good initial cluster centers and the number of clusters

· The time complexity has been analyzed in linear time

20

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

21

Comments

· Advantages─ Improve the old method about setting the two

parameters· Applications

─ Data clustering

top related