a k-mean clustering algorithm for mixed numeric and categorical data

12
Intelligent Database Systems Lab 國國國國國國國國 National Yunlin University of Science and Technology A k-mean clustering algorithm for mixed numeric and categorical data Presenter : Shao-Wei Cheng Authors : Amir Ahmad, Lipika Dey DKE 2007

Upload: cassia

Post on 12-Jan-2016

55 views

Category:

Documents


0 download

DESCRIPTION

A k-mean clustering algorithm for mixed numeric and categorical data. Presenter : Shao -Wei Cheng Authors : Amir Ahmad, Lipika Dey. DKE 2007. Outline. Motivation Objective Methodology Experiments Conclusion Comments. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A k-mean clustering algorithm for mixed numeric and categorical data

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

A k-mean clustering algorithm for mixed numeric and categorical data

Presenter : Shao-Wei ChengAuthors : Amir Ahmad, Lipika Dey

DKE 2007

Page 2: A k-mean clustering algorithm for mixed numeric and categorical data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

2

Outline

Motivation Objective Methodology Experiments Conclusion Comments

Page 3: A k-mean clustering algorithm for mixed numeric and categorical data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Motivation

3

The traditional k-mean algorithm is limited to numeric data. The Huang’s cost algorithm tried to cluster mixed numeric

and categorical data

The cluster center is represented by the mode of the cluster. Use the binary distance between two categorical attribute values. The significance(weight) of numeric attribute is taken to be 1, and γj is

a user-defined parameter.

Page 4: A k-mean clustering algorithm for mixed numeric and categorical data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

4

Objectives

This paper attempts to alleviate the short-comings of Huang’s cost algorithm. Propose a new representation for the cluster center. Computing distance between two categorical values by the overall

distribution of categorical attribute. The parameter is defined by the contribution of a categorical

attribute.

Page 5: A k-mean clustering algorithm for mixed numeric and categorical data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Cost function

The Huang’s cost algorithm

The proposed cost algorithm

5

Methodology

The distance between De Niro and Stewart is ?

Page 6: A k-mean clustering algorithm for mixed numeric and categorical data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

6

Methodology

Page 7: A k-mean clustering algorithm for mixed numeric and categorical data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

7

Methodology

Significance of numeric attribute

The numeric attributes need to be discretized. equal width discretization

Page 8: A k-mean clustering algorithm for mixed numeric and categorical data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

8

Methodology

Algorithm① Initialization.

② Computing the cluster centers.

③ Assign the data element to the cluster whose center is closest to it

④ Repeat 2 and 3, until clusters do not change or for a fixed number of iterations.

Page 9: A k-mean clustering algorithm for mixed numeric and categorical data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Evaluation method

Data sets Iris – all numeric attributes Vote – all categorical attributes Heart disease data – mixed data set Australian credit data – mixed data set

Experiments

9

Page 10: A k-mean clustering algorithm for mixed numeric and categorical data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

10

Page 11: A k-mean clustering algorithm for mixed numeric and categorical data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Conclusion

11

This paper introduced a new distance measure for categorical attribute values and proposed a modified k-mean algorithm for clustering mixed data sets.

The results obtained with this algorithm over a number of real-world data sets are highly encouraging.

Future work Other methods for discretizing numeric valued attributes. Other implementations of k-mean algorithm.

Page 12: A k-mean clustering algorithm for mixed numeric and categorical data

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

12

Comments

Advantage The view of overall attributes is good.

Drawback …

Application Mixed data sets clustering.