ieee transactions on knowledge and data engineering, tkde (2009)

13
Intelligent Database Systems Lab N.Y.U.S. T. I. M. Information-theoretic distance measures for clustering validation: Generalization and normalization IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE (2009) Presenter : Lin, Shu-Han Authors : Ping Luo, Hui Xiong, Guoxing Zhan, Junjie Wu, and Zhongzhi Shi

Upload: etta

Post on 23-Feb-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Information-theoretic distance measures for clustering validation: Generalization and normalization. Presenter : Lin, Shu -Han Authors : Ping Luo , Hui Xiong , Guoxing Zhan, Junjie Wu, and Zhongzhi Shi. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE (2009). Outline. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE (2009)

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

Information-theoretic distance measures for clustering validation:

Generalization and normalization

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE (2009)

Presenter : Lin, Shu-HanAuthors : Ping Luo, Hui Xiong, Guoxing Zhan, Junjie Wu, and Zhongzhi Shi

Page 2: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE (2009)

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

2

Outline

Motivation Objective Methodology Experiments Conclusion Comments

Page 3: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE (2009)

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Motivation

External criteria for clustering validation: Information-theoretic distance measures are used to Comparing the

clustering output with the “true” partition

Clustering ability of algorithms: Compare different clustering algorithms, given dataset

Clustering difficulty of datasets: Compare different datasets, given algorithm

3

A B C1 30 0 1

2 2 20 0

3 0 2 15

σ : the “true” partitionπ:

clustering output

Page 4: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE (2009)

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Objectives

Since Dimension, size, sparseness of data; scales of attributes are different for different datasets. the range of distance measures are different To do fair comparison: distance normalization

4

A B C120 120 120

A B C D E F G12 23 30 24 5 90 20

Page 5: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE (2009)

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology – Conditional Entropy

5

The equality C1=C2 yields the Shannon entropy

π: group labelσ: class label

Page 6: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE (2009)

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology – Quasi-Distance

6

Minimum reachable: d(π, σ) reaches its minimum over both and iff π=σ

Symmetry: d(π, σ) = d(σ, π) Triangle law: d(π, σ) + d(σ, π) ≧ d(σ, τ)

A B C1 30 0 1

2 2 20 0

3 0 2 15

σ : the “true” partitionπ:

clustering output

Page 7: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE (2009)

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology – Normalization Issue

7

A B C120 120 120

A B C D E F G12 23 30 24 5 90 20

How to get it?

Page 8: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE (2009)

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology – Computation of

8

Generate a π0 PART(∈ A) such that

σ: n

The worse result of π (m groups)

Page 9: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE (2009)

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology – Computation of

9

There is an difference between and

Page 10: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE (2009)

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Experiments

10

Shannon Entropy

Pal Entropy

Gini Index

Goodman-Kruskal

Page 11: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE (2009)

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Experiments

11

Page 12: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE (2009)

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

12

Conclusions

Quasi-distance: external measure for clustering validation Symmetry Triangle law Minimum reachable

Normalization: maximum value of a distance measure Compare clustering performances of an algorithm on

different datasets The normalized distance measures outperform the original distance

measure Normalized Shannon distance has best performance among 4 observed

distance measures

Page 13: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE (2009)

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

13

Comments

Advantage Idea is intuitive Theoretically analysis

Drawback Describe why they think quasi-distance is better than DCV.

Application The same use of DCV?