ieee transactions on knowledge and data engineering, tkde (2009)
DESCRIPTION
Information-theoretic distance measures for clustering validation: Generalization and normalization. Presenter : Lin, Shu -Han Authors : Ping Luo , Hui Xiong , Guoxing Zhan, Junjie Wu, and Zhongzhi Shi. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE (2009). Outline. - PowerPoint PPT PresentationTRANSCRIPT
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.
Information-theoretic distance measures for clustering validation:
Generalization and normalization
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE (2009)
Presenter : Lin, Shu-HanAuthors : Ping Luo, Hui Xiong, Guoxing Zhan, Junjie Wu, and Zhongzhi Shi
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.
2
Outline
Motivation Objective Methodology Experiments Conclusion Comments
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Motivation
External criteria for clustering validation: Information-theoretic distance measures are used to Comparing the
clustering output with the “true” partition
Clustering ability of algorithms: Compare different clustering algorithms, given dataset
Clustering difficulty of datasets: Compare different datasets, given algorithm
3
A B C1 30 0 1
2 2 20 0
3 0 2 15
σ : the “true” partitionπ:
clustering output
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Objectives
Since Dimension, size, sparseness of data; scales of attributes are different for different datasets. the range of distance measures are different To do fair comparison: distance normalization
4
A B C120 120 120
A B C D E F G12 23 30 24 5 90 20
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Methodology – Conditional Entropy
5
The equality C1=C2 yields the Shannon entropy
π: group labelσ: class label
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Methodology – Quasi-Distance
6
Minimum reachable: d(π, σ) reaches its minimum over both and iff π=σ
Symmetry: d(π, σ) = d(σ, π) Triangle law: d(π, σ) + d(σ, π) ≧ d(σ, τ)
A B C1 30 0 1
2 2 20 0
3 0 2 15
σ : the “true” partitionπ:
clustering output
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Methodology – Normalization Issue
7
A B C120 120 120
A B C D E F G12 23 30 24 5 90 20
How to get it?
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Methodology – Computation of
8
Generate a π0 PART(∈ A) such that
σ: n
The worse result of π (m groups)
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Methodology – Computation of
9
There is an difference between and
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Experiments
10
Shannon Entropy
Pal Entropy
Gini Index
Goodman-Kruskal
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Experiments
11
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.
12
Conclusions
Quasi-distance: external measure for clustering validation Symmetry Triangle law Minimum reachable
Normalization: maximum value of a distance measure Compare clustering performances of an algorithm on
different datasets The normalized distance measures outperform the original distance
measure Normalized Shannon distance has best performance among 4 observed
distance measures
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.
13
Comments
Advantage Idea is intuitive Theoretically analysis
Drawback Describe why they think quasi-distance is better than DCV.
Application The same use of DCV?