a fast and flexible clustering algorithm using binary discretization
DESCRIPTION
A Fast and Flexible Clustering Algorithm Using Binary Discretization,IEEE ICDM 2011, Vancouver, Canada, Dec. 11-14, 2011TRANSCRIPT
IEEE ICDM December –,
A Fast and Flexible ClusteringAlgorithm Using Binary Discretization
Mahito SUGIYAMA†,‡,Akihiro YAMAMOTO†
†Kyoto University‡JSPS Research Fellow
/
Main Results
. BOOL (Bianry cOding Oriented cLustering)• A clustering algorithm for multivariate datato detect arbitrarily shaped clusters
• Noise tolerant• Robust to changes in input parameters
. BOOL is faster than K-means and the fastest amongthe algorithms to detect arbitrarily shaped clusters• It is about two to three orders of magnitude faster thantwo state-of-the-art algorithms [Chaoji et al. SDM ,Chaoji et al. KAIS ] that can detect non-convexclusters of arbitrary shapes
/
Evaluation with Typical Benchmarks
• We used four synthetic databases (DS - DS)– From the CLUTO website
∘ http://glaros.dtc.umn.edu/gkhome/views/cluto/• Typical benchmarks for (spatial) clustering– CHAMELEON [Karypis et al., ], SPARCL [Chaoji et al., ],
ABACUS [Chaoji et al., ], and so onDS1 DS2
0 100 200 300 400 500 6000
50
100
150
200
250
300
0 200 400 600 800
50
100
150
DS3 DS4
0 100 200 400 500 600 700300
100
200
300
400
0 100 200 400 500 600 700300
100
200
300
400
0
/
Results for Benchmarks
• Running time (in seconds)– In table, n denotes the number of data points– Results for ABACUS and SPARCL are from [Chaoji et al., ]
∘ Their source codes are not publicly available
Non-convex Convex
n BOOL DBSCAN ABACUS SPARCL K-means
DS . . . . .DS . . . . .DS . . . . .DS . . . . .
/
Results (Original Data)DS1 DS2
0 100 200 300 400 500 6000
50
100
150
200
250
300
0 200 400 600 800
50
100
150
DS3 DS4
0 100 200 400 500 600 700300
100
200
300
400
0 100 200 400 500 600 700300
100
200
300
400
0
/
Results (BOOL)DS1 DS2
0 100 200 300 400 500 6000
50
100
150
200
250
300
0 200 400 600 800
50
100
150
DS3 DS4
0 100 200 400 500 600 700300
100
200
300
400
00 100 200 400 500 600 700300
100
200
300
400
/
Results (K-means)DS1 DS2
0 100 200 300 400 500 6000
50
100
150
200
250
300
0 200 400 600 800
50
100
150
DS3 DS4
0 100 200 400 500 600 700300
100
200
300
400
0 100 200 400 500 600 700300
100
200
300
400
0
/
Evaluation with Natural Images
• We used four natural images– From the Berkeley segmentation database and benchmark
∘ http://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/segbench/
• × in size, , pixels in total– The RGB (red-green-blue) values for each pixel were obtained
by pre-precessing, same as [Chaoji et al., ]
Horse Mushroom Pyramid Road
/
Results for Natural Images
• Running time (in seconds)– In table, n denotes the number of data points– Results for ABACUS and SPARCL are from [Chaoji et al., ]
Non-convex Convex
n BOOL ABACUS SPARCL KmeansHorse . . . .Mushroom . . – .Pyramid . . – .Road . . – .
/
Results (Original Data)Horse Mushroom
Pyramid Road
/
Results (BOOL)Horse Mushroom
Pyramid Road
/
Results (K-means)Horse Mushroom
Pyramid Road
/
Results for UCI Data
• Running time (in seconds)
n d K BOOL K-means DBSCAN
ecoli . . .sonar . . .shuttle . . –wdbc . . .wine . . .wine quality . . .
/
Results for UCI Data
• Adjusted Rand index
n d K BOOL K-means DBSCAN
ecoli . . .sonar . . .shuttle . . –wdbc . . .wine . . .wine quality . . .
/
Motivation
• The K-means algorithm is widely used– Simple and efficient– The main drawback is its limited clustering ability:Non-spherical clusters cannot be found
• Many shape-based algorithms have been proposed– Can find clusters of arbitrary shape– Drawbacks:. Not scalable (time complexity is quadratic or cubic). Not robust to input parameters
/
Key Strategy
• Requirements:. Fast (linear in data size). Robust. Find arbitrary shaped clusters
• Key idea: Translate input data onto binary words usingbinary discretization– A hierarchy of clusters is naturally produced byincreasing the accuracy of the discretization
• Each cluster is constructed byagglomerating adjacent smaller clusters– performed linearly by sorting binary representations
/
Clustering Process
id x y
a . .b . .c . .d . .e . .
/
Discretize Database at Level
0 1
0
1
Discretization level:
id x y
a . .b . .c . .d . .e . .
/
Discretize Database at Level
0 1
0
1
Discretization level:
id x y
a 0 0b 1 1c 1 1d 0 1e 1 0
/
Discretize Database at Level
0 1
0
1
Discretization level:
id x y
a 0 0b 1 1c 1 1d 0 1e 1 0
/
Discretize Database at Level
0 1
0
1
Discretization level:
id x y
a 0 0b 1 1c 1 1d 0 1e 1 0
/
Discretize Database at Level
0 1
0
1
Discretization level:
id x y
a 0 0b 1 1c 1 1d 0 1e 1 0
/
Discretize Database at Level
0 1
0
1
Discretization level:
id x y
a 0 0b 1 1c 1 1d 0 1e 1 0
/
Agglomerate Adjacent Clusters
0 1
0
1
Discretization level:
id x y
a 0 0b 1 1c 1 1d 0 1e 1 0
/
Go to Next Level
id x y
a . .b . .c . .d . .e . .
/
Discretize Database at Level
0 1 2 3
0
1
2
3Discretization level:
id x y
a 0 1b 2 2c 2 3d 1 3e 2 1
/
Discretize Database at Level
0 1 2 3
0
1
2
3Discretization level:
id x y
a 0 1b 2 2c 2 3d 1 3e 2 1
/
Agglomerate Adjacent Clusters
0 1 2 3
0
1
2
3Discretization level:
id x y
a 0 1b 2 2c 2 3d 1 3e 2 1
/
Construct Clusters with Sorting
0 1 2 3
0
1
2
3Discretization level:
id x y
a 0 1b 2 2c 2 3d 1 3e 2 1
/
Sort by x-axis → Sort by y-axis
0 1 2 3
0
1
2
3Discretization level:
id x y
a 0 1e 2 1b 2 2d 1 3c 2 3
/
Compare Each Point to the Next Point
0 1 2 3
0
1
2
3Discretization level:
id x y
a 0 1e 2 1b 2 2d 1 3c 2 3
/
Sort by x-axis
0 1 2 3
0
1
2
3Discretization level:
id x y
a 0 1d 1 3e 2 1b 2 2c 2 3
/
Compare Each Point to the Next Point
0 1 2 3
0
1
2
3Discretization level:
id x y
a 0 1d 1 3e 2 1b 2 2c 2 3
/
Finish (... or go to the next level)
0 1 2 3
0
1
2
3Discretization level:
id x y
a 0 1d 1 3e 2 1b 2 2c 2 3
/
Distance Parameter l
l = 1
l = 2
l = 3
/
Noise Filtering by BOOL
• Noise filtering is important in clustering• Define 𝒞⩾N ≔ {C ∈ 𝒞 ∣ #C ⩾ N} for a partition 𝒞– See a cluster C as noises if #C < N
• Example: Given 𝒞 = {{0.1}, {0.4, 0.5, 0.6}, {0.9}}– 𝒞⩾2 = {{0.4, 0.5, 0.6}}, and 0.1 and 0.9 are noises
• We input the number (noise parameter) N as the lowerbound of the size of cluster in BOOL
/
Conclusion
• We have developed a clustering algorithm BOOL– Fastest w.r.t finding arbitrarily shaped clusters– Robust, highly scalable, and noise tolerant
• The key to performance is discretization based onthe binary encoding and sorting of data points
• Future work: BOOL is simple and can easily be extendedto other machine learning tasks– e.g., anomaly detection, semi-supervised learning
/
Appendix
A-/A-
Other Experiments
• We evaluate BOOL experimentally to determine its scala-bility and effectiveness for various types of databases– Randomly generated synthetic databases (d = 2)– Famous synthetic databases for spatial clustering (d = 2)– Real databases generated from natural images (d = 3)– Real databases from the UCI repository (d = 7, 9, 30, 60)
• WeusedMacOSXversion.. (×.-GHzQuad-CoreIntel Xeon CPUs, GB of memory)– BOOL was implemented in C, compiled with gcc ..– All experiments were performed in R version ..
• All databases are pre-processed by min-max normaliza-tion in BOOL
A-/A-
Speed w.r.t. Data and Clusters
• Randomly generated synthetic databases– K = 5 (left), n = 10000 (right)
5·105103 104 105 106Number of data points n
5·1045·103102
Runn
ing
time
(s)
10-4
10-2
1
104
102
3510 20 30 40Number of clusters K
2515510-3
10-2
10-1
10
1BOOLK-meansDBSCAN
A-/A-
Quality w.r.t. Data and Clusters
• Randomly generated synthetic databases– K = 5 (left), n = 10000 (right)
5·105103 104 105 106Number of data points n
5·1045·103102
Adju
sted
Ran
d in
dex
0.9
0.92
0.96
1.00
0.98
0.94
3510 20 30 40Number of clusters K
251550.4
0.5
0.7
1.0
0.9
0.6
0.8
A-/A-
Speed w.r.t. Input Parameters
• Randomly generated synthetic databases– K = 5 and n = 10000
10
0.020
0.015
0.010
0.005
02 4 6 8
Distance parameter l
Runn
ing
time
(s)
10
0.020
0.015
0.010
0.005
02 4 6 8Noise parameter N
0
A-/A-
Quality w.r.t. Input Parameters
• Randomly generated synthetic databases– K = 5 and n = 10000
10
1.00
0.98
0.94
0.92
0.92 4 6 8
Distance parameter l
Adju
sted
Ran
d in
dex
0.96
10
1.2
0.8
1.0
0.6
0.4
0.2
0
-0.22 4 6 8Noise parameter N
0
A-/A-
Notation• We treat a target database as a relation[Garcia-Molina et al., ]– The columns are called attributes– The domain of each attribute is [0, 1]– Attributes are composed of natural numbers {1, 2,… , d}
∘ Each tuple corresponds to a data point in the d-dimensionalEuclidean space ℝd
– xi is the value for attribute i of x– ForM ⊆ {1, 2,… , d}, πM(X) denotes a database that hasonly the columns for attributesM of X
• Clustering is partition of X into K subsets (clusters) C1, … , CK
– Ci ≠ ∅ and Ci ∩ Cj = ∅– We call 𝒞 = {C1,… ,CK} a partition of X– 𝒞(X) = {𝒞 ∣ 𝒞 is a partition of X}
A-/A-
Discretization
• For a data point x in a database X, discretization at level kis an operator Δk for x, where each value xi is mapped toa natural number m such that xi = 0 implies m = 0, andxi ≠ 0 implies
m =
⎧⎪⎨⎪⎩
0 if 0 < xi ⩽ 2−k,1 if 2−k+1 < xi ⩽ 2−k+2,...2k − 1 if 2−k+(k−1) < xi ⩽ 1.
– Weuse the sameoperator Δk for discretization of a database X;i.e., each data point x in X is discretized to Δk(x) in the databaseΔk(X).
A-/A-
Reachability
• Given a distance parameter l ∈ ℕ, a data point x in X isreachable at level k from a data point y if there exists achain of points z1, z2,… , zp (p ⩾ 2) such that z1 = x, zp = y,and the distance
d0(Δk(zi), Δ
k(zi+1)) ⩽ 1 and d∞(Δk(zi), Δk(zi+1)) ⩽ l
for all i ∈ {1, 2,… , p − 1}– The distance between x and y is defined by
d0(x, y) =d∑i=1
δ(xi, yi), where δ = 0 if xi = yi,1 if xi ≠ yi,
d∞(x, y) = maxi∈{1,…,d}
|xi − yi|
in the L0 and L∞ metrics, respectively
A-/A-
Pseudo-Code (/)
Input: Database X,lower bound on number of clusters K,noise parameter N, anddistance parameter l
Output: Partition 𝒞function BOOL(X, K, N): k ← 1 // k is level of discretization: repeat: 𝒞 ← MAKEHIERARCHY(X, k, N): k ← k + 1: until #𝒞 ⩾ K: output 𝒞
A-/A-
Pseudo-Code (/)
functionMAKEHIERARCHY(X, k, N): 𝒞 ← {{x} ∣ x ∈ X}: h ← d // d is number of attributes of X: XD ← Δk(X) // discretize X at level k: X ← Sπ1(XD) ∘ Sπ2(XD) ∘ … ∘ Sπd(XD)(X): repeat: X ← Sπh(XD)(X): 𝒞 ← AGGL(X, 𝒞, h): h ← h − 1: until h = 0
: 𝒞 ← {C ∈ 𝒞 ∣ #C ⩾ N}: output 𝒞
A-/A-
Pseudo-Code (/)
function AGGL(X, 𝒞, h): for each object x of X: y ← successive object of x: if d0(Δ
k(x), Δk(y)) ⩽ 1 and d∞(Δk(x), Δk(y)) ⩽ l then: delete C ∋ x and D ∋ y from 𝒞, and add C ∪ D: end if: end for: output 𝒞
A-/A-
Level-k partition and Sorting
• BOOL construct clusters through level-k partitions(k = 1, 2, 3,… )– Apartition of a databaseX is a level-kpartition, denotedby𝒞k,
if it satisfies the following condition:For all pairs x and y, thepair are in the samecluster iff y is reach-able at level k from x
• Sorting of a database X is defined as follows:– Let Y be a key database s.t. #X = #Y and Y has only one of X’s
attributes– The expression SY(X) is the database X for which data points
are sorted in the order indicated by Y∘ Ties keep the original order of X
A-/A-
Properties of Reachability
• If the distance parameter l = 1, the condition is exactly thesame as
d1(Δk(zi), Δ
k(zi+1)) ⩽ 1– d1 is the Manhattan distance (L1 metric)
• The notion of reachability is symmetric– If a data point x is reachable at level k from y, then y isreachable from x
A-/A-
Hierarchy of Clusters
• Level-k partitions have a hierarchical structure
• For the level-k and k + 1 partitions 𝒞k and 𝒞k+1,the following condition holds:– For every cluster C ∈ 𝒞k, there exists a set of clusters
𝒟 ⊆ 𝒞k+1 such that ⋃ 𝒟 = C.
• This is why, for two objects x and y, if d0(Δk(x), Δk(y)) ⩽ 1
and d∞(Δk(x), Δk(y)) ⩽ l for some k, then the same holdsfor all k , with k ⩽ k.
• BOOL can be viewed as a divisive hierarchical clusteringalgorithm
A-/A-
Adjusted Rand Index
• Let the result be𝒞 = {C1,… ,CK} and the correct partitionbe 𝒟 = {D1,… ,DM}
• Suppose nij ≔ ‖{x ∈ X ∣ x ∈ Ci, x ∈ Dj}‖. Then∑i, j nijC2 − (∑i ‖Ci‖C2 ∑h ‖Dj‖C2)/nC2
2−1(∑i ‖Ci‖C2 + ∑h ‖Dj‖C2) − (∑i ‖Ci‖C2 ∑h ‖Dj‖C2)/nC2
A-/A-
Shape-Based Clustering Algorithms
• Many shape-based algorithms have been proposed– See [Berkhin, ; Halkidi et al., ; Jain et al., ]
• Partitional algorithms– ABACUS [Chaoji et al., ], SPARCL [Chaoji et al., ]
• Mass-based algorithms [Ting and Wells, ]• Density-based algorithms– DBSCAN [Ester et al., ]
• Hierarchical clustering algorithms– CURE [Guha et al., ], CHAMELEON [Karypis et al., ]
• Grid-based algorithms– STING [Wang et al., ]
A-/A-
References
[Berkhin, ] P. Berkhin. A survey of clustering data mining techniques.GroupingMultidimensional Data, pages –, .
[Chaoji et al., ] V. Chaoji, G. Li, H. Yildirim, and M. J. Zaki. ABACUS:Mining arbitrary shaped clusters from large datasets based on back-bone identification. In Proceedings of SIAM International Confer-ence on DataMining, pages –, .
[Chaoji et al., ] V. Chaoji, M. A. Hasan, S. Salem, andM. J. Zaki. SPARCL:An effective and efficient algorithm formining arbitrary shape-basedclusters. Knowledge and Information Systems, ():–, .
[Ester et al., ] M. Ester, H. P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databaseswith noise. In Proceedings of KDD, , –, .
[Garcia-Molina et al., ] H. Garcia-Molina, J. D. Ullman, and J. Widom.Database systems: The complete book. Prentice Hall Press, .
[Guha et al., ] S. Guha, R. Rastogi, and K. Shim. CURE: An effi-
A-/A-
cient clustering algorithm for large databases. Information Systems,():–, .
[Halkidi et al., ] M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clus-tering validation techniques. Journal of Intelligent Information Sys-tems, ():–, .
[Hinneburg and Keim, ] A. Hinneburg and D. A. Keim. An efficient ap-proach to clustering in large multimedia databases with noise. InProceedings of the Fourth International Conference on Knowledge Dis-covery and DataMining, pages –, .
[Jain et al., ] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: Areview. ACMComputing Surveys, ():–, .
[Karypis et al., ] G. Karypis, H. Eui-Hong, and V. Kumar. CHAMELEON:Hierarchical clustering using dynamic modeling. Computer,():–, .
[Lin et al., ] J. Lin, E. Keogh, S. Lonardi, and B. Chiu. A symbolic repre-sentation of time series, with implications for streaming algorithms.In Proceedings of the th ACM SIGMODWorkshop on Research Issues inDataMining and Knowledge Discovery, pages –, .
A-/A-
[Qiu and Joe, ] W. Qiu and H. Joe. Generation of random clusters withspecified degree of separation. Journal of Classification, :–,.
[Sheikholeslami et al., ] G. Sheikholeslami, S. Chatterjee, andA. Zhang. WaveCluster: A multi-resolution clustering approach forvery large spatial databases. In Proceedings of the th InternationalConference on Very Large Data Bases, pages –, .
[Ting and Wells, ] K. M. Ting and J. R. Wells. Multi-dimensional massestimation and mass-based clustering. In Proceedings of th IEEE In-ternational Conference on DataMining, pages – , .
[Wang et al., ] W. Wang, J. Yang, and R. Muntz. STING: A statisticalinformation grid approach to spatial data mining. In Proceedingsof the rd International Conference on Very Large Data Bases, pages–, .
A-/A-