a fast and flexible clustering algorithm using binary discretization

IEEE ICDM December –,

A Fast and Flexible ClusteringAlgorithm Using Binary Discretization

Mahito SUGIYAMA†,‡，Akihiro YAMAMOTO†

†Kyoto University‡JSPS Research Fellow

/

Main Results

. BOOL (Bianry cOding Oriented cLustering)• A clustering algorithm for multivariate datato detect arbitrarily shaped clusters

• Noise tolerant• Robust to changes in input parameters

. BOOL is faster than K-means and the fastest amongthe algorithms to detect arbitrarily shaped clusters• It is about two to three orders of magnitude faster thantwo state-of-the-art algorithms [Chaoji et al. SDM ,Chaoji et al. KAIS ] that can detect non-convexclusters of arbitrary shapes

/

Evaluation with Typical Benchmarks

• We used four synthetic databases (DS - DS)– From the CLUTO website

∘ http://glaros.dtc.umn.edu/gkhome/views/cluto/• Typical benchmarks for (spatial) clustering– CHAMELEON [Karypis et al., ], SPARCL [Chaoji et al., ],

ABACUS [Chaoji et al., ], and so onDS1 DS2

0 100 200 300 400 500 6000

50

100

150

200

250

300

0 200 400 600 800

50

100

150

DS3 DS4

0 100 200 400 500 600 700300

100

200

300

400

0 100 200 400 500 600 700300

100

200

300

400

0

/

http://glaros.dtc.umn.edu/gkhome/views/cluto/

Results for Benchmarks

• Running time (in seconds)– In table, n denotes the number of data points– Results for ABACUS and SPARCL are from [Chaoji et al., ]

∘ Their source codes are not publicly available

Non-convex Convex

n BOOL DBSCAN ABACUS SPARCL K-means

DS . . . . .DS . . . . .DS . . . . .DS . . . . .

/

Results (Original Data)DS1 DS2

0 100 200 300 400 500 6000

50

100

150

200

250

300

0 200 400 600 800

50

100

150

DS3 DS4

0 100 200 400 500 600 700300

100

200

300

400

0 100 200 400 500 600 700300

100

200

300

400

0

/

Results (BOOL)DS1 DS2

0 100 200 300 400 500 6000

50

100

150

200

250

300

0 200 400 600 800

50

100

150

DS3 DS4

0 100 200 400 500 600 700300

100

200

300

400

00 100 200 400 500 600 700300

100

200

300

400

/

Results (K-means)DS1 DS2

0 100 200 300 400 500 6000

50

100

150

200

250

300

0 200 400 600 800

50

100

150

DS3 DS4

0 100 200 400 500 600 700300

100

200

300

400

0 100 200 400 500 600 700300

100

200

300

400

0

/

Evaluation with Natural Images

• We used four natural images– From the Berkeley segmentation database and benchmark

∘ http://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/segbench/

• × in size, , pixels in total– The RGB (red-green-blue) values for each pixel were obtained

by pre-precessing, same as [Chaoji et al., ]

Horse Mushroom Pyramid Road

/

http://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/segbench/

http://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/segbench/

Results for Natural Images

• Running time (in seconds)– In table, n denotes the number of data points– Results for ABACUS and SPARCL are from [Chaoji et al., ]

Non-convex Convex

n BOOL ABACUS SPARCL KmeansHorse . . . .Mushroom . . – .Pyramid . . – .Road . . – .

/

Results (Original Data)Horse Mushroom

Pyramid Road

/

Results (BOOL)Horse Mushroom

Pyramid Road

/

Results (K-means)Horse Mushroom

Pyramid Road

/

Results for UCI Data

• Running time (in seconds)

n d K BOOL K-means DBSCAN

ecoli . . .sonar . . .shuttle . . –wdbc . . .wine . . .wine quality . . .

/

Results for UCI Data

• Adjusted Rand index

n d K BOOL K-means DBSCAN

ecoli . . .sonar . . .shuttle . . –wdbc . . .wine . . .wine quality . . .

/

Motivation

• The K-means algorithm is widely used– Simple and efficient– The main drawback is its limited clustering ability:Non-spherical clusters cannot be found

• Many shape-based algorithms have been proposed– Can find clusters of arbitrary shape– Drawbacks:. Not scalable (time complexity is quadratic or cubic). Not robust to input parameters

/

Key Strategy

• Requirements:. Fast (linear in data size). Robust. Find arbitrary shaped clusters

• Key idea: Translate input data onto binary words usingbinary discretization– A hierarchy of clusters is naturally produced byincreasing the accuracy of the discretization

• Each cluster is constructed byagglomerating adjacent smaller clusters– performed linearly by sorting binary representations

/

Clustering Process

id x y

a . .b . .c . .d . .e . .

/

Discretize Database at Level

0 1

0

1

Discretization level:

id x y

a . .b . .c . .d . .e . .

/


0 1

0

1


id x y

a 0 0b 1 1c 1 1d 0 1e 1 0

/

Agglomerate Adjacent Clusters

0 1

0

1


id x y

a 0 0b 1 1c 1 1d 0 1e 1 0

/

Go to Next Level

id x y

a . .b . .c . .d . .e . .

/


0 1 2 3

0

1

2

3Discretization level:

id x y

a 0 1b 2 2c 2 3d 1 3e 2 1

/

Agglomerate Adjacent Clusters

0 1 2 3

0

1

2


id x y

a 0 1b 2 2c 2 3d 1 3e 2 1

/

Construct Clusters with Sorting

0 1 2 3

0

1

2


id x y

a 0 1b 2 2c 2 3d 1 3e 2 1

/

Sort by x-axis → Sort by y-axis

0 1 2 3

0

1

2


id x y

a 0 1e 2 1b 2 2d 1 3c 2 3

/

Compare Each Point to the Next Point

0 1 2 3

0

1

2


id x y

a 0 1e 2 1b 2 2d 1 3c 2 3

/

Sort by x-axis

0 1 2 3

0

1

2


id x y

a 0 1d 1 3e 2 1b 2 2c 2 3

/

Compare Each Point to the Next Point

0 1 2 3

0

1

2


id x y

a 0 1d 1 3e 2 1b 2 2c 2 3

/

Finish (... or go to the next level)

0 1 2 3

0

1

2


id x y

a 0 1d 1 3e 2 1b 2 2c 2 3

/

Distance Parameter l

l = 1

l = 2

l = 3

/

Noise Filtering by BOOL

• Noise filtering is important in clustering• Define 𝒞⩾N ≔ {C ∈ 𝒞 ∣ #C ⩾ N} for a partition 𝒞– See a cluster C as noises if #C < N

• Example: Given 𝒞 = {{0.1}, {0.4, 0.5, 0.6}, {0.9}}– 𝒞⩾2 = {{0.4, 0.5, 0.6}}, and 0.1 and 0.9 are noises

• We input the number (noise parameter) N as the lowerbound of the size of cluster in BOOL

/

Conclusion

• We have developed a clustering algorithm BOOL– Fastest w.r.t finding arbitrarily shaped clusters– Robust, highly scalable, and noise tolerant

• The key to performance is discretization based onthe binary encoding and sorting of data points

• Future work: BOOL is simple and can easily be extendedto other machine learning tasks– e.g., anomaly detection, semi-supervised learning

/

Appendix

A-/A-

Other Experiments

• We evaluate BOOL experimentally to determine its scala-bility and effectiveness for various types of databases– Randomly generated synthetic databases (d = 2)– Famous synthetic databases for spatial clustering (d = 2)– Real databases generated from natural images (d = 3)– Real databases from the UCI repository (d = 7, 9, 30, 60)

• WeusedMacOSXversion.. (×.-GHzQuad-CoreIntel Xeon CPUs, GB of memory)– BOOL was implemented in C, compiled with gcc ..– All experiments were performed in R version ..

• All databases are pre-processed by min-max normaliza-tion in BOOL

A-/A-

Speed w.r.t. Data and Clusters

• Randomly generated synthetic databases– K = 5 (left), n = 10000 (right)

5·105103 104 105 106Number of data points n

5·1045·103102

Runn

ing

time

(s)

10-4

10-2

1

104

102

3510 20 30 40Number of clusters K

2515510-3

10-2

10-1

10

1BOOLK-meansDBSCAN

A-/A-

Quality w.r.t. Data and Clusters

• Randomly generated synthetic databases– K = 5 (left), n = 10000 (right)

5·105103 104 105 106Number of data points n

5·1045·103102

Adju

sted

Ran

d in

dex

0.9

0.92

0.96

1.00

0.98

0.94

3510 20 30 40Number of clusters K

251550.4

0.5

0.7

1.0

0.9

0.6

0.8

A-/A-

Speed w.r.t. Input Parameters

• Randomly generated synthetic databases– K = 5 and n = 10000

10

0.020

0.015

0.010

0.005

02 4 6 8

Distance parameter l

Runn

ing

time

(s)

10

0.020

0.015

0.010

0.005

02 4 6 8Noise parameter N

0

A-/A-

Quality w.r.t. Input Parameters

• Randomly generated synthetic databases– K = 5 and n = 10000

10

1.00

0.98

0.94

0.92

0.92 4 6 8

Distance parameter l

Adju

sted

Ran

d in

dex

0.96

10

1.2

0.8

1.0

0.6

0.4

0.2

0

-0.22 4 6 8Noise parameter N

0

A-/A-

Notation• We treat a target database as a relation[Garcia-Molina et al., ]– The columns are called attributes– The domain of each attribute is [0, 1]– Attributes are composed of natural numbers {1, 2,… , d}

∘ Each tuple corresponds to a data point in the d-dimensionalEuclidean space ℝd

– xi is the value for attribute i of x– ForM ⊆ {1, 2,… , d}, πM(X) denotes a database that hasonly the columns for attributesM of X

• Clustering is partition of X into K subsets (clusters) C1, … , CK

– Ci ≠ ∅ and Ci ∩ Cj = ∅– We call 𝒞 = {C1,… ,CK} a partition of X– 𝒞(X) = {𝒞 ∣ 𝒞 is a partition of X}

A-/A-

Discretization

• For a data point x in a database X, discretization at level kis an operator Δk for x, where each value xi is mapped toa natural number m such that xi = 0 implies m = 0, andxi ≠ 0 implies

m =

⎧⎪⎨⎪⎩

0 if 0 < xi ⩽ 2−k,1 if 2−k+1 < xi ⩽ 2−k+2,...2k − 1 if 2−k+(k−1) < xi ⩽ 1.

– Weuse the sameoperator Δk for discretization of a database X;i.e., each data point x in X is discretized to Δk(x) in the databaseΔk(X).

A-/A-

Reachability

• Given a distance parameter l ∈ ℕ, a data point x in X isreachable at level k from a data point y if there exists achain of points z1, z2,… , zp (p ⩾ 2) such that z1 = x, zp = y,and the distance

d0(Δk(zi), Δ

k(zi+1)) ⩽ 1 and d∞(Δk(zi), Δk(zi+1)) ⩽ l

for all i ∈ {1, 2,… , p − 1}– The distance between x and y is defined by

d0(x, y) =d∑i=1

δ(xi, yi), where δ = 0 if xi = yi,1 if xi ≠ yi,

d∞(x, y) = maxi∈{1,…,d}

|xi − yi|

in the L0 and L∞ metrics, respectively

A-/A-

Pseudo-Code (/)

Input: Database X,lower bound on number of clusters K,noise parameter N, anddistance parameter l

Output: Partition 𝒞function BOOL(X, K, N): k ← 1 // k is level of discretization: repeat: 𝒞 ← MAKEHIERARCHY(X, k, N): k ← k + 1: until #𝒞 ⩾ K: output 𝒞

A-/A-

Pseudo-Code (/)

functionMAKEHIERARCHY(X, k, N): 𝒞 ← {{x} ∣ x ∈ X}: h ← d // d is number of attributes of X: XD ← Δk(X) // discretize X at level k: X ← Sπ1(XD) ∘ Sπ2(XD) ∘ … ∘ Sπd(XD)(X): repeat: X ← Sπh(XD)(X): 𝒞 ← AGGL(X, 𝒞, h): h ← h − 1: until h = 0

: 𝒞 ← {C ∈ 𝒞 ∣ #C ⩾ N}: output 𝒞

A-/A-

Pseudo-Code (/)

function AGGL(X, 𝒞, h): for each object x of X: y ← successive object of x: if d0(Δ

k(x), Δk(y)) ⩽ 1 and d∞(Δk(x), Δk(y)) ⩽ l then: delete C ∋ x and D ∋ y from 𝒞, and add C ∪ D: end if: end for: output 𝒞

A-/A-

Level-k partition and Sorting

• BOOL construct clusters through level-k partitions(k = 1, 2, 3,… )– Apartition of a databaseX is a level-kpartition, denotedby𝒞k,

if it satisfies the following condition:For all pairs x and y, thepair are in the samecluster iff y is reach-able at level k from x

• Sorting of a database X is defined as follows:– Let Y be a key database s.t. #X = #Y and Y has only one of X’s

attributes– The expression SY(X) is the database X for which data points

are sorted in the order indicated by Y∘ Ties keep the original order of X

A-/A-

Properties of Reachability

• If the distance parameter l = 1, the condition is exactly thesame as

d1(Δk(zi), Δ

k(zi+1)) ⩽ 1– d1 is the Manhattan distance (L1 metric)

• The notion of reachability is symmetric– If a data point x is reachable at level k from y, then y isreachable from x

A-/A-

Hierarchy of Clusters

• Level-k partitions have a hierarchical structure

• For the level-k and k + 1 partitions 𝒞k and 𝒞k+1,the following condition holds:– For every cluster C ∈ 𝒞k, there exists a set of clusters

𝒟 ⊆ 𝒞k+1 such that ⋃ 𝒟 = C.

• This is why, for two objects x and y, if d0(Δk(x), Δk(y)) ⩽ 1

and d∞(Δk(x), Δk(y)) ⩽ l for some k, then the same holdsfor all k , with k ⩽ k.

• BOOL can be viewed as a divisive hierarchical clusteringalgorithm

A-/A-

Adjusted Rand Index

• Let the result be𝒞 = {C1,… ,CK} and the correct partitionbe 𝒟 = {D1,… ,DM}

• Suppose nij ≔ ‖{x ∈ X ∣ x ∈ Ci, x ∈ Dj}‖. Then∑i, j nijC2 − (∑i ‖Ci‖C2 ∑h ‖Dj‖C2)/nC2

2−1(∑i ‖Ci‖C2 + ∑h ‖Dj‖C2) − (∑i ‖Ci‖C2 ∑h ‖Dj‖C2)/nC2

A-/A-

Shape-Based Clustering Algorithms

• Many shape-based algorithms have been proposed– See [Berkhin, ; Halkidi et al., ; Jain et al., ]

• Partitional algorithms– ABACUS [Chaoji et al., ], SPARCL [Chaoji et al., ]

• Mass-based algorithms [Ting and Wells, ]• Density-based algorithms– DBSCAN [Ester et al., ]

• Hierarchical clustering algorithms– CURE [Guha et al., ], CHAMELEON [Karypis et al., ]

• Grid-based algorithms– STING [Wang et al., ]

A-/A-

References

[Berkhin, ] P. Berkhin. A survey of clustering data mining techniques.GroupingMultidimensional Data, pages –, .

[Chaoji et al., ] V. Chaoji, G. Li, H. Yildirim, and M. J. Zaki. ABACUS:Mining arbitrary shaped clusters from large datasets based on back-bone identification. In Proceedings of SIAM International Confer-ence on DataMining, pages –, .

[Chaoji et al., ] V. Chaoji, M. A. Hasan, S. Salem, andM. J. Zaki. SPARCL:An effective and efficient algorithm formining arbitrary shape-basedclusters. Knowledge and Information Systems, ():–, .

[Ester et al., ] M. Ester, H. P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databaseswith noise. In Proceedings of KDD, , –, .

[Garcia-Molina et al., ] H. Garcia-Molina, J. D. Ullman, and J. Widom.Database systems: The complete book. Prentice Hall Press, .

[Guha et al., ] S. Guha, R. Rastogi, and K. Shim. CURE: An effi-

A-/A-

cient clustering algorithm for large databases. Information Systems,():–, .

[Halkidi et al., ] M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clus-tering validation techniques. Journal of Intelligent Information Sys-tems, ():–, .

[Hinneburg and Keim, ] A. Hinneburg and D. A. Keim. An efficient ap-proach to clustering in large multimedia databases with noise. InProceedings of the Fourth International Conference on Knowledge Dis-covery and DataMining, pages –, .

[Jain et al., ] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: Areview. ACMComputing Surveys, ():–, .

[Karypis et al., ] G. Karypis, H. Eui-Hong, and V. Kumar. CHAMELEON:Hierarchical clustering using dynamic modeling. Computer,():–, .

[Lin et al., ] J. Lin, E. Keogh, S. Lonardi, and B. Chiu. A symbolic repre-sentation of time series, with implications for streaming algorithms.In Proceedings of the th ACM SIGMODWorkshop on Research Issues inDataMining and Knowledge Discovery, pages –, .

A-/A-

[Qiu and Joe, ] W. Qiu and H. Joe. Generation of random clusters withspecified degree of separation. Journal of Classification, :–,.

[Sheikholeslami et al., ] G. Sheikholeslami, S. Chatterjee, andA. Zhang. WaveCluster: A multi-resolution clustering approach forvery large spatial databases. In Proceedings of the th InternationalConference on Very Large Data Bases, pages –, .

[Ting and Wells, ] K. M. Ting and J. R. Wells. Multi-dimensional massestimation and mass-based clustering. In Proceedings of th IEEE In-ternational Conference on DataMining, pages – , .

[Wang et al., ] W. Wang, J. Yang, and R. Muntz. STING: A statisticalinformation grid approach to spatial data mining. In Proceedingsof the rd International Conference on Very Large Data Bases, pages–, .

A-/A-

a fast and flexible clustering algorithm using binary discretization

Documents