![Page 1: Fast and Accurate K-means for Large Datasets #nipsereading](https://reader033.vdocuments.pub/reader033/viewer/2022060115/55763debd8b42ac31b8b46cb/html5/thumbnails/1.jpg)
Fast and Accurate K-‐means for Large Datasets
Michael Shindler, Alex Wong, Adam Meyerson
Presenter: Yoh Okuno #nipsreading
![Page 2: Fast and Accurate K-means for Large Datasets #nipsereading](https://reader033.vdocuments.pub/reader033/viewer/2022060115/55763debd8b42ac31b8b46cb/html5/thumbnails/2.jpg)
• Name: Yoh Okuno
• R&D Engineer at Yahoo! Japan
• Interest: NLP (Natural Language Processing),
Machine Learning, and Data Mining.
• Skills: C/C++, Java, Python, and Hadoop.
• Website: http://yoh.okuno.name/
About Presenter
![Page 3: Fast and Accurate K-means for Large Datasets #nipsereading](https://reader033.vdocuments.pub/reader033/viewer/2022060115/55763debd8b42ac31b8b46cb/html5/thumbnails/3.jpg)
Overview 1. Recent Advancement on K-‐means Clustering
– Batch versus Streaming Settings
– Related Works and Our Contribution
2. Algorithm for Large-‐Scale K-‐means Clustering
– Streaming + Mini-‐Batch + Smart Initialization
3. Incorporating Approximate Nearest Neighbor Search
– Based on Random Projection (Hashing)
4. Evaluation and Discussion
![Page 4: Fast and Accurate K-means for Large Datasets #nipsereading](https://reader033.vdocuments.pub/reader033/viewer/2022060115/55763debd8b42ac31b8b46cb/html5/thumbnails/4.jpg)
1. Recent Advancement on K-‐means Clustering
![Page 5: Fast and Accurate K-means for Large Datasets #nipsereading](https://reader033.vdocuments.pub/reader033/viewer/2022060115/55763debd8b42ac31b8b46cb/html5/thumbnails/5.jpg)
Review of the Standard K-‐means Clustering
• Minimize cost function below iteratively:
1. Update z with fixed μ (assign cluster number)
2. Update μ with fixed z (calculate average)
minimize:N�
i=1
�xi − µzi�2
x_i: i-‐th data point z_i: cluster number μ_j: centroid of j-‐th cluster
Where:
![Page 6: Fast and Accurate K-means for Large Datasets #nipsereading](https://reader033.vdocuments.pub/reader033/viewer/2022060115/55763debd8b42ac31b8b46cb/html5/thumbnails/6.jpg)
Related Works and Our Contributions
• The standard batch algorithm [Lloyd 1982]
• Streaming approaches [Aggarwal 2007]
• Mini-‐batch approaches [Sculley 2010]
• Our work is based on a recent streaming
approach [Braverman+ 2011]
• Incorporated approximate nearest neighbor
![Page 7: Fast and Accurate K-means for Large Datasets #nipsereading](https://reader033.vdocuments.pub/reader033/viewer/2022060115/55763debd8b42ac31b8b46cb/html5/thumbnails/7.jpg)
2. Algorithm for Large-‐Scale K-‐means Clustering
![Page 8: Fast and Accurate K-means for Large Datasets #nipsereading](https://reader033.vdocuments.pub/reader033/viewer/2022060115/55763debd8b42ac31b8b46cb/html5/thumbnails/8.jpg)
![Page 9: Fast and Accurate K-means for Large Datasets #nipsereading](https://reader033.vdocuments.pub/reader033/viewer/2022060115/55763debd8b42ac31b8b46cb/html5/thumbnails/9.jpg)
Initialize
Streaming
Mini Batch
![Page 10: Fast and Accurate K-means for Large Datasets #nipsereading](https://reader033.vdocuments.pub/reader033/viewer/2022060115/55763debd8b42ac31b8b46cb/html5/thumbnails/10.jpg)
Initialize clusters • Create clusters until the buffer will be full
– Run nearest neighbor search on the new data
– Add a cluster randomly (according to its distance)
![Page 11: Fast and Accurate K-means for Large Datasets #nipsereading](https://reader033.vdocuments.pub/reader033/viewer/2022060115/55763debd8b42ac31b8b46cb/html5/thumbnails/11.jpg)
Streaming K-‐means Clustering • Renew clusters randomly in the same way
Same to the previous page
![Page 12: Fast and Accurate K-means for Large Datasets #nipsereading](https://reader033.vdocuments.pub/reader033/viewer/2022060115/55763debd8b42ac31b8b46cb/html5/thumbnails/12.jpg)
Ball k-‐means on weighted points
• Run ball k-‐means on weighted points
[Braverman+ 2011] [Ostrovsky+ 2006]
![Page 13: Fast and Accurate K-means for Large Datasets #nipsereading](https://reader033.vdocuments.pub/reader033/viewer/2022060115/55763debd8b42ac31b8b46cb/html5/thumbnails/13.jpg)
3. Incorporating Approximate Nearest Neighbor Search
![Page 14: Fast and Accurate K-means for Large Datasets #nipsereading](https://reader033.vdocuments.pub/reader033/viewer/2022060115/55763debd8b42ac31b8b46cb/html5/thumbnails/14.jpg)
Bottleneck: nearest neighbor search among points
![Page 15: Fast and Accurate K-means for Large Datasets #nipsereading](https://reader033.vdocuments.pub/reader033/viewer/2022060115/55763debd8b42ac31b8b46cb/html5/thumbnails/15.jpg)
Approximate Nearest Neighbor Search
• Use simple random projection
1. Set ω ∈ R^d as [0, 1) randomly
2. Calculate inner product of ω and clusters
3. Given query x, calculate inner product x・ω
4. Find the nearest cluster with x using product
![Page 16: Fast and Accurate K-means for Large Datasets #nipsereading](https://reader033.vdocuments.pub/reader033/viewer/2022060115/55763debd8b42ac31b8b46cb/html5/thumbnails/16.jpg)
4. Evaluation and Discussions
![Page 17: Fast and Accurate K-means for Large Datasets #nipsereading](https://reader033.vdocuments.pub/reader033/viewer/2022060115/55763debd8b42ac31b8b46cb/html5/thumbnails/17.jpg)
Datasets
• BigCross dataset:
– Size: 11 million points in 55 dimensions
• Census 1990: national survey
– 2 million points in 68 dimensions
• Environment: C++ / Ubuntu / 2.9Ghz / 6GB
![Page 18: Fast and Accurate K-means for Large Datasets #nipsereading](https://reader033.vdocuments.pub/reader033/viewer/2022060115/55763debd8b42ac31b8b46cb/html5/thumbnails/18.jpg)
Note: Lower cost is Better
![Page 19: Fast and Accurate K-means for Large Datasets #nipsereading](https://reader033.vdocuments.pub/reader033/viewer/2022060115/55763debd8b42ac31b8b46cb/html5/thumbnails/19.jpg)
Note: Lower time is better
![Page 20: Fast and Accurate K-means for Large Datasets #nipsereading](https://reader033.vdocuments.pub/reader033/viewer/2022060115/55763debd8b42ac31b8b46cb/html5/thumbnails/20.jpg)
Conclusion
• Proposed a fast, accurate k-‐means clustering
based on a streaming algorithm
• Incorporated approximate nearest neighbor
search with the proposed algorithm
• Excellent on both practice and theory
![Page 21: Fast and Accurate K-means for Large Datasets #nipsereading](https://reader033.vdocuments.pub/reader033/viewer/2022060115/55763debd8b42ac31b8b46cb/html5/thumbnails/21.jpg)
References • [Lloyd 1982] Least Squares Quantization in PCM. IEEE on
Information Theory.
• [Aggarwal 2007] Data Streams: Models and Algorithms.
Springer.
• [Braverman+ 2011] Streaming K-‐means on Well-‐
Clusterable Data. SODA.
• [Ackermann+ 2010] StreamKM++: A Clustering Algorithm
for Data Streams. ALENEX.
• [Sculley 2010] Web-‐Scale K-‐means Clustering. WWW.
![Page 22: Fast and Accurate K-means for Large Datasets #nipsereading](https://reader033.vdocuments.pub/reader033/viewer/2022060115/55763debd8b42ac31b8b46cb/html5/thumbnails/22.jpg)
Any Questions?