kmeans with canopy clustering
TRANSCRIPT
k-means, canopy clustering Machine Learning Study (2015-03-05) BDS연구소 플랫폼연구팀 정성현 선임
1
Classification Supervised learning
Training set:
ClassB
ClassA
2
Classification Supervised learning
Training set:
ClassB
ClassA
3
Decision boundary
Classification Supervised learning
Training set:
ClassB
ClassA
4
Decision boundary
new input
Classification Supervised learning
Training set:
ClassB
ClassA
5
Decision boundary
new input classify as classB
Clustering Un-Supervised learning
Training set:
6
Clustering Un-Supervised learning
Training set:
ClusterA
ClusterB
7
k-means clustering
Input: -‐ (number of clusters) -‐ Training set
8
k-means clustering
Ini?al cluster centroids
Input: -‐ (number of clusters) -‐ Training set
9
k-means clustering 10
Repeat { for = 1 to := index (from 1 to ) of cluster centroid closest to for = 1 to := average (mean) of points assigned to cluster }
k-means clustering
Repeat { for = 1 to := index (from 1 to ) of cluster centroid closest to for = 1 to := average (mean) of points assigned to cluster }
11
k-means clustering
Repeat { for = 1 to := index (from 1 to ) of cluster centroid closest to for = 1 to := average (mean) of points assigned to cluster }
12
k-means clustering
Repeat { for = 1 to := index (from 1 to ) of cluster centroid closest to for = 1 to := average (mean) of points assigned to cluster }
13
k-means clustering
Repeat { for = 1 to := index (from 1 to ) of cluster centroid closest to for = 1 to := average (mean) of points assigned to cluster }
14
k-means clustering
Repeat { for = 1 to := index (from 1 to ) of cluster centroid closest to for = 1 to := average (mean) of points assigned to cluster }
15
k-means clustering
Repeat { for = 1 to := index (from 1 to ) of cluster centroid closest to for = 1 to := average (mean) of points assigned to cluster }
16
= index of cluster (1,2,…, ) to which example is currently assigned
= cluster centroid ( ) = cluster centroid of cluster to which example has been assigned
K-means optimization objective 17
Randomly ini?alize cluster centroids
Repeat { for = 1 to := index (from 1 to ) of cluster centroid closest to for = 1 to := average (mean) of points assigned to cluster }
K-means algorithm
18
Should have Randomly pick training examples. Set equal to these examples.
Random initialization
19
Local optima 20
Local optima 21
Local optima 22
For i = 1 to 100 {
Randomly ini?alize K-‐means. Run K-‐means. Get . Compute cost func?on (distor?on)
}
Pick clustering that gave lowest cost
Random initialization
23
Elbow method:
1 2 3 4 5 6 7 8
Cost fu
nc?o
n
(no. of clusters)
Choose K=3
Choosing the value of K
24
Elbow method:
1 2 3 4 5 6 7 8
Cost fu
nc?o
n
(no. of clusters) 1 2 3 4 5 6 7 8 9
Cost fu
nc?o
n
(no. of clusters)
Choose K=3
Choosing the value of K
25
DEMO k-means
26
Demo Starbuck area clustering
k-‐=5 k-‐=3
서울, 경기 강원
부산, 경상
전라, 제주
충청
서울, 경기, 강원
충청, 전라, 제주 부산, 경상
Demo Local optima
28
Demo Find Elbow
29
cost
# of K 2 3 4 5 6 7
Canopy Clustering Threshold T1, T2
30
Select one of samples for cluster center
Canopy Clustering Threshold T1, T2
31
Inside T2, member of cluster cannot be a cluster center
Canopy Clustering Threshold T1, T2
32
Outside T2, inside T1, member of cluster, could also be a cluster center itself
Canopy Clustering Overlap
33
Overlap
choose closest cluster
Canopy clustering Finding the perfect k using canopy clustering
34
> 5% of popula?on > 10% of popula?on sample data
Finding the perfect k using canopy clustering
SEEDING K-‐MEANS CENTROIDS USING CANOPY GENERATION
-‐ Mahout in ac?on -‐
DEMO canopy, canopy & k-means
35
Demo Starbuck area clustering
[k-‐means] K=5, elapsed ?me=2.57 sec [canopy] canopy[K]=5, elapsed ?me=0.012 sec
Demo K-means with canopy
37
[k-‐means] K=5, elapsed ?me=2.57 sec [k-‐means + canopy] canopy[K]=5, elapsed ?me=1.89 sec
Demo Canopy clustering to find user POI(Point Of Interest)
38
GPS data of one month
Demo Canopy clustering to find user POI(Point Of Interest)
39
All Canopies
Demo Canopy clustering to find user POI(Point Of Interest)
40
Canopies > 10% of popula?on
Reference 41
Reference hhp://www.coursera.com Machine Learning (Andrew Ng) Clustering Chapter
Reference 42
Reference hhp://mahout.apache.org Mahout Mahout in ac?on
THANKS
43