kmeans with canopy clustering

43
k-means, canopy clustering Machine Learning Study (2015-03-05) BDS연구소 플랫폼연구팀 정성현 선임 1

Upload: seonghyun-jeong

Post on 18-Jul-2015

200 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Kmeans with canopy clustering

k-means, canopy clustering Machine Learning Study (2015-03-05) BDS연구소 플랫폼연구팀 정성현 선임

1

Page 2: Kmeans with canopy clustering

Classification Supervised learning

Training  set:                        

ClassB  

ClassA  

2

Page 3: Kmeans with canopy clustering

Classification Supervised learning

Training  set:                        

ClassB  

ClassA  

3

Decision  boundary  

Page 4: Kmeans with canopy clustering

Classification Supervised learning

Training  set:                        

ClassB  

ClassA  

4

Decision  boundary  

new  input  

Page 5: Kmeans with canopy clustering

Classification Supervised learning

Training  set:                        

ClassB  

ClassA  

5

Decision  boundary  

new  input  classify  as  classB  

Page 6: Kmeans with canopy clustering

Clustering Un-Supervised learning

Training  set:        

6

Page 7: Kmeans with canopy clustering

Clustering Un-Supervised learning

Training  set:        

ClusterA  

ClusterB  

7

Page 8: Kmeans with canopy clustering

k-means clustering

Input:  -­‐         (number  of  clusters)  -­‐  Training  set  

8

Page 9: Kmeans with canopy clustering

k-means clustering

Ini?al  cluster  centroids    

Input:  -­‐         (number  of  clusters)  -­‐  Training  set  

9

Page 10: Kmeans with canopy clustering

k-means clustering 10

Repeat  {    for      =  1  to        :=  index  (from  1  to          )  of  cluster  centroid                closest  to      for        =  1  to        :=  average  (mean)  of  points  assigned  to  cluster    }  

Page 11: Kmeans with canopy clustering

k-means clustering

Repeat  {    for      =  1  to        :=  index  (from  1  to          )  of  cluster  centroid                closest  to      for        =  1  to        :=  average  (mean)  of  points  assigned  to  cluster    }  

11

Page 12: Kmeans with canopy clustering

k-means clustering

Repeat  {    for      =  1  to        :=  index  (from  1  to          )  of  cluster  centroid                closest  to      for        =  1  to        :=  average  (mean)  of  points  assigned  to  cluster    }  

12

Page 13: Kmeans with canopy clustering

k-means clustering

Repeat  {    for      =  1  to        :=  index  (from  1  to          )  of  cluster  centroid                closest  to      for        =  1  to        :=  average  (mean)  of  points  assigned  to  cluster    }  

13

Page 14: Kmeans with canopy clustering

k-means clustering

Repeat  {    for      =  1  to        :=  index  (from  1  to          )  of  cluster  centroid                closest  to      for        =  1  to        :=  average  (mean)  of  points  assigned  to  cluster    }  

14

Page 15: Kmeans with canopy clustering

k-means clustering

Repeat  {    for      =  1  to        :=  index  (from  1  to          )  of  cluster  centroid                closest  to      for        =  1  to        :=  average  (mean)  of  points  assigned  to  cluster    }  

15

Page 16: Kmeans with canopy clustering

k-means clustering

Repeat  {    for      =  1  to        :=  index  (from  1  to          )  of  cluster  centroid                closest  to      for        =  1  to        :=  average  (mean)  of  points  assigned  to  cluster    }  

16

Page 17: Kmeans with canopy clustering

=  index  of  cluster  (1,2,…,      )  to  which  example                    is  currently  assigned  

=  cluster  centroid          (                            )  =  cluster  centroid  of  cluster  to  which  example                    has  been  assigned  

K-means optimization objective 17

Page 18: Kmeans with canopy clustering

Randomly  ini?alize            cluster  centroids  

Repeat  {    for      =  1  to                :=  index  (from  1  to          )  of  cluster  centroid                closest  to      for        =  1  to                :=  average  (mean)  of  points  assigned  to  cluster    }  

K-means algorithm

18

Page 19: Kmeans with canopy clustering

Should  have        Randomly  pick          training    examples.    Set                                          equal  to  these            examples.  

Random initialization

19

Page 20: Kmeans with canopy clustering

Local optima 20

Page 21: Kmeans with canopy clustering

Local optima 21

Page 22: Kmeans with canopy clustering

Local optima 22

Page 23: Kmeans with canopy clustering

For  i  =  1  to  100  {      

 Randomly  ini?alize  K-­‐means.    Run  K-­‐means.  Get                                                                                                  .    Compute  cost  func?on  (distor?on)    

   }  

Pick  clustering  that  gave  lowest  cost  

Random initialization

23

Page 24: Kmeans with canopy clustering

Elbow  method:  

1   2   3   4   5   6   7   8  

Cost  fu

nc?o

n    

(no.  of  clusters)  

Choose  K=3  

Choosing the value of K

24

Page 25: Kmeans with canopy clustering

Elbow  method:  

1   2   3   4   5   6   7   8  

Cost  fu

nc?o

n    

(no.  of  clusters)  1   2   3   4   5   6   7   8   9  

Cost  fu

nc?o

n    

(no.  of  clusters)  

Choose  K=3  

Choosing the value of K

25

Page 26: Kmeans with canopy clustering

DEMO k-means

26

Page 27: Kmeans with canopy clustering

Demo Starbuck area clustering

k-­‐=5   k-­‐=3  

서울, 경기  강원  

부산, 경상  

전라, 제주  

충청  

서울,  경기,  강원  

충청, 전라, 제주  부산, 경상  

Page 28: Kmeans with canopy clustering

Demo Local optima

28

Page 29: Kmeans with canopy clustering

Demo Find Elbow

29

cost  

#  of  K  2   3   4   5   6   7  

Page 30: Kmeans with canopy clustering

Canopy Clustering Threshold T1, T2

30

Select  one  of  samples  for  cluster  center  

Page 31: Kmeans with canopy clustering

Canopy Clustering Threshold T1, T2

31

Inside  T2,  member  of  cluster  cannot  be  a  cluster  center  

Page 32: Kmeans with canopy clustering

Canopy Clustering Threshold T1, T2

32

Outside  T2,  inside  T1,  member  of  cluster,  could  also  be  a  cluster  center  itself  

Page 33: Kmeans with canopy clustering

Canopy Clustering Overlap

33

Overlap  

choose  closest  cluster  

Page 34: Kmeans with canopy clustering

Canopy clustering Finding the perfect k using canopy clustering

34

>  5%  of  popula?on   >  10%  of  popula?on  sample  data  

Finding  the  perfect  k  using  canopy  clustering      

SEEDING  K-­‐MEANS  CENTROIDS  USING  CANOPY  GENERATION  

-­‐  Mahout  in  ac?on  -­‐  

Page 35: Kmeans with canopy clustering

DEMO canopy, canopy & k-means

35

Page 36: Kmeans with canopy clustering

Demo Starbuck area clustering

[k-­‐means]  K=5,  elapsed  ?me=2.57  sec   [canopy]  canopy[K]=5,  elapsed  ?me=0.012  sec  

Page 37: Kmeans with canopy clustering

Demo K-means with canopy

37

[k-­‐means]  K=5,  elapsed  ?me=2.57  sec   [k-­‐means  +  canopy]  canopy[K]=5,  elapsed  ?me=1.89  sec  

Page 38: Kmeans with canopy clustering

Demo Canopy clustering to find user POI(Point Of Interest)

38

GPS  data  of  one  month  

Page 39: Kmeans with canopy clustering

Demo Canopy clustering to find user POI(Point Of Interest)

39

All  Canopies  

Page 40: Kmeans with canopy clustering

Demo Canopy clustering to find user POI(Point Of Interest)

40

Canopies  >  10%  of  popula?on  

Page 41: Kmeans with canopy clustering

Reference 41

Reference    hhp://www.coursera.com  Machine  Learning  (Andrew  Ng)    Clustering  Chapter  

Page 42: Kmeans with canopy clustering

Reference 42

Reference    hhp://mahout.apache.org  Mahout  Mahout  in  ac?on  

Page 43: Kmeans with canopy clustering

THANKS

43