clustering using representatives: an e cient clustering...

37
Clustering Using REpresentatives: An Efficient Clustering Algorithm for Large Databases Bilz˜ a Marques de Ara´ ujo [email protected] SCC5895 - An´ alise de Agrupamento de Dados Semin´ ario 09 de Dezembro de 2010 Bilz˜ a Ara´ ujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 1 / 31

Upload: others

Post on 27-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Clustering Using REpresentatives: An EfficientClustering Algorithm for Large Databases

Bilza Marques de [email protected]

SCC5895 - Analise de Agrupamento de Dados

Seminario09 de Dezembro de 2010

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 1 / 31

Page 2: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Conteudo

1 Introducao

2 CURE

3 Melhorias - Large Scale Databases

4 Resultados

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 2 / 31

Page 3: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Principais Referencias

Guha, S., Rastogi, R., & Shim, K. (1998).Cure: an efficient clustering algorithm for large databases.In Proceedings of the 1998 ACM SIGMOD international conference onManagement of data, SIGMOD ’98 (pp. 73–84). New York, NY, USA:ACM.

Guha, S., Rastogi, R., & Shim, K. (2001).Cure: an efficient clustering algorithm for large databases.Information Systems, 26, 35–58.

Theodoridis, S. & Koutroumbas, K. (2009).Pattern Recognition, (pp. 683–685).Academic Press, 4th edition.

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 3 / 31

Page 4: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

1 Introducao

2 CURE

3 Melhorias - Large Scale Databases

4 Resultados

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 4 / 31

Page 5: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Introducao

Problema de clustering: Dado um conjunto de objetos, agrupar os objetosem clusters tal que cada objetos em um cluster seja mais similares aosdemais objetos no mesmo cluster que a objetos em outros clusters.

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 5 / 31

Page 6: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Representantes

Quando unem grupos, algoritmos aglomerativos utilizam“representantes”:

centroides, medoides: dmean

todos os objetos do grupo: dmin, dmax, dave

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 6 / 31

Page 7: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Deficiencias abordagens classicas

Abordagens classicas favorecem clusters esfericos e tamanhossimilares ou sao frageis na presenca de outliers

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 7 / 31

Page 8: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

1 Introducao

2 CURE

3 Melhorias - Large Scale Databases

4 Resultados

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 8 / 31

Page 9: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Clustering Using REpresentatives

Tecnica de Agrupamento de Dados Hierarquica

Cada grupo e representado por min(c, |Ci|) pontos que descrevem aforma* do grupo

Dois grupos, C1 e C2, sao unidos se ∀Ci, Cj

dcure(C1, C2) == min(dcure(Ci, Cj))

dcure(Ci, Cj) e a menor distancia entre representantes de Ci e Cj

Os grupos sao unidos ate que sejam obtidos k grupos

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 9 / 31

Page 10: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

c objetos bem distribuıdos

+ +

+ +

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 10 / 31

Page 11: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

c objetos bem distribuıdos

+ +

+ +

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 10 / 31

Page 12: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

c objetos bem distribuıdos

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 11 / 31

Page 13: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

c objetos bem distribuıdos

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 11 / 31

Page 14: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

c representantes

Cada representante e encolhidos sentido ao centroide em α

p = p+ α(x− p)

+ +

Dois grupos sao unidos de acordo com os representantes

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 12 / 31

Page 15: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

c representantes

Cada representante e encolhidos sentido ao centroide em α

p = p+ α(x− p)

+ +

Dois grupos sao unidos de acordo com os representantes

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 12 / 31

Page 16: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Algoritmo

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 13 / 31

Page 17: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Algoritmo

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 13 / 31

Page 18: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

1 Introducao

2 CURE

3 Melhorias - Large Scale Databases

4 Resultados

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 14 / 31

Page 19: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Melhorias - Large Scale Databases

Bases de dados com centenas de milhares de objetos

Limitacoes de memoria

Inviabilidade Computacional O(n2 log n)

Base de dados

Elimina outliers

Agrupa grupos parciais

Rotulação dados no disco

Amostragem sobre

os dados

Particionamento das

amostras

Agrupamento parcial

cada partição

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 15 / 31

Page 20: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Amostragem

Amostra suficientemente grande preserva caracterısticas dos grupos eelimina outliers

De acordo com Chernoff bounds, para clusters bem definidos (densosintra e esparso inter):

smin = ξkρ+ kρ log(1

δ) + kρ

√(log(

1

δ))2 + 2ξ log(

1

δ)

Nao depende do numero de objetos n, mas sim do numero de clustersbem definidos, k

Com grupos de densidades variadas e necessario assumir k de acordocom regioes densas

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 16 / 31

Page 21: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Particionamento

Particionamento da amostra

Agrupamento hierarquico ate npq

Permite melhor tratamento de outliers*

Reduz ainda mais custo computacional

a2 + b2 < (a+ b)2

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 17 / 31

Page 22: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Remocao de outliers

Modelo ainda susceptıvel a outliers

Amostragem e parametro α podem nao eliminar todo efeito de

Uma vez que outliers tendem a ser agrupados no final da hierarquia

Grupos com poucos objetos sao removidos da amostra

ao final de cada particaoquando atinge o numero de grupos k

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 18 / 31

Page 23: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Rotulacao em disco

Objetos em disco sao rotulados de acordo com os representantes degrupos

Cada objeto e atribuıdo ao grupo do representante mais proximo

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 19 / 31

Page 24: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

1 Introducao

2 CURE

3 Melhorias - Large Scale Databases

4 Resultados

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 20 / 31

Page 25: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Comparacao com BIRCH e MST

n = 100000, s = 2500, c = 10, α = 0.3, p = 1

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 21 / 31

Page 26: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Comparacao com BIRCH e MST

n = 100000, s = 2500, c = 10, α = 0.3, p = 1

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 22 / 31

Page 27: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Comparacao com BIRCH e MST

n = 121560, s = 3000, c = 100, α = 0.15, p = 1

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 23 / 31

Page 28: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Fator de encolhimento α

n = 100000, s = 2500, c = 10, p = 1, α = [0.1, 0.9]

Quando α = 0 similar ao MST, quando α = 1, similar ao BIRCH

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 24 / 31

Page 29: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Fator de encolhimento α

n = 100000, s = 2500, c = 10, p = 1, α = [0.1, 0.9]

Quando α = 0 similar ao MST, quando α = 1, similar ao BIRCH

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 24 / 31

Page 30: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Numero de representantes, c

n = 100000, s = 2500, α = 0.3, p = 1, c = [1, 100]

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 25 / 31

Page 31: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Numero de representantes, c

n = 100000, s = 2500, α = 0.3, p = 1, c = [1, 100]

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 25 / 31

Page 32: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Tamanho da amostra de representantes, s

n = 100000, α = 0.3, c = 10, p = 1, s = [500, 5000]

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 26 / 31

Page 33: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Numero de particoes, p

n = 100000, α = 0.3, c = 10, s = 2500, p = [1, 100],

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 27 / 31

Page 34: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Performance comparado ao BIRCH

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 28 / 31

Page 35: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Performance

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 29 / 31

Page 36: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Obrigado pela atencao!

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 30 / 31

Page 37: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering

Bibliografia I

Guha, S., Rastogi, R., & Shim, K. (1998).Cure: an efficient clustering algorithm for large databases.In Proceedings of the 1998 ACM SIGMOD international conference onManagement of data, SIGMOD ’98 (pp. 73–84). New York, NY, USA:ACM.

Guha, S., Rastogi, R., & Shim, K. (2001).Cure: an efficient clustering algorithm for large databases.Information Systems, 26, 35–58.

Theodoridis, S. & Koutroumbas, K. (2009).Pattern Recognition, (pp. 683–685).Academic Press, 4th edition.

Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 31 / 31