cluster analysis. 2 first used by tryon (1939) encompasses a number of different algorithms and...

86
Cluster Analysis

Post on 19-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

Cluster Analysis

Page 2: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

2

First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective categories.

Page 3: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

3

對資料作簡化的工作與分類,也就是把相似的個體 (觀察物 )歸於一群。

將事物根據某些屬性歸集在各個群體之中,使在同一個集群內的事物都具有相同的特性 (homogeneity) ,而在不同的集群之間卻有顯著的差異性。

集群分析 目的

Page 4: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

4

若從幾何圖形來看。同一集群內的分子應聚集在一起,而不同集群的分子應該彼此遠離。

Page 5: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

5

集群分析應用

教學應用 醫學界 社會學 心理學 經濟學 生物學

Page 6: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

6

集群分析的計算

利用觀察體的『距離』資料或『相似性』資料為根據。兩者的『距離』量數愈小,則兩個觀察體在某方面就愈類似,『相似性』的量數也就愈大。

利用計算出來的『距離矩陣』或『相似係數矩陣』,便可根據某些標準將 N個觀察體依次加以歸併最後可以聚集成幾個代表性的集群。

Page 7: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

7

距離式的衡量:以點與點之間的距離為測度,較常採用的方法為歐幾里德距離 (Euclidean Distance) :若有 N個觀察體,每個觀察體有 M個屬性,則令 X是 N*M 的資料矩陣,點與點的歐幾里德距離為

Page 8: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

8

dij= 2

1

1

2

m

pjpip xx

如果各屬性的衡量單位不同,則在計算歐幾里德距離前宜將各變數之數值予以標準化,使其平均數為 0,而標準差為 1。

Page 9: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

9

相似性的衡量:

相似性愈大表示兩種觀察體之相異姓愈小,因而再相似性矩陣運算中,要將相似性數值愈大的集群先加以合併。

兩觀察體之間的相似性可用下述配合係數 (matching coefficient) 來衡量:

Page 10: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

10

m

baM ij

a 為觀察 i與 j共同具有的屬性數目;b為觀察 i與 j共同不具有的屬性數目m為屬性的總數。

Ex: 設 i與 j具備的屬性如下: (1代表具備該屬性, 0代表不具備該屬性 )

A B C D E Fi 1 0 0 1 1 1j 0 0 1 1 0 0

Page 11: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

11

則配合係數為:

3

1

6

2

6

11

m

baM ij

Page 12: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

12

集群分析運算的方法非階層式 (non-hierarchical) 的集群分析 :直接由距離或相似性矩陣開始運算,可分為下列幾種: a. 連續關鍵值法 (sequential threshold) :

使用本法時,事先要挑選一個集群核心,並訂定一個關鍵值,所有與此一中心點之距離在某一預定關鍵值內的各觀察點即形一集群;然後再選擇另一新的集群核心,對尚未歸入集群之各觀察點則歸入第二集群,如此依次連續進行。

Page 13: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

13

b. 平行關鍵值法 (paralleled threshold) :

此法一開始就同時將幾個集群核心選定並訂定關鍵值,然後根據關鍵值,將各個觀察點歸入最近的集群中心,形成各集群。同時關鍵值亦可加以調整,以允許較多 (或較少 )的觀察點進入各集群中。

c.最適劃分法 (optimizing partitioning) :此法是以某一效標 ( 如平均之群內距離為最小 ) 為基礎,不斷嘗試各種分類,直到效標值 (criterion measure) 達到最佳值為止。

Page 14: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

14

d. 平均數法 (K-means Method) :

此法是上述方法的一種整合應用,其步驟是將各觀察值分割為 K個集群,然後計算觀察體到各集群重心的距離,並將各觀察體分派到距離最近的集群內。重新計算得到新觀察體與喪失該觀察體的集群重心,再依各觀察體到各集群重心的距離。如此反覆計算,直到各群沒有須重新分配的觀察體為止。

Page 15: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

15

階層式 (hierarchical) 的集群分析

特性是每一個新的集群,都是由前一階層形成的集群而集結或分裂而成,因此集群分析後可形成一個樹狀結構。 在階層式分裂法中,常見的方法為平均距離分裂法,其分析步驟 :

Page 16: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

16

先找出一個與其他觀察體平均距離最遠者,將此觀察體稱為分裂群,其餘的觀察體稱之為主要群,然後計算分裂群與主要群間、以及主要群之內各觀察體之間的距離。

若主要群之間某一觀察體與主要群其它觀察體的距離,大於此觀察體與分裂群的距離,則將之歸入分裂群,反之則留在主要群中。

Page 17: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

17

以 K-means 計算步驟說明之:

1. 將各個觀察體分割成 K個原始集群

2. 計算某一觀察體到各集群中心(平均數)距離(通常採用歐氏距離),接著將一些觀察體分派到距離最近的那個集群。最後則重新計算得到新觀察體及喪失該等觀察體的兩個集群之新中心。

3.重複第二步驟,直到各觀察體都不必重新分派到其他集群為止。

Page 18: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

18

Ex:四個觀察體在兩個變項上的數量分布

1 2 3 4X1 12 -8 4 -8X2 8 4 -2 -6

首先將此四個觀察體任意分割成兩個集群,如集群【 1 , 2 】及集群【 3, 4 】,然後計算這兩個集群的形心之座標如下:

集群【 1 , 2 】 集群 【 3, 4 】

22

812

X1= X1= 2

2

84

Page 19: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

19

X2= 62

48

X2= 4

2

62

接著計算每一觀察體到各集群形心的歐氏距離,並將其分派到距離最近的集群。如 D21【 1, 2】 =(12-2)2+(8-6)2=104

由上述計算可知:觀察體 4與集群【 3, 4 】距離較近,故不必重新分派;觀察體 2與集群【 3,4 】的距離較近,故將之分派到集群【 3, 4 】而得到新的兩個集群【 1】、【 2, 3, 4 】,其形心之座標如下:

Page 20: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

20

集群【 1】 集群【 2, 3, 4 】

X1=12         X1= 43

848

X2=8          X2= 33.13

624

繼續計算各觀察體到集群【 1】及集群【 2, 3,4 】的歐氏距離

Page 21: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

21

集群 1 2 3 41【 】 0 416 164 596

2 3 4【 , , 】 343.05 44.41 64.45 37.81

由資料顯示:觀察體 1與集群【 1】距離最近;觀察體 2, 3, 4與集群【 2, 3, 4 】之距離最近,所以不需再重新計算分派,而得到 K=2個集群,分別為集群【 1】及集群【 2, 3, 4 】。

Page 22: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

22

Two-stage Cluster Sampling When Clusters are of

Unequal Size

Desired Sample Proportion p=n/N a: Desired # of Clusters Selected in

the 1st Stage A: Total # of Clusters b: Sample Size within Each Cluster

Selected Ni : # of Elements in Cluster i

Page 23: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

23

Simple Two-stage Cluster Sampling

The First-stage Prob. p1=a/A The Second-stage Prob. p2=p(a/

A) Sample size in cluster I, ni =p2*Ni

Page 24: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

24

Probability Proportional to Size

Nn=ab=b*

NNN

aN=p

i

A

i

ii

A

i

i

where

a

Np=b

i

A

i

Page 25: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

25

Example

Draw a sample of 1,000 households from a city that contains about 200,000 households distributed among 2000 blocks of unequal but known size.

The desired sample proportion =1/200 The desired # of clusters selected in

the 1st stage=100 How do we conduct the two-stage

cluster sampling?

Page 26: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

26

What is Cluster Analysis? Cluster Analysis is a class of statistical

techniques that can be applied to data that exhibit natural groupings.

CA is an interdependence technique that makes no distinction between dependent and independent variables.

There is NO statistical significance testing in CA.

CA is more a group of different algorithms that put objects into clusters following “well-defined similarity rules.”

Page 27: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

27

What is A Cluster?

A cluster is a group of relatively homogeneous cases and observations.

Clusters exhibit high internal homogeneity and high external heterogeneity.

Page 28: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

28

A Cluster Diagram: Drinker’s Perceptions of Alcohol

Page 29: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

29

Characteristics of CA

Cluster Analysis is a tool of discovery.

It discovers structures in data but does NOT explain why they exist.

CA is used when we do not have an a priori hypothesis, but when we are in the exploratory phase.

Page 30: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

30

How does CA differ…

From Discriminant AnalysisA dependence techniquePredict the probability that an object will f

all into one of two or more mutually exclusive categories based on several independent variables.

Find a linear combination of independent variables.

Find natural groupings based on distances among objects.

Page 31: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

31

From Factor AnalysisSimilar to cluster analysis in that it is an interdependence technique.

Primary difference lies in the focus on objects and variables.

Factor analysis reduces variables to a few factors. Cluster analysis reduces objects to a few clusters.

Page 32: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

32

Cluster Analysis Methods

Three Cluster Analysis Methods Joining (Tree Clustering)Two-way JoiningK-means Testing

Page 33: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

33

Joining (Tree Clustering)

A type of hierarchical clustering -- agglomerative

Each unit is a cluster.

Dendogram Many other metho

ds

Page 34: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

34

The first level shows all samples xi as singleton clusters. Increase levels, more samples are clustered together in a hierarchical manner.

Page 35: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

35

It is based on sets where each cluster level may contain sets that are subclusters as shown in the Venn diagram.

Page 36: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

36

Two-way Joining Hartigan (1975)

Two-way Joining tries to cluster both variables and objects.

Only useful if you think clustering along BOTH lines will be useful.

Very rare in application.

Page 37: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

37

k-Means Clustering:

Begin with a preconception about the number of clusters (k).

Thought of as ANOVA in reverse.ANOVA evaluates between group var. ag

ainst within group var. when computing stat. signif. of hypothesis that groups are different.

In k-Means the computer will try to move objects in and out of the groups to get the most significant ANOVA results.

Page 38: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

38

It’s all about distance…

Distance MeasuresEuclidean DistanceSquared Euclidean DistanceManhattan DistanceChebychev DistancePower Distance

Page 39: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

39

EQUATION: Euclidean Distance

Basic equation for determining distance measure.

Distance (x,y) = {Σi (xi – yi)2}1/2

A standard formula for determining the distance between two points on a plane

Page 40: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

40

Fairly simple, right?

Page 41: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

41

In other words, how do we get from this…

Page 42: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

42

To this…

Page 43: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

43

To this…

Page 44: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

44

How to Determine Clusters.

Use a computer.

Call a professional.

Page 45: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

45

Clusters in the Real World

Page 46: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

46

Why is Cluster Analysis Important?Relatively new/evolving techniqueHighly useful for market segmentationSegmentation = identifying groupings of

customers using statistical multi-variate analysis, often based on perceptions and attitudes as well as demographics and behavior.

Segmentation helpful to small companies attempting to carve out a niche

Large companies trying to tailor their products/services to different segments

Page 47: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

47

In addition to segmentation, clusters are used to…Design products and establish

brandsTarget direct mailMake decisions about customer

conversion and retentionDecide on marketing cost levels

Page 48: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

48

Ex: Luxury Car CustomersDemographic examples easier to

illustrateDemographics:

GenderEducationAge

149 customers (objects) of a luxury car dealership

Page 49: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

49

Using SPSS for Clustering

Chose “TwoStep Cluster Analysis”Basically, the agglomerative techniqu

e (dendogram).Step One: Creates very small (individu

al) sub-clusters.Step Two: Cluster sub-clusters into de

sired number of clusters.Automatically finds optimum number

of clusters.

Page 50: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

50

Two-Step CA Output

What are these clusters?

Cluster Distribution

43 28.9% 28.9%

27 18.1% 18.1%

29 19.5% 19.5%

21 14.1% 14.1%

29 19.5% 19.5%

149 100.0% 100.0%

149 100.0%

1

2

3

4

5

Combined

Cluster

Total

N% of

Combined % of Total

Page 51: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

51

Two-Step CA Output

Age

0 .0% 16 34.8% 18 35.3% 7 33.3% 2 20.0%

16 76.2% 0 .0% 10 19.6% 0 .0% 1 10.0%

3 14.3% 4 8.7% 15 29.4% 1 4.8% 6 60.0%

2 9.5% 10 21.7% 7 13.7% 1 4.8% 1 10.0%

0 .0% 16 34.8% 1 2.0% 12 57.1% 0 .0%

21 100.0% 46 100.0% 51 100.0% 21 100.0% 10 100.0%

1

2

3

4

5

Combined

ClusterFrequency Percent Frequency Percent Frequency Percent Frequency Percent Frequency Percent

2 3 4 5 6

Page 52: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

52

Education

0 .0% 5 41.7% 0 .0% 38 60.3% 0 .0%

2 100.0% 0 .0% 0 .0% 19 30.2% 6 17.6%

0 .0% 0 .0% 29 76.3% 0 .0% 0 .0%

0 .0% 0 .0% 0 .0% 0 .0% 21 61.8%

0 .0% 7 58.3% 9 23.7% 6 9.5% 7 20.6%

2 100.0% 12 100.0% 38 100.0% 63 100.0% 34 100.0%

1

2

3

4

5

Combined

ClusterFrequency Percent Frequency Percent Frequency Percent Frequency Percent Frequency Percent

1 2 3 4 5

Page 53: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

53

Gender

0 .0% 43 48.9%

19 31.1% 8 9.1%

13 21.3% 16 18.2%

0 .0% 21 23.9%

29 47.5% 0 .0%

61 100.0% 88 100.0%

1

2

3

4

5

Combined

ClusterFrequency Percent Frequency Percent

0 1

Page 54: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

54

What does this mean?

Cluster 5:Age: 36 - 65Education: High School graduate or

aboveGender: Female

Could have used k-Means, would have generated different results.

Clustering is a powerful marketing research tool.

Page 55: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

55

Claritas: Clustering Experts

Example: Claritas Corporation Claritas founded the U.S. geodemographic in

dustry when it launched the first PRIZM segmentation system in 1974.

PRIZM (Potential Rating Index for Zip Markets) categorizes every U.S. neighborhood into 1 of 62 “clusters.”

Descriptive Names:Money and BrainsYoung LiteratiShotguns and Pickups

Page 56: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

56

Money and BrainsSophisticated Urban Fringe Couples

Cluster is a mix of family types: singles, married couples with children and married couples without children. These families own their homes in upscale neighborhoods near cities. Dual incomes provide luxuries, travel and entertainment.

Demographics: Affluent Age Groups: 55-64, 65+Predominantly White, High Asian

Page 57: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

57

Clusters Work!

At a conservative estimate, more than 20,000 companies in the United States and Canada alone used clusters as part of their marketing information mix last year.

Page 58: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

58

Web Sources

http://cwis.livjm.ac.uk/bus/busrmccl/ae230/lect10.ppt http://www.clusterbigip1.claritas.com/claritas/Default.jsp?main=3&su

bmenu=seg&subcat=segprizm http://www.clusterbigip1.claritas.com/claritas/Default.jsp?main=3&su

bmenu=seg&subcat=segprizmne http://www.insightsc.ie/newsletter7.htm http://www.directionsmag.com/article.asp?article_id=12 http://fun.supereva.it/scoleri.freeweb/cern/biografie/hawking.jpg http://www.statsoft.com/textbook/stcluan.html http://www-db.stanford.edu/~ullman/mining/cluster1.pdf http://www.snr.missouri.edu/multivariate/ClusterAnalysis.pdf

Page 59: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

59

Print Sources

Recent Developments in Clustering and Data Analysis. Edited by Chikio Hayashi, Edwin Diday, Michel Jambou, Noboru Ohsumi. Academic Press, Inc. 1988.

Finding Groups in Data: An Introduction to Cluster Analysis. Leonard Kaufman, Peter J. Rousseeuw. John Wiley and Sons, Inc. 1990.

Marketing Research: An Aid to Decision Making. Dr. Alan T. Shao. South-Western. 2002.

Exploring Marketing Research. William G. Zikmund. South-Western. 2003.

Page 60: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

60

Ex 7: Hypothetical Data

Subject Id. Income ($1000) Education (years)

S1 5 5

S2 6 6

S3 15 14

S4 16 15

S5 25 20

S6 30 19

Page 61: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

61

Similarity Matrix (Euclidean Distances)

Id S1 S2 S3 S4 S5 S6

S1 0 2 181 221 625 821

S2 2 0 145 181 557 745

S3 181 145 0 2 136 250

S4 221 181 2 0 106 212

S5 625 557 136 106 0 26

S6 821 745 250 212 26 0

d(S1, S3) = (15-5)2 + (19-5)2 = 181

d(S1, S2) = 2 距離最小 (相似性較高 ) 故合併

Page 62: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

62

Centroid Method: Five ClustersData For Five ClustersCluster Cluster

MembersIncome

($1000)

Education

(years)

1 S1&S2

(5,5) (6,6)

5.5

5+6/2

5.5

5+6/2

2 S3 15 14

3 S4 16 15

4 S5 25 20

5 S6 30 19

Page 63: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

63

Similarity Matrix (Euclidean Distances)

Id S1 &S2 S3 S4 S5 S6

S1 &S2 0 162.5 200.5 590.5 782.5

S3 162 0 2 135.96 250

S4 200.5 2 0 106 212

S5 590.5 135.96 106 0 26

S6 782.5 250 212 26 0

d(S1& S2 , S3) = (5.5-15)2 + (5.5-14)2 = 162.5

d( S3, S4) = 2 距離最小 (相似性較高 ) 故合併

Page 64: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

64

Centroid Method: Four ClustersData For Four ClustersCluster Cluster

MembersIncome

($1000)

Education

(years)

1 S1&S2

(5,5) (6,6)

5.5

5+6/2

5.5

5+6/2

2 S3 & S4

(15,14) (16,15)

15.5 15+16/2

14 .5 14+15/2

3 S5 25 20

4 S6 30 19

Page 65: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

65

Similarity Matrix (Euclidean Distances)

Id S1 &S2 S3&S4 S5 S6

S1 &S2 0 181 590.5 782.5

S3 & S4 181 0 120.5 230.5

S5 590.5 120.5 0 26

S6 782.5 230.5 26 0

d(S1& S2 , S5) = (5.5-25)2 + (5.5-20)2 = 590.5

d( S5, S6) = 26 距離最小 (相似性較高 ) 故合併

Page 66: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

66

Centroid Method: Three ClustersData For Three ClustersCluster Cluster

MembersIncome

($1000)

Education

(years)

1 S1&S2

(5,5) (6,6)

5.5

5+6/2

5.5

5+6/2

2 S3 & S4

(15,14) (16,15)

15.5 15+16/2

14 .5 14+15/2

3 S5 & S6 (25,20) (30,19)

27.5 25+30/2

19.5 14+15/2

Page 67: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

67

Similarity Matrix (Euclidean Distances)

Id S1 &S2 S3&S4 S5 & S6

S1 &S2 0 181 680

S3 & S4 181 0 169

S5 & S6 680 169 0

d(S1& S2 , S5 & S6) = (5.5-27.5)2 + (5.5-19.5)2 = 680

d( S3 & S4, S5 & S6) = 169 距離最小 (相似性較高 ) 故合併

Page 68: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

68

Exhibit 7-1:SAS Output for cluster analysis on data in Table 7.1

Simple statistics

Mean Std Dev Skewness Kurtosis Bimodality

INCOME 16.1667 9.9883 0.2684 -1.4015 0.2211

EDUC 13.1667 6.3692 -0.4510 -1.8108 0.2711

Root-Mean-Square Total-Sample Standard Deviation = 8.376555

1

從此處無法得知分出幾群

Page 69: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

69

Step Number Frequency RMS STDNumber of of New of New Semipartial Centroid Clusters Clusters Joined Cluster Cluster R-Squared R-Squared Distance 1 5 S1 S2 2 0.707107 0.001425 0.998575 1.4142 2 4 S3 S4 2 0.707107 0.001425 0.997150 1.4142 3 3 S5 S6 2 2.549510 0.018527 0.978622 5.0990 4 2 CL4 CL3 4 5.522681 0.240855 0.737767 13.0000 5 1 CL5 CL2 6 8.376555 0.737767 0.000000 19.7041

3a 3b 3c 3d 3e 3f 3g

2 Root-Mean-Square Total-Sample Standard Deviation=8.376555

(RMSSTD)

RMSSTO越小表示群內個體相似性越高 (只能用來檢測相似性 )

很好的切點 ,因 R2急速下降

Page 70: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

70

CLUSTER=1 CLUSTER=2 CLUSTER=3OBS SID INCOME EDUC OBS SID INCOME EDUC OBS SID INCOME EDUC 1 S1 5 5 3 S3 15 14 5 S5 25 20 2 S2 6 6 4 S4 16 15 6 S6 30 19

4

Page 71: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

71

Exhibit 7.2:Non-hierarchical Clustering On Data

Replace=FULL Radius=0 Maxclusters=3 Maxiter=20 Converge=0.02

Initial Seeds

Cluster INCOME EDUC

-------- -----------------------------------

1 5.0000 5.0000

2 30.0000 19.0000

3 16.0000 15.0000

1

2

最初選擇觀察的種子點 S1, S6, S4

Page 72: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

72

Exhibit 7-2 (continued)

Minimum Distance Between Seeds = 14.56022

Iteration Change in Cluster Seeds 1 2 3 -------------------------------------------------- 1 0.707107 2.54951 0.707107 2 0 0 0

Statistics for Variables Variable Total STD Within STD R-Squared RSQ/(1-RSQ)-------------- ------------------------------------------------------------------------------ INCOME 9.988327 2.121320 0.972937 35.950617 EDUC 6.369197 0.707107 0.992605 134.222222 OVER-ALL 8.376555 1.581139 0.978622 45.777778

3

4

RMSSTD: 記得與 Total STD 比較

Ex: ED: 0.7071/6.3691

Total 的 R2 值很接近 1.表示此分群 Incom & EDUC

的異質性高

較大 .故 EDUC 群內的同質性較高

R2 = SSB/SST

Page 73: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

73

Exhibit 7-2 (continued)

Pseudo F Statistic = 68.67

Approximate Expected Over-All R-Squared = .

Cubic Clustering Criterion = .

WARNING: The two above values are invalid for correlated variables.

Cluster Means

Cluster INCOME EDUC

--------- -----------------------------------

1 5.5000 5.5000

2 27.5000 19.5000

3 15.5000 14.5000

5

另一種分群方法 (集群平均數 )

可被解釋為

低收入 ,低教育程度

高收入 ,高教育程度

中收入 , 中教育程度

Page 74: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

74

Exhibit 7.4: Hierarchical Cluster Analysis For Food Data

SINGLE LINKAGE CLUSTER ANALYSIS SIMPLE STATISTICS MEAN STD DEV SKEWNESS KURTOSIS BIMODALITY CALORIES 207.407 101.208 0.542 -0.675 0.478 PROTEIN 19.000 4.252 -0.824 1.327 0.357 FAT 13.481 11.257 0.790 -0.624 0.589 CALCIUM 43.963 78.034 3.159 11.345 0.746 IRON 2.381 1.461 1.230 1.469 0.518

Page 75: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

75Exhibit 7.4 (continued)

COMPLETE LINKAGE CLUSTER ANALYSIS NUMBER FREQUENCY RMS STD OF CLUSTERS OF NEW OF NEW SEMIPARTIAL MAXIMUM CLUSTERS JOINED CLUSTER CLUSTER R-SQUARED R-SQUARED DISTANCE

10 CL15 CANNED CRABMEAT 4 11.32324 0.003476 0.985594 50.6665 9 CL17 ROAST LAMB SHOUL 3 12.59929 0.003226 0.982367 55.6611 8 CL14 CANNED SHRIMP 3 16.10565 0.005231 0.977136 71.1677 7 CL13 ROAST BEEF 6 14.34190 0.009755 0.967381 80.9343 6 CL10 CL8 7 22.14096 0.023782 0.943599 108.1758 5 CL9 CL11 11 20.22234 0.039103 0.904496 141.7814 4 CL6 CL12 9 30.07489 0.048662 0.855835 154.4447 3 CL7 CL5 17 38.73570 0.220433 0.635402 262.5666 2 CL4 CANNED SARDINES 10 51.36181 0.192623 0.442779 364.8934 1 CL3 CL2 27 57.40958 0.442779 0.000000 433.7617

(完全連鎖法 )

Page 76: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

76Exhibit 7.4 (continued)

ROOT-MEAN-SQUARE TOTAL-SAMPLE STANDARD DEVIATION = 57.4096 NUMBER FREQUENCY RMS STD OF CLUSTERS OF NEW OF NEW SEMIPARTIAL MINIMUM CLUSTERS JOINED CLUSTER CLUSTER R-SQUARED R-SQUARED DISTANCE

10 CANNED CANNED 2 11.16786 0.001455 0.973438 35.3159

MACKEREL SALMON 9 CL14 ROAST LAMB 3 12.59929 0.003226 0.970211 35.4131 SHOULDER 8 CL11 CANNED 12 16.80697 0.014701 0.955510 39.5267 CRABMEAT 7 CL15 CL9 8 20.48901 0.028341 0.927169 40.1627 6 CL7 CL8 20 40.04817 0.285060 0.642109 40.2746 5 CL12 CANNED 3 16.10565 0.005231 0.636878 44.8504 SHRIMP 4 CL6 ROAST BEEF 21 43.49500 0.085924 0.550954 45.7642 3 CL4 CL5 24 48.72189 0.189548 0.361406 48.7139 2 CL3 CL10 26 50.53988 0.106595 0.254811 62.2624 1 CL2 CANNED 27 57.40958 0.254811 0.000000 211.5691

SARDINES

RMS越小表示此群內的同質性越高

R2越大 .表示此群間的相異性越高

太低可刪除

Page 77: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

77Exhibit 7.4 (continued)

CENTROID HIERARCHICAL CLUSTER ANALYSIS

NUMBER FREQUENCY RMS STD OF CLUSTERS OF NEW OF NEW SEMIPARTIAL CENTROID CLUSTERS JOINED CLUSTER CLUSTER R-SQUARED R-SQUARED DISTANCE 10 CL15 CANNED 4 11.32324 0.003476 0.985594 44.5633 CRABMEAT 9 CL16 ROAST LAMB 3 12.59929 0.003226 0.982367 45.5370 SHOULDER 8 CL14 CANNED SHRIMP 3 16.10565 0.005231 0.977136 57.9815 7 CL13 CL10 12 16.80697 0.026857 0.950279 65.6901 6 CL12 ROAST BEEF 6 14.34190 0.009755 0 940524 70.8222 5 CL6 CL9 9 24.36751 0.039727 0.900797 92.2533 4 CL8 CL11 5 26.85628 0.026158 0.874639 96.6423 3 CL7 CL4 17 31.36108 0.113709 0.760930 117.4906 2 CL5 CL3 26 50.53988 0.506119 0.254811 191.9655 1 CL2 CANNED 27 57.40958 0.254811 0.000000 336.7134

SARDINES

損失群內相似性 表群內相似性越高

( 中心法 )

Page 78: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

78Exhibit 7.4 (continued)

WARD'S MINIMUM VARIANCE CLUSTER ANALYSIS NUMBER FREQUENCY RMS STD BETWEEN- OF CLUSTERS OF NEW OF NEW SEMIPARTIAL CLUSTER CLUSTERS JOINED CLUSTER CLUSTER R-SQUARED R-SQUARED SUM OF

SQUARES

10 CL14 CANNED 4 11.32324 0.003476 0.985908 1489.42 CRABMEAT 9 CL16 CL20 8 7.75641 0.003541 0.982367 1517.12 8 CL15 CANNED 3 16.10565 0.005231 0.977136 2241.24 SHRIMP 7 CL12 ROAST BEEF 6 14.34190 0.009755 0.967381 4179.83 6 CL10 CL8 7 22.14096 0.023782 0.943599 10189.5 5 CL11 CL9 11 20.22234 0.039103 0.904496 16754.1 4 CL6 CL13 9 30.07489 0.048662 0.855835 20849.7 3 CL5 CL4 20 36.22080 0.158726 0.697109 68007.8 2 CL3 CANNED 21 47.72546 0.240715 0.456394 103137

SARDINES 1 CL7 CL2 27 57.40958 0.456394 0.000000 195548

(華德法 )

Page 79: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

79Exhibit 7.5: Non-Hierarchical Analysis For Food-Nutrient Data

INITIAL SEEDS ( 以中心法為主 ) CLUSTER CALORIES PROTEIN FAT CALCIUM IRON---------------------------------------------------------------------------------------------------

1 331.111 19.000 27.556 8.778 2.467

2 161.667 20.500 7.500 14.250 1.925 3 100.000 14.800 3.400 114.000 3.000

1

Page 80: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

80Exhibit 7.5 (continued)

MINIMUM DISTANCE BETWEEN SEEDS = 117.4876 ITERATION CHANGE IN CLUSTER SEEDS 1 2 3----------------------- ------------------------------------------ 1 10.8475 6.46446 0.3 2 0 6.85281 12.7855 3 0 0 0

2

Page 81: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

81

CLUSTER SUMMARY

MAXIMUM DISTANCECLUSTER RMS STD FROM SEED TO NEAREST CENTROIDNUMBER FREQUENCY DEVIATION OBSERVATION CLUSTER DISTANCE

---------------------------------------------------------------------------------------------------------------- 1 8 20.8936 78.8882 2 168.5 2 12 16.3651 70.9576 3 117.9 3 6 27.8059 79.6672 2 117.9 共有 ?個體 群 2的同質性較高 最接近 兩群中心點

的群數 的距離

3a

3b 3c 3d 3e

RMS STD 值越小 ,表群內相似性越高

Page 82: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

82當測量尺度 (工具 )相同時 ,可直接比較 RMSSTD. 若不同時 , 必須討論 Within SD/Total SD

VARIABLE TOTAL STD WITHIN STD R-SQUARED RSQ/(1-RSQ) ---------------------------------------------------------------------------------------------------------- CALORIES 103.06085 39.89286 0.86216 6.25453 PROTEIN 4.29257 3.58590 0.35798 0.55758 FAT 11.44357 4.52989 0.85584 5.93681 CALCIUM 44.70188 22.76009 0.76150 3.19291 IRON 1.49005 1.51663 0.04688 0.04919 OVER-ALL 50.53988 20.71299 0.84547 5.47135

PSEUDO F STATISTIC = 62.92 APPROXIMATE EXPECTED OVER-ALL R-SQUARED = 0.78678 CUBIC CLUSTERING CRITERION = 2.186

4a 4b 4cSTATISTICS FOR VARIABLES

此兩群體值太小 . 表示測量此兩群並無差異即分群效果差 .刪除此二組資料再分群一次

WITHIN STD/ TOTAL STD 之值越小 ,表示群內同質性越高

Page 83: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

83Exhibit 7.5 (continued)

CLUSTER MEANS CLUSTER CALORIES PROTEIN FAT CALCIUM IRON----------------------------------------------------------------------------------------------- 1 341.875 18.750 28.875 8.750 2.437 2 174.583 21.083 8.750 11.833 2.083 3 98.333 14.667 3.167 101.333 2.883

5

Cluster 1 命名為卡洛里

Cluster 2 命名為肥胖指數

Cluster 3 命名為低鈣

Page 84: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

84Exhibit 7.5 (continued)

CLUSTER=1 OBS NAME CLUS DISTA CALORIES PROTEIN FAT CALCIUM IRON 1 BRAISED BEEF 1 2.4357 340 2 0 28 9

2.6 2 ROAST BEEF 1 78.8882 420 15 39 7

2.0 3 BEEF STEAK 1 33.2744 375 19 32 9

2.6 4 ROST LAMB LEG 1 77.3963 265 20 20 9

2.6 5 ROAST LAMB 1 42.0616 300 18 25 9 2.3 6 SMOKED HAM 1 2.4311 340 20 28 9 2.5 7 PORK ROAST 1 1.9132 340 19 29 9 2.5 8 PORK SIMMERED 1 13.1779 355 19 30 9 2.4

6

命名為高肥胖食物群 (高卡洛里 , 肥胖指數高 ,低鈣 )

(8 Cases)

Page 85: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

85Exhibit 7.5 (continued)

CLUSTER=2 OBS NAME CLUSTER DISTANCE CALORIES PROTEIN FAT CALCIUM IRON 9 HAMBURGER 2 70.9576 245 21 17 9 2.7 10 CANNED BEEF 2 7.8135 180 22 10 17 3.7 11 BROILED CHICKEN 2 59.9964 115 20 3 8 1.4 12 CANNED CHICKEN 2 6.3070 170 25 7 12 1.5 13 BEEF HEART 2 16.4369 160 26 5 14 5.9 14 BEEF TONGUE 2 31.3971 205 18 14 7 2.5 15 VEAL CUTLET 2 10.9841 185 23 9 9 2.7 16 BAKED BLUEFISH 2 42.0215 135 22 4 25 0.6 17 FRIED HADDOCK 2 40.2403 135 16 5 15 0.5 18 BROILED MACKEREL 2 26.7634 200 19 13 5 1.0 19 FRIED PERCH 2 21.2850 195 16 11 14 1.3 20 CANNED TUNA 2 7.9719 170 25 7 7 1.2

命名為中肥胖食物群 (低卡洛里 , 肥胖指數低 ,低鈣 )

(12 Cases)

Page 86: Cluster Analysis. 2 First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective

86Exhibit 7.5 (continued)

CLUSTER=3 OBS NAME CLUSTER DISTANCE CALORIES PROTEIN FAT CALCIUM IRON 21 RAW CLAMS 3 34.7046 70 11 1 82 6.0 22 CANNED CLAMS 3 60.5092 45 7 1 74 5.4 23 CANNED CRABMEAT 3 63.9273 90 14 2 38 0.8 24 CANNED MACKEREL 3 79.6672 155 16 9 157 1.8 25 CANNED SALMON 3 61.7127 120 17 5 159 0.7 26 CANNED SHRIMP 3 14.8809 110 23 1 98 2.6

命名為低卡洛里高鈣食物群 (6 Cases)