כריית מידע -- clustering

מידע -- Clusteringכריית

רוזנפלד" אבי ר ד

: הם דומים דברים הכללי הרעיוןדומים

דומים • דברים נאסוף איך–Regression, Classification (Supervised), k-nn– Clustering (Unsupervised) k-meand–Partitioning Algorithms (k-mean), Hierarchical

Algorithms•" " : קירבה להגדיר איך פתוחות שאלות

Euclideanמרחק – Manhattan (Judea Pearl)מרחק –אחריות – אופציות הרבה

)||...|||(|),( 22

11 pp jxixjxixjxixjid

השאלה סימן את לסווג ?איך

K-Nearest Neighborאמת • בזמן הסיווג את model freeבודקיםהשכנים • מספר את לקבוע צריכיםמהנקודה • המרחק לפי שקלול יש כלל בדרךדומה Case Based Reasoningאו CBRגם •לפי ) • משקל איזשהו או הרוב לפי הולכים בסיווג

הקרבה(איזשהו ) • או הרוב לפי יהיה הערך ברגרסיה

) הקרבה לפי משקל

1-Nearest Neighbor

3-Nearest Neighbor

k NEAREST NEIGHBOR

• Choosing the value of k:– If k is too small, sensitive to noise points– If k is too large, neighborhood may include points from other

classes– Choose an odd value for k, to eliminate ties

k = 3: Belongs to triangle class k = 7: Belongs to square class

ICDM: Top Ten Data Mining Algorithms k nearest neighbor classification December 2006

k = 1: Belongs to square class

Remarks+Highly effective inductive inference method for

noisy training data and complex target functions

+Target function for a whole space may be described as a combination of less complex local approximations

+Learning is very simple- Classification is time consuming

Clustering K-MEAN: האלגוריתם הבסיסי ל Kבחר ערך רצוי של אשכולות: 1. נקודות Kמתוך אוכלוסיית המדגם שנבחרה (להלן הנקודות), בחר2.

אקראיות. נקודות אלו הם המרכזים ההתחלתיים של )Seedsהאשכולות(

קבע את המרחק האוקלידי של כל הנקודות מהמרכזים שנבחרו3.

K כל נקודה משויכת למרכז הקרוב אליה ביותר. בצורה זו קיבלנו 4.אשכולות זרים זה לזה.

בכל אשכול: קבע נקודות מרכז חדשה על ידי חישוב הממוצע של 5.כל הנקודות באשכול

אם נקודת המרכז שווה לנקודה הקודמת התהליך הסתיים , 6.3אחרת חזור ל

נקודות6דוגמא עם

Instance X Y

1 1.0 1.5

2 1.0 4.5

3 2.0 1.5

4 2.0 3.5

5 3.0 2.5

6 5.0 6.0

נקודות6דוגמא עם

1איטרציה C1,C2 להלן 1,3באופן אקראי נבחרו הנקודות •3,4,5,6 נבחרו הנקודות C2. למרכז 1,2 נבחרות נקודות C1למרכז •Distance= √(x1-x2)² + ( y1-y2 ( ²נוסחת המרחק: •

C1המרחק מ C2המרחק מ

0.00 1.003.00 3.161.00 0.002.24 2.002.24 1.416.02 5.41

בחירת מרכזים חדשים

C1ל •–X=(1.0+1.0)/2=1.0–Y=(1.5+4.5)/2=3.0

C2ל •–X=(2.0+2.0+3.0+5.0)/4.0=3.0–Y=(1.5+3.5+2.5+6.0)/4.0=3.375

2איטרציה C1(1.0, 3.0) C2(3.0, 3.375)נקודות המרכז החדשות: •4,5,6 יצטרפו : C2 ל 1,2,3 יצטרפו הנקודות: C1ל •

C1המרחק מ C2המרחק מ

1.5 2.741.5 2.291.8 2.125

1.12 1.012.06 0.8755.00 3.30

התוצאה הסופית

CS583, Bing Liu, UIC 20

עם k-meansבעיותמראש • להגדיר המשתמש Kעלהממוצע • את לחשב שניתן מניחל • רגיש outliersמאוד

–Outliers מהאחרים הרחוקות נקודות הם–... טעות סתם להיות יכול

של OUTLIERדוגמא

Euclideanמרחק

• Euclidean distance:

• Properties of a metric d(i,j):–d(i,j) 0–d(i,i) = 0–d(i,j) = d(j,i)–d(i,j) d(i,k) + d(k,j)

)||...|||(|),( 22

11 pp jxixjxixjxixjid

Hierarchical Clustering• Produce a nested sequence of clusters, a tree, also

called Dendrogram.

Types of hierarchical clustering• Agglomerative (bottom up) clustering: It builds the

dendrogram (tree) from the bottom level, and – merges the most similar (or nearest) pair of clusters – stops when all the data points are merged into a single cluster

(i.e., the root cluster). • Divisive (top down) clustering: It starts with all data

points in one cluster, the root. – Splits the root into a set of child clusters. Each child cluster is

recursively divided further – stops when only singleton clusters of individual data points

remain, i.e., each cluster with only a single point

Agglomerative clustering

It is more popular then divisive methods.• At the beginning, each data point forms a

cluster (also called a node). • Merge nodes/clusters that have the least

distance.• Go on merging• Eventually all nodes belong to one cluster

Agglomerative clustering algorithm

An example: working of the algorithm

Measuring the distance of two clusters

• A few ways to measure distances of two clusters.

• Results in different variations of the algorithm.– Single link– Complete link– Average link– Centroids– …

Single link method• The distance between two

clusters is the distance between two closest data points in the two clusters, one data point from each cluster.

• It can find arbitrarily shaped clusters, but– It may cause the

undesirable “chain effect” by noisy points

Two natural clusters are split into two

Complete link method• The distance between two clusters is the distance of two furthest data points in the two clusters.

• It is sensitive to outliers because they are far away

EM Algorithm

• Initialize K cluster centers• Iterate between two steps

– Expectation step: assign points to clusters

–Maximation step: estimate model parameters

jijkikki cdwcdwcdP ) |Pr() |Pr() (

kiik cdP

cdPdm 1 ) (

כריית מידע -- clustering

Documents

אבטחת מידע וסייבר

מערכות מידע - מצגת.ppt

מידע צופה פני העתיד

כוחו של מידע: איך למצוא מידע...

פרוייקט באבטחת מידע

כריית נתונים

יסודות מערכות מידע- מירי

אדריכל מידע כסוכן שינוי

איחזור מידע אלגוריתמי חיפוש ...

ביטוח - פוליסות, מידע, המלצות

ניהול מידע דוידה

ידיעות טכנולוגיות מידע 3

אחסון מידע ב- html5

כריית מידע -- clustering ד " ר אבי...

מצגת ככלי להצגת מידע

databases מאגרי מידע

דף מידע מחשבה טובה

מידע על סכרת בהריון

רמת השרון - מידע עירוני

מידע למורים - עברית