[系列活動] 資料探勘速遊

203
Quick Tour of Data Mining Yi-Shin Chen Institute of Information Systems and Applications Department of Computer Science National Tsing Hua University [email protected]

Post on 24-Jan-2017

2.081 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: [系列活動] 資料探勘速遊

Quick Tour of Data Mining

Yi-Shin Chen

Institute of Information Systems and Applications

Department of Computer Science

National Tsing Hua University

[email protected]

Page 2: [系列活動] 資料探勘速遊

About Speaker

陳宜欣 Yi-Shin Chen

▷Currently• 清華大學資訊工程系副教授• 主持智慧型資料工程與應用實驗室 (IDEA Lab)

▷Education• Ph.D. in Computer Science, USC, USA

• M.B.A. in Information Management, NCU, TW

• B.B.A. in Information Management, NCU, TW

▷Courses (all in English)

• Research and Presentation Skills

• Introduction to Database Systems

• Advanced Database Systems

• Data Mining: Concepts, Techniques, and

Applications

2

Page 3: [系列活動] 資料探勘速遊

Evolution of Data

Management

The relationships between the techniques and our world

3

Page 4: [系列活動] 資料探勘速遊

4

1900 1920 1940 1950 1960 1970

Manual

Record

Managers

1950: Univac had developed

a magnetic tape

1951: Univac I delivered to

the US Census Bureau

1931: Gödel's Incompleteness

Theorem 1948: Information theory (by

Shannon)

Information Entropy

1944: Mark I(Server) 1963: The origins of the Internet

Programmed Record

Managers• Birth of high-level

programming

languages

• Batch processingPunched-Card

Record Managers

On-line Network

Databases• Indexed sequential

records

• Data independence

• Concurrent Access

Page 5: [系列活動] 資料探勘速遊

2001: Data Science

2009: Deep Learning

5

1970 1980 1990 2000 2010

1985: 1st standardized

of SQL

1976: E-R Model by

Peter Chen1993: WWW

2006: Amazon.com Elastic

Compute Cloud

1980: Artificial Neural

Networks

Knowledge Discovery

in Databases

Object Relational

Model• Support multiple

datatypes and

applications

1974: IBM System R

Relational Model• Give Database users

high-level set-oriented

data access

operations

Page 6: [系列活動] 資料探勘速遊

Data Mining

What we know, and what we do now

6

Page 7: [系列活動] 資料探勘速遊

Data Mining

▷What is data mining?

• Algorithms for seeking unexpected “pearls of wisdom”

▷Current data mining research:

• Focus on efficient ways to discover models of existing data sets

• Developed algorithms are: classification, clustering, association-

rule discovery, summarization…etc.

7

Page 8: [系列活動] 資料探勘速遊

Data Mining Examples

8Slide from: Prof. Shou-De Lin

Page 9: [系列活動] 資料探勘速遊

Origins of Data Mining

▷Draws ideas from

• Machine learning/AI

• Pattern recognition

• Statistics

• Database systems

▷ Traditional Techniques may be unsuitable due to

• Enormity of data

• High dimensionality of data

• Heterogeneous, distributed nature of data

9© Tan, Steinbach, Kumar Introduction to Data Mining

Data Mining

Machine Learning/AI

Pattern Recognition Statistics

Database

Page 10: [系列活動] 資料探勘速遊

Knowledge Discovery (KDD) Process

10

Data CleaningData Integration

Databases

Data

Warehouse

Task-relevant

Data

Selection

Data Mining

Pattern

Evaluation

Page 11: [系列活動] 資料探勘速遊

Database

11

Page 12: [系列活動] 資料探勘速遊

Database

12

Page 13: [系列活動] 資料探勘速遊

Database

13

Page 14: [系列活動] 資料探勘速遊

Informal Design Guidelines for Database

▷Design a schema that can be explained easily relation by

relation. The semantics of attributes should be easy to interpret

▷Should avoid update anomaly problems

▷Relations should be designed such that their tuples will have as

few NULL values as possible

▷ The relations should be designed to satisfy the lossless join

condition (guarantee meaningful results for join operations)

14

Page 15: [系列活動] 資料探勘速遊

Data Warehouse

▷Assemble and manage data from various sources

for the purpose of answering business questions

15

CRM ERP POS …OLTP

Data Warehouse

Meaningful

Page 16: [系列活動] 資料探勘速遊

Knowledge Discovery (KDD) Process

16

Data CleaningData Integration

Databases

Data

Warehouse

Task-relevant

Data

Selection

Data Mining

Pattern

Evaluation

Page 17: [系列活動] 資料探勘速遊

KDD Process: Several Key Steps

▷Pre-processing

• Learning the application domain

→ Relevant prior knowledge and goals of application

• Creating a target data set: data selection

• Data cleaning and preprocessing: (may take 60% of effort!)

• Data reduction and transformation

→ Find useful features

▷Data mining

• Choosing functions of data mining

→ Choosing the mining algorithm

• Search for patterns of interest

▷Evaluation

• Pattern evaluation and knowledge presentation

→ visualization, transformation, removing redundant patterns, etc.17

© Han & Kamper Data Mining: Concepts and Techniques

Page 18: [系列活動] 資料探勘速遊

Data

Many slides provided by Tan, Steinbach, Kumar for book “Introduction to Data Mining” are adapted in this presentation

The most important part in the whole process

18

Page 19: [系列活動] 資料探勘速遊

Types of Attributes

▷There are different types of attributes

• Nominal (=,≠)

→ Nominal values can only distinguish one object from

another

→ Examples: ID numbers, eye color, zip codes

• Ordinal (<,>)

→ Ordinal values can help to order objects

→ Examples: rankings, grades

• Interval (+,-)

→ The difference between values are meaningful

→ Examples: calendar dates

• Ratio (*,/)

→ Both differences and ratios are meaningful

→ Examples: temperature in Kelvin, length, time, counts

19只有這一種能適用所有的處理方法

Page 20: [系列活動] 資料探勘速遊

Types of Data Sets

▷Record

• Data Matrix

• Document Data

• Transaction Data

▷Graph

• World Wide Web

• Molecular Structures

▷Ordered

• Spatial Data

• Temporal Data

• Sequential Data

• Genetic Sequence Data

20

1.12.216.226.2512.65

1.22.715.225.2710.23

Thickness LoadDistanceProjection

of y load

Projection

of x Load

1.12.216.226.2512.65

1.22.715.225.2710.23

Thickness LoadDistanceProjection

of y load

Projection

of x Load

Document 1

se

aso

n

time

ou

t

lost

wi

n

ga

me

sco

re

ba

ll

play

co

ach

tea

m

Document 2

Document 3

3 0 5 0 2 6 0 2 0 2

0

0

7 0 2 1 0 0 3 0 0

1 0 0 1 2 2 0 3 0

TID Time Items

1 2009/2/8 Bread, Coke, Milk

2 2009/2/13 Beer, Bread

3 2009/2/23 Beer, Diaper

4 2009/3/1 Coke, Diaper, Milk

Page 21: [系列活動] 資料探勘速遊

A Facebook Example

21

Page 22: [系列活動] 資料探勘速遊

Data Matrix/Graph Data Example

22

Page 23: [系列活動] 資料探勘速遊

Document Data

23

Page 24: [系列活動] 資料探勘速遊

Transaction Data

24

Page 25: [系列活動] 資料探勘速遊

Spatio-Temporal Data

25

Page 26: [系列活動] 資料探勘速遊

Sequential Data

26

2017/1/3

2016/12/31

2016/12/31

2016/11/28

2016/11/17

2016/11/09

2016/11/08

2016/11/08

2016/11/08

2016/11/08

2016/11/05

Page 27: [系列活動] 資料探勘速遊

Tips for Converting Text to

Numerical Values

27

Page 28: [系列活動] 資料探勘速遊

Recap: Types of Attributes

▷There are different types of attributes

• Nominal (=,≠)

→ Nominal values can only distinguish one object from

another

→ Examples: ID numbers, eye color, zip codes

• Ordinal (<,>)

→ Ordinal values can help to order objects

→ Examples: rankings, grades

• Interval (+,-)

→ The difference between values are meaningful

→ Examples: calendar dates

• Ratio (*,/)

→ Both differences and ratios are meaningful

→ Examples: temperature in Kelvin, length, time, counts

28

Page 29: [系列活動] 資料探勘速遊

Vector Space Model

▷Represent the keywords of objects using a term vector

• Term: basic concept, e.g., keywords to describe an object

• Each term represents one dimension in a vector

• N total terms define an n-element terms

• Values of each term in a vector corresponds to the

importance of that term

▷Measure similarity by the vector distances

29

Document 1

se

aso

n

time

ou

t

lost

wi

n

ga

me

sco

re

ba

ll

play

co

ach

tea

m

Document 2

Document 3

3 0 5 0 2 6 0 2 0 2

0

0

7 0 2 1 0 0 3 0 0

1 0 0 1 2 2 0 3 0

Page 30: [系列活動] 資料探勘速遊

Term Frequency and Inverse

Document Frequency (TFIDF)

▷Since not all objects in the vector space are equally

important, we can weight each term using its

occurrence probability in the object description

• Term frequency: TF(d,t)

→ number of times t occurs in the object description d

• Inverse document frequency: IDF(t)

→ to scale down the terms that occur in many descriptions

30

Page 31: [系列活動] 資料探勘速遊

Normalizing Term Frequency

▷nij represents the number of times a term ti occurs in

a description dj . tfij can be normalized using the total

number of terms in the document

• 𝑡𝑓𝑖𝑗 =𝑛𝑖𝑗

𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑𝑉𝑎𝑙𝑢𝑒

▷NormalizedValue could be:• Sum of all frequencies of terms

• Max frequency value

• Any other values can make tfij between 0 to 1

31

Page 32: [系列活動] 資料探勘速遊

Inverse Document Frequency

▷ IDF seeks to scale down the coordinates of terms

that occur in many object descriptions

• For example, some stop words(the, a, of, to, and…) may

occur many times in a description. However, they should

be considered as non-important in many cases

• 𝑖𝑑𝑓𝑖 = 𝑙𝑜𝑔𝑁

𝑑𝑓𝑖+ 1

→ where dfi (document frequency of term ti) is the

number of descriptions in which ti occurs

▷ IDF can be replaced with ICF (inverse class frequency) and

many other concepts based on applications

32

Page 33: [系列活動] 資料探勘速遊

Reasons of Log

▷ Each distribution can indicate the hidden force•

33

Power-law distribution Normal distribution Normal distribution

Page 34: [系列活動] 資料探勘速遊

Data Quality

Dirty Data

34

Page 35: [系列活動] 資料探勘速遊

Big Data?

▷ “Every day, we create 2.5 quintillion bytes of data — so much

that 90% of the data in the world today has been created in the

last two years alone. This data comes from everywhere:

sensors used to gather climate information, posts to social

media sites, digital pictures and videos, purchase transaction

records, and cell phone GPS signals to name a few. This data

is “big data.”

• --from www.ibm.com/software/data/bigdata/what-is-big-data.html

35

Page 36: [系列活動] 資料探勘速遊

4V

36

Page 37: [系列活動] 資料探勘速遊

Data Quality

▷What kinds of data quality problems?

▷How can we detect problems with the data?

▷What can we do about these problems?

▷Examples of data quality problems:

•Noise and outliers

•Missing values

•Duplicate data

37

Page 38: [系列活動] 資料探勘速遊

Noise

▷Noise refers to modification of original values

• Examples: distortion of a person’s voice when talking

on a poor phone and “snow” on television screen

38Two Sine Waves Two Sine Waves + Noise

Page 39: [系列活動] 資料探勘速遊

Outliers

▷Outliers are data objects with characteristics

that are considerably different than most of

the other data objects in the data set

39

Page 40: [系列活動] 資料探勘速遊

Missing Values

▷Reasons for missing values

• Information is not collected

→ e.g., people decline to give their age and weight

• Attributes may not be applicable to all cases

→ e.g., annual income is not applicable to children

▷Handling missing values

• Eliminate Data Objects

• Estimate Missing Values

• Ignore the Missing Value During Analysis

• Replace with all possible values

→ Weighted by their probabilities

40

Page 41: [系列活動] 資料探勘速遊

Duplicate Data

▷Data set may include data objects that are

duplicates, or almost duplicates of one another• Major issue when merging data from heterogeneous sources

▷Examples:• Same person with multiple email addresses

▷Data cleaning• Process of dealing with duplicate data issues

41

Page 42: [系列活動] 資料探勘速遊

Data Preprocessing

To be or not to be

42

Page 43: [系列活動] 資料探勘速遊

Data Preprocessing

▷Aggregation

▷Sampling

▷Dimensionality reduction

▷Feature subset selection

▷Feature creation

▷Discretization and binarization

▷Attribute transformation

43

Page 44: [系列活動] 資料探勘速遊

Aggregation

▷Combining two or more attributes (or objects) into a single

attribute (or object)

▷Purpose• Data reduction

→ Reduce the number of attributes or objects

• Change of scale

→ Cities aggregated into regions, states, countries, etc

• More “stable” data

→ Aggregated data tends to have less variability

44

SELECT d.Name, avg(Salary)

FROM Employee AS e, Department AS d

WHERE e.Dept=d.DNo

GROUP BY d.Name

HAVING COUNT(e.ID)>=2;

Page 45: [系列活動] 資料探勘速遊

Sampling

▷Sampling is the main technique employed for data

selection

• It is often used for both

→ Preliminary investigation of the data

→ The final data analysis

• Reasons:

→ Statistics: Obtaining the entire set of data of interest is too

expensive

→ Data mining: Processing the entire data set is too

expensive

45

Page 46: [系列活動] 資料探勘速遊

Key Principle For Effective Sampling

▷The sample is representative

•Using a sample will work almost as well as using

the entire data sets

•The approximately the same property as the

original set of data

46

Page 47: [系列活動] 資料探勘速遊

Sample Size Matters

47

8000 points 2000 Points 500 Points

Page 48: [系列活動] 資料探勘速遊

Sampling Bias

▷ 2004 Taiwan presidential election polls

48

TVBS

聯合報

訪問日期 93 年 1 月 15日至 1 月 17日有效樣本 1068 人 拒訪 699 人抽樣誤差 在 95% 信心水準下,約 ± 3個百分點訪問地區 台灣地區抽樣方法 電話簿分層系統抽樣,電話號碼末二位隨機

Page 49: [系列活動] 資料探勘速遊

Dimensionality Reduction

▷Purpose:• Avoid curse of dimensionality• Reduce amount of time and memory required by data

mining algorithms• Allow data to be more easily visualized• May help to eliminate irrelevant features or reduce noise

▷Techniques• Principle Component Analysis• Singular Value Decomposition• Others: supervised and non-linear techniques

49

Page 50: [系列活動] 資料探勘速遊

Curse of Dimensionality

▷When dimensionality increases, data becomes

increasingly sparse in the space that it occupies

• Definitions of density and distance between points, which is

critical for clustering and outlier detection, become less

meaningful

50

• Randomly generate 500

points

• Compute difference

between max and min

distance between any pair

of points

Page 51: [系列活動] 資料探勘速遊

Dimensionality Reduction: PCA

▷Goal is to find a projection that captures

the largest amount of variation in data

51

x2

x1

e

Page 52: [系列活動] 資料探勘速遊

Feature Subset Selection

▷Another way to reduce dimensionality of data

▷Redundant features

•Duplicate much or all of the information contained in

one or more other attributes

•E.g. purchase price of a product vs. sales tax

▷Irrelevant features

•Contain no information that is useful for the data

mining task at hand

•E.g. students' ID is often irrelevant to the task of

predicting students' GPA

52

Page 53: [系列活動] 資料探勘速遊

Feature Creation

▷Create new attributes that can capture the

important information in a data set much

more efficiently than the original attributes

▷Three general methodologies:

•Feature extraction

→Domain-specific

•Mapping data to new space

•Feature construction

→Combining features

53

Page 54: [系列活動] 資料探勘速遊

Mapping Data to a New Space

▷Fourier transform

▷Wavelet transform

54

Two Sine

WavesTwo Sine Waves +

NoiseFrequency

Page 55: [系列活動] 資料探勘速遊

Discretization Using Class Labels

▷Entropy based approach

55

3 categories for both x and y 5 categories for both x and y

Page 56: [系列活動] 資料探勘速遊

Discretization Without Using Class Labels

56

Page 57: [系列活動] 資料探勘速遊

Attribute Transformation

▷A function that maps the entire set of values of a

given attribute to a new set of replacement values

• So each old value can be identified with one of the new

values

• Simple functions: xk, log(x), ex, |x|

• Standardization and Normalization

57

Page 58: [系列活動] 資料探勘速遊

Transformation Examples

58

Log (

Fre

quency)

Log(L

og (

Fre

quency))

Page 59: [系列活動] 資料探勘速遊

Preprocessing in Reality

59

Page 60: [系列活動] 資料探勘速遊

Data Collection

▷Align /Classify the attributes correctly

60

Who post this message Mentioned User

HashtagShared URL

Page 61: [系列活動] 資料探勘速遊

Language Detection

▷To detect an language (possible languages)

in which the specified text is written

▷Difficulties

•Short message

•Different languages in one statement

•Noisy61

你好 現在幾點鐘apa kabar sekarang jam berapa ?

繁體中文 (zh-tw)印尼文 (id)

Page 62: [系列活動] 資料探勘速遊

Wrong Detection Examples

▷Twitter examples

62

@sayidatynet top song #LailaGhofran

shokran ya garh new album #listen

中華隊的服裝挺特別的,好藍。。。#ChineseTaipei #Sochi #2014冬奧

授業前の雪合戦w http://t.co/d9b5peaq7J

Before / after removing noise

en -> id

it -> zh-tw

en -> ja

Page 63: [系列活動] 資料探勘速遊

Removing Noise

▷Removing noise before detection•Html file ->tags

•Twitter -> hashtag, mention, URL

63

<meta name=\"twitter:description\"content=\"觸犯法國隱私法〔駐歐洲特派記者胡蕙寧、國際新聞中心/綜合報導〕網路搜 尋 引 擎 巨 擘 Google8 日 在 法 文 版 首 頁(www.google.fr)張貼悔過書 ...\"/>

觸犯法國隱私法〔駐歐洲特派記者胡蕙寧、國際新聞中心/綜合報導〕網路搜尋引擎巨擘Google8日在法文版首頁(www.google.fr)張貼悔過書 ...

英文(en)

繁中(zh-tw)

Page 64: [系列活動] 資料探勘速遊

Data Cleaning

▷Special character

▷Utilize regular expressions to clean data

64

Unicode emotions ☺, ♥…

Symbol icon ☏, ✉…

Currency symbol €, £, $...

Tweet URL

Filter out non-(letters, space,

punctuation, digit) ◕‿◕ Friendship is everything ♥ ✉

[email protected]

I added a video to a @YouTube playlist

http://t.co/ceYX62StGO Jamie Riepe(^|\\s*)http(\\S+)?(\\s*|$)

(\\p{L}+)|(\\p{Z}+)|

(\\p{Punct}+)|(\\p{Digit}+)

Page 65: [系列活動] 資料探勘速遊

Japanese Examples

▷Use regular expression remove all

special words

•うふふふふ(*^^*)楽しむ!ありがとうございま

す^o^ アイコン、ラブラブ(-_-)♡

•うふふふふ楽しむありがとうございます ア

イコンラブラブ

65

\\W

Page 66: [系列活動] 資料探勘速遊

Part-of-speech (POS) Tagging

▷Processing text and assigning parts of

speech to each word

▷Twitter POS tagging

•Noun (N), Adjective (A), Verb (V), URL (U)…

66

Happy Easter! I went to work and came home to an empty house now im

going for a quick run http://t.co/Ynp0uFp6oZ

Happy_A Easter_N !_, I_O went_V to_P work_N and_& came_V home_N

to_P an_D empty_A house_N now_R im_L going_V for_P a_D quick_A

run_N http://t.co/Ynp0uFp6oZ_U

Page 67: [系列活動] 資料探勘速遊

Stemming

▷@DirtyDTran gotta be caught up for

tomorrow nights episode

▷@ASVP_Jaykey for some reasons I found

this very amusing

67

• @DirtyDTran gotta be catch up for tomorrow night episode

• @ASVP_Jaykey for some reason I find this very amusing

RT @kt_biv : @caycelynnn loving and missing you! we are

still looking for Lucy

love miss be

look

Page 68: [系列活動] 資料探勘速遊

Hashtag Segmentation

▷By using Microsoft Web N-Gram Service

(or by using Viterbi algorithm)

68

#pray #for #boston

Wow! explosion at a boston race ... #prayforboston

#citizenscience

#bostonmarathon

#goodthingsarecoming

#lowbloodpressure

#citizen #science

#boston #marathon

#good #things #are #coming

#low #blood #pressure

Page 69: [系列活動] 資料探勘速遊

More Preprocesses for Different Web

Data

▷Extract source code without javascript

▷Removing html tags

69

Page 70: [系列活動] 資料探勘速遊

Extract Source Code Without Javascript

▷Javascript code should be considered as an exception

• it may contain hidden content

70

Page 71: [系列活動] 資料探勘速遊

Remove Html Tags

▷Removing html tags to extract meaningful content

71

Page 72: [系列活動] 資料探勘速遊

More Preprocesses for Different Languages

▷Chinese Simplified/Traditional Conversion

▷Word segmentation

72

Page 73: [系列活動] 資料探勘速遊

Chinese Simplified/Traditional Conversion

▷Word conversion• 请乘客从后门落车 → 請乘客從後門下車

▷One-to-many mapping• @shinrei 出去旅游还是崩坏 → @shinrei 出去旅游還是崩壞

游 (zh-cn) → 游|遊 (zh-tw)

▷Wrong segmentation• 人体内存在很多微生物 → 內存: 人體 記憶體 在很多微生物

→ 存在: 人體內 存在 很多微生物

73

內存|存在

Page 74: [系列活動] 資料探勘速遊

Wrong Chinese Word Segmentation

▷Wrong segmentation• 這(Nep) 地面(Nc) 積(VJ) 還(D) 真(D) 不(D) 小(VH) http://t.co/QlUbiaz2Iz

▷Wrong word• @iamzeke 實驗(Na) 室友(Na) 多(Dfa) 危險(VH) 你(Nh) 不(D) 知道(VK) 嗎

(T) ?

▷Wrong order

• 人體(Na) 存(VC) 內在(Na) 很多(Neqa) 微生物(Na)

▷Unknown word

• 半夜(Nd) 逛團(Na) 購(VC) 看到(VE) 太(Dfa) 吸引人(VH) !!

74

地面|面積

實驗室|室友

存在|內在

未知詞:團購

Page 75: [系列活動] 資料探勘速遊

Similarity and Dissimilarity

To like or not to like

75

Page 76: [系列活動] 資料探勘速遊

Similarity and Dissimilarity

▷Similarity

• Numerical measure of how alike two data objects are.

• Is higher when objects are more alike.

• Often falls in the range [0,1]

▷Dissimilarity

• Numerical measure of how different are two data objects

• Lower when objects are more alike

• Minimum dissimilarity is often 0

• Upper limit varies

76

Page 77: [系列活動] 資料探勘速遊

Euclidean Distance

Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.

▷Standardization is necessary, if scales differ.

77

n

kkk qpdist

1

2)(

Page 78: [系列活動] 資料探勘速遊

Minkowski Distance

▷Minkowski Distance is a generalization of Euclidean Distance

Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.

78

rn

k

rkk qpdist

1

1)||(

: is extremely sensitive to the scales of the variables involved

Page 79: [系列活動] 資料探勘速遊

Mahalanobis Distance

▷Mahalanobis distance measure:

•Transforms the variables into covariance

•Make the covariance equal to 1

•Calculate simple Euclidean distance

79

)()(),( 1 yxSyxyxd

S is the covariance matrix of the input data

Page 80: [系列活動] 資料探勘速遊

Similarity Between Binary Vectors

▷ Common situation is that objects, p and q, have only binary attributes

▷ Compute similarities using the following quantitiesM01 = the number of attributes where p was 0 and q was 1

M10 = the number of attributes where p was 1 and q was 0

M00 = the number of attributes where p was 0 and q was 0

M11 = the number of attributes where p was 1 and q was 1

▷ Simple Matching and Jaccard Coefficients SMC = number of matches / number of attributes

= (M11 + M00) / (M01 + M10 + M11 + M00)

J = number of 11 matches / number of not-both-zero attributes values= (M11) / (M01 + M10 + M11)

80

Page 81: [系列活動] 資料探勘速遊

Cosine Similarity

▷ If d1 and d2 are two document vectors, thencos( d1, d2 ) = (d1 d2) / ||d1|| ||d2|| ,

where indicates vector dot product and || d || is the length of vector d.

▷ Example:

d1 = 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2

d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5

||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150

81

Page 82: [系列活動] 資料探勘速遊

Correlation

▷Correlation measures the linear relationship between objects

▷ To compute correlation, we standardize data objects, p and q,

and then take their dot product

82

)(/))(( pstdpmeanpp kk

)(/))(( qstdqmeanqq kk

qpqpncorrelatio ),(

Page 83: [系列活動] 資料探勘速遊

Using Weights to Combine Similarities

▷May not want to treat all attributes the same.• Use weights wk which are between 0 and 1 and sum to 1.

83

Page 84: [系列活動] 資料探勘速遊

Density

▷Density-based clustering require a notion of

density

▷Examples:• Euclidean density

→ Euclidean density = number of points per unit volume

• Probability density

• Graph-based density

84

Page 85: [系列活動] 資料探勘速遊

Data Exploration

Seeing is beliving

Page 86: [系列活動] 資料探勘速遊

Data Exploration

▷A preliminary exploration of the data to better

understand its characteristics

▷Key motivations of data exploration include

• Helping to select the right tool for preprocessing or

analysis

• Making use of humans’ abilities to recognize

patterns

• People can recognize patterns not captured by

data analysis tools

86

Page 87: [系列活動] 資料探勘速遊

Summary Statistics

▷Summary statistics are numbers that

summarize properties of the data

•Summarized properties include frequency,

location and spread

→ Examples: location - mean

spread - standard deviation

•Most summary statistics can be calculated in

a single pass through the data

87

Page 88: [系列活動] 資料探勘速遊

Frequency and Mode

▷Given a set of unordered categorical values

→ Compute the frequency with each value occurs is the

easiest way

▷The mode of a categorical attribute

• The attribute value that has the highest frequency

88

m

vvfrequency i

i

valueattribute with ovjects ofnumber

Page 89: [系列活動] 資料探勘速遊

Percentiles

▷For ordered data, the notion of a percentile is

more useful

▷Given• An ordinal or continuous attribute x

• A number p between 0 and 100

▷The pth percentile xp is a value of x• p% of the observed values of x are less than xp

89

Page 90: [系列活動] 資料探勘速遊

Measures of Location: Mean and Median

▷The mean is the most common measure of the

location of a set of points. • However, the mean is very sensitive to outliers.

• Thus, the median or a trimmed mean is also commonly used

90

Page 91: [系列活動] 資料探勘速遊

Measures of Spread: Range and Variance

▷Range is the difference between the max and min

▷The variance or standard deviation is the most

common measure of the spread of a set of points.

▷However, this is also sensitive to outliers, so that

other measures are often used

91

Page 92: [系列活動] 資料探勘速遊

Visualization

Visualization is the conversion of data into a visual or tabular format

▷Visualization of data is one of the most powerful and appealing techniques for data exploration. • Humans have a well developed ability to analyze large

amounts of information that is presented visually

• Can detect general patterns and trends

• Can detect outliers and unusual patterns

92

Page 93: [系列活動] 資料探勘速遊

Arrangement

▷Is the placement of visual elements within a display

▷Can make a large difference in how easy it is to

understand the data

▷Example

93

Page 94: [系列活動] 資料探勘速遊

Visualization Techniques: Histograms

▷Histogram

• Usually shows the distribution of values of a single variable

• Divide the values into bins and show a bar plot of the number of

objects in each bin.

• The height of each bar indicates the number of objects

• Shape of histogram depends on the number of bins

▷Example: Petal Width (10 and 20 bins, respectively)

94

Page 95: [系列活動] 資料探勘速遊

Visualization Techniques: Box Plots

▷Another way of displaying the distribution of data • Following figure shows the basic part of a box plot

95

Page 96: [系列活動] 資料探勘速遊

Scatter Plot Array

96

Page 97: [系列活動] 資料探勘速遊

Visualization Techniques: Contour Plots

▷Contour plots • Partition the plane into regions of similar values

• The contour lines that form the boundaries of these regions

connect points with equal values

• The most common example is contour maps of elevation

• Can also display temperature, rainfall, air pressure, etc.

97

Celsius

Sea Surface Temperature (SST)

Page 98: [系列活動] 資料探勘速遊

Visualization of the Iris Data Matrix

98

standard

deviation

Page 99: [系列活動] 資料探勘速遊

Visualization of the Iris Correlation Matrix

99

Page 100: [系列活動] 資料探勘速遊

Visualization Techniques: Star Plots

▷Similar approach to parallel coordinates• One axis for each attribute

▷ The size and the shape of polygon fives a visual description of the attribute value of the object

100

Petal length sepal length

Se

pa

l wid

th P

eta

l wid

th

Page 101: [系列活動] 資料探勘速遊

Visualization Techniques: Chernoff Faces

▷This approach associates each attribute with a characteristic of a face▷The values of each attribute determine the

appearance of the corresponding facial characteristic

▷Each object becomes a separate face

101

Data Feature Facial Feature

Sepal length Size of face

Sepal width Forehead/jaw relative arc length

Petal length Shape of forehead

Petal width Shape of jaw

Page 102: [系列活動] 資料探勘速遊

Do's and Don'ts

▷ Apprehension

• Correctly perceive relations among variables▷ Clarity

• Visually distinguish all the elements of a graph▷ Consistency

• Interpret a graph based on similarity to previous graphs▷ Efficiency

• Portray a possibly complex relation in as simple a way as

possible▷ Necessity

• The need for the graph, and the graphical elements▷ Truthfulness

• Determine the true value represented by any graphical

element

102

Page 103: [系列活動] 資料探勘速遊

Data Mining Techniques

Yi-Shin Chen

Institute of Information Systems and Applications

Department of Computer Science

National Tsing Hua University

[email protected]

Many slides provided by Tan, Steinbach, Kumar for book “Introduction to Data Mining” are adapted in this presentation

Page 104: [系列活動] 資料探勘速遊

Overview

Understand the objectivities

104

Page 105: [系列活動] 資料探勘速遊

Tasks in Data Mining

▷Problems should be well defined at the beginning

▷Two categories of tasks [Fayyad et al., 1996]

105

Predictive Tasks

• Predict unknown values

• e.g., potential customers

Descriptive Tasks

• Find patterns to describe data

• e.g., Friendship finding

VIPCheap

Potential

Page 106: [系列活動] 資料探勘速遊

Select Techniques

▷Problems could be further decomposed

106

Predictive Tasks

• Classification

• Ranking

• Regression

• …

Descriptive Tasks

• Clustering

• Association rules

• Summarization

• …

Supervised

Learning

Unsupervised

Learning

Page 107: [系列活動] 資料探勘速遊

Supervised vs. Unsupervised Learning

▷Supervised learning

• Supervision: The training data (observations, measurements,

etc.) are accompanied by labels indicating the class of the

observations

• New data is classified based on the training set

▷Unsupervised learning

• The class labels of training data is unknown

• Given a set of measurements, observations, etc. with the aim of

establishing the existence of classes or clusters in the data

107

Page 108: [系列活動] 資料探勘速遊

Classification

▷Given a collection of records (training set )

• Each record contains a set of attributes

• One of the attributes is the class

▷ Find a model for class attribute:

• The model forms a function of the values of other attributes

▷Goal: previously unseen records should be assigned a class as

accurately as possible.

• A test set is needed

→ To determine the accuracy of the model

▷Usually, the given data set is divided into training & test• With training set used to build the model

• With test set used to validate it

108

Page 109: [系列活動] 資料探勘速遊

Ranking

▷Produce a permutation to items in a new list• Items ranked in higher positions should be more important

• E.g., Rank webpages in a search engine Webpages in

higher positions are more relevant.

109

Page 110: [系列活動] 資料探勘速遊

Regression

▷Find a function which model the data with least error• The output might be a numerical value

• E.g.: Predict the stock value

110

Page 111: [系列活動] 資料探勘速遊

Clustering

▷Group data into clusters• Similar to the objects within the same cluster

• Dissimilar to the objects in other clusters

• No predefined classes (unsupervised classification)

111

Page 112: [系列活動] 資料探勘速遊

Association Rule Mining

▷Basic concept• Given a set of transactions

• Find rules that will predict the occurrence of an item

• Based on the occurrences of other items in the transaction

112

Page 113: [系列活動] 資料探勘速遊

Summarization

▷Provide a more compact representation of the data• Data: Visualization

• Text – Document Summarization

→ E.g.: Snippet

113

Page 114: [系列活動] 資料探勘速遊

Classification

114

Page 115: [系列活動] 資料探勘速遊

Illustrating Classification Task

115

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Learning

algorithm

Training Set

Page 116: [系列活動] 資料探勘速遊

Decision Tree

116

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoricalcontinuous class

© Tan, Steinbach, Kumar Introduction to Data Mining

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

MarriedSingle, Divorced

< 80K > 80K

Splitting Attributes

Model: Decision TreeTraining Data

There could be more than one tree that fits the same data!

Page 117: [系列活動] 資料探勘速遊

Algorithm for Decision Tree Induction

▷Basic algorithm (a greedy algorithm)

• Tree is constructed in a top-down recursive divide-and-conquer

manner

• At start, all the training examples are at the root

• Attributes are categorical (if continuous-valued, they are

discretized in advance)

• Examples are partitioned recursively based on selected

attributes

• Test attributes are selected on the basis of a heuristic or

statistical measure (e.g., information gain)

Data Mining 117

Page 118: [系列活動] 資料探勘速遊

Tree Induction

▷Greedy strategy.

• Split the records based on an attribute test that optimizes

certain criterion.

▷Issues

• Determine how to split the records

→ How to specify the attribute test condition?

→ How to determine the best split?

• Determine when to stop splitting

118© Tan, Steinbach, Kumar Introduction to Data Mining

Page 119: [系列活動] 資料探勘速遊

The Problem Of Decision Tree

119

Deep Bushy Tree Deep Bushy Tree Useless

The Decision Tree has a hard time with correlated attributes

10

1 2 3 4 5 6 7 8 9 10

1

23456789

10

1 2 3 4 5 6 7 8 9 10

1

23456789

?100

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

80

90

Page 120: [系列活動] 資料探勘速遊

Advantages/Disadvantages of Decision Trees

▷Advantages:

• Easy to understand

• Easy to generate rules

▷Disadvantages:

• May suffer from overfitting.

• Classifies by rectangular partitioning (so does not handle

correlated features very well).

• Can be quite large – pruning is necessary.

• Does not handle streaming data easily

120

Page 121: [系列活動] 資料探勘速遊

Underfitting and Overfitting

121Underfitting: when model is too simple, both training and test errors are large

© Tan, Steinbach, Kumar Introduction to Data Mining

Page 122: [系列活動] 資料探勘速遊

Overfitting due to Noise

122© Tan, Steinbach, Kumar Introduction to Data Mining

Decision boundary is distorted by noise point

Page 123: [系列活動] 資料探勘速遊

Overfitting due to Insufficient Examples

123© Tan, Steinbach, Kumar Introduction to Data Mining

Lack of data points in the lower half of the diagram makes it difficult to

predict correctly the class labels of that region

- Insufficient number of training records in the region causes the decision

tree to predict the test examples using other training records that are

irrelevant to the classification task

Page 124: [系列活動] 資料探勘速遊

Bayes Classifier

▷A probabilistic framework for solving classification

problems

▷Conditional Probability:

▷ Bayes theorem:

124© Tan, Steinbach, Kumar Introduction to Data Mining

)(

)()|()|(

AP

CPCAPACP

)(

),()|(

)(

),()|(

CP

CAPCAP

AP

CAPACP

Page 125: [系列活動] 資料探勘速遊

Bayesian Classifiers

▷Consider each attribute and class label as random

variables

▷Given a record with attributes (A1, A2,…,An)

• Goal is to predict class C

• Specifically, we want to find the value of C that maximizes

P(C| A1, A2,…,An )

▷Can we estimate P(C| A1, A2,…,An ) directly from

data?

125© Tan, Steinbach, Kumar Introduction to Data Mining

Page 126: [系列活動] 資料探勘速遊

Bayesian Classifier Approach

▷Compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem

▷Choose value of C that maximizes P(C | A1, A2, …, An)

▷Equivalent to choosing value of C that maximizesP(A1, A2, …, An|C) P(C)

▷How to estimate P(A1, A2, …, An | C )?

126© Tan, Steinbach, Kumar Introduction to Data Mining

)(

)()|()|(

21

2121

n

nn

AAAP

CPCAAAPAAACP

Page 127: [系列活動] 資料探勘速遊

Naïve Bayes Classifier

▷A simplified assumption: attributes are conditionally

independent and each data sample has n attributes

▷No dependence relation between attributes

▷By Bayes theorem,

▷As P(X) is constant for all classes, assign X to the

class with maximum P(X|Ci)*P(Ci)

127

n

kCixkPCiXP

1

)|()|(

)()()|()|(

XPCiPCiXPXCiP

Page 128: [系列活動] 資料探勘速遊

Naïve Bayesian Classifier: Comments

▷Advantages :

• Easy to implement

• Good results obtained in most of the cases

▷Disadvantages

• Assumption: class conditional independence

• Practically, dependencies exist among variables

→ E.g., hospitals: patients: Profile: age, family history etc

→ E.g., Symptoms: fever, cough etc., Disease: lung cancer, diabetes

etc

• Dependencies among these cannot be modeled by Naïve

Bayesian Classifier

▷How to deal with these dependencies?

• Bayesian Belief Networks

128

Page 129: [系列活動] 資料探勘速遊

Bayesian Networks

▷Bayesian belief network allows a subset of the

variables conditionally independent

▷A graphical model of causal relationships

• Represents dependency among the variables

• Gives a specification of joint probability distribution

Data Mining 129

Page 130: [系列活動] 資料探勘速遊

Bayesian Belief Network: An Example

130

Family

History

LungCancer

PositiveXRay

Smoker

Emphysema

Dyspnea

LC

~LC

(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

0.8

0.2

0.5

0.5

0.7

0.3

0.1

0.9

Bayesian Belief Networks

The conditional probability table for the variable

LungCancer:

Shows the conditional probability for each

possible combination of its parents

n

i

ZParents iziPznzP

1

))(|(),...,1(

Page 131: [系列活動] 資料探勘速遊

Neural Networks

▷Artificial neuron

• Each input is multiplied by a weighting factor.

• Output is 1 if sum of weighted inputs exceeds a threshold

value; 0 otherwise

▷Network is programmed by adjusting weights using

feedback from examples

131

Page 132: [系列活動] 資料探勘速遊

General Structure

Data Mining 132

Output nodes

Input nodes

Hidden nodes

Output vector

Input vector: xi

wij

i

jiijj OwI

jIje

O

1

1

))(1( jjjjj OTOOErr

jkk

kjjj wErrOOErr )1(

ijijij OErrlww )(

jjj Errl)(

Page 133: [系列活動] 資料探勘速遊

Network Training

▷The ultimate objective of training • Obtain a set of weights that makes almost all the tuples in

the training data classified correctly

▷Steps• Initialize weights with random values

• Feed the input tuples into the network one by one

• For each unit

→ Compute the net input to the unit as a linear combination of

all the inputs to the unit

→ Compute the output value using the activation function

→ Compute the error

→ Update the weights and the bias

133

Page 134: [系列活動] 資料探勘速遊

Summary of Neural Networks

▷Advantages

• Prediction accuracy is generally high

• Robust, works when training examples contain errors

• Fast evaluation of the learned target function

▷Criticism

• Long training time

• Difficult to understand the learned function (weights)

• Not easy to incorporate domain knowledge

134

Page 135: [系列活動] 資料探勘速遊

The k-Nearest Neighbor Algorithm

▷All instances correspond to points in the n-D space.

▷The nearest neighbor are defined in terms of

Euclidean distance.

▷The target function could be discrete- or real-

valued.

▷For discrete-valued, the k-NN returns the most

common value among the k training examples

nearest to xq.

135

.

_+

_ xq

+

_ _+

_

_

+

Page 136: [系列活動] 資料探勘速遊

Discussion on the k-NN Algorithm

▷Distance-weighted nearest neighbor algorithm

• Weight the contribution of each of the k neighbors

according to their distance to the query point xq

→ Giving greater weight to closer neighbors

▷Curse of dimensionality: distance between

neighbors could be dominated by irrelevant

attributes.

• To overcome it, elimination of the least relevant attributes.

136

Page 137: [系列活動] 資料探勘速遊

Association Rule Mining

Page 138: [系列活動] 資料探勘速遊

Definition: Frequent Itemset

▷ Itemset: A collection of one or more items

• Example: {Milk, Bread, Diaper}

▷ k-itemset

• An itemset that contains k items

▷Support count ()

• Frequency of occurrence of an itemset

• E.g. ({Milk, Bread,Diaper}) = 2

▷Support

• Fraction of transactions that contain an itemset

• E.g. s({Milk, Bread, Diaper}) = 2/5

▷ Frequent Itemset

• An itemset whose support is greater than or

equal to a minsup threshold

138© Tan, Steinbach, Kumar Introduction to Data Mining

Market-Basket transactions

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Page 139: [系列活動] 資料探勘速遊

Definition: Association Rule

139

Association Rule

– An implication expression of the form

X Y, where X and Y are itemsets

– Example:

{Milk, Diaper} {Beer}

Rule Evaluation Metrics

– Support (s)

Fraction of transactions that contain

both X and Y

– Confidence (c)

Measures how often items in Y

appear in transactions that

contain X

© Tan, Steinbach, Kumar Introduction to Data Mining

Market-Basket transactions

Example:

Beer}Diaper,Milk{

4.05

2

|T|

)BeerDiaper,,Milk(

s

67.03

2

)Diaper,Milk(

)BeerDiaper,Milk,(

c

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Page 140: [系列活動] 資料探勘速遊

Strong Rules & Interesting

140

▷Corr(A,B)=P(AUB)/(P(A)P(B))• Corr(A, B)=1, A & B are independent

• Corr(A, B)<1, occurrence of A is negatively correlated with B

• Corr(A, B)>1, occurrence of A is positively correlated with B

▷E.g. Corr(games, videos)=0.4/(0.6*0.75)=0.89

• In fact, games & videos are negatively associated→ Purchase of one actually decrease the likelihood of purchasing the other

10000

6000

games

7500

video4000

Page 141: [系列活動] 資料探勘速遊

Clustering Analysis

Page 142: [系列活動] 資料探勘速遊

Good Clustering

▷Good clustering (produce high quality clusters)

• Intra-cluster similarity is high

• Inter-cluster class similarity is low

▷Quality factors

• Similarity measure and its implementation

• Definition and representation of cluster chosen

• Clustering algorithm

142

Page 143: [系列活動] 資料探勘速遊

Types of Clusters: Well-Separated

▷Well-Separated clusters: • A cluster is a set of points such that any point in a cluster

is closer (or more similar) to every other point in the cluster than to any point not in the cluster.

143© Tan, Steinbach, Kumar Introduction to Data Mining

3 well-separated clusters

Page 144: [系列活動] 資料探勘速遊

Types of Clusters: Center-Based

▷Center-based• A cluster is a set of objects such that an object in a cluster

is closer (more similar) to the “center” of a cluster• The center of a cluster is often a centroid, the average of

all the points in the cluster, or a medoid, the most “representative” point of a cluster

144© Tan, Steinbach, Kumar Introduction to Data Mining

4 center-based clusters

Page 145: [系列活動] 資料探勘速遊

Types of Clusters: Contiguity-Based

▷Contiguous cluster (Nearest neighbor or transitive)• A cluster is a set of points such that a point in a cluster is

closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.

145© Tan, Steinbach, Kumar Introduction to Data Mining

Page 146: [系列活動] 資料探勘速遊

Types of Clusters: Density-Based

▷Density-based• A cluster is a dense region of points, which is separated

by low-density regions, from other regions of high density. • Used when the clusters are irregular or intertwined, and

when noise and outliers are present.

146© Tan, Steinbach, Kumar Introduction to Data Mining

Page 147: [系列活動] 資料探勘速遊

Types of Clusters: Objective Function

▷Clusters defined by an objective function

• Finds clusters that minimize or maximize an objective

function.

• Naïve approaches:

→ Enumerate all possible ways

→ Evaluate the `goodness' of each potential set of clusters

→NP Hard

• Can have global or local objectives.

→ Hierarchical clustering algorithms typically have local

objectives

→ Partitioned algorithms typically have global objectives

147© Tan, Steinbach, Kumar Introduction to Data Mining

Page 148: [系列活動] 資料探勘速遊

Partitioning Algorithms: Basic Concept

▷Given a k, find a partition of k clusters that optimizes

the chosen partitioning criterion

• Global optimal: exhaustively enumerate all partitions.

• Heuristic methods.

→ k-means: each cluster is represented by the center of the

cluster

→ k-medoids or PAM (Partition Around Medoids) : each

cluster is represented by one of the objects in the cluster.

148

Page 149: [系列活動] 資料探勘速遊

K-Means Clustering Algorithm

▷Algorithm:

• Randomly initialize k cluster means

• Iterate:

→ Assign each genes to the nearest cluster mean

→ Recompute cluster means

• Stop when clustering converges

149

K=4

Page 150: [系列活動] 資料探勘速遊

Two different K-means Clusterings

150© Tan, Steinbach, Kumar Introduction to Data Mining

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Sub-optimal Clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Optimal Clustering

Original Points

Page 151: [系列活動] 資料探勘速遊

Solutions to Initial Centroids Problem

▷Multiple runs

• Helps, but probability is not on your side

▷Sample and use hierarchical clustering to determine initial

centroids

▷Select more than k initial centroids and then select among

these initial centroids

• Select most widely separated

▷Postprocessing

▷Bisecting K-means

• Not as susceptible to initialization issues

151© Tan, Steinbach, Kumar Introduction to Data Mining

Page 152: [系列活動] 資料探勘速遊

Bisecting K-means

▷ Bisecting K-means algorithm• Variant of K-means that can produce a partitioned or a

hierarchical clustering

152© Tan, Steinbach, Kumar Introduction to Data Mining

Page 153: [系列活動] 資料探勘速遊

Bisecting K-means Example

153

K=4

Page 154: [系列活動] 資料探勘速遊

Bisecting K-means Example

154

K=4

Produce a hierarchical clustering based on the sequence of

clusterings produced

Page 155: [系列活動] 資料探勘速遊

Limitations of K-means: Differing Sizes

155© Tan, Steinbach, Kumar Introduction to Data Mining

Original Points K-means (3 Clusters)

One solution is to use many clusters.

Find parts of clusters, but need to put together

K-means (10 Clusters)

Page 156: [系列活動] 資料探勘速遊

Limitations of K-means: Differing Density

156© Tan, Steinbach, Kumar Introduction to Data Mining

Original Points K-means (3 Clusters)

One solution is to use many clusters.

Find parts of clusters, but need to put together

K-means (10 Clusters)

Page 157: [系列活動] 資料探勘速遊

Limitations of K-means: Non-globular Shapes

157© Tan, Steinbach, Kumar Introduction to Data Mining

Original Points K-means (2 Clusters)

One solution is to use many clusters.

Find parts of clusters, but need to put together

K-means (10 Clusters)

Page 158: [系列活動] 資料探勘速遊

Hierarchical Clustering

▷Produces a set of nested clusters organized as a

hierarchical tree

▷Can be visualized as a dendrogram

• A tree like diagram that records the sequences of merges

or splits

158© Tan, Steinbach, Kumar Introduction to Data Mining

1 3 2 5 4 60

0.05

0.1

0.15

0.2

1

2

3

4

5

6

1

23 4

5

Page 159: [系列活動] 資料探勘速遊

Strengths of Hierarchical Clustering

▷Do not have to assume any particular number of clusters• Any desired number of clusters can be obtained by

‘cutting’ the dendogram at the proper level

▷They may correspond to meaningful taxonomies• Example in biological sciences (e.g., animal kingdom,

phylogeny reconstruction, …)

159© Tan, Steinbach, Kumar Introduction to Data Mining

Page 160: [系列活動] 資料探勘速遊

Density-Based Clustering

▷Clustering based on density (local cluster criterion),

such as density-connected points

▷Each cluster has a considerable higher density of

points than outside of the cluster

160

Page 161: [系列活動] 資料探勘速遊

Density-Based Clustering Methods

▷Major features:

• Discover clusters of arbitrary shape

• Handle noise

• One scan

• Need density parameters as termination condition

▷Approaches

• DBSCAN (KDD’96)

• OPTICS (SIGMOD’99).

• DENCLUE (KDD’98)

• CLIQUE (SIGMOD’98)

Data Mining 161

Page 162: [系列活動] 資料探勘速遊

DBSCAN

▷Density = number of points within a specified radius

(Eps)

▷A point is a core point if it has more than a specified

number of points (MinPts) within Eps

• These are points that are at the interior of a cluster

▷A border point has fewer than MinPts within Eps, but

is in the neighborhood of a core point

▷A noise point is any point that is not a core point or a

border point.

162© Tan, Steinbach, Kumar Introduction to Data Mining

Page 163: [系列活動] 資料探勘速遊

DBSCAN: Core, Border, and Noise Points

163© Tan, Steinbach, Kumar Introduction to Data Mining

Page 164: [系列活動] 資料探勘速遊

DBSCAN Examples

164© Tan, Steinbach, Kumar Introduction to Data Mining

Original Points Point types: core,

border and noise

Eps = 10, MinPts = 4

Page 165: [系列活動] 資料探勘速遊

When DBSCAN Works Well

165© Tan, Steinbach, Kumar Introduction to Data Mining

Original Points Clusters

• Resistant to Noise

• Can handle clusters of different shapes and sizes

Page 166: [系列活動] 資料探勘速遊

Recap: Data Mining Techniques

166

Predictive Tasks

• Classification

• Ranking

• Regression

• …

Descriptive Tasks

• Clustering

• Association rules

• Summarization

• …

Page 167: [系列活動] 資料探勘速遊

Evaluation

Yi-Shin Chen

Institute of Information Systems and Applications

Department of Computer Science

National Tsing Hua University

[email protected]

Many slides provided by Tan, Steinbach, Kumar for book “Introduction to Data Mining” are adapted in this presentation

Page 168: [系列活動] 資料探勘速遊

Tasks in Data Mining

▷Problems should be well defined at the beginning

▷Two categories of tasks [Fayyad et al., 1996]

168

Predictive Tasks

• Predict unknown values

• e.g., potential customers

Descriptive Tasks

• Find patterns to describe data

• e.g., Friendship finding

VIPCheap

Potential

Page 169: [系列活動] 資料探勘速遊

For Predictive Tasks

169

Page 170: [系列活動] 資料探勘速遊

Metrics for Performance Evaluation

170© Tan, Steinbach, Kumar Introduction to Data Mining

Focus on the predictive capability of a model

Confusion Matrix:

PREDICTED CLASS

ACTUAL

CLASS

Class=Yes Class=No

Class=Yes a b

Class=No c d

a: TP (true positive)

b: FN (false negative)

c: FP (false positive)

d: TN (true negative)

Page 171: [系列活動] 資料探勘速遊

Metrics for Performance Evaluation

171© Tan, Steinbach, Kumar Introduction to Data Mining

Most widely-used metric:

PREDICTED CLASS

ACTUAL

CLASS

Class=Yes Class=No

Class=Yes a

(TP)

b

(FN)

Class=No c

(FP)

d

(TN)

FNFPTNTP

TNTP

dcba

da

Accuracy

Page 172: [系列活動] 資料探勘速遊

Limitation of Accuracy

▷Consider a 2-class problem

• Number of Class 0 examples = 9990

• Number of Class 1 examples = 10

▷ If model predicts everything to be class 0, accuracy is

9990/10000 = 99.9 %

▷Accuracy is misleading because model does not detect any

class 1 example

172© Tan, Steinbach, Kumar Introduction to Data Mining

Page 173: [系列活動] 資料探勘速遊

Cost-Sensitive Measures

173© Tan, Steinbach, Kumar Introduction to Data Mining

cba

a

pr

rp

ba

a

ca

a

2

22(F) measure-F

(r) Recall

(p)Precision

Precision is biased towards C(Yes|Yes) & C(Yes|No)

Recall is biased towards C(Yes|Yes) & C(No|Yes)

F-measure is biased towards all except C(No|No)

dwcwbwaw

dwaw

4321

41Accuracy Weighted

Page 174: [系列活動] 資料探勘速遊

Test of Significance

▷Given two models:

• Model M1: accuracy = 85%, tested on 30 instances

• Model M2: accuracy = 75%, tested on 5000 instances

▷Can we say M1 is better than M2?

• How much confidence can we place on accuracy of M1 and M2?

• Can the difference in performance measure be explained as a

result of random fluctuations in the test set?

174© Tan, Steinbach, Kumar Introduction to Data Mining

Page 175: [系列活動] 資料探勘速遊

Confidence Interval for Accuracy

▷Prediction can be regarded as a Bernoulli trial

• A Bernoulli trial has 2 possible outcomes

→ Possible outcomes for prediction: correct or wrong

• Collection of Bernoulli trials has a Binomial distribution:

→ x ≈ Bin(N, p) x: number of correct predictions

→ e.g: Toss a fair coin 50 times, how many heads would turn up?

Expected number of heads = N × p = 50 × 0.5 = 25

▷Given x (# of correct predictions) or equivalently, accuracy

(ac)=x/N, and N (# of test instances)

Can we predict p (true accuracy of model)?

175© Tan, Steinbach, Kumar Introduction to Data Mining

Page 176: [系列活動] 資料探勘速遊

Confidence Interval for Accuracy

▷For large test sets (N > 30),

• ac has a normal distribution

with mean p and variance

p(1-p)/N

• Confidence Interval for p:

176© Tan, Steinbach, Kumar Introduction to Data Mining

Area = 1 -

Z/2 Z1- /2

1

)/)1(

( 2/12/ ZNpp

paZP c

)(2

4422

2/

22

2/2/

2

2/

ZN

aNaNZZZaNp

ccc

Page 177: [系列活動] 資料探勘速遊

Example :Comparing Performance of 2 Models

▷Given: M1: n1 = 30, e1 = 0.15M2: n2 = 5000, e2 = 0.25

• d = |e2 – e1| = 0.1 (2-sided test)

▷At 95% confidence level, Z/2=1.96

177© Tan, Steinbach, Kumar Introduction to Data Mining

0043.05000

)25.01(25.0

30

)15.01(15.0ˆ

d

128.0100.00043.096.1100.0 t

d

Interval contains 0 :

difference may not be statistically significant

Page 178: [系列活動] 資料探勘速遊

For Descriptive Tasks

178

Page 179: [系列活動] 資料探勘速遊

Computing Interestingness Measure

▷Given a rule X Y, information needed to compute rule

interestingness can be obtained from a contingency table

179© Tan, Steinbach, Kumar Introduction to Data Mining

Y Y

X f11 f10 f1+

X f01 f00 fo+

f+1 f+0 |T|

Contingency table for X Y

f11: support of X and Y

f10: support of X and Y

f01: support of X and Y

f00: support of X and Y

Used to define various measures

support, confidence, lift, Gini,

J-measure, etc.

Page 180: [系列活動] 資料探勘速遊

Drawback of Confidence

180© Tan, Steinbach, Kumar Introduction to Data Mining

Coffee Coffee

Tea 15 5 20

Tea 75 5 80

90 10 100

Association Rule: Tea Coffee

Confidence= P(Coffee|Tea) = 0.75

but P(Coffee) = 0.9

Although confidence is high, rule is misleading

P(Coffee|Tea) = 0.9375

Page 181: [系列活動] 資料探勘速遊

Statistical Independence

▷Population of 1000 students

• 600 students know how to swim (S)

• 700 students know how to bike (B)

• 420 students know how to swim and bike (S,B)

• P(SB) = 420/1000 = 0.42

• P(S) P(B) = 0.6 0.7 = 0.42

• P(SB) = P(S) P(B) => Statistical independence

• P(SB) > P(S) P(B) => Positively correlated

• P(SB) < P(S) P(B) => Negatively correlated

181© Tan, Steinbach, Kumar Introduction to Data Mining

Page 182: [系列活動] 資料探勘速遊

Statistical-based Measures

▷Measures that take into account statistical dependence

182© Tan, Steinbach, Kumar Introduction to Data Mining

)](1)[()](1)[(

)()(),(

)()(),(

)()(

),(

)(

)|(

YPYPXPXP

YPXPYXPtcoefficien

YPXPYXPPS

YPXP

YXPInterest

YP

XYPLift

Page 183: [系列活動] 資料探勘速遊

Example: Interest Factor

183© Tan, Steinbach, Kumar Introduction to Data Mining

Coffee Coffee

Tea 15 5 20

Tea 75 5 80

90 10 100

Association Rule: Tea Coffee

Confidence= P(Coffee,Tea) = 0.15

P(Coffee) = 0.9, P(Tea) = 0.2

Interest = 0.15/(0.9×0.2)= 0.83 (< 1, therefore is negatively

associated)

Page 184: [系列活動] 資料探勘速遊

Subjective Interestingness Measure

▷Objective measure:

• Rank patterns based on statistics computed from data

• e.g., 21 measures of association (support, confidence,

Laplace, Gini, mutual information, Jaccard, etc).

▷Subjective measure:

• Rank patterns according to user’s interpretation

→ A pattern is subjectively interesting if it contradicts the

expectation of a user (Silberschatz & Tuzhilin)

→ A pattern is subjectively interesting if it is actionable

(Silberschatz & Tuzhilin)

184© Tan, Steinbach, Kumar Introduction to Data Mining

Page 185: [系列活動] 資料探勘速遊

Interestingness via Unexpectedness

185© Tan, Steinbach, Kumar Introduction to Data Mining

Need to model expectation of users (domain knowledge)

Need to combine expectation of users with evidence from data (i.e., extracted patterns)

+ Pattern expected to be frequent

- Pattern expected to be infrequent

Pattern found to be frequent

Pattern found to be infrequent

+

-

Expected Patterns-

+ Unexpected Patterns

Page 186: [系列活動] 資料探勘速遊

Different Propose Measures

186© Tan, Steinbach, Kumar Introduction to Data Mining

Some measures are

good for certain

applications, but not for

others

What criteria should we

use to determine

whether a measure is

good or bad?

Page 187: [系列活動] 資料探勘速遊

Comparing Different Measures

187© Tan, Steinbach, Kumar Introduction to Data Mining

Example f11 f10 f01 f00

E1 8123 83 424 1370

E2 8330 2 622 1046

E3 9481 94 127 298

E4 3954 3080 5 2961

E5 2886 1363 1320 4431

E6 1500 2000 500 6000

E7 4000 2000 1000 3000

E8 4000 2000 2000 2000

E9 1720 7121 5 1154

E10 61 2483 4 7452

10 examples of

contingency tables:

Rankings of contingency tables

using various measures:

Page 188: [系列活動] 資料探勘速遊

Property under Variable Permutation

Does M(A,B) = M(B,A)?

Symmetric measures: support, lift, collective strength, cosine, Jaccard, etc

Asymmetric measures:

confidence, conviction, Laplace, J-measure, etc

188© Tan, Steinbach, Kumar Introduction to Data Mining

B B A p q

A r s

A A B p r

B q s

Page 189: [系列活動] 資料探勘速遊

Cluster Validity

▷ For supervised classification we have a variety of measures to

evaluate how good our model is

• Accuracy, precision, recall

▷ For cluster analysis, the analogous question is how to evaluate

the “goodness” of the resulting clusters?

▷But “clusters are in the eye of the beholder”!

▷ Then why do we want to evaluate them?

• To avoid finding patterns in noise

• To compare clustering algorithms

• To compare two sets of clusters

• To compare two clusters

189© Tan, Steinbach, Kumar Introduction to Data Mining

Page 190: [系列活動] 資料探勘速遊

Clusters Found in Random Data

190© Tan, Steinbach, Kumar Introduction to Data Mining

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Random

Points

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

K-

means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

yComplete

Link

Page 191: [系列活動] 資料探勘速遊

Measures of Cluster Validity

▷Numerical measures that are applied to judge various aspects

of cluster validity, are classified into the following three types

• External Index: Used to measure the extent to which cluster

labels match externally supplied class labels, e.g., Entropy

• Internal Index: Used to measure the goodness of a clustering

structure without respect to external information, e.g., Sum of

Squared Error (SSE)

• Relative Index: Used to compare two different clusters

▷Sometimes these are referred to as criteria instead of indices

• However, sometimes criterion is the general strategy and index

is the numerical measure that implements the criterion.

191© Tan, Steinbach, Kumar Introduction to Data Mining

Page 192: [系列活動] 資料探勘速遊

Measuring Cluster Validity Via Correlation

▷ Two matrices • Proximity Matrix

• “Incidence” Matrix→ One row and one column for each data point

→ An entry is 1 if the associated pair of points belong to the same

cluster

→ An entry is 0 if the associated pair of points belongs to different

clusters

▷ Compute the correlation between the two matrices• Since the matrices are symmetric, only the correlation between

n(n-1) / 2 entries needs to be calculated.

▷ High correlation indicates that points that belong to the same

cluster are close to each other.

▷ Not a good measure for some density or contiguity based

clusters.

192© Tan, Steinbach, Kumar Introduction to Data Mining

Page 193: [系列活動] 資料探勘速遊

Measuring Cluster Validity Via Correlation

▷Correlation of incidence and proximity matrices for

the K-means clustering of the following two data sets

193© Tan, Steinbach, Kumar Introduction to Data Mining

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Corr = -0.9235 Corr = -0.5810

Page 194: [系列活動] 資料探勘速遊

Internal Measures: SSE

▷Clusters in more complicated figures aren’t well separated

▷ Internal Index: Used to measure the goodness of a clustering

structure without respect to external information

• Sum of Squared Error (SSE)

▷SSE is good for comparing two clusters

▷Can also be used to estimate the number of clusters

194© Tan, Steinbach, Kumar Introduction to Data Mining

2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SS

E

5 10 15

-6

-4

-2

0

2

4

6

Page 195: [系列活動] 資料探勘速遊

Internal Measures: Cohesion and Separation

▷Cluster Cohesion: Measures how closely related are objects in

a cluster• Cohesion is measured by the within cluster sum of squares (SSE)

▷Cluster Separation: Measure how distinct or well-separated a

cluster is from other clusters• Separation is measured by the between cluster sum of squares

• Where |Ci| is the size of cluster i

195© Tan, Steinbach, Kumar Introduction to Data Mining

i Cx

ii

mxWSS2)(

i

ii mmCBSS 2)(

Page 196: [系列活動] 資料探勘速遊

Final Comment on Cluster Validity

“The validation of clustering structures is the most difficult and

frustrating part of cluster analysis.

Without a strong effort in this direction, cluster analysis will

remain a black art accessible only to those true believers who

have experience and great courage.”

Algorithms for Clustering Data, Jain and Dubes

196© Tan, Steinbach, Kumar Introduction to Data Mining

Page 197: [系列活動] 資料探勘速遊

Case Studies

Yi-Shin Chen

Institute of Information Systems and Applications

Department of Computer Science

National Tsing Hua University

[email protected]

Many slides provided by Tan, Steinbach, Kumar for book “Introduction to Data Mining” are adapted in this presentation

Page 198: [系列活動] 資料探勘速遊

Case: Mining Reddit Data

Please check the data set during the breaks

198

Page 199: [系列活動] 資料探勘速遊

Reddit Datahttps://drive.google.com/open?id=0BwpI8947eCyuRFVDLU4tT2

5JbFE

199

Page 200: [系列活動] 資料探勘速遊

Reddit: The Front Page of the

Internet

50k+ on

this set

Page 201: [系列活動] 資料探勘速遊

Subreddit Categories

▷Reddit’s structure may already provide a

baseline similarity

Page 202: [系列活動] 資料探勘速遊

Provided Data

Page 203: [系列活動] 資料探勘速遊

Recover Structure