04 association

สอนโดย ดร.หทยรตน เกตมณชยรตน

ภาควชาการจดการเทคโนโลยการผลตและสารสนเทศ

บทท 4 : การหากฎความสมพนธของขอมล

(Association)

336331 การทาเหมองขอมล (Data Mining)

WHAT IS ASSOCIATION MINING? การคนหาความสมพนธระหวาง RECORD หรอกลมของ RECORD ใน DATABASE

Association Rule Mining

Algorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databases

Objective

Search for interesting relationships among items in a given data set

Association Rule

Antecedent → Consequent

Example:

{Diaper} → {Beer}

การทาเหมองขอมลสมพนธ (ASSOCIATION MINING)

จดประสงคเพอ

คนหากฎและวเคราะหความสมพนธระหวางไอเทมซงอยใน

เซตของสนคา

สวนใหญจะทาเพอ การวเคราะหตลาด เรยกวา

“การวเคราะหตะกราการซอ”

(Market Basket Analysis)

การวเคราะหตลาด/ตะกราการซอ

ในธรกจการขาย

ลกคา คอ ผซอสนคาหรอบรการจากบรษท ซงลกคาอาจะ

มพฤตกรรมการซอทแตกตางกนออกไป

ผบรหารตองการวเคราะหขอมลการซอ หรอ ขายสนคาท

เกดพรอมกนหรอไมเกดพรอมกน


การวเคราะห ผบรหารตองการทราบ

พฤตกรรมของการซอขายทไมเจาะจงกบลกคาคนหนงคนใด

ลกษณะของสนคาหรอผลตภณฑทถกซอและขายคกน

ลกษณะการซอขายทแตกตางกบสนคาอน

การวางแผนลกษณะการดาเนนการธรกจจากความรทได

ประยกตเพอเพมยอดขาย

การวางผงภายในราน

การนาสนคามาลด แลก แจก แถม ออกโปรโมชน


Catalog design

Product pricing and promotion Cross-market

Store Layout

HOW ARE ASSOCIATION RULES MINED?

There are two-step process: 1. Find all frequent itemsets: by definition, each of these

itemsets will occur at least as frequently as a pre-determined minimum support count

2. Generate strong association rules from the frequent itemsets: by definition, these rules must satisfy minimum support and minimum confidence

ASSOCIATION MINING RULE

การวเคราะหตะกราการซอ ใชเทคนค ทเรยกวา กฎการทาเหมองความความสมพนธ (Association Mining Rule)

ทเนนขอมล จาก Poin-of-Sale (P-O-S) ขอมลทนามาใชวเคราะหอยในรป transaction

ตวอยางขอมล TRANSACTION

รหสการซอหนงใบเสรจ

ขอมลลกษณะลกคา (อาจม)

ปรมาณสนคาทซอ

ขอมลประเภทของสนคาทขาย

จานวนเงน

นยามของการทาเหมองความสมพนธ

การทาเหมองความสมพนธ คอ การมาซงกฎความสมพนธโดยเปนการหา

รปแบบทเกดขนบอยคกน เรยกวา frequent pattern และความสมพนธท

เกดขน เรยกวา association ของกลมไอเทมจากขอมลทอยในรป transaction

ผลลพธทไดอยในรปกฎความสมพนธ

และเขยนคาเปอรเซน Support และ Confidence กากบดวย

item1 → item2 [support, confidence]

ประเภทรปกฎการทาเหมองความสมพนธ

แบบ “Single-dimensional association rules” computer → software

[support = 1%, confidence = 50%]

แบบ “Multidimensional association rule” Computer, Book → software

[support = 5%, confidence = 80%] หรอ

Age(X, “20..29”) ∧ income (X, “20K..29K”) → buys (X, “CD player”)

[support = 2%, confidence = 60%]

ตวอยางของการเขยนกฎความสมพนธ (1)

ตองการเขยน กฎ ซงแสดงความสมพนธระหวาง item เชน บรษทแหงหนงม

บรการตรวจรบรถ ลางอดฉด ขายอปกรณตกแตงรถ เปนตน จากการทาเหมอง

กฎความสมพนธ ไดกฎดงตอไปน คอ

ลกคาทมาชอบรบบรการตรวจรถและมการซออปกรณตกแตงรถ เทากบ 30%

รบบรการตรวจรถ → ซออปกรณตกแตงรถ

[30%, confidence]


ลกคาทมารบบรการตรวจรถแลวมการซออปกรณตกแตงรถเพมเตมดวย

เทากบ 50%


[30%, confidence]


[30%, 50%]


ตองการเขยน กฎ ซงแสดงความสมพนธระหวาง item การซอสนคาของรานคา

แหงหนง จากการทาเหมองกฎความสมพนธ ไดกฎดงตอไปน คอ

รานคาพบวาลกคาชอบซอทงเบยร (Beer) และ ผาออม (Pamper) คดเปน 80%

ซอเบยร (Beer) → ซอผาออม (Pamper)

[80%, confidence]


จาการวเคราะหลกคาเขามาซอของในรานถาซอเบยรแลวสวนใหญตองซอ

ผาออมดวยเทากบ 50%


[80%, confidence]


[80%, 50%]


ตองการเขยน กฎ ซงแสดงความสมพนธระหวาง item การซอสนคาของรานคาแหง

หนง จากการทาเหมองกฎความสมพนธ ไดกฎดงตอไปน คอ

รานคาพบวาลกคาอาย 20 ถง 29 ปและรายไดประมาณ 20,000 ถง 29000 ชอบซอ

CD player 2%


[2%, confidence]


จากการวเคราะหลกคาเขามาซอของในรานถาอาย 20 ถง 29 ปและรายได

ประมาณ 20,000 ถง 29000 แลวจะซอ CD player ดวยเทากบ 60%


[2%, confidence]


[2%, 60%]

ขนตอนในหากฎความสมพนธ ม 2 ขนตอนหลก คอ

1) Find all frequent itemsets

หมายถงหารายการสนคาทเกดขนบอยทงหมดกอนโดยกฎทไดมานนตอง

มากกวาคา support ตาทสดทกาหนดไว

2) Generate strong association rules from the frequent itemsets

หมายถงหากฎความสมพนธทแขงแกรง โดยกฎทไดมานนตองมากกวาคา

Support ตาทสดทกาหนดไว เรยกวา Min_sup และมากกวาคา Confidence ตาทสด

เรยกวา Min_Conf ทกาหนดไวดวย

แนวความคดพนฐานสาหรบการหากฎความสมพนธ

Itemset is a set of items

Let I = {i1, …, ik}

An Association rule XY where X ⊂ I, Y ⊂ I

Find all the rules XY with min confidence and support called strong association rules

support, s, probability that a transaction contains X ∪ Y

confidence, c, conditional probability that a transaction having X also contains Y.

support (XY) = P(X ∪ Y)

confidence(XY) = P(Y|X) = P(X ∪ Y) / P(x)

ตวอยางเชน

Transaction-id Items bought

10 A, B, C

20 A, C

30 A, D

40 B, E, F

Frequent pattern Support {A} 75% {B} 50% {C} 50%

{A, C} 50%

Frequently itemsets 1

A → C support = 50% C → A support = 50% confidence of A → C = P(A∪C)/P(A) = (1/2) / (3/4) = (1/2)*(4/3) = 66.67%

confidence of C → A =P(C ∪ A)/P(C) =(1/2) / (1/2) = (1/2)*(2) = 100%

C → A is an exact rule

2 Min_Sup = 50% Min_Conf = 60%

การหากฎความสมพนธดวยวธ APRIORI

ขนตอนการหาดวยวธ Apriori

Input: ฐานขอมลแบบ transactions; กาหนดคา min_sup และคา min_sup

Output: ไดไอเทมทเกดขนบอยจากฐานขอมล

วธ APRIORI

ขนตอนท 1 Find all frequent itemsets

หมายถงหารายการสนคาทเกดขนบอยทงหมดกอน

a b c

ab bc

abc

1-itemset

2-itemset

3-itemset

k-itemset

…

bc

วธ APRIORI

ขนตอนท 1 Find all frequent itemsets

หมายถงหารายการสนคาทเกดขนบอยทงหมดกอนแตถาหาก

รายการใดไมผานคา min_sup รายการนนทหา การเกดขนบอยกบรายการ

อนจะไมผานคา min_sup ดวย

a b c

ab bc

abc

bc

{c} = min_sup = 30%

min_sup threshold = 50%

ตวอยาง วธ APRIORI

Database TDB

1st scan

C1 L1

L2

C2 C2 2nd scan

C3 L3 3rd scan

Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E

Itemset sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3

Itemset sup {A} 2 {B} 3 {C} 3 {E} 3

Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E}

Itemset sup {A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2

Itemset sup {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2

Itemset sup {B, C, E} 2

Itemset {B, C, E}

วธ APRIORI

ขนตอนท 2 Generate strong association rules from the frequent itemsets

Strong association rules หมายถงตองผานคา

- minimum support

- minimum confidence

)(_sup)(_sup)|()(

AcountportBAcountportABPBAconfidence ∪

==→

ตวอยาง STRONG ASSOCIATION RULES

• Suppose the data contain the frequent itemset l = {I1, I2, I5} What are the association rules that can be generate from l?

• The nonempty subsets of l are

{I1, I2}, {I1,I5}, {I2, I5}, {I1}, {I2}, {I5}

• List its confidence:

TID List of item_IDs

T100 I1, I2, I5

T200 I2, I4

T300 I2, I3

T400 I1, I2, I4

T500 I1, I3

T600 I2, I3

T700 I1, I3

T800 I1, I2, I3, I5

T900 I1, I2, I3

I1 ∧ I2 → I5 Confidence = 2/4 = 50% I1 ∧ I5 → I2 Confidence = 2/2 = 100% I2 ∧ I5 → I1 Confidence = 2/2 = 100% I1 → I2∧ I5 Confidence = 2/6 = 33% I2 → I1∧ I5 Confidence = 2/7 = 29% I5 → I1∧ I2 Confidence = 2/2 = 100%

> Min_conf threshold= 70 %

การวเคราะหสหสมพนธ

การวเคราะหสหสมพนธ (Correlation Analysis หรอ Lift) หมายถง คา

สหสมพนธทบงบอกทความสมพนธทนาสนใจระหวาง item ได

ทาไมถงตองมการวเคราะหคาสหสมพนธ

การใชคา Support และ Confidence มประโยชนมากสาหรบหลายแอปพลเคชน

แตควรระวงวาอาจจะทาใหเกดความเขาใจผดในบางกฎได


นยาม

มาตรวดความสมพนธระหวางสองไอเทมเซตทแขงแกรงหรอไม

โดยท P(B|A) = P(A ∪ B)/P(A)

ผลลพธทไดจากการวเคราะหสหสมพนธ คอ

ถาคานอยกวา 1 หมายถงการทเกดไอเทม A ไมไดสงเสรมไอเทม B จรง

ถาคามากกวา 1 หมายถงการทเกดไอเทม A สงเสรมไอเทม B จรง

ถาคาเทากบ 1 หมายถง การเกดไอเทม A ไมไดมความสมพนธแตอยางใดกบไอเทม B คอเปนอสระกน

)()|(

)()()(

, BPABP

BPAPBAPcorr BA =

∪=


play basketball → eat cereal [40%, 66.7%] เปนกฎทเกดการเขาใจผด (Misleading)

Sup = 2000/5000 = 40%, conf = 2000/3000 = 66.7%

ซงจานวนเปอรเซนตของนกเรยนทกนซเรยล (eating cereal) เทากบ 75%

(3750/5000) ซงมากกวา 66.7%.

play basketball → not eat cereal [20%, 33.3%] นาจะมความถกตองมากกวาแมวาคา

Support และ confidence ตากวากฎทแลว

Basketball Not basketball Sum (row)

Cereal 2000 1750 3750

Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000


play basketball → eat cereal [40%, 66.7%]

play basketball → not eat cereal [20%, 33.3%]

8.075.0*6.0

4.0_,_ ==cerealeatbasketballplaycorr

33.125.0*6.0

2.0__,_ ==cerealeatnotbasketballplaycorr

Basketball Not basketball Sum (row)

Cereal 2000 1750 3750

Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000

ผลลพธของกฎการทาเหมองความสมพนธ มประโยชนและเอาไปประยกตจรงได

ตวอยาง เชน

Buy {Diaper} → buy {Beer} (buy in Friday)

ไมไรสาระหรอเปนสงททราบอยแลว

ตวอยาง เชน ลกคาซอโปรโมชน 3 สายซอน แลวซอรอสายดวย เนองจากการซอแบบนมอยแลวจงไมตองแสดงกฎพวกนอก

BuyPro {3 way calling} → BuyPro {call waiting}

ทราบเหตและผลจากกฎทได

ตวอยาง เชน ขายวงแหวนหองนาในวนครสมาส ขายด ถาหากไมทราบเหตผลกไมสามารถนาไปใชได

Sell {toilet ring} → Date {Christmas’s Day}

PREDICT

Predicted Label

Positive Negative

Known Label Positive

True Positive (TP)

False Negative (FN)

Negative False Positive

(FP) True Negative

(TN)

For simplicity, the assumption is that each instance can only be assigned one of two classes: Positive or Negative (e.g. a patient's tumor may be malignant or benign). Each instance (e.g. a patient) has a Known label, and a Predicted label. Some method is used (e.g. cross-validation) to make predictions on each instance. Each instance then increments one cell in the confusion matrix.

PREDICT

Measure Formula Intuitive Meaning

Precision TP / (TP + FP) The percentage of positive predictions that are correct.

Recall / Sensitivity TP / (TP + FN) The percentage of positive labeled instances that were predicted as positive.

Specificity TN / (TN + FP) The percentage of negative labeled instances that were predicted as negative.

Accuracy (TP + TN) / (TP + TN + FP + FN) The percentage of predictions that are correct.

For simplicity, the assumption is that each instance can only be assigned one of two classes: Positive or Negative (e.g. a patient's tumor may be malignant or benign). Each instance (e.g. a patient) has a Known label, and a Predicted label. Some method is used (e.g. cross-validation) to make predictions on each instance. Each instance then increments one cell in the confusion matrix.

PREDICT This seems to suggest that, without any knowledge of the distribution of

data, the best measures to use are Recall (Sensitivity) and Specificity to

allow one to find problems with classifiers. However, many other cases

can arise other than these four boundary cases. Consider the following

confusion matrix for a data set with 600 out of 11,100 instances positive:

Predicted Label

Positive Negative

Known Label

Positive 500 100

Negative 500 10,000

PRECISION

TP / (TP + FP) = 500 / (500 + 500) = ½ = 0.5 = 50%

RECALL / SENSITIVITY

TP / (TP + FN) = 500 / (500 + 100) = 5/6 = 0.83 = 83%

SPECIFICITY

TN / (TN + FP) = 10000 /(10000 + 500) = 0.95 = 95%

ACCURACY

(TP + TN) / (TP + TN + FP + FN) = (500 + 10000) / (500+10000+500+100) = 0.95 = 95%

Predicted Label

Positive Negative

Known Label

Positive 500 (TP) 100 (FN)

Negative 500 (FP) 10,000 (TN)

IN THIS CASE, PRECISION = 50%, RECALL = 83%, SPECIFICITY = 95%, AND

ACCURACY = 95%. IN THIS CASE, PRECISION IS LOW, WHICH MEANS THE

CLASSIFIER IS PREDICTING POSITIVES POORLY. HOWEVER, THE THREE OTHER

MEASURES SEEM TO SUGGEST THAT THIS IS A GOOD CLASSIFIER. THIS JUST

GOES TO SHOW THAT THE PROBLEM DOMAIN HAS A MAJOR IMPACT ON THE

MEASURES THAT SHOULD BE USED TO EVALUATE A CLASSIFIER WITHIN IT,

AND THAT LOOKING AT THE 4 SIMPLE CASES PRESENTED IS NOT SUFFICIENT.

แบบฝกหดบทท 4 What is Association Mining? What is objective of Association Mining? How many type of Association Mining rule? How many step of Association Mining?

แบบฝกหดบทท 4 Consider the database in the figure below and

assume the minimum support is 3 transactions.

แบบฝกหดบทท 4 From table below, Please calculate Recall, Specificity, Precision และ Accuracy

LAB 4 Apriori works with categorical values only. Therefore, if a dataset

contains numeric attributes, they need to be converted into nominal before

applying the Apriori algorithm. Hence, data preprocessing must be

performed. Repeat LAB 3 (Data Preprocessing), if you don’t know how to

deal with numeric to nominal conversion.

weather.nominal.arff

bank-data.arff

market‐basket.arff

04 association

Education