hash-based algorithm for mining association rules

21
1 Hash-Based Algorithm for Mining Association Rules

Upload: lilah-melendez

Post on 01-Jan-2016

71 views

Category:

Documents


0 download

DESCRIPTION

Hash-Based Algorithm for Mining Association Rules. Data Mining. Mining Association Rules. Mining Association Rules. Mining Association Rules Support Obtain Large Itemset Confidence Generate Association Rules. Apriori - رويكرد مبتني بر - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Hash-Based Algorithm for Mining Association Rules

1

Hash-Based Algorithm for Mining Association Rules

Page 2: Hash-Based Algorithm for Mining Association Rules

2

Data Mining Mining Association Rules

Page 3: Hash-Based Algorithm for Mining Association Rules

3

Mining Association Rules

Mining Association Rules Support

Obtain Large Itemset Confidence

Generate Association Rules

Page 4: Hash-Based Algorithm for Mining Association Rules

Apriori -رويكرد مبتني بر در Apriori ابتدا در ميان مجموعه ساختار هاي داده شده به دنبال زيرساختارهاي متناوبي با اندازه

رويكرد مبتني بر كوچك مي گرديم. پس از آن در هر مرحله با يك نود به يك زير ساختار متناوب، زير ساختار جديدي

.ايجاد مي شود براي افزودن نودها به يك زير ساختار متناوب، تنها نودهايي مورد استفاده قرار م يگيرند كه در

مرحله اول به عنوان نود متناوب شناخته شده باشند. با ايجاد زير ساختار جديد، مجموعه ساختارها براي مشخص شدن

تناوب يا عدم .تناوب زيرساختار جديد مورد پويش قرار م يگيرد

4

Page 5: Hash-Based Algorithm for Mining Association Rules

5

TID Items

100

A C D

200

B C E

300

A B C E

400

B E

D

ScanD

Itemset

Sup.

{A} 2

{B} 3

{C} 3

{D} 1

{E} 3

C1Itemset Sup.

{A} 2

{B} 3

{C} 3

{E} 3

L1

Itemset

{A B}

{A C}

{A E}

{B C}

{B E}

{C E}

ScanD

Itemset

Sup.

{A B} 1

{A C} 2

{A E} 1

{B C} 2

{B E} 3

{C E} 2

Itemset

Sup.

{A C} 2

{B C} 2

{B E} 3

{C E} 2

C2 C2 L2

Itemset

{B C E}

ScanD

Itemset Sup.

{B C E} 2

Itemset Sup.

{B C E} 2

C3 C3 L3

Apriori

Sup=2

Page 6: Hash-Based Algorithm for Mining Association Rules

6

Apriori Cont. Disadvantages

Inefficient Produce much more useless

candidates

Page 7: Hash-Based Algorithm for Mining Association Rules

7

DHP Prune useless candidates in advance Reduce database size at each iteration

Direct Hashing with EfficientPruning for Fast Data Mining

DHP

Page 8: Hash-Based Algorithm for Mining Association Rules

8

C1 Count

{A}

2

{B}

3

{C}

3

{D}

1

{E}

3

L1

{A}

{B}

{C}

{E}

Min sup=2

Making a hash table

100

{A C}

200

{B C},{B E},{C E}

300

{A B},{A C},{A E},{B C},{B E},{C E}

400

{B E}

H{[x y]}=((order of x )*10+(order of y)) mod 7;

{B E}

{C E}

{B C}

{B E}

{A C}

{C E}

{B C}

{B E}

{A B}

{A C}

2 0 2 0 3 1 2

0 1 2 3 4 5 6

1 0 1 0 1 0 1

Hash table H2

Hash address

The number of items hashed to bucket 0

Bit vector

TID Items

100

A C D

200

B C E

300

A B C E

400

B E

D

Page 9: Hash-Based Algorithm for Mining Association Rules

9

Perfect Hashing Schemes (PHS) for Mining Association Rules

Page 10: Hash-Based Algorithm for Mining Association Rules

10

Motivation Apriori and DHP produce Ci from Li-

1 that may be the bottleneck

Collisions in DHP

Designing a perfect hashing function for every transaction databases is a thorny problem

Page 11: Hash-Based Algorithm for Mining Association Rules

11

Definition Definition. A Join operation is to join two

different (k-1)-itemsets, , respectively, to produces a k-itemset, where

= p1p2…pk-1

= q1q2…qk-1 and p2=q1, p3=q2,…,pk-2=qk-3, pk-1=qk-2.

Example: ABC, BCD 3-itemsets of ABCD: ABC, ABD, ACD, BCD only one pair that satisfies the join definition

11kS

21kS

Page 12: Hash-Based Algorithm for Mining Association Rules

12

Algorithm PHS (Perfect Hashing and Data

Shrinking)

Page 13: Hash-Based Algorithm for Mining Association Rules

13

Example1 (sup=2)

TID Items

100 ACD

200 BCE

300 BCDE

400 BE

TID Items

100 (CD)

200 (BC) (BE)(CE)

300 (BC)(BD)(BE)(CD)(CE)(DE)

400 (BE)

Itemsets (BC)

(BD)

(BE)

(CD)

(CE)

(DE)

Support 2 1 3 2 2 1

Encoding A B C D

Original (BC) (BE) (CD) (CE)

Itemset

Sup.

{B} 3

{C} 3

{D} 2

{E} 3

L1

2 2( ) ( ) ( ( ) ( )) 1n n-index(X)hash X,Y C C index Y index X

Page 14: Hash-Based Algorithm for Mining Association Rules

14

Example2 (sup=2)

TID Items

100 Null

200 (AD)

300 (AC)(AD)

400 Null

Itemsets (AB)

(AC)

(AD)

(BC)

(BD)

(CD)

Support 0 1 2 0 0 0

Encoding A

Original (AD)

Decode -> (BC)(CE) = BCE

2 2( ) ( ) ( ( ) ( )) 1n n-index(X)hash X,Y C C index Y index X

Page 15: Hash-Based Algorithm for Mining Association Rules

15

Problem on Hash Table Consider a database contains p transactions,

which are comprised of unique items and are of equal length N, and the minimum support of 1.

Loading density :2( )

, ( 1)( 1)

N kpm N

pm k

Page 16: Hash-Based Algorithm for Mining Association Rules

16

How to Improve the Loading Density

Two level perfect hash scheme (parital hash)

  A B C

Hash Table C D Null Null

Count 1 2    

Itemsets (AB)

(AC)

(AD)

(BC)

(BD)

(CD)

Support 0 1 2 0 0 0

Page 17: Hash-Based Algorithm for Mining Association Rules

17

Page 18: Hash-Based Algorithm for Mining Association Rules

18

Experiments

T5I4D200K

0

20

40

60

80

100

1.5 1.25 1 0.75 0.5 0.25

Minimum Support (%)

Tim

e (

sec)

PHS DHP Apriori

T20I4D100K

0500

1000150020002500

1.25 1 0.75 0.5 0.25

Minimum Support (%)Tim

e (

sec)

PHS DHP MPHP

Page 19: Hash-Based Algorithm for Mining Association Rules

19

Experiments

˹

˺ ˹ ˹

˻ ˹ ˹

˼ ˹ ˹

~̊˹ ˹

�̊˹ ˹

˻ ˹ ˹ K ~̊˹ ˹ K �̊˹ ˹ K �̊˹ ˹ K ˺ ˹ ˹ ˹ K

Tim

e (s

ec)

Number of Transactions

Increasing Number of Transactions

T̊�I̊~(PHS) T˺ ˹ I̊�(PHS)

T̊�I̊~(DHP) T˺ ˹ I̊�(DHP)

Page 20: Hash-Based Algorithm for Mining Association Rules

20

Experiments

T15I8D500K

100

200

300

400500

600

700

800

1.5 1.25 1 0.75 0.5

Support (%)

Tim

e (

sec)

Direct Hash Partial Hash

T15I8D500K (sup=0.5%)

0

100

200

300

400

2 3 4 5 6 7 8 9 10

PassesM

emor

y us

age

(MB

)

Direct Hash Partial Hash

Page 21: Hash-Based Algorithm for Mining Association Rules

21

We examined in this paper the issue of mining associationrules among items in a large database of sales transactions.The problem of discovering large itemsets wassolved by constructing a candidate set of itemsets firstand then, identifying, within this candidate set, thoseitemsets that meet the large itemset requirement

Conclusions