hash-based algorithm for mining association rules

1

Hash-Based Algorithm for Mining Association Rules

2

Data Mining Mining Association Rules

3

Mining Association Rules

Mining Association Rules Support

Obtain Large Itemset Confidence

Generate Association Rules

Apriori -رويكرد مبتني بر در Apriori ابتدا در ميان مجموعه ساختار هاي داده شده به دنبال زيرساختارهاي متناوبي با اندازه

رويكرد مبتني بر كوچك مي گرديم. پس از آن در هر مرحله با يك نود به يك زير ساختار متناوب، زير ساختار جديدي

.ايجاد مي شود براي افزودن نودها به يك زير ساختار متناوب، تنها نودهايي مورد استفاده قرار م يگيرند كه در

مرحله اول به عنوان نود متناوب شناخته شده باشند. با ايجاد زير ساختار جديد، مجموعه ساختارها براي مشخص شدن

تناوب يا عدم .تناوب زيرساختار جديد مورد پويش قرار م يگيرد

4

5

TID Items

100

A C D

200

B C E

300

A B C E

400

B E

D

ScanD

Itemset

Sup.

{A} 2

{B} 3

{C} 3

{D} 1

{E} 3

C1Itemset Sup.

{A} 2

{B} 3

{C} 3

{E} 3

L1

Itemset

{A B}

{A C}

{A E}

{B C}

{B E}

{C E}

ScanD

Itemset

Sup.

{A B} 1

{A C} 2

{A E} 1

{B C} 2

{B E} 3

{C E} 2

Itemset

Sup.

{A C} 2

{B C} 2

{B E} 3

{C E} 2

C2 C2 L2

Itemset

{B C E}

ScanD

Itemset Sup.

{B C E} 2

Itemset Sup.

{B C E} 2

C3 C3 L3

Apriori

Sup=2

6

Apriori Cont. Disadvantages

Inefficient Produce much more useless

candidates

7

DHP Prune useless candidates in advance Reduce database size at each iteration

Direct Hashing with EfficientPruning for Fast Data Mining

DHP

8

C1 Count

{A}

2

{B}

3

{C}

3

{D}

1

{E}

3

L1

{A}

{B}

{C}

{E}

Min sup=2

Making a hash table

100

{A C}

200

{B C},{B E},{C E}

300

{A B},{A C},{A E},{B C},{B E},{C E}

400

{B E}

H{[x y]}=((order of x )*10+(order of y)) mod 7;

{B E}

{C E}

{B C}

{B E}

{A C}

{C E}

{B C}

{B E}

{A B}

{A C}

2 0 2 0 3 1 2

0 1 2 3 4 5 6

1 0 1 0 1 0 1

Hash table H2

Hash address

The number of items hashed to bucket 0

Bit vector

TID Items

100

A C D

200

B C E

300

A B C E

400

B E

D

9

Perfect Hashing Schemes (PHS) for Mining Association Rules

10

Motivation Apriori and DHP produce Ci from Li-

1 that may be the bottleneck

Collisions in DHP

Designing a perfect hashing function for every transaction databases is a thorny problem

11

Definition Definition. A Join operation is to join two

different (k-1)-itemsets, , respectively, to produces a k-itemset, where

= p1p2…pk-1

= q1q2…qk-1 and p2=q1, p3=q2,…,pk-2=qk-3, pk-1=qk-2.

Example: ABC, BCD 3-itemsets of ABCD: ABC, ABD, ACD, BCD only one pair that satisfies the join definition

11kS

21kS

12

Algorithm PHS (Perfect Hashing and Data

Shrinking)

13

Example1 (sup=2)

TID Items

100 ACD

200 BCE

300 BCDE

400 BE

TID Items

100 (CD)

200 (BC) (BE)(CE)

300 (BC)(BD)(BE)(CD)(CE)(DE)

400 (BE)

Itemsets (BC)

(BD)

(BE)

(CD)

(CE)

(DE)

Support 2 1 3 2 2 1

Encoding A B C D

Original (BC) (BE) (CD) (CE)

Itemset

Sup.

{B} 3

{C} 3

{D} 2

{E} 3

L1

2 2( ) ( ) ( ( ) ( )) 1n n-index(X)hash X,Y C C index Y index X

14

Example2 (sup=2)

TID Items

100 Null

200 (AD)

300 (AC)(AD)

400 Null

Itemsets (AB)

(AC)

(AD)

(BC)

(BD)

(CD)

Support 0 1 2 0 0 0

Encoding A

Original (AD)

Decode -> (BC)(CE) = BCE

2 2( ) ( ) ( ( ) ( )) 1n n-index(X)hash X,Y C C index Y index X

15

Problem on Hash Table Consider a database contains p transactions,

which are comprised of unique items and are of equal length N, and the minimum support of 1.

Loading density :2( )

, ( 1)( 1)

N kpm N

pm k

16

How to Improve the Loading Density

Two level perfect hash scheme (parital hash)

A B C

Hash Table C D Null Null

Count 1 2

Itemsets (AB)

(AC)

(AD)

(BC)

(BD)

(CD)

Support 0 1 2 0 0 0

18

Experiments

T5I4D200K

0

20

40

60

80

100

1.5 1.25 1 0.75 0.5 0.25

Minimum Support (%)

Tim

e (

sec)

PHS DHP Apriori

T20I4D100K

0500

1000150020002500

1.25 1 0.75 0.5 0.25

Minimum Support (%)Tim

e (

sec)

PHS DHP MPHP

19

Experiments

˹

˺ ˹ ˹

˻ ˹ ˹

˼ ˹ ˹

~̊˹ ˹

�̊˹ ˹

˻ ˹ ˹ K ~̊˹ ˹ K �̊˹ ˹ K �̊˹ ˹ K ˺ ˹ ˹ ˹ K

Tim

e (s

ec)

Number of Transactions

Increasing Number of Transactions

T̊�I̊~(PHS) T˺ ˹ I̊�(PHS)

T̊�I̊~(DHP) T˺ ˹ I̊�(DHP)

20

Experiments

T15I8D500K

100

200

300

400500

600

700

800

1.5 1.25 1 0.75 0.5

Support (%)

Tim

e (

sec)

Direct Hash Partial Hash

T15I8D500K (sup=0.5%)

0

100

200

300

400

2 3 4 5 6 7 8 9 10

PassesM

emor

y us

age

(MB

)

Direct Hash Partial Hash

21

We examined in this paper the issue of mining associationrules among items in a large database of sales transactions.The problem of discovering large itemsets wassolved by constructing a candidate set of itemsets firstand then, identifying, within this candidate set, thoseitemsets that meet the large itemset requirement

Conclusions

hash-based algorithm for mining association rules

Documents