sequential pattern mining using a bitmap representation jay ayres, johannes gehrke, tomi yiu, and...

Post on 01-Apr-2015

212 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Sequential PAttern Mining Using A Bitmap Representation

Jay Ayres, Johannes Gehrke, Tomi Yiu, and Jason FlannickDept. of Computer Science Cornell University

(SIGKDD 2002)

Presenter李佩書 P76034525 楊璨瑜 P76034672 陳奕廷 P78031125 李昕純 Q56034035

2014/11/20

2

Outline

1. Introduction

2. The SPAM algorithm

3. Data representation

4. Experimental

5. Conclusion & Discussion

2014/11/20

3

Introduction

2014/11/20

4

Sequential Patterns

• R. Agrawal and R. Srikant.(In ICDE 1995)• Algorithm : AprioriALL, AprioriSOME, PrefixSpan…

2014/11/20

5

Problem

• Mining sequential patterns• Given a minimum support minSup• Find all frequent sequential patterns Sa

• supD(Sa) minSup

2014/11/20

6

SPAM Algorithm

• Sequential PAttern Mining Algorithm• The first DFS(depth-first search) strategy for mining sequential patterns

• Vertical bitmap representation for simple, efficient counting.

2014/11/20

7

The SPAM Algorithm

2014/11/20

8

Lexicographic Tree• Sequence-extended Sequence (S-step)

• Generate by adding a new transaction consisting of a single item to the end of sequence

• Ex: ({a, b, c}, {a, b})→({a, b, c}, {a, b}, {a})

• Itemset-extended sequence (I-step)• Generate by adding an item to the last itemset in the sequence• Ex 1: ({a, b, c}, {a, b}) →({a, b, c}, {a, b, d})• Ex 2: ({a, b, c}, {a, b, d}) →({a, b, c}, {a, b, d, c})

• Identifies two sets of each node n• Sn: the set of candidate items for S-step extensions• In: the set of candidate items for I-step extensions

2014/11/20

9

I={a,b}

2014/11/20

10

Pruning

• Apriori-Based• Minimizing the size of Sn and In• Pruning candidate by DFS.

S-step Pruning I-step Pruning

2014/11/20

11

S-step Pruning

S({a}) = {a, b, c, d}I({a}) = {b, c, d}S({a}, {a}) = S({a}, {b}) = {a, b, c, d}I({a}, {a}) = {b, c, d}I({a}, {b}) = {c, d}

2014/11/20

12

I-step Pruning

S({a, b}) = S({a, d}) = {a, b}I({a}, {b}) = {c, d}I({a}, {d}) = {}

2014/11/20

132014/11/20

14

Data Representation

2014/11/20

15

If the size of a sequence between 2k+1 and 2k+1 2k+1-bit sequence

• We store each candidate sequence as a vertical bitmap• Each customer is assigned a fixed slice of each bitmap

for all of its transactions

2014/11/20

16

Bitmap of itemset{a}

1000

0100

1000

{b}

1110

1100

1100

{a,b}

1000

0100

1000

&

2014/11/20

17

Bitmap of sequence• Define B(s) as the bitmap for sequence s.

• In sequence s• If the last itemset is in transaction j

and the other itemsets is in transaction before j • Then set 1 , otherwise set 0

• Example1:

Customer ID

Transaction ID

Itemset

1 1 {b}

1 2 {d}

1 3 {e}

1 4 {c}

({b},{c})

2014/11/20

18

Bitmap of sequence• Define B(s) as the bitmap for sequence s.

• In sequence s• If the last itemset is in transaction j

and the other itemsets is in transaction before j • Then set 1 , otherwise set 0

• Example1:

Customer ID

Transaction ID

Itemset

1 1 {b}

1 2 {d}

1 3 {e}

1 4 {c}

({b},{c})

2014/11/20

19

Bitmap of sequence• Define B(s) as the bitmap for sequence s.

• In sequence s• If the last itemset is in transaction j

and the other itemsets is in transaction before j • Then set 1 , otherwise set 0

• Example1:

Customer ID

Transaction ID

Itemset

1 1 {b}

1 2 {d}

1 3 {e}

1 4 {c}

({b},{c})

2014/11/20

20

Bitmap of sequence• Define B(s) as the bitmap for sequence s.

• In sequence s• If the last itemset is in transaction j

and the other itemsets is in transaction before j • Then set 1 , otherwise set 0

• Example1:

Customer ID

Transaction ID

Itemset

1 1 {b}

1 2 {d}

1 3 {e}

1 4 {c}

({b},{c})

2014/11/20

21

Bitmap of sequence• Define B(s) as the bitmap for sequence s.

• In sequence s• If the last itemset is in transaction j

and the other itemsets is in transaction before j • Then set 1 , otherwise set 0

• Example1:

Customer ID

Transaction ID

Itemset

1 1 {b}

1 2 {d}

1 3 {e}

1 4 {c}

({b},{c})

2014/11/20

22

Bitmap of sequence• Define B(s) as the bitmap for sequence s.

• In sequence s• If the last itemset is in transaction j

and the other itemsets is in transaction before j • Then set 1 , otherwise set 0

• Example1:

Customer ID

Transaction ID

Itemset

1 1 {b}

1 2 {d}

1 3 {e}

1 4 {c}

({b},{c})

1

2014/11/20

23

Bitmap of sequence• Define B(s) as the bitmap for sequence s.

• In sequence s• If the last itemset is in transaction j

and the other itemsets is in transaction before j • Then set 1 , otherwise set 0

• Example1:

Customer ID

Transaction ID

Itemset

1 1 {b}

1 2 {d}

1 3 {e}

1 4 {c}

({b},{c})

0

0

0

1

2014/11/20

24

• Example2

Customer ID

Transaction ID

Itemset

1 1 {a,b,d}

1 3 {b,c,d}

1 6 {b,c,d}

-- -- --

({a},{b,d})

0

1

1

0

2014/11/20

25

S-step ProcessStep 1 : S-Step Process to construct the transformed bitmap ({a})s

Step 2 : ANDing B({a})s and B({b})s

Support=2

2014/11/20

26

S-step ProcessStep 1:S-Step Process to construct the transformed bitmap ({a})s

Step 2:ANDing B({a}) s and B({b})s

2014/11/20

27

I-step Process

Support=2

2014/11/20

28

I-step Process

2014/11/20

29

Experimental

2014/11/20

30

Comparison With SPADE and PrefixSpan

Method-1• Compare for various minimum support values on

Small datasetsMedium datasetsLarge datasets

• Methods-2Compare several parameters in the dataset Number of customersNumber of transactions per customerNumber of items per transactionAverage length of the maximal sequences

2014/11/20

312014/11/20

32

Conclusion & Discussion

2014/11/20

33

CONCLUSION

• ALGORITHM• Outperforms SPADE and PrefixSpan on large datasets• Faster then SPADE and PrefixSpan

• DATA REPRESENTATION• Bitmap representation• S-step/I-step traversal• S-step/I-step pruning • Especially efficient when the sequential patterns are

very long

2014/11/20

34

Implement SPAM algorithm

SPMF is an mining mining frameworkWritten in Java/Open-source data http://www.philippe-fournier-viger.com/spmf/index.php

2014/11/20

35

DISCUSSION

1. SPAM assumes that the entire database completely fit into main memory, what is the solution ?

2. Why they set the size of a sequence between 2k+1 and 2k+1 ?

2014/11/20

top related