sequential pattern mining using a bitmap representation jay ayres, johannes gehrke, tomi yiu, and...
TRANSCRIPT
1
Sequential PAttern Mining Using A Bitmap Representation
Jay Ayres, Johannes Gehrke, Tomi Yiu, and Jason FlannickDept. of Computer Science Cornell University
(SIGKDD 2002)
Presenter李佩書 P76034525 楊璨瑜 P76034672 陳奕廷 P78031125 李昕純 Q56034035
2014/11/20
2
Outline
1. Introduction
2. The SPAM algorithm
3. Data representation
4. Experimental
5. Conclusion & Discussion
2014/11/20
3
Introduction
2014/11/20
4
Sequential Patterns
• R. Agrawal and R. Srikant.(In ICDE 1995)• Algorithm : AprioriALL, AprioriSOME, PrefixSpan…
2014/11/20
5
Problem
• Mining sequential patterns• Given a minimum support minSup• Find all frequent sequential patterns Sa
• supD(Sa) minSup
2014/11/20
6
SPAM Algorithm
• Sequential PAttern Mining Algorithm• The first DFS(depth-first search) strategy for mining sequential patterns
• Vertical bitmap representation for simple, efficient counting.
2014/11/20
7
The SPAM Algorithm
2014/11/20
8
Lexicographic Tree• Sequence-extended Sequence (S-step)
• Generate by adding a new transaction consisting of a single item to the end of sequence
• Ex: ({a, b, c}, {a, b})→({a, b, c}, {a, b}, {a})
• Itemset-extended sequence (I-step)• Generate by adding an item to the last itemset in the sequence• Ex 1: ({a, b, c}, {a, b}) →({a, b, c}, {a, b, d})• Ex 2: ({a, b, c}, {a, b, d}) →({a, b, c}, {a, b, d, c})
• Identifies two sets of each node n• Sn: the set of candidate items for S-step extensions• In: the set of candidate items for I-step extensions
2014/11/20
9
I={a,b}
2014/11/20
10
Pruning
• Apriori-Based• Minimizing the size of Sn and In• Pruning candidate by DFS.
S-step Pruning I-step Pruning
2014/11/20
11
S-step Pruning
S({a}) = {a, b, c, d}I({a}) = {b, c, d}S({a}, {a}) = S({a}, {b}) = {a, b, c, d}I({a}, {a}) = {b, c, d}I({a}, {b}) = {c, d}
2014/11/20
12
I-step Pruning
S({a, b}) = S({a, d}) = {a, b}I({a}, {b}) = {c, d}I({a}, {d}) = {}
2014/11/20
132014/11/20
14
Data Representation
2014/11/20
15
If the size of a sequence between 2k+1 and 2k+1 2k+1-bit sequence
• We store each candidate sequence as a vertical bitmap• Each customer is assigned a fixed slice of each bitmap
for all of its transactions
2014/11/20
16
Bitmap of itemset{a}
1000
0100
1000
{b}
1110
1100
1100
{a,b}
1000
0100
1000
&
2014/11/20
17
Bitmap of sequence• Define B(s) as the bitmap for sequence s.
• In sequence s• If the last itemset is in transaction j
and the other itemsets is in transaction before j • Then set 1 , otherwise set 0
• Example1:
Customer ID
Transaction ID
Itemset
1 1 {b}
1 2 {d}
1 3 {e}
1 4 {c}
({b},{c})
2014/11/20
18
Bitmap of sequence• Define B(s) as the bitmap for sequence s.
• In sequence s• If the last itemset is in transaction j
and the other itemsets is in transaction before j • Then set 1 , otherwise set 0
• Example1:
Customer ID
Transaction ID
Itemset
1 1 {b}
1 2 {d}
1 3 {e}
1 4 {c}
({b},{c})
2014/11/20
19
Bitmap of sequence• Define B(s) as the bitmap for sequence s.
• In sequence s• If the last itemset is in transaction j
and the other itemsets is in transaction before j • Then set 1 , otherwise set 0
• Example1:
Customer ID
Transaction ID
Itemset
1 1 {b}
1 2 {d}
1 3 {e}
1 4 {c}
({b},{c})
2014/11/20
20
Bitmap of sequence• Define B(s) as the bitmap for sequence s.
• In sequence s• If the last itemset is in transaction j
and the other itemsets is in transaction before j • Then set 1 , otherwise set 0
• Example1:
Customer ID
Transaction ID
Itemset
1 1 {b}
1 2 {d}
1 3 {e}
1 4 {c}
({b},{c})
2014/11/20
21
Bitmap of sequence• Define B(s) as the bitmap for sequence s.
• In sequence s• If the last itemset is in transaction j
and the other itemsets is in transaction before j • Then set 1 , otherwise set 0
• Example1:
Customer ID
Transaction ID
Itemset
1 1 {b}
1 2 {d}
1 3 {e}
1 4 {c}
({b},{c})
2014/11/20
22
Bitmap of sequence• Define B(s) as the bitmap for sequence s.
• In sequence s• If the last itemset is in transaction j
and the other itemsets is in transaction before j • Then set 1 , otherwise set 0
• Example1:
Customer ID
Transaction ID
Itemset
1 1 {b}
1 2 {d}
1 3 {e}
1 4 {c}
({b},{c})
1
2014/11/20
23
Bitmap of sequence• Define B(s) as the bitmap for sequence s.
• In sequence s• If the last itemset is in transaction j
and the other itemsets is in transaction before j • Then set 1 , otherwise set 0
• Example1:
Customer ID
Transaction ID
Itemset
1 1 {b}
1 2 {d}
1 3 {e}
1 4 {c}
({b},{c})
0
0
0
1
2014/11/20
24
• Example2
Customer ID
Transaction ID
Itemset
1 1 {a,b,d}
1 3 {b,c,d}
1 6 {b,c,d}
-- -- --
({a},{b,d})
0
1
1
0
2014/11/20
25
S-step ProcessStep 1 : S-Step Process to construct the transformed bitmap ({a})s
Step 2 : ANDing B({a})s and B({b})s
Support=2
2014/11/20
26
S-step ProcessStep 1:S-Step Process to construct the transformed bitmap ({a})s
Step 2:ANDing B({a}) s and B({b})s
2014/11/20
27
I-step Process
Support=2
2014/11/20
28
I-step Process
2014/11/20
29
Experimental
2014/11/20
30
Comparison With SPADE and PrefixSpan
Method-1• Compare for various minimum support values on
Small datasetsMedium datasetsLarge datasets
• Methods-2Compare several parameters in the dataset Number of customersNumber of transactions per customerNumber of items per transactionAverage length of the maximal sequences
2014/11/20
312014/11/20
32
Conclusion & Discussion
2014/11/20
33
CONCLUSION
• ALGORITHM• Outperforms SPADE and PrefixSpan on large datasets• Faster then SPADE and PrefixSpan
• DATA REPRESENTATION• Bitmap representation• S-step/I-step traversal• S-step/I-step pruning • Especially efficient when the sequential patterns are
very long
2014/11/20
34
Implement SPAM algorithm
SPMF is an mining mining frameworkWritten in Java/Open-source data http://www.philippe-fournier-viger.com/spmf/index.php
2014/11/20
35
DISCUSSION
1. SPAM assumes that the entire database completely fit into main memory, what is the solution ?
2. Why they set the size of a sequence between 2k+1 and 2k+1 ?
2014/11/20