uspan: an efficient algorithm for mining high utility sequential patterns authors: junfu yin,...

55
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2012. p. 660-668. Presenter: 0356069 江江江 0356027 江江江

Upload: martha-kennedy

Post on 26-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns

Authors: Junfu Yin, Zhigang Zheng, Longbing Cao

In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2012. p. 660-668.

Presenter:0356069 江怡蕙 0356027 薛筑軒

Page 2: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

Outline

• Introduction• Related work• Problem Statement• USpan algorithm• Experiment • Conclusions & Discussions

2

Page 3: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

Outline

• Introduction• Background• Definition• Challenges

• Related work• Problem Statement• USpan algorithm• Experiment • Conclusions & Discussions

3

Page 4: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

Introduction

Sequential pattern mining has proven to be very essential for handling order-based critical business problems. EX: structures and functions of molecular or DNA sequences

4

Page 5: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

Background

The selection of interesting sequences is generally based on the frequency/support framework: sequences of high frequency are treated as significant. Under this framework, the downward closure property (also known as Apriori property) plays a fundamental role.

5

Page 6: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

• Utility • Internal utility = quantity ; External utility = quality• High utility pattern mining• Minimum utility• The utility of <ea> in sequence 2 is {(6 × 1 + 1 × 2) ,

(6 × 1 + 2 × 2)} = {8, 10}

6

Definition

Page 7: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

• The concept of sequence utility by considering the quality and quantity associated with each item in a sequence, and define the problem of mining high utility sequential patterns;• A complete lexicographic quantitative sequence

tree (LQS-tree) to construct utility-based sequences; two concatenation mechanisms I-Concatenation and S-Concatenation generate newly concatenated sequences;

7

Definition

Page 8: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

Definition

• Two pruning methods, width and depth pruning, substantially reduce the search space in the LQS-tree;• USpan traverses LQS-tree and outputs all the high

utility sequential patterns.

8

Page 9: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

Outline

• Introduction• Related work• Utility Itemset/Pattern Mining• Utility-based Sequential Pattern Mining

• Problem Statement• USpan algorithm• Experiment • Conclusions & Discussions

9

Page 10: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

• Mining high utility itemsets is much more challenging than discovering frequent itemsets, because the fundamental downward closure property in frequent itemset mining does not hold in utility itemsets.• The addition of ordering information in sequences makes it

fundamentally different and much more challenging than mining utility itemsets

10

Utility Itemset/Pattern Mining

Page 11: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

Utility-based Sequential Pattern Mining• Mining frequent sequences many patterns being mined; • Patterns with frequencies lower than minimum support are

filtered

11

Page 12: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

Outline

• Introduction• Related work• Problem Statement• USpan algorithm• Experiment • Conclusions & Discussions

12

Page 13: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

Sequence Utility Framework

• I = {i1, i2, ..., in} a set of distinct items

• Each item ik I∈ (1<= k<=n) is associated with a quality (or external utility), denoted as p(ik)

• A quantitative item, or q-item, is an ordered pair (i, q), where i I ∈ represents an item and q is a positive number representing the quantity or internal utility

13

Page 14: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

• A quantitative itemset, or q-itemset, consists of more than one q-item, which is denoted and defined as l = [(ij1, q1)(ij2, q2)...(ijn, qn )]

• A quantitative sequence, or q-sequence, is an ordered list of qitemsets, which is denoted and defined as s =< l1l2 ... lm>

• A q-sequence database S consists of sets of tuples <sid, s>

14

Sequence Utility Framework

Page 15: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

Sequence Utility Framework- Definitions

15

EX: (a, 4), [(a, 4)(e, 2)] and [(a, 4)(b, 1)(e, 2)] ⊆ [(a, 4)(b, 1)(e, 2)]But [(a, 2)(e, 2)] or [(a, 4)(c, 1)] not contained in [(a, 4)(b, 1)(e, 2)]

<(b, 2)>, <[(b, 2)(e, 3)]>, <[(b, 2)][(e, 3)](a, 2)>

Page 16: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

Sequence Utility Framework- Definitions

16

<(e, 5)[(c, 2)(f, 1)](b, 2)> is a 4-q-sequence with size 3. <ea> is a 2-sequence with size 2.

Page 17: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

17

Sequence Utility Framework- Definitions

Page 18: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

18

Sequence Utility Framework- Definitions

Page 19: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

19

Sequence Utility Framework- Definitions

t = <ea>t’s utility in the s4 sequence in Table 2 is v(t, s4) = {u(<(e, 2)(a, 7)>), u(<(e, 2)(a, 4)>)} = {16, 10}. t’s utility in S is v(t) = {u(t, s2), u(t, s4),u(t, s5)} = {{8, 10}, {16, 10}, {15, 7}}

Page 20: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

20

High Utility Sequential Pattern Mining

Page 21: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

High Utility Sequential Pattern Mining• Definition 10. (High Utility Sequential Pattern) Because a

sequence may have multiple utility values in the q-sequence context, we choose the maximum utility as the sequence’s utility. The maximum utility of a sequence t is denoted and defined as umax(t):

• Sequence t is a high utility sequential pattern if and only if

ξ user-specified minimum utility

21

The utility of sequence ea is umax(<ea>) =10 + 16 + 15 = 41. If the minimum utility is ξ = 40, then sequence s = <ea> is a high utility sequential pattern since umax(s) = 41 ≥ ξ

Page 22: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

Outline

• Introduction• Related work• Problem Statement• USpan algorithm• Lexicographic Q-Sequence Tree• Concatenations• Width Pruning• Depth Pruning• USpan Algorithm

• Experiment • Conclusions & Discussions

22

Page 23: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

USpan Algorithm

• USpan is composed of • a lexicographic q-sequence tree• two concatenation mechanisms • two pruning strategies

23

Page 24: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

Lexicographic Q-Sequence Tree• Adapt the concept of the Lexicographic Sequence Tree• Suppose we have a k-sequence t, we call the operation of

appending a new item to the end of t to form (k+1)-sequence concatenation.• If the size of t does not change, we call the operation

I-Concatenation. Otherwise, if the size increases by one, we call it S-Concatenation• <ea>’s I-Concatenate and S-Concatenate with b result in

<e(ab)> and <eab>, respectively.

24

Page 25: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

Lexicographic Q-Sequence Tree• Assume two k-sequences ta and tb are concatenated from

sequence t, then ta < tb if• i) ta is I-Concatenated from t, and tb is S-Concatenated from

t, • ii) both ta and tb are I-Concatenated or S-Concatenated from

t, but the concatenated item in ta is alphabetically smaller than that of tb.• <(ab)> < <(ab)b>, <(abc)> < <(ab)b>, <(ab)c> < <(ab)d> and

<(ab)(de)> < <(ab)(df )>

25

Page 26: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

Lexicographic Q-Sequence Tree• Definition 11. (Lexicographic Q-sequence Tree) An

lexicographic q-sequence tree (LQS-Tree) T is a tree structure satisfying the following rules:• Each node in T is a sequence along with the utility of the

sequence, while the root is empty• Any node’s child is either an I-Concatenated or S-

concatenated sequence node of the node itself• All the children of any node in T are listed in an

incremental and alphabetical order

26

Page 27: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

• v(ea) = {{8, 10}, {16, 10}, {15, 7}} and umax(ea) = 41.• “Can any <ea>’s child’s maximum utility be calculated by

simply adding the highest utility of the q-items after <ea> to umax(ea)?”

27

Lexicographic Q-Sequence Tree

---------------- no

Page 28: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

• Depth-first search• How can we generate the node’s children’s utilities by

concatenating the corresponding items?• How can we avoid checking unpromising children?• When should USpan stop the search of deeper nodes?

28

Lexicographic Q-Sequence Tree

Concatenations

Width pruning

Depth pruning

Page 29: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

29

Concatenations

• Utility matrix• (utility, remaining utility)

Page 30: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

30

Concatenations

• Utility matrix• (utility, remaining utility)

Page 31: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

31

Concatenations

• Utility matrix• (utility, remaining utility)

Page 32: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

32

Concatenations: I-Concatenation• I-Concatenation for the sequence in s4• = { 10 + 2, 5 + 2 } = { 12, 7 }

Page 33: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

33

Concatenations: S-Concatenation• S-Concatenation for the sequence in s4• Candidates: , , , • = { 12 + 14, 12 + 8 } = { 26, 20 }

Page 34: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

34

Concatenations

• To calculate the utilities of the children of a given sequence (e.g. in s4)• The positions of the last q-items of q-subsequences that

match the sequence (e.g. e1, e3)• Pivot: e1 (the first place where ends)• Ending q-items: e3 (other places where ends)

• The utility of a sequence (e.g. 12, 7)• These values are stored in LQS-Tree.

Page 35: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

35

Concatenations

Page 36: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

36

Width Pruning

• Avoid constructing unpromising items into LQS-Tree• Sequence-weighted utilization (SWU) of sequence t

Page 37: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

37

Width Pruning

• Sequence-weighted Downward Closure Property (SDCP)• Given an utility-based sequence database S, and two

sequences t1 and t2, where t2 contains t1, then

• Item i is a promising item to t• After concatenating i to t, for the new sequence t’

Page 38: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

38

Depth Pruning

• Stop USpan from going deeper in LQS-Tree• Even if all the utilities of the remaining q-items are

counted into the utility of the sequence, the cumulative utility still cannot satisfy .

Page 39: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

39

Depth Pruning

• Given a sequence t and S, the maximum utilities of t and t’s offspring are no more than

• i: the pivot in s of t• urest: remaining utility at q-item i in q-sequence s

• E.g.

Page 40: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

40

USpan Algorithm

// includes depth pruning strategy

// width pruning strategy

// generate candidates

// deal with I-Concatenation

// deal with S-Concatenation

Page 41: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

Outline

• Introduction• Related work• Problem Statement• USpan algorithm• Experiment • Settings• Results

• Conclusions & Discussions

41

Page 42: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

42

Experimental Settings

• Data Sets• DS1: C10 T2.5 S4 I2.5 DB10k N1k• DS2: C8 T2.5 S6 I2.5 DB10k N10k

• The average number of elements in a sequence is 10 (8). • The average number of items in an element is 2.5 (2.5). • The average length of a maximal pattern consists of 4 (6)

elements and each element is composed of 2.5 (2.5) items average.

• The data set contains 10k (10k) sequences. • The number of items is 1k (10k).

Page 43: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

43

Experimental Settings

• Data Sets• DS3: online shopping transactions

• There are 811 distinct products, 350,241 transactions and 59,477 customers.

• The average number of elements in a sequence is 5. • The max length of a customer’s sequence is 82. • The most popular product has been ordered 2176 times.

• DS4: mobile communication transactions• The dataset is a 100,000 mobile-call history. • There are 67,420 customers in the dataset. • The maximum length of a sequence is 152.

Page 44: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

44

Experimental Results – Execution Time & (#Patterns)

Page 45: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

45

Experimental Results – Execution Time & (#Patterns)

Page 46: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

46

Experimental Results – Distribution in Terms of Length

Page 47: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

47

Experimental Results – Distribution in Terms of Length

Page 48: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

48

Experimental Results – Pruning

Page 49: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

49

Experimental Results – Scalability

Page 50: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

50

Experimental Results – Utility vs Frequent

Page 51: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

Outline

• Introduction• Related work• Problem Statement• USpan algorithm• Experiment • Conclusions & Discussions

51

Page 52: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

52

Conclusions

• Provide a systematic statement of a generic framework for high utility sequential pattern mining. • Propose an efficient algorithm, Uspan• I-Concatenation, S-Concatenation• Width pruning, depth pruning

• USpan can efficiently identify high utility sequences in large-scale data with low minimum utility.

Page 53: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

53

Discussions

• Strongest part of this paper• USpan grows tree by DFS and needs not to store the

whole LQS-Tree in memory. • Two pruning strategies are proposed and work well in

their experiments. • Only need to calculate the tables once at beginning.

• Weak points of this paper• Each sequence needs a table to store it values and all

the tables are stored in memory. • Each single tree node contains much information.

Page 54: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

54

Discussions

• Possible improvement• Design algorithms for even bigger datasets and better

pruning strategies. • Shrink the number of tables or shrink the number of

elements in a table. • Possible extension• The metric of “utility”• Items with positive and negative unit profits• Time constraints (as in GSP)

• Possible Application• Business decision-making• Analysis of game records of experts

• But need to specify “item” and “utility” first

Page 55: USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM

END&

Thanks for your attention