uspan: an efficient algorithm for mining high utility sequential patterns authors: junfu yin,...

USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns

Authors: Junfu Yin, Zhigang Zheng, Longbing Cao

In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2012. p. 660-668.

Presenter:0356069 江怡蕙 0356027 薛筑軒

Outline

• Introduction• Related work• Problem Statement• USpan algorithm• Experiment • Conclusions & Discussions

2

Outline

• Introduction• Background• Definition• Challenges

• Related work• Problem Statement• USpan algorithm• Experiment • Conclusions & Discussions

3

Introduction

Sequential pattern mining has proven to be very essential for handling order-based critical business problems. EX: structures and functions of molecular or DNA sequences

4

Background

The selection of interesting sequences is generally based on the frequency/support framework: sequences of high frequency are treated as significant. Under this framework, the downward closure property (also known as Apriori property) plays a fundamental role.

5

• Utility • Internal utility = quantity ; External utility = quality• High utility pattern mining• Minimum utility• The utility of <ea> in sequence 2 is {(6 × 1 + 1 × 2) ,

(6 × 1 + 2 × 2)} = {8, 10}

6

Definition

• The concept of sequence utility by considering the quality and quantity associated with each item in a sequence, and define the problem of mining high utility sequential patterns;• A complete lexicographic quantitative sequence

tree (LQS-tree) to construct utility-based sequences; two concatenation mechanisms I-Concatenation and S-Concatenation generate newly concatenated sequences;

7

Definition

Definition

• Two pruning methods, width and depth pruning, substantially reduce the search space in the LQS-tree;• USpan traverses LQS-tree and outputs all the high

utility sequential patterns.

8

Outline

• Introduction• Related work• Utility Itemset/Pattern Mining• Utility-based Sequential Pattern Mining

• Problem Statement• USpan algorithm• Experiment • Conclusions & Discussions

9

• Mining high utility itemsets is much more challenging than discovering frequent itemsets, because the fundamental downward closure property in frequent itemset mining does not hold in utility itemsets.• The addition of ordering information in sequences makes it

fundamentally different and much more challenging than mining utility itemsets

10

Utility Itemset/Pattern Mining

Utility-based Sequential Pattern Mining• Mining frequent sequences many patterns being mined; • Patterns with frequencies lower than minimum support are

filtered

11

Outline


12

Sequence Utility Framework

• I = {i1, i2, ..., in} a set of distinct items

• Each item ik I∈ (1<= k<=n) is associated with a quality (or external utility), denoted as p(ik)

• A quantitative item, or q-item, is an ordered pair (i, q), where i I ∈ represents an item and q is a positive number representing the quantity or internal utility

13

• A quantitative itemset, or q-itemset, consists of more than one q-item, which is denoted and defined as l = [(ij1, q1)(ij2, q2)...(ijn, qn )]

• A quantitative sequence, or q-sequence, is an ordered list of qitemsets, which is denoted and defined as s =< l1l2 ... lm>

• A q-sequence database S consists of sets of tuples <sid, s>

14

Sequence Utility Framework

Sequence Utility Framework- Definitions

15

EX: (a, 4), [(a, 4)(e, 2)] and [(a, 4)(b, 1)(e, 2)] ⊆ [(a, 4)(b, 1)(e, 2)]But [(a, 2)(e, 2)] or [(a, 4)(c, 1)] not contained in [(a, 4)(b, 1)(e, 2)]

<(b, 2)>, <[(b, 2)(e, 3)]>, <[(b, 2)][(e, 3)](a, 2)>


16

<(e, 5)[(c, 2)(f, 1)](b, 2)> is a 4-q-sequence with size 3. <ea> is a 2-sequence with size 2.

17


18


19


t = <ea>t’s utility in the s4 sequence in Table 2 is v(t, s4) = {u(<(e, 2)(a, 7)>), u(<(e, 2)(a, 4)>)} = {16, 10}. t’s utility in S is v(t) = {u(t, s2), u(t, s4),u(t, s5)} = {{8, 10}, {16, 10}, {15, 7}}

20

High Utility Sequential Pattern Mining

High Utility Sequential Pattern Mining• Definition 10. (High Utility Sequential Pattern) Because a

sequence may have multiple utility values in the q-sequence context, we choose the maximum utility as the sequence’s utility. The maximum utility of a sequence t is denoted and defined as umax(t):

• Sequence t is a high utility sequential pattern if and only if

ξ user-specified minimum utility

21

The utility of sequence ea is umax(<ea>) =10 + 16 + 15 = 41. If the minimum utility is ξ = 40, then sequence s = <ea> is a high utility sequential pattern since umax(s) = 41 ≥ ξ

Outline

• Introduction• Related work• Problem Statement• USpan algorithm• Lexicographic Q-Sequence Tree• Concatenations• Width Pruning• Depth Pruning• USpan Algorithm

• Experiment • Conclusions & Discussions

22

USpan Algorithm

• USpan is composed of • a lexicographic q-sequence tree• two concatenation mechanisms • two pruning strategies

23

Lexicographic Q-Sequence Tree• Adapt the concept of the Lexicographic Sequence Tree• Suppose we have a k-sequence t, we call the operation of

appending a new item to the end of t to form (k+1)-sequence concatenation.• If the size of t does not change, we call the operation

I-Concatenation. Otherwise, if the size increases by one, we call it S-Concatenation• <ea>’s I-Concatenate and S-Concatenate with b result in

<e(ab)> and <eab>, respectively.

24

Lexicographic Q-Sequence Tree• Assume two k-sequences ta and tb are concatenated from

sequence t, then ta < tb if• i) ta is I-Concatenated from t, and tb is S-Concatenated from

t, • ii) both ta and tb are I-Concatenated or S-Concatenated from

t, but the concatenated item in ta is alphabetically smaller than that of tb.• <(ab)> < <(ab)b>, <(abc)> < <(ab)b>, <(ab)c> < <(ab)d> and

<(ab)(de)> < <(ab)(df )>

25

Lexicographic Q-Sequence Tree• Definition 11. (Lexicographic Q-sequence Tree) An

lexicographic q-sequence tree (LQS-Tree) T is a tree structure satisfying the following rules:• Each node in T is a sequence along with the utility of the

sequence, while the root is empty• Any node’s child is either an I-Concatenated or S-

concatenated sequence node of the node itself• All the children of any node in T are listed in an

incremental and alphabetical order

26

• v(ea) = {{8, 10}, {16, 10}, {15, 7}} and umax(ea) = 41.• “Can any <ea>’s child’s maximum utility be calculated by

simply adding the highest utility of the q-items after <ea> to umax(ea)?”

27

Lexicographic Q-Sequence Tree

---------------- no

• Depth-first search• How can we generate the node’s children’s utilities by

concatenating the corresponding items?• How can we avoid checking unpromising children?• When should USpan stop the search of deeper nodes?

28

Lexicographic Q-Sequence Tree

Concatenations

Width pruning

Depth pruning

29

Concatenations

• Utility matrix• (utility, remaining utility)

30

Concatenations


31

Concatenations


32

Concatenations: I-Concatenation• I-Concatenation for the sequence in s4• = { 10 + 2, 5 + 2 } = { 12, 7 }

33

Concatenations: S-Concatenation• S-Concatenation for the sequence in s4• Candidates: , , , • = { 12 + 14, 12 + 8 } = { 26, 20 }

34

Concatenations

• To calculate the utilities of the children of a given sequence (e.g. in s4)• The positions of the last q-items of q-subsequences that

match the sequence (e.g. e1, e3)• Pivot: e1 (the first place where ends)• Ending q-items: e3 (other places where ends)

• The utility of a sequence (e.g. 12, 7)• These values are stored in LQS-Tree.

35

Concatenations

36

Width Pruning

• Avoid constructing unpromising items into LQS-Tree• Sequence-weighted utilization (SWU) of sequence t

37

Width Pruning

• Sequence-weighted Downward Closure Property (SDCP)• Given an utility-based sequence database S, and two

sequences t1 and t2, where t2 contains t1, then

• Item i is a promising item to t• After concatenating i to t, for the new sequence t’

38

Depth Pruning

• Stop USpan from going deeper in LQS-Tree• Even if all the utilities of the remaining q-items are

counted into the utility of the sequence, the cumulative utility still cannot satisfy .

39

Depth Pruning

• Given a sequence t and S, the maximum utilities of t and t’s offspring are no more than

• i: the pivot in s of t• urest: remaining utility at q-item i in q-sequence s

• E.g.

40

USpan Algorithm

// includes depth pruning strategy

// width pruning strategy

// generate candidates

// deal with I-Concatenation

// deal with S-Concatenation

Outline

• Introduction• Related work• Problem Statement• USpan algorithm• Experiment • Settings• Results

• Conclusions & Discussions

41

42

Experimental Settings

• Data Sets• DS1: C10 T2.5 S4 I2.5 DB10k N1k• DS2: C8 T2.5 S6 I2.5 DB10k N10k

• The average number of elements in a sequence is 10 (8). • The average number of items in an element is 2.5 (2.5). • The average length of a maximal pattern consists of 4 (6)

elements and each element is composed of 2.5 (2.5) items average.

• The data set contains 10k (10k) sequences. • The number of items is 1k (10k).

43

Experimental Settings

• Data Sets• DS3: online shopping transactions

• There are 811 distinct products, 350,241 transactions and 59,477 customers.

• The average number of elements in a sequence is 5. • The max length of a customer’s sequence is 82. • The most popular product has been ordered 2176 times.

• DS4: mobile communication transactions• The dataset is a 100,000 mobile-call history. • There are 67,420 customers in the dataset. • The maximum length of a sequence is 152.

44

Experimental Results – Execution Time & (#Patterns)

45

Experimental Results – Execution Time & (#Patterns)

46

Experimental Results – Distribution in Terms of Length

47

Experimental Results – Distribution in Terms of Length

48

Experimental Results – Pruning

49

Experimental Results – Scalability

50

Experimental Results – Utility vs Frequent

Outline


51

52

Conclusions

• Provide a systematic statement of a generic framework for high utility sequential pattern mining. • Propose an efficient algorithm, Uspan• I-Concatenation, S-Concatenation• Width pruning, depth pruning

• USpan can efficiently identify high utility sequences in large-scale data with low minimum utility.

53

Discussions

• Strongest part of this paper• USpan grows tree by DFS and needs not to store the

whole LQS-Tree in memory. • Two pruning strategies are proposed and work well in

their experiments. • Only need to calculate the tables once at beginning.

• Weak points of this paper• Each sequence needs a table to store it values and all

the tables are stored in memory. • Each single tree node contains much information.

54

Discussions

• Possible improvement• Design algorithms for even bigger datasets and better

pruning strategies. • Shrink the number of tables or shrink the number of

elements in a table. • Possible extension• The metric of “utility”• Items with positive and negative unit profits• Time constraints (as in GSP)

• Possible Application• Business decision-making• Analysis of game records of experts

• But need to specify “item” and “utility” first

END&

Thanks for your attention

uspan: an efficient algorithm for mining high utility sequential patterns authors: junfu yin,...

Documents