ch5 sequential

Upload: bama-raja-segaran

Post on 03-Jun-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 Ch5 Sequential

    1/20

    August 19, 2014 Data Mining: Concepts and Techniques 1

    Chap 5.1: Mining Sequential Patterns

    A kind of association rules Its algorithms are closely related with association

    rule mining algorithms

  • 8/11/2019 Ch5 Sequential

    2/20

    August 19, 2014 Data Mining: Concepts and Techniques 2

    Sequence Databases and SequentialPattern Analysis

    Transaction databases, time-series databases vs. sequencedatabases

    Time-series dbstores sequences of values that change with time, such as

    data collected regarding the stock exchange.

    Frequent patterns vs. (frequent) sequential patterns

    Applications of sequential pattern mining

    Customer shopping sequences:

    First buy computer, then CD-ROM, and then digital camera,

    within 3 months. Medical treatment, natural disasters (e.g., earthquakes),

    science & engineering processes, stocks and markets, etc.

    Telephone calling patterns, Weblog click streams

    DNA sequences and gene structures

  • 8/11/2019 Ch5 Sequential

    3/20

    August 19, 2014 Data Mining: Concepts and Techniques 3

    What Is Sequential Pattern Mining?

    Given a set of sequences, find the complete setof frequent subsequences

    A sequence database

    A sequence : < (ef) (ab) (df) c b >

    An element may contain a set of items.Items within an element ar

    e unorderedand we list them alphabetically.

    is a subsequenceof

    Given support thre

    shold min_sup =2, is a

    sequential pattern

    SID sequence

    10

    20

    30

    40

  • 8/11/2019 Ch5 Sequential

    4/20

    August 19, 2014 Data Mining: Concepts and Techniques 4

    Mining Sequential Patterns

    A sequence is contained in anothersequence if there exist integers1i1

  • 8/11/2019 Ch5 Sequential

    5/20

    August 19, 2014 Data Mining: Concepts and Techniques 5

    Mining Sequential Patterns : An example

    CustomerId

    TransactionTime

    ItemsBought

    1

    1

    June 25 93

    June 30 93

    30

    90

    2

    2

    2

    June 10 93

    June 15 93

    June 20 93

    10, 20

    30

    40, 60, 70

    3 June 25 93 30, 50, 70

    4

    44

    5

    June 25 93

    June 30 93July 25 93

    June 12 93

    30

    40,7090

    90

    Database Sorted by CustomerId and Transaction Time

    Customer CustomerID Sequence

    1 2 3 4

    5

    Customer sequence version of the DB

    Sequential patterns with support > 25%

  • 8/11/2019 Ch5 Sequential

    6/20

    August 19, 2014 Data Mining: Concepts and Techniques 6

    Mining Sequential Patterns

    Given a database Dof customer transaction, the problemof mining sequential patterns is to find the maximalsequences among all sequences that have a certain userspecified minimum support

  • 8/11/2019 Ch5 Sequential

    7/20August 19, 2014 Data Mining: Concepts and Techniques 7

    Mining Sequential Patterns

    Involve 5 phase Sort phase

    Litemset phase

    Transformation phase

    Sequence phase

    Maximal phase

    Terminology

    Lengthof a sequencenumber of itemset in the sequence (

    sequence of length k called k-sequence) Support of an itemset i is a fraction of customers who bought

    items in i in a single transaction

    An itemset with minimum support is called a large itemset orlitemset

  • 8/11/2019 Ch5 Sequential

    8/20August 19, 2014 Data Mining: Concepts and Techniques 8

    Mining Sequential Patterns

    Involve 5 phase Sort phase

    The database (D) is sorted, with customer-id as the major key and

    transaction time as the minor key

    Converts the original transaction database into a database of

    customer sequences

    Litemset phase

    Find the set of all litemsets L

    Simultaneously find the set of all large 1-sequences {|l L} The set of litemset is mappedto a set of contiguous integers

    Reason for mapping: by treating litemset as single entities,comparetwo itemset for equality in constant time, and reduce the timerequired to check if a sequence is contained in a customer sequence

  • 8/11/2019 Ch5 Sequential

    9/20August 19, 2014 Data Mining: Concepts and Techniques 9

    Litemset phase

    CustomerId

    TransactionTime

    ItemsBought

    1

    1

    June 25 93

    June 30 93

    30

    90

    2

    2

    2

    June 10 93

    June 15 93

    June 20 93

    10, 20

    30

    40, 60, 70

    3 June 25 93 30, 50, 70

    4

    44

    5

    June 25 93

    June 30 93July 25 93

    June 12 93

    30

    40,7090

    90

    Fig.1 Database Sorted by CustomerId and Transaction Time

    Large itemsets are (30), (40),(70),(40,70) and (90)

    Large Itemset Mapped to

    (30) 1(40) 2(70) 3(40 70) 4(90) 5

  • 8/11/2019 Ch5 Sequential

    10/20August 19, 2014 Data Mining: Concepts and Techniques 10

    Transformation Phase

    Repeatedly determine which of agiven set of large sequences arecontained in a customer sequence

    By transforming each customersequence into an alternativerepresentation

    Each transaction is replaced bythe set of all litemsetscontained in the transaction If a transaction does not

    contain any litemset, notretained in the transformedsequence

    If a customer sequence doesnot contain any litemsets, thissequence is dropped from thetransformed database

    Customer CustomerID Sequence

    1 2 3 4

    5

    Customer sequence version of the DB

    Large Itemset Mapped to

    (30) 1(40) 2(70) 3(40 70) 4(90) 5

    Large itemset

  • 8/11/2019 Ch5 Sequential

    11/20August 19, 2014 Data Mining: Concepts and Techniques 11

    Transformation Phase

    Customer Customer Transformed AfterID Sequence Customer Sequence Mapping

    1 2 3 4 5

    Transformed Database

    Large Itemset Mapped to(30) 1(40) 2(70) 3(40 70) 4(90) 5

    Large itemset

  • 8/11/2019 Ch5 Sequential

    12/20August 19, 2014 Data Mining: Concepts and Techniques 12

    Sequence Phase

    Make multiple passesover the data In each pass, we start with a seed set of large

    sequences

    Use the seed set for generating new potentially

    large sequences called candidate sequences Count the support while pass the data

    At the end of the pass, determine the largecandidate sequences

    these large candidate becomes the seed for thenext pass.

    Involve 2 algorithms Count-all andcount some

    Count-all based on apriori algorithm calledAprioriAll

  • 8/11/2019 Ch5 Sequential

    13/20August 19, 2014 Data Mining: Concepts and Techniques 13

    Sequence Phase

    AprioriAll Apriori candidate generation

    Large Candidate Candidate3-Sequences 4-sequences 4-sequences

    (after join) (after pruning)

    Maximal phase

    Find the maximal sequences among the set of large sequences

  • 8/11/2019 Ch5 Sequential

    14/20August 19, 2014 Data Mining: Concepts and Techniques 14

    Sequence Phase

    Large sequences< {1 5} {2} {3} {4} >< {1} {3} {4} {3 5}>< {1} {2} {3} {4}>< {1} {3} {5} >

    < {4} {5} >

    Customer sequences

    1-sequences support 4 2 4

    4 4

    2-sequences support 2 4

    3 3 2 2 3 2

    2

    3-sequences support 2 2 3 2 2

    4-sequences support 2

    L1

    L2

    L3

    L4

    Min_support = 40%(2 customer sequences)

    Maximal large sequence, ,

  • 8/11/2019 Ch5 Sequential

    15/20August 19, 2014 Data Mining: Concepts and Techniques 15

    Challenges on Sequential Pattern Mining

    A hugenumber of possible sequential patterns are

    hidden in databases

    A mining algorithm should

    find the complete set of patterns, when possible,

    satisfying the minimum support (frequency) threshold

    be highly efficient, scalable, involving only a small

    number of database scans be able to incorporate various kinds of user-specific

    constraints

  • 8/11/2019 Ch5 Sequential

    16/20August 19, 2014 Data Mining: Concepts and Techniques 16

    Studies on Sequential Pattern Mining

    Concept introduction and an initial Apriori-like algorithm

    R. Agrawal & R. Srikant. Mining sequential patterns, ICDE95

    GSPAn Apriori-based, influential mining method (developed at IBMAlmaden)

    R. Srikant & R. Agrawal. Mining sequential patterns:Generalizations and performance improvements, EDBT96

    From sequential patterns to episodes (Apriori-like + constraints)

    H. Mannila, H. Toivonen & A.I. Verkamo. Discovery of frequentepisodes in event sequences, Data Mining and Knowledge

    Discovery, 1997

    Mining sequential patterns with constraints

    M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential

    Pattern Mining with Regular Expression Constraints. VLDB 1999

  • 8/11/2019 Ch5 Sequential

    17/20August 19, 2014 Data Mining: Concepts and Techniques 17

    A Basic Property of Sequential Patterns: Apriori

    A basic property: Apriori (Agrawal & Srikant94)

    If a sequence S is not frequent

    Then none of the super-sequences of S is frequent

    E.g, is infrequentso do and

    50

    40

    30

    20

    10

    SequenceSeq. ID Given support thresholdmin_sup =2

  • 8/11/2019 Ch5 Sequential

    18/20August 19, 2014 Data Mining: Concepts and Techniques 18

    GSPA Generalized Sequential Pattern Mining Algorithm

    GSP (Generalized Sequential Pattern) mining algorithm

    proposed by Agrawal and Srikant, EDBT96

    Outline of the method

    Initially, every item in DB is a candidate of length-1

    for each level (i.e., sequences of length-k) do

    scan database to collect support count for eachcandidate sequence

    generate candidate length-(k+1) sequences from

    length-k frequent sequences using Apriori

    repeat until no frequent sequence or no candidatecan be found

    Major strength: Candidate pruning by Apriori

  • 8/11/2019 Ch5 Sequential

    19/20August 19, 2014 Data Mining: Concepts and Techniques 19

    Finding Length-1 Sequential Patterns

    Examine GSP using an example Initial candidates: all singleton sequences

    , , , , , ,,

    Scan database once, count support forcandidates

    50

    40

    30

    20

    10

    SequenceSeq. ID

    min_sup =2

    Cand Sup

    3

    5

    4

    3

    3

    2

    1

    1

  • 8/11/2019 Ch5 Sequential

    20/20A t 19 2014 D t Mi i C t d T h i 20

    Generating Length-2 Candidates

    15 length-2

    Candidates

    Without Apriori

    property,8*8+8*7/2=92

    candidates

    Apriori prunes

    44.57% candidates