windmine: fast and effective mining of web-click sequences sdm 2011y. sakurai et al.1 yasushi...

23
WindMine: Fast and Effective Mining of Web-click Sequences SDM 2011 Y. Sakurai et al. 1 Yasushi Sakurai (NTT) Lei Li (Carnegie Mellon Univ.) Yasuko Matsubara (Kyoto Univ.) Christos Faloutsos (Carnegie Mellon Univ.)

Upload: blanche-louise-young

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

WindMine: Fast and Effective Mining of Web-click

Sequences

SDM 2011 Y. Sakurai et al. 1

Yasushi Sakurai (NTT)Lei Li (Carnegie Mellon Univ.)Yasuko Matsubara (Kyoto Univ.)Christos Faloutsos (Carnegie Mellon Univ.)

Introduction

Web-click sequence applicationsWeb masters and web-site owners- Capacity planning- Intrusion detection- Advertisement design

Goal- Find meaningful patterns for web-click data

(e.g., the lunch-break trend, huge spike, anomalies)

- Find periodicity (daily and/or weekly, etc)- Determine suitable window sizes automatically

SDM 2011 Y. Sakurai et al. 2

Introduction

Examples access count from a business news site

SDM 2011 Y. Sakurai et al. 3

Original web-click sequence

Problem definition

Web-click sequences of m URLs:

(X1 , … , Xm) Web-click sequence X of duration n :

X = (x1 ,…, xt ,…, xn)

Local Component Analysis: Given m sequences of duration n, (X1 , … , Xm) - Find patterns, main components of the sequences- Find the ‘best window’ size w for the analysis

Final challenge: scalable algorithm for the local component analysis

SDM 2011 Y. Sakurai et al. 4

Background

Independent component analysis (ICA)

- PCA vs. ICA

SDM 2011 Y. Sakurai et al. 5

1PC

2PC

1IC

2IC

Why not ‘PCA’?

Example of component analysis

SDM 2011 Y. Sakurai et al. 6

Source

Source

Mix

Why not ‘PCA’?

Example of component analysis

SDM 2011 Y. Sakurai et al. 7

ICA recognizes the components successfully and separately

ICA recognizes the components successfully and separately

PCAPCA

ICAICA

Main idea (1)

Multi-scale local component analysis

SDM 2011 Y. Sakurai et al. 8

Divide a sequence into subsequences of length

w

Divide a sequence into subsequences of length

w

Compute the local components from the

window matrix

Compute the local components from the

window matrix

aa bb

cc dd

ee ff

gg hh

aa bb cc dd ee ff gg hh

time

SDM 2011 9

Main idea (2)

Best window size selection

Proposed criterion: CEM (Component Entropy Maximization)- Estimate the optimal number of w for the sequence set- Compute the entropy of the weight values of the

mixing matrix A- ‘popular’ (widely-used) components show high CEM

scores

Y. Sakurai et al.

Q : How to estimate a ‘good window size’ automatically when we have multiple sequences?

Q : How to estimate a ‘good window size’ automatically when we have multiple sequences?

Main idea (2)

CEM criterion: - CEM score of the j-th component for the window size w

- Probability for the j-th component (size of the j-th component’s contribution to each subsequence)

- Normalized weight values for each subsequences

- Mixing matrix

SDM 2011 Y. Sakurai et al. 10

k: # of componentsM: # of

subsequences

ijijijw pp

wC ,,, log

1

i

jijiji aap ,,,

j

jijiji aaa 2,,,

][ , jiw aA

),,1;,,1( kjMi

WindMine-part

Efficient solution

Hierarchical partitioning approach: WindMine-part- Partition the original window matrix into sub-matrices- Extract local components each from the sub-matrices- Reuse the local components for the component

analysis on the higher level

SDM 2011 Y. Sakurai et al. 11

Q : How do we efficiently extract the best local component from large sequence sets?

Q : How do we efficiently extract the best local component from large sequence sets?

WindMine-part

SDM 2011 Y. Sakurai et al. 12

local componentswindow matrix sub-matrices

Level 1

Level 2

Experimental Results

Experiments with real and datasetsOndemand TV, WebClick,

Automobile, Temperature, Sunspots

EvaluationAccuracy for pattern discoveryAccuracy for the best window sizeComputation time

SDM 2011 Y. Sakurai et al. 13

Pattern discovery

Ondemand TVaccess count of users

SDM 2011 Y. Sakurai et al. 14

Original sequence

PCA: failed PCA: failed

Anomaly spikes

Anomaly spikes

Weekly pattern Weekly pattern Daily pattern Daily pattern

Pattern discovery

WebClickQ & A site

SDM 2011 Y. Sakurai et al. 15

Weekly patternWeekly pattern

Low activity during sleeping

time

Low activity during sleeping

time

Dip at dinner time

Dip at dinner time

Increase from morning to night and reach a peak

Increase from morning to night and reach a peak

Pattern discovery

WebClickjob-seeking site

SDM 2011 Y. Sakurai et al. 16

High activity on week days

(daily access decreases as the weekend approaches)

High activity on week days

(daily access decreases as the weekend approaches)

Workers arrive at their office

Workers arrive at their office

Job seeking during a short

break

Job seeking during a short

break

Large spike during the lunch break

Large spike during the lunch break

Pattern discovery

WebClickother websites

SDM 2011 Y. Sakurai et al. 17

Educational site for kids

(they visit here after school, 3pm)

Educational site for kids

(they visit here after school, 3pm)

Website for baby nursery (the main users will be

their parents, rather than babies!)

Website for baby nursery (the main users will be

their parents, rather than babies!)

High activity 8am-11pm,

weekday(business purposes)

High activity 8am-11pm,

weekday(business purposes)

Pattern discovery

WebClickother websites

SDM 2011 Y. Sakurai et al. 18

The users visit three times a day

(early morning, noon, early evening)

The users visit three times a day

(early morning, noon, early evening)

The users rarely visit here late in the

evening (which is indeed good

for their health!)

The users rarely visit here late in the

evening (which is indeed good

for their health!)

Access count is still high in the night, 0am-1am

(healthy diet should include an earlier bed

time!)

Access count is still high in the night, 0am-1am

(healthy diet should include an earlier bed

time!)

Access count increases after meal

times

Access count increases after meal

times

Pattern discovery

Generalization of WindMine

SDM 2011 Y. Sakurai et al. 19

Choice of best window sizeCEM score for various window sizes

SDM 2011 Y. Sakurai et al. 20

Computation time

Wall clock time vs. # of subsequences- Up to 70 times faster

SDM 2011 Y. Sakurai et al. 21

Computation time

Wall clock time vs. durationSDM 2011 Y. Sakurai et al. 22

ConclusionsScalable pattern extraction and anomaly

detection in large web-click sequences

1. Scalable, parallelizable method for breaking sequences into a few, fundamental ingredients

2. Linearly over the sequence duration, and near-linearly on the number of sequence

SDM 2011 Y. Sakurai et al. 23