fast mining and forecasting of complex time-stamped eventsyasuko/publications/... · money.com...

1
Fast Mining and Forecasting of Complex Time-Stamped Events Paper ID: 183 Yasuko Matsubara Kyoto University [email protected] Yasushi Sakurai NTT Communication Science Labs [email protected] Christos Faloutsos Carnegie Mellon University [email protected] Tomoharu Iwata NTT Communication Science Labs [email protected] Masatoshi Yoshikawa Kyoto University [email protected] Motivation Complex time-stamped events Web click events: {time, url, user, access device, http referrer,} Challenges We cannot see any trends!! Problem definition Given: A sequence of complex time-stamped events Goal: (1) Find major topics (2) Forecast future events Timestamp URL User Device 2012-08-01-12:00 CNN.com Smith iphone 2012-08-02-15:00 YouTube.com Brown iphone 2012-08-02-19:00 CNET.com Smith mac Proposed method: TriMine Main idea (1): M-way analysis (a) Complex time-stamped tensor (b) Decompose to 3 topic vectors (c) Inference (Gibbs sampling) Infer k topics for each element according to probability p: Main idea (2): Multi-scale analysis TriMine is linear on the input size N, i.e., (N: counts of events in X, n: duration of X) u v n Object Actor Time Topic1 (business) Topic2 (news) Topic3 (media) Object Actor Time TriMine-F – forecasts Use topic matrices O, A, C [step1] Forecast time-topic matrix C’ Forecast C’ using multiple levels of matrices [step2] Generate events using three matrices O, A, C’ (a) Count estimation – X’ = O * A * C’ (b) Complex event generation – sampling approach Experiments web click data / On demand TV data TriMine-plot (red points: each URL/user/TV program) Benefit of multi-scale forecasting Forecasting accuracy Temporal perplexity RMSE between original & forecasted events Scalability Conclusions We addressed the problem of complex event mining TriMine has following properties: • Effective: It finds meaningful patterns in real datasets • Accurate: It enables forecasting • Scalable: It is linear on the database size Code: http://www.kecl.ntt.co.jp/csl/sirg/people/yasuko/software.html Original access counts (100 users, 1 week) M th order tensor (M=3) cube {object x actor x time} : event count of object i, actor j at time t x ijt e.g., (‘Smith’, ‘ CNN.com ’, ‘Aug1 12pm’; 21 times) Object/ URL Actor/ user Time u v n x ijt e.g., Topic 1 :business topic (Higher value: related topic) Object /URL Money.com CNN.com Smith Johnson Actor/ user Time Mon-Fri Sat-Sun Object matrix Actor matrix Time matrix u v n k k k O A C time actors O A n u k u k k χ = χ (0) u χ (1) C (1) k u n / l 1 C (2) k n / l 2 χ (2) l 0 l 1 l 2 v v v v n / l 0 C (0) Share O & A for all levels Hourly pattern Daily pattern Weekly pattern Compute C for each level Future events Tensor X O A C ˆ C O A ˆ C time 1 t 2 t c r (0) c r (1) c r (2) r,t (0) ˆ c Forecasted value [Step 1] [Step 2] 10 2 10 3 10 4 10 2 10 4 10 6 Duration Wall clock time (sec.) AR TriMineF (naive) TriMineF 7x 74x 10 20 30 20 40 60 80 Time (per 2 hours) RMSE TriMineF PLiF T2 1 2 3 10 3 10 4 Time (per 1 day) RMSE TriMineF PLiF T2 AR O( N log n ) O( N ) 0 100 200 300 0 0.005 0.01 0.015 0.02 0.025 Time (per 1 hour) Value "Business" "Media" "Drive" Weekend Weekend "Business" "Drive" "Media" car&bike travel area, wordof mouth news portal business news pet restaurant money, finance URL matrix User matrix Time matrix "Blog" "Communication" "Food" blog route map, diet community pet rss web mail mobile mail restaurant 0 100 200 300 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 Time (per 1 hour) Value "Blog" "Food" "Communication" 11pm 4pm 0 20 40 60 80 0 0.005 0.01 0.015 Time (per 4 hours) Value "Sports" "Romance" "Action" Saturday Sunday Saturday "Romance" "Sports" "Action" LOST Taxi 1 & 2 Oceans 11 & 12 Australian Open tennis French Open tennis final (women) Desperate Housewives French Open tennis final (men) TV matrix Time matrix 0 50 100 150 200 250 300 10 x 10 3 Time (per 2hours) Forecasting TriMineF 0 50 100 150 200 250 300 10 x 10 3 Time (per 2hours) Forecasting TriMineF (single) 0 5 10 x 10 3 Time (per 2hours) Value 2 topics by TriMine Original C matrix Single scale: failed Multi-scale: capture patterns Any trends? Can we forecast future events? 0 50 100 150 200 250 300 0 100 200 300 Time Count of clicks URL: blog site Bursty Noisy URL matrix Time matrix Similar trends very clear groups daily + weekly periodicities 180 200 220 240 260 280 300 10 5 Time (per 2 hours) Perplexity TriMineF T2 170 180 190 200 210 220 230 240 10 6 Time (per 4 hours) Perplexity TriMineF T2 T2: [Hong et al.’11], PLiF [Li et al.’10]

Upload: others

Post on 24-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fast Mining and Forecasting of Complex Time-Stamped Eventsyasuko/PUBLICATIONS/... · Money.com CNN.com Smith Johnson Actor/ user Time Mon-Fri Sat-Sun Object matrix Actor ... "Food"

Fast Mining and Forecasting of Complex Time-Stamped Events

Paper ID: 183

Yasuko Matsubara

Kyoto University [email protected]

Yasushi Sakurai

NTT Communication Science Labs [email protected]

Christos Faloutsos

Carnegie Mellon University [email protected]

Tomoharu Iwata

NTT Communication Science Labs [email protected]

Masatoshi Yoshikawa

Kyoto University [email protected]

Motivation Complex time-stamped events Web click events: {time, url, user, access device, http referrer,…}

Challenges

L We cannot see any trends!!

Problem definition Given: A sequence of complex time-stamped events Goal: (1) Find major topics (2) Forecast future events

Timestamp URL User Device 2012-08-01-12:00 CNN.com Smith iphone 2012-08-02-15:00 YouTube.com Brown iphone 2012-08-02-19:00 CNET.com Smith mac

… … … …

Proposed method: TriMine Main idea (1): M-way analysis (a) Complex time-stamped tensor (b) Decompose to 3 topic vectors (c) Inference (Gibbs sampling) Infer k topics for each element according to probability p: Main idea (2): Multi-scale analysis TriMine is linear on the input size N, i.e., (N: counts of events in X, n: duration of X)

u

vn

Object

Actor

Time

Topic1 (business)

Topic2 (news)

Topic3 (media)

Object

Actor

Time

TriMine-F – forecasts Use topic matrices O, A, C [step1] Forecast time-topic matrix C’ Forecast C’ using multiple levels of matrices [step2] Generate events using three matrices O, A, C’

(a)  Count estimation – X’ = O * A * C’ (b)  Complex event generation – sampling approach

Experiments web click data / On demand TV data TriMine-plot (red points: each URL/user/TV program) Benefit of multi-scale forecasting Forecasting accuracy Temporal perplexity RMSE between original & forecasted events Scalability

Conclusions We addressed the problem of complex event mining TriMine has following properties:

• Effective: It finds meaningful patterns in real datasets • Accurate: It enables forecasting • Scalable: It is linear on the database size

Code: http://www.kecl.ntt.co.jp/csl/sirg/people/yasuko/software.html

Original access counts (100 users, 1 week)

Mth order tensor (M=3) cube {object x actor x time} : event count of object i, actor j at time t xijt

e.g., (‘Smith’, ‘CNN.com’, ‘Aug1 12pm’; 21 times)

Object/URL

Actor/user

Time

u

v

n

xijt

e.g., Topic 1 :business topic (Higher value: related topic)

Object/URL

Money.com

CNN.com

Smith Johnson

Actor/user

Time

Mon-Fri Sat-Sun

Object matrix

Actor matrix

Time matrix

u

v n

k

k

k€

O

A

C

time actors

O

A

n€

u

k

u

k€

k

χ = χ(0)

u

χ(1)

C(1)

k

u

n / l1

C(2)

k

n / l2

χ(2)

l0

l1

l2

v

v

v

v

n / l0

C(0)

Share O & A for all levels

Hourly pattern

Daily pattern

Weekly pattern

Compute C for each level

Future events

Tensor X €

O

AC C

O

AC

time 1−t2−t

c r(0)

c r(1)

c r(2)

r,t(0)c

Forecasted value

[Step 1] [Step 2]

102 103 104102

104

106

Duration

Wal

l clo

ck ti

me

(sec

.)

ARTriMine−F (naive)TriMine−F

7x

74x

10 20 3020

40

60

80

Time (per 2 hours)

RM

SE

Marginal

TriMine−FPLiFT2

1 2 3

103

104

Time (per 1 day)

RM

SE

Marginal

TriMine−FPLiFT2AR

O(N logn)→O(N )

0 100 200 3000

0.005

0.01

0.015

0.02

0.025

Time (per 1 hour)

Valu

e

"Business""Media""Drive"

Weekend

Weekend

"Business"

"Drive"

"Media"

car&biketravelarea,

word−of mouth

newsportal

business news

pet

restaurant

money,finance

URL matrix User matrix Time matrix

"Blog"

"Communication"

"Food" blog

route map, diet

community

pet

rss

web mailmobile mail

restaurant0 100 200 300

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

Time (per 1 hour)

Valu

e

"Blog""Food""Communication"

11pm4pm

0 20 40 60 800

0.005

0.01

0.015

Time (per 4 hours)

Valu

e

"Sports""Romance""Action"

Saturday

SundaySaturday

"Romance"

"Sports"

"Action"

LOST

Taxi 1 & 2Oceans 11 & 12

Australian Open tennis

French Open tennis final (women)

Desperate Housewives

French Open tennis final (men)

TV matrix Time matrix

0

5

10 x 10−3

Time (per 2hours)

Valu

e

2 topics by TriMine

0 50 100 150 200 250 3000

5

10 x 10−3

Time (per 2hours)

Valu

e

Forecasting − TriMine−F (single)

0 50 100 150 200 250 3000

5

10 x 10−3

Time (per 2hours)

Valu

e

Forecasting − TriMine−F

0

5

10 x 10−3

Time (per 2hours)

Valu

e

2 topics by TriMine

0 50 100 150 200 250 3000

5

10 x 10−3

Time (per 2hours)

Valu

e

Forecasting − TriMine−F (single)

0 50 100 150 200 250 3000

5

10 x 10−3

Time (per 2hours)

Valu

e

Forecasting − TriMine−F

0

5

10 x 10−3

Time (per 2hours)

Valu

e

2 topics by TriMine

0 50 100 150 200 250 3000

5

10 x 10−3

Time (per 2hours)

Valu

e

Forecasting − TriMine−F (single)

0 50 100 150 200 250 3000

5

10 x 10−3

Time (per 2hours)

Valu

e

Forecasting − TriMine−F

Original C matrix Single scale: failed Multi-scale: capture patterns

Any trends?

Can we forecast future events?

0 50 100 150 200 250 3000

100

200

300

Time

Cou

nt o

f clic

ks URL: blog site

Bursty L

Noisy L

URL matrix Time matrix

Similar trends very clear groups

daily + weekly periodicities

180 200 220 240 260 280 300

105

Time (per 2 hours)

Perp

lexi

ty

TriMine−FT2

170 180 190 200 210 220 230 240

106

Time (per 4 hours)

Perp

lexi

ty

TriMine−FT2

T2: [Hong et al.’11], PLiF [Li et al.’10]