on the infrastructure for network data mining: …muthu/minenet06-pub.pdfon the infrastructure for...

45
S. (Muthu) Muthukrishnan Google adorisms, mysliceofpizza On the Infrastructure for Network Data Mining: Concepts and Experiences

Upload: doananh

Post on 05-May-2018

213 views

Category:

Documents


1 download

TRANSCRIPT

S. (M

uthu

) Mut

hukr

ishn

an

Goo

gle

ador

ism

s, m

yslic

eofp

izza

On

the

Infr

astr

uctu

re fo

r N

etw

ork

Dat

a M

inin

g:

Con

cept

s and

Exp

erie

nces

Tal

k O

verv

iew

�D

ata

Ana

lysi

s in

Diff

eren

t Com

mun

ities

�A

lgor

ithm

s, D

atab

ases

and

Net

wor

king

�In

fras

truct

ure

Vie

w o

f Dat

a A

naly

sis

�Ex

ampl

e 1:

Cel

lpho

neC

all T

raff

ic�

Exam

ple

2: IP

Pac

ket T

raff

ic S

tream

s�

Exam

ple

3: W

eb T

raff

ic

�Pe

rspe

ctiv

es

Ear

liest

Kno

wn

Dat

a A

naly

sis

�N

atur

al a

nd p

oliti

cal o

bser

vatio

ns m

ade

upon

th

e B

ills o

f Mor

talit

y, b

y Jo

hn G

raun

t, 16

62.

�U

se fi

gure

s of b

irths

and

dea

ths i

n Lo

ndon

col

lect

ed b

y pa

rishe

s.�

Few

beg

gars

star

ve to

dea

th, p

olyg

amy

is ir

ratio

nal,

Hea

d is

to

o bi

g fo

r the

bod

y,�

�H

ow m

any

M/F

? H

ow m

any

mar

ried/

sing

le?

Wha

t yea

rs

frui

tful/m

orta

l and

in w

hat i

nter

vals

?��

Kno

wle

dge

of th

ese

is n

eces

sary

to e

ase

gove

rnm

ent,

bala

nce

Parti

es a

nd fa

ctio

ns b

oth

in c

hurc

h an

d st

ate.

But

is it

ne

cess

ary

for o

ther

s bes

ides

the

king

and

his

min

iste

rs?

�Fi

rst l

ife in

sura

nce

tabl

es, b

y Ed

mun

d H

alle

y, 1

693.

Dat

a A

naly

sis i

n D

iffer

ent C

omm

uniti

es�

Net

wor

king

:�

Min

ing

anom

alie

s usi

ng tr

affic

feat

ure

dist

ribut

ions

A.L

akhi

na, M

.Cro

vella

, C.D

iot.

SIG

CO

MM

05.

�A

lgor

ithm

s:�

Stre

amin

g an

dsu

blin

eara

ppro

xim

atio

n of

ent

ropy

and

in

form

atio

n di

stan

ces.

S. G

uha,

A. M

cGre

gor,

S.V

enka

tasu

bram

ania

n. S

OD

A 2

006.

�D

atab

ases

:�

Hol

istic

UD

AFs

at st

ream

ing

spee

ds.

G.C

orm

ode,

T. J

ohns

on, F

.Kor

n, S

.Mut

hukr

ishn

an,

O.S

pats

chec

k, D

.Sriv

asta

va. S

IGM

OD

200

4.

entro

py

Use

r def

ined

agg

rega

te fu

nctio

n (U

DA

F): E

ntro

py.

Infr

astr

uctu

re V

iew

, Exa

mpl

e 1:

C

ellp

hone

Cal

ls A

naly

sis

Det

aile

d vi

ew o

f CD

Rs

IP

GSM

M

SCV

oIP

VoI

PTD

MA

M

SCoriginating

terminating

�Ea

ch n

etw

ork

elem

ent w

rites

reco

rds (

Cal

l D

etai

l Rec

ords

) of w

hat i

t see

s in

diff

eren

t fo

rmat

s.

�C

DR

sare

seve

ral h

undr

ed b

ytes

long

.

�It

is h

ard

to e

ven

join

the

reco

rds.

Ana

lyzi

ng C

DR

s: D

ata

�D

ata:

�TD

MA

:Er

icss

on, L

ucen

t, N

orte

l etc

.MSC

s; G

SM a

nd

UM

TS:

MSC

s; V

oIP:

Gat

eway

s; G

PRS:

SGSN

s, G

GSN

s, an

d M

MSC

s; S

MS

logs

.�

10�s

of d

iffer

ent d

ata

form

ats.

�Si

de ta

bles

: LER

G. H

ands

et in

fo. T

runk

info

.�

Abo

ut 1

Tby

te/m

onth

for l

arge

car

riers

.

switc

h

Dat

a co

llect

ion

poin

t

Ana

lyzi

ng C

DR

s: A

naly

ses

�A

naly

ses:

�10

0�s o

f rep

orts

a m

onth

.

�Ex

ampl

e A

naly

ses:

�D

ropp

ed c

alls

per

han

dset

type

�2A

or 2

B c

onne

ctio

ns.

�Fr

audu

lent

tran

sit c

alls

�C

ell a

djac

ency

gra

ph

Exa

mpl

e C

DR

Ana

lysi

s

Dis

tant

Tow

er P

robl

em

D1

D2

D3

Dis

tant

Tow

er P

robl

em

(Par

tial)

Solu

tion:

Find

a d

ropp

ed c

all

usin

g ce

lltow

erC

imm

edia

tely

pre

cedi

nga

succ

essf

ul c

all u

sing

cel

ltow

er D

si

gnifi

cant

ly fa

r aw

ayfr

om C

.

Ana

lyzi

ng C

DR

s: In

fras

truc

ture

�C

halle

nge

is n

ot th

e si

ze o

f the

dat

a.

�un

ders

tand

ing

the

data

, tra

nsla

ting

a bu

sine

ss

prob

lem

dow

n to

CD

R a

naly

sis.

�Tu

rnar

ound

tim

e: D

ays o

r wee

ks.

�Sm

all t

eam

of a

naly

sts r

espo

nsib

le.

Infr

astr

uctu

re:

�Lar

ge d

isks

.�M

ultip

le C

PU m

achi

nes.

�Scr

iptin

g la

ngua

ges,

stan

dard

file

syst

em.

Tal

k O

verv

iew

�D

ata

Ana

lysi

s in

Diff

eren

t Com

mun

ities

�A

lgor

ithm

s, D

atab

ases

and

Net

wor

king

�In

fras

truct

ure

Vie

w o

f Dat

a A

naly

sis

�Ex

ampl

e 1:

Cel

lpho

neC

all T

raff

ic�

Exam

ple

2: IP

Pac

ket T

raff

ic S

tream

s�

Exam

ple

3: W

eb T

raff

ic

�Pe

rspe

ctiv

es

Infr

astr

uctu

re V

iew

: IP

Tra

ffic

Ana

lysi

s

Ana

lyzi

ng IP

Tra

ffic

(ISP

Vie

w):

Dat

a

�SN

MP,

IP fl

ows,

pack

et h

eade

r log

s, pa

cket

co

nten

ts.

�R

outin

g ta

bles

, BG

P up

date

s, Fa

ult a

larm

s. �

OC

48, 1

92, 7

68:x

Tbyt

es/h

our.

�6M

pkts

/sec

to 9

6Mpk

ts/s

ec.

Pack

et tr

affic

seen

as s

tream

s, ot

her l

ogs m

ay

be st

ored

in d

atab

ases

.

Ana

lyzi

ng IP

Tra

ffic

: Ana

lyse

s�

Rea

l tim

e, ro

uter

spee

d an

alys

is.

�Ex

ampl

e:

�R

epor

ting,

SLA

med

iatio

n.�

Ano

mal

y/A

ttack

det

ectio

n.�

Law

ful i

nter

cept

�M

onito

ring

failu

res.

�Tr

affic

cla

ssifi

catio

n.

Lat

in A

mer

ican

Car

rier

:R

eal-T

ime

Tra

ffic

Insi

ght (

NA

RU

S)

Gra

ph a

nd t

able

show

ing p

roto

cols

(a

nd a

ssoc

iate

d s

ess

ion c

ounts

and

byt

es)

runnin

g o

n c

ust

om

er�

s netw

ork

�G

igas

cope

is a

n SQ

L-ba

sed

oper

atio

nal I

P tra

ffic

an

alys

is to

ol a

t AT&

T.

�H

as tw

o le

vel a

rch.

Low

-leve

l que

riesp

erfo

rm

initi

al fa

st se

lect

ion

and

aggr

egat

ion

on h

igh

spee

d st

ream

.�

Com

plex

agg

rega

tion

on

high

leve

l, at

mon

itor s

erve

r�

Dep

endi

ng o

n th

e ca

pabi

litie

s of t

he N

IC,

can

push

ope

rato

rs a

nd

low

-leve

l que

riesi

nto

it.

NIC

Rin

g B

uffe

r

Low

Low

Low

Hig

hH

igh

App

NIC

Gig

asco

peA

rchi

tect

ure

Sel

ect t

b, S

rcIP

, cou

nt(*

)Fr

om U

DP

Gro

up B

y tim

e/60

as

tb, S

rcIP

Sel

ect t

b, S

rcIP

, sum

(Cnt

)Fr

om S

ubq

Gro

up B

y tb

, Src

IP

Sel

ect t

b, S

rcIP

, co

unt(*

) as

Cnt

From

UD

PG

roup

By

time/

60 a

s tb

, Src

IP

Sub

q:

GSQ

L Q

uery

Spl

ittin

g

Low

leve

l

Gig

asco

pe, S

tatu

sC

urre

ntly

supp

orts

:�

GSQ

L,U

DA

Fs.

�st

ream

agg

rega

te q

uerie

s.

�Sa

mpl

ing.

�O

pera

torc

an b

e sp

ecia

lized

to

mos

t stre

am sa

mpl

ing

met

hods

.�

Mos

t com

plex

que

ries c

an b

e ex

ecut

ed w

ith sa

mpl

ing

to

prov

ide

sem

antic

ally

cor

rect

ou

tput

.

�Sk

etch

es.

�H

eavy

hitt

ers,

quan

tiles

.

�R

egex

mat

cher

for f

low

s.�

Mat

ch c

onte

nts a

cros

s pac

kets

in

pres

ence

of d

uplic

ates

, out

-of-

orde

r or

ove

rlapp

ing

pack

ets.

�H

eartb

eats

.�

Prel

im d

istri

bute

d im

plem

enta

tion.

�D

eplo

yed

Ted

John

son

Ir

ina

Roz

enba

umV

lad

Shka

peny

uk O

liver

Spa

tsch

eck.

Dat

a St

ream

s: A

lgor

ithm

s and

App

licat

ions

, S. M

uthu

kris

hnan

,Fo

unda

tions

of T

heor

etic

al C

ompu

ter S

cien

ce, 2

005.

Sam

plin

g O

pera

tor

�M

any

sam

plin

g al

gorit

hms k

now

n fo

r IP

traff

ic st

ream

s.�

Uni

form

rand

om sa

mpl

ing

�Pr

iorit

y sa

mpl

ing

�V

alue

sam

plin

g�

Dis

tinct

, inv

erse

, min

-wis

e sa

mpl

ing.

�O

bser

vatio

n:

�M

ost s

ampl

ing

algo

rithm

s hav

e a

over

all c

omm

on e

xecu

tion

stru

ctur

e.

�O

ur a

ppro

ach:

�D

efin

e an

d op

timiz

e a

sing

le sa

mpl

ing

oper

ator

.

Stre

am S

ampl

ing

Ope

rato

r

�C

an b

e sp

ecia

lized

for w

ide

varie

ty o

f stre

am sa

mpl

ing

algo

rithm

s.�

Enco

urag

es e

xper

imen

tatio

n an

d de

velo

pmen

t of n

ew

sam

plin

g al

gorit

hms.

Select<select expression list>.

From<stream>.

Where<predicate>.

Group by<group-by variables definition list>.

Cleaning when

<predicate>.

Cleaning by<predicate>.

[Having<predicate>].

–Cleaning when

�co

nditi

on fo

r trig

gerin

g a

clea

ning

pha

se.

–Cleaning by

�co

nditi

on fo

r sam

ple

redu

ctio

n.

T. Jo

hnso

n, S

. Mut

hukr

ishn

an a

nd I.

Roz

enba

um, S

IGM

OD

200

2.

Subs

et-S

um S

ampl

ing

Que

ry

Wha

t are

the

size

s (in

byt

es) o

f all

flow

s see

n ov

er a

time

inte

rval

of 6

0 se

cond

s?

Selecttb, srcIP, dstIP, UMAX(sum(len), ss_threshold())

FromSOURCE

Wheressample(len, 1000) = TRUE

Group Bytime/60 as tb, srcIP, dstIP

Cleaning Whenss_need_to_clean() = TRUE

Cleaning Byss_do_clean(sum(len))

Havingss_final_clean(sum(len))

�Th

is q

uery

retu

rns a

sam

ple

of 1

000

elem

ents

for e

very

60

seco

ndin

terv

al

Hea

vy H

itter

s Que

ry

List

the

num

ber o

f byt

es a

nd th

e nu

mbe

r of p

acke

ts fo

r de

stin

atio

n IP

add

ress

es w

hich

acc

ount

for a

t lea

st 1

%

of th

e to

tal t

raff

ic.

Select

tb,destIP, sum(len), count(*).

From

SOURCE.

Group By

time/60 astb,dstIP.

Cleaning Whenlocal_count(100) = TRUE.

Cleaning By

count(*) < current_bucket()-.

first(current_bucket()).

Sam

plin

g O

pera

tor

War

stor

y:�

Dur

ing

SYN

floo

ding

and

DD

OS

atta

cks,

Cis

coN

etflo

wge

nera

tor i

s ove

rwhe

lmed

and

pro

duce

s us

eles

s out

put.

�Pa

cket

sam

plin

g do

es n

ot p

rovi

de a

ccur

ate

flow

sa

mpl

es.

�B

y co

mbi

ning

flow

sam

plin

g an

d flo

w g

ener

atio

n lo

gic

usin

g th

e sa

mpl

ing

oper

ator

,Gig

asco

pepr

oduc

es m

eani

ngfu

l, va

luab

le fl

ow sa

mpl

es e

ven

at

peak

rate

s of f

low

s suc

h as

in a

ttack

s.

Spri

nt�s

CM

ON

Sof

twar

e A

rchi

tect

ure

CM

ON

Sam

ple

Syst

em C

onfig

urat

ion

Exa

mpl

e A

pplic

atio

n�

Hea

vy h

itter

q-g

ram

in p

acke

t con

tent

s.�

Des

ign

sam

plin

g+sk

etch

ing

met

hod

to sk

ip o

ver

vast

num

ber o

f pac

kets

.

�O

rder

s of m

agni

tude

impr

ovem

ent o

ver p

rior

wor

k in

net

wor

king

, ski

ppin

g fr

actio

n of

pac

kets

!S.

Bha

ttach

aryy

a, A

. Mad

eria

, S. M

uthu

kris

hnan

and

T. Y

e.Sp

rint A

TL T

echn

ical

Rep

ort,

2006

.

IP T

raff

ic A

naly

sis:

Infr

astr

uctu

re�

Cha

lleng

e:�

Size

, rat

e of

dat

a. A

naly

ses:

Sim

ple.

�Tu

rnar

ound

tim

e: M

inut

es, d

ays.

�M

oder

ate

size

d te

am o

f ana

lyst

s.�

Spec

ial i

nfra

stru

ctur

e:�

Opt

ical

split

ters

, NIC

Mul

tiple

CPU

mac

hine

s�

Dat

a st

ream

man

agem

ent s

yste

ms (

DSM

Ss)

Tal

k O

verv

iew

�D

ata

Ana

lysi

s in

Diff

eren

t Com

mun

ities

�A

lgor

ithm

s, D

atab

ases

and

Net

wor

king

�In

fras

truct

ure

Vie

w o

f Dat

a A

naly

sis

�Ex

ampl

e 1:

Cel

lpho

neC

all T

raff

ic�

Exam

ple

2: IP

Pac

ket T

raff

ic S

tream

s�

Exam

ple

3: W

eb T

raff

ic

�Pe

rspe

ctiv

es

Infr

astr

uctu

re V

iew

: W

eb T

raff

ic A

naly

sis

Goo

gle

Sear

ch

Web

Imag

eV

ideo

New

sU

sene

t Gro

ups

Blo

gs

Goo

gle:

Cal

cula

tor

Co.

Goo

gle:

Adv

ertis

ing

Goo

gle

Sear

ch

Web

Imag

eV

ideo

New

sU

sene

t Gro

ups

Blo

gs

Cal

cula

tor

Co. Con

vert

units

,C

alcu

late

.

Adv

ertis

ing

AdW

ords

AdS

ense

Partn

er si

tes

Earth

Map

Fina

nce

Tren

dsW

ritel

yPe

rson

aliz

eFr

oogl

e�

.

Exa

mpl

e: S

pons

ored

Sea

rch

�A

dver

tiser

s wan

t to

plac

e ad

s in

resp

onse

to u

ser

quer

ies.

�H

ave

to fi

gure

out

wha

t que

ries a

re in

tere

stin

g, h

ow

muc

h to

bid

on

each

que

ry, w

hat i

s the

bud

get,�

�Pr

oble

m:G

iven

a se

t of q

uerie

s and

a p

oten

tial b

id,

outp

ut th

e di

strib

utio

n of

�N

umbe

r of c

licks

exp

ecte

d�

Expe

cted

pos

ition

on

the

ad li

st�

Expe

cted

pric

e.

�In

put:

quer

ies,

ads s

how

n, b

ids,

pric

e, e

tc.T

erab

ytes

of

data

on

1000

�s o

f com

mod

ity m

achi

nes.

Goo

gle

Spon

sore

d Se

arch

A

uctio

n

Tra

ffic

E

stim

atio

n fo

r Sp

onso

red

Sear

ch

Map

Red

uce

[Dea

n. G

hem

awat

OSD

I04]

�Pa

ralle

l pro

gram

min

g in

fras

truct

ure

at G

oogl

e.�

Use

rs sp

ecify

map

and

redu

ce fu

nctio

ns.

�In

put:

set o

f rec

ords

.�

Each

reco

rd is

map

ped

to a

set o

f (ke

y, v

alue

) pai

rs.

�A

ll pa

irs w

ith sa

me

key

are

cons

ider

ed to

geth

er a

nd

a re

duce

func

tion

is a

pplie

d to

the

valu

es.

�Sy

stem

aut

omat

ical

ly ta

kes c

are

of

�Pa

ralle

lizin

g on

100

�s++

com

mod

ity m

achi

nes.

�Fa

ult t

oler

ance

�Sc

hedu

ling,

load

bal

ance

, int

er-m

achi

ne

com

mun

icat

ion,

etc

.

Tra

ffic

Est

imat

ion

Usi

ng M

apR

educ

e(m

ade-

up e

xerc

ise)

�Lo

gs c

onsi

st o

f (q,

b 1,p

1,b2,q

2,..,c

).�

qis

the

quer

y.�

b iis

the

bid

of a

dver

tiser

inith

plac

e an

d p i

the

pric

e.�

cis

the

ad c

licke

d on

.

�M

apto

(q,b

i,pi,i

,1 if

c=i

)for

all

i; q

is th

e ke

y.�

Red

uce

will

hav

e al

l rec

ords

with

sam

e q.

Cal

cula

te.

�nu

mbe

r of c

licks

,�

aver

age

posi

tion,

�av

erag

e co

st p

er c

lick,

etc

.�

Run

this

per

iodi

cally

and

inde

x fo

r eac

h q.

Loo

kup

whe

n ad

verti

ser q

uerie

s.

Web

Tra

ffic

Ana

lysi

s: In

fras

truc

ture

�Te

raby

tes o

f dat

a on

100

0�s o

f com

mod

ity m

achi

nes.

�10

0�s o

f eng

inee

rs ru

nnin

g m

any

anal

yses

si

mul

tane

ousl

y an

y da

y.�

Enor

mou

sly

succ

essf

ul a

tGoo

gle

for m

achi

ne le

arni

ng,

grap

h co

mpu

ting

to in

dex

gene

ratio

n.

�O

ther

infr

astru

ctur

e: B

igTa

ble,

Stu

bby,

Map

Red

uce

was

use

d fo

r 29k

jobs

, dea

lt w

ith 3

k TB

, 300

+ pr

ogra

ms,

79k

mac

hine

day

s, in

Aug

04,

[OSD

I04]

Tal

k O

verv

iew

�D

ata

Ana

lysi

s in

Diff

eren

t Com

mun

ities

�A

lgor

ithm

s, D

atab

ases

and

Net

wor

king

�In

fras

truct

ure

Vie

w o

f Dat

a A

naly

sis

�Ex

ampl

e 1:

Cel

lpho

neC

all T

raff

ic�

Exam

ple

2: IP

Pac

ket T

raff

ic S

tream

s�

Exam

ple

3: W

eb T

raff

ic

�Pe

rspe

ctiv

es

Sum

mar

y

1000

�s o

f m/c

�s, G

FS,

Map

Red

uce,

Big

tabl

e,

Opt

ical

split

ters

, N

ICs,

stre

am m

gmt

engi

nes.

File

syst

em, s

crip

t la

ngua

ge, p

aral

lel

CPU

s.M

ainl

y sy

stem

s.A

lg/D

B si

nce

96.

Mai

nly

publ

.N

o pu

blic

atio

ns

Larg

e nu

mbe

r of

engi

neer

s/an

alys

tsSm

all/M

oder

ate

# of

rese

arch

ers

Smal

l tea

m o

f an

alys

ts.

PB/m

onth

ho

urs/

days

Nea

rly a

ll se

rvic

es.

TB/h

our

min

/hou

rs/d

ays

Det

ect a

ttack

s, ap

pl.

--TB

/mon

th--

wee

kly/

mon

thly

--R

epor

ts.

Web

Tra

ffic

(Sea

rch

Engi

ne)

IP T

raff

ic(I

SP)

Cel

lpho

netra

ffic

(cel

lco)

Cha

lleng

es�

Dat

a cl

eani

ng�

Bui

ld g

ener

al in

fras

truct

ure

for d

ata

clea

ning

.�

Ex: G

ener

al sy

stem

for S

NM

P da

ta c

lean

ing.

Mak

ing

IP st

ream

ana

lyse

s sys

tem

dis

trib

uted

.�

Shor

t ter

m: D

istri

bute

d G

igas

cope

/CM

ON

.�

Long

term

: Pla

net M

apR

educ

efo

r IP

traff

ic a

naly

sis.

�Pr

ivac

y in

web

dat

a an

alys

is.

�St

ory:

Whe

re is

my

spou

se?

�Th

eory

of a

ppro

xim

ate,

priv

ate

com

putin

g in

theo

ry

and

data

base

s res

earc

h.

Ack

now

ledg

emen

ts�

Than

ks to

Anj

a Fe

ldm

ann

for g

uidi

ng m

e th

roug

h th

is ta

lk.

�Jo

n Fe

ldm

an fo

r hel

p w

ith M

apR

educ

e fo

r spo

nsor

ed se

arch

.�

Nat

han

Ham

ilton

for 5

+ ye

ars o

f col

labo

ratio

n on

cel

lula

r dat

a an

alys

is.

�Te

d, O

liver

, and

Div

esh

at A

T&T

for s

ever

al y

ears

of j

oint

w

ork

on G

igas

cope

.�

Supr

atik

Bha

ttach

aryy

a an

d Ta

o Y

e at

Spr

int f

or jo

int w

ork

on

CM

ON

.

�Th

anks

to st

uden

ts a

nd c

olle

ague

s at R

utge

rs M

assD

AL.

�Th

anks

to c

olle

ague

s at G

oogl

e, S

prin

t, A

T&T,

Nar

us.