bonus lecture : anomaly detection in large...

20
Faloutsos 15826 1 CMU SCS 15-826: Multimedia Databases and Data Mining Bonus Lecture : Anomaly detection in large graphs Christos Faloutsos CMU SCS Repeat: BONUS LECTURE I.e., you may skip it, and still get 100/100 in the course 15-826 Copyright: C. Faloutsos (2016) 2 CMU SCS Roadmap Introduction – Motivation • Unsupervised – static graphs – Time-evolving graphs Supervised: BP • Conclusions 15-826 Copyright: C. Faloutsos (2016) 3 CMU SCS Fraud Detection: Graph Analysis Problem [www.buyfollowz.org] [buymorelikes.com] 15-826 Copyright: C. Faloutsos (2016) 4

Upload: trinhliem

Post on 01-Sep-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Faloutsos 15826

1

CMU SCS

15-826: Multimedia Databases and Data Mining

Bonus Lecture : Anomaly detection in large graphs

Christos Faloutsos

CMU SCS

Repeat: BONUS LECTURE I.e., you may skip it, and still get 100/100 in the course

15-826 Copyright: C. Faloutsos (2016) 2

CMU SCS

Roadmap

•  Introduction – Motivation •  Unsupervised

–  static graphs – Time-evolving graphs

•  Supervised: BP •  Conclusions

15-826 Copyright: C. Faloutsos (2016) 3

CMU SCS Fraud Detection: Graph Analysis

Problem

[www.buyfollowz.org]

[buymorelikes.com]

15-826 Copyright: C. Faloutsos (2016) 4

Faloutsos 15826

2

CMU SCS Fraud Detection: Graph Analysis

Problem

[buycheaplikes.com]

[reviewsteria.com]

15-826 Copyright: C. Faloutsos (2016) 5

CMU SCS

Other applications: •  Cyber-security (DDoS attacks – distributed

denial of service) •  Abnormal individuals / social groups

–  dysfunctional departments in a company – Onset of depression/hijacking wrt a FB user – …

15-826 Copyright: C. Faloutsos (2016) 6

CMU SCS

Roadmap

•  Introduction – Motivation •  Unsupervised

–  static graphs •  Eigenspokes, and extensions •  fbox

– Time-evolving graphs

•  Supervised: BP •  Conclusions

15-826 Copyright: C. Faloutsos (2016) 7

CMU SCS

EigenSpokes B. Aditya Prakash, Mukund Seshadri, Ashwin

Sridharan, Sridhar Machiraju and Christos Faloutsos: EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs, PAKDD 2010, Hyderabad, India, 21-24 June 2010.

15-826 Copyright: C. Faloutsos (2016) 8

Faloutsos 15826

3

CMU SCS

EigenSpokes • Eigenvectors of adjacency matrix

!  equivalent to singular vectors (symmetric, undirected graph)

A = U�UT

�u1 �ui

N

N

details

15-826 Copyright: C. Faloutsos (2016) 9

CMU SCS

EigenSpokes •  EE plot: •  Scatter plot of

scores of u1 vs u2 •  One would expect

– Many points @ origin

– A few scattered ~randomly u1

u2 90o

15-826 Copyright: C. Faloutsos (2016) 10

CMU SCS

Bipartite Communities!

magnified bipartite community

patents from same inventor(s)

`cut-and-paste’ bibliography!

15-826 Copyright: C. Faloutsos (2016) 11

CMU SCS

Roadmap

•  Introduction – Motivation •  Unsupervised

–  static graphs •  Eigenspokes, and extensions •  fbox

– Time-evolving graphs

•  Supervised: BP •  Conclusions

15-826 Copyright: C. Faloutsos (2016) 12

Faloutsos 15826

4

CMU SCS

Inferring Strange Behavior from Connectivity Pattern in Social Networks

Meng Jiang, Peng Cui, Shiqiang Yang (Tsinghua, Beijing)

Alex Beutel, Christos Faloutsos (CMU)

CMU SCS

Lockstep and Spectral Subspace Plot

•  Case #0: No lockstep behavior in random power law graph of 1M nodes, 3M edges

•  Random “Scatter”

Adjacency Matrix Spectral Subspace Plot

15-826 Copyright: C. Faloutsos (2016) 14

CMU SCS

Lockstep and Spectral Subspace Plot

•  Case #1: non-overlapping lockstep •  “Blocks” “Rays”

Adjacency Matrix Spectral Subspace Plot

15-826 Copyright: C. Faloutsos (2016) 15

CMU SCS

Lockstep and Spectral Subspace Plot

•  Case #2: non-overlapping lockstep •  “Blocks; low density” Elongation

Adjacency Matrix Spectral Subspace Plot

15-826 Copyright: C. Faloutsos (2016) 16

Faloutsos 15826

5

CMU SCS

Lockstep and Spectral Subspace Plot

•  Case #3: non-overlapping lockstep

Adjacency Matrix Spectral Subspace Plot

??

15-826 Copyright: C. Faloutsos (2016) 17

CMU SCS

Lockstep and Spectral Subspace Plot

•  Case #3: non-overlapping lockstep •  “Camouflage” (or “Fame”) Tilting

“Rays” Adjacency Matrix Spectral Subspace Plot

15-826 Copyright: C. Faloutsos (2016) 18

CMU SCS

Lockstep and Spectral Subspace Plot

•  Case #4: ? lockstep •  “?” “Pearls”

Adjacency Matrix Spectral Subspace Plot

?

15-826 Copyright: C. Faloutsos (2016) 19

CMU SCS

Lockstep and Spectral Subspace Plot

•  Case #4: overlapping lockstep •  “Staircase” “Pearls”

Adjacency Matrix Spectral Subspace Plot

15-826 Copyright: C. Faloutsos (2016) 20

Faloutsos 15826

6

CMU SCS

Roadmap

•  Introduction – Motivation •  Unsupervised

–  static graphs •  Eigenspokes, and extensions •  fbox

– Time-evolving graphs

•  Supervised: BP •  Conclusions

15-826 Copyright: C. Faloutsos (2016) 21

CMU SCS

Issue: Flying below the radar •  Plotting the top $k$ eigenspokes, will miss

small groups •  Q: Can we spot them, too? •  A: ‘fBox’

15-826 Copyright: C. Faloutsos (2016) 22

CMU SCS Problem: Social Network Link Fraud

Target: find “stealthy” attackers missed by other algorithms

Clique

Bipartite core

41.7M nodes 1.5B edges

15-826 Copyright: C. Faloutsos (2016) 23

CMU SCS Problem: Social Network Link Fraud

Neil Shah, Alex Beutel, Brian Gallagher and Christos Faloutsos. Spotting Suspicious Link Behavior with fBox: An Adversarial Perspective. ICDM 2014, Shenzhen, China.

Target: find “stealthy” attackers missed by other algorithms

Takeaway: use reconstruction error between true/latent representation!

15-826 Copyright: C. Faloutsos (2016) 24

Faloutsos 15826

7

CMU SCS

Roadmap

•  Introduction – Motivation •  Unsupervised

–  static graphs •  Eigenspokes, and extensions •  fBox •  Spot strange, individual nodes: oddBall

– Time-evolving graphs •  Supervised: BP •  Conclusions 15-826 Copyright: C. Faloutsos (2016) 25

CMU SCS

OddBall: Spotting Anomalies in Weighted Graphs

Leman Akoglu, Mary McGlohon, Christos Faloutsos

Carnegie Mellon University School of Computer Science

PAKDD 2010, Hyderabad, India

CMU SCS

Main idea For each node, •  extract ‘ego-net’ (=1-step-away neighbors) •  Extract features (#edges, total weight, etc

etc) •  Compare with the rest of the population

15-826 Copyright: C. Faloutsos (2016) 27

CMU SCS What is an egonet?

ego egonet

15-826 Copyright: C. Faloutsos (2016) 28

Faloutsos 15826

8

CMU SCS

Selected Features !  Ni: number of neighbors (degree) of ego i !  Ei: number of edges in egonet i !  Wi: total weight of egonet i !  λw,i: principal eigenvalue of the weighted

adjacency matrix of egonet I

15-826 Copyright: C. Faloutsos (2016) 29

CMU SCS Near-Clique/Star

15-826 Copyright: C. Faloutsos (2016) 30

CMU SCS Near-Clique/Star

15-826 Copyright: C. Faloutsos (2016) 31

CMU SCS Near-Clique/Star

15-826 Copyright: C. Faloutsos (2016) 32

Faloutsos 15826

9

CMU SCS

Andrew Lewis (director)

Near-Clique/Star

15-826 Copyright: C. Faloutsos (2016) 33

CMU SCS

Roadmap

•  Introduction – Motivation •  Unsupervised

–  static graphs – Time-evolving graphs

•  ‘copyCatch’ in FaceBook •  Patterns of IAT •  [tensors]

•  Supervised: BP •  Conclusions 15-826 Copyright: C. Faloutsos (2016) 34

CMU SCS

Fraud •  Given

– Who ‘likes’ what page, and when

•  Find – Suspicious users and suspicious

products

CopyCatch: Stopping Group Attacks by Spotting Lockstep Behavior in Social Networks, Alex Beutel, Wanhong Xu, Venkatesan Guruswami, Christopher Palow, Christos Faloutsos WWW, 2013.

15-826 Copyright: C. Faloutsos (2016) 35

CMU SCS

Fraud •  Given

– Who ‘likes’ what page, and when

•  Find – Suspicious users and suspicious

products

CopyCatch: Stopping Group Attacks by Spotting Lockstep Behavior in Social Networks, Alex Beutel, Wanhong Xu, Venkatesan Guruswami, Christopher Palow, Christos Faloutsos WWW, 2013.

Users Pages

A

B

C

D

E

1

2

3

4

40 Z

Likes

15-826 Copyright: C. Faloutsos (2016) 36

Faloutsos 15826

10

CMU SCS

Our intuition ▪  Lockstep behavior: Same Likes, same time

Graph Patterns and Lockstep Behavior

Pages

Users

0

10

20

30

40

50

60

70

80

90

Tim

e

Users Pages

A

B

C

D

E

1

2

3

4

40 Z

Likes 15-826 Copyright: C. Faloutsos (2016) 37

CMU SCS

Our intuition ▪  Lockstep behavior: Same Likes, same time

Graph Patterns and Lockstep Behavior

Users Pages

A

B

C

D

E

1

2

3

4

40 Z

Pages

Users

0

10

20

30

40

50

60

70

80

90

Tim

e

B CD A E

Reordered Pages

4

3 2 1

Reord

ere

d U

sers

0

10

20

30

40

50

60

70

80

90

Tim

e

Likes 15-826 Copyright: C. Faloutsos (2016) 38

CMU SCS

Our intuition ▪  Lockstep behavior: Same Likes, same time

Graph Patterns and Lockstep Behavior

Users Pages

A

B

C

D

E

1

2

3

4

40 Z

Pages

Users

0

10

20

30

40

50

60

70

80

90

Tim

e

B CD A E

Reordered Pages

4

3 2 1

Reord

ere

d U

sers

0

10

20

30

40

50

60

70

80

90

Tim

e

Suspicious Lockstep Behavior Likes

15-826 Copyright: C. Faloutsos (2016) 39

CMU SCS

MapReduce Overview ▪  Use Hadoop to search for

many clusters in parallel:

1.  Start with randomly seed

2.  Update set of Pages and center Like times for each cluster

3.  Repeat until convergence

Users Pages

A

B

C

D

E

1

2

3

4

40 Z

Likes

15-826 Copyright: C. Faloutsos (2016) 40

Faloutsos 15826

11

CMU SCS

Deployment at Facebook ▪  CopyCatch runs regularly (along with many other

security mechanisms, and a large Site Integrity team)

08/25 09/08 09/22 10/06 10/20 11/03 11/17 12/01

Num

ber

of u

sers

cau

ght

Date of CopyCatch run

3 months of CopyCatch @ Facebook

#users caught

time 15-826 Copyright: C. Faloutsos (2016) 41

CMU SCS

Deployment at Facebook

23%

58%

5%9%

5%

Fake AccountsMalicious Browser ExtensionsOS MalwareCredential StealingSocial Engineering

Manually labeled 22 randomly selected clusters from February 2013

Most clusters (77%) come from real but compromised users

Fake acct

15-826 Copyright: C. Faloutsos (2016) 42

CMU SCS

Roadmap

•  Introduction – Motivation •  Unsupervised

–  static graphs – Time-evolving graphs

•  ‘copyCatch’ in FaceBook •  Patterns of IAT •  [tensors]

•  Supervised: BP •  Conclusions 15-826 Copyright: C. Faloutsos (2016) 43

CMU SCS

RSC: Mining and Modeling Temporal Activity in Social Media

Alceu F. Costa* Yuto Yamaguchi Agma J. M. Traina

Caetano Traina Jr. Christos Faloutsos

Universidade de São Paulo

KDD 2015 – Sydney, Australia

*[email protected]

Faloutsos 15826

12

CMU SCS

Reddit Dataset Time-stamp from comments 21,198 users 20 Million time-stamps

Twitter Dataset Time-stamp from tweets 6,790 users 16 Million time-stamps

Pattern Mining: Datasets

For each user we have: Sequence of postings time-stamps: T = (t1, t2, t3, …) Inter-arrival times (IAT) of postings: (∆1, ∆2, ∆3, …)

t1 t2 t3 t4

∆1 ∆2 ∆3

time 15-826 Copyright: C. Faloutsos (2016) 45

CMU SCS

Guess the IAT pdf •  Gaussian (10’ +/- a bit)? •  Power law? Slope? •  Log-logistic? •  Periodic (spike for 1day, 1 week)?

IAT

PDF

15-826 Copyright: C. Faloutsos (2016) 46

CMU SCS

Human? Robots?

log

linear

15-826 Copyright: C. Faloutsos (2016) 47

CMU SCS

Human? Robots?

log

linear 2’ 3h 1day

15-826 Copyright: C. Faloutsos (2016) 48

Faloutsos 15826

13

CMU SCS

Pattern Mining Pattern 1: Distribution of IAT is heavy-tailed

Users can be inactive for long periods of time before making new postings IAT Complementary Cumulative Distribution Function (CCDF)

(log-log axis)

105 106 10710−5

100

∆, IAT (seconds)

CC

DF,

P(x

≥ ∆

)

1 day 10 days105 106 107

10−5

100

∆, IAT (seconds)

CC

DF,

P(x

≥ ∆

)1 day10 days

Reddit Users Twitter Users 15-826 Copyright: C. Faloutsos (2016) 49

CMU SCS

Pattern Mining Pattern 2: Bimodal IAT distribution

Users have highly active sections and resting periods

Log-binned histogram of postings IAT

102 104 1060

0.005

0.01

0.015

∆, IAT (seconds)

PDF

1st Mode (1min) 2nd Mode (3h)

15-826 Copyright: C. Faloutsos (2016) 50

CMU SCS

102 104 1060

0.005

0.01

∆, IAT (seconds)

PDF

Pattern Mining

Pattern 3: Periodic spikes in the IAT distribution

Caused by daily sleeping intervals

1050

0.005

0.01

0.015

∆, IAT (seconds)

PDF

7h 12h 24h 48h 72h

15-826 Copyright: C. Faloutsos (2016) 51

CMU SCS

Experiments: Can RSC-Spotter Detect Bots?

Methodology Datasets

Users were manually labeled as bot or humans

Training

Same size for train and test subsets (preserved class distribution)

Baseline features:

1,963 Humans 37 Bots Reddit

1353 Humans 64 Bots Twitter

1.  IAT Histogram Log-binned IAT histogram

2. Entropy Entropy of the IAT histogram

3. Week Hist. # of postings for day of week

4. All features Combination of 1, 2 and 3

15-826 Copyright: C. Faloutsos (2016) 52

Faloutsos 15826

14

CMU SCS

Experiments: Can RSC-Spotter Detect Bots?

Precision vs. Sensitivity Curves Good performance: curve close to the top

0 0.2 0.4 0.6 0.8 10

0.20.40.60.8

1

Sensitivity (Recall)

Prec

isio

n

0 0.2 0.4 0.6 0.8 10

0.20.40.60.8

1

Sensitivity (Recall)

Prec

isio

n

RSC−Spotter

IAT Hist.Entropy [6]Weekday Hist.

All Features

Precision > 94% Sensitivity >

70%

With strongly imbalanced

datasets

# humans >> # bots

Twitter Dataset

15-826 Copyright: C. Faloutsos (2016) 53

CMU SCS

Experiments: Can RSC-Spotter Detect Bots?

Precision vs. Sensitivity Curves Good performance: curve close to the top

0 0.2 0.4 0.6 0.8 10

0.20.40.60.8

1

Sensitivity (Recall)

Prec

isio

n

RSC−Spotter

IAT Hist.Entropy [6]Weekday Hist.

All Features

Precision > 96% Sensitivity >

47%

With strongly imbalanced

datasets

# humans >> # bots

Reddit Dataset

15-826 Copyright: C. Faloutsos (2016) 54

CMU SCS

Roadmap

•  Introduction – Motivation •  Unsupervised

–  static graphs – Time-evolving graphs

•  ‘copyCatch’ in FaceBook •  Patterns of IAT •  [tensors]

•  Supervised: BP •  Conclusions 15-826 Copyright: C. Faloutsos (2016) 55

CMU SCS Anomaly detection in time-

evolving graphs

•  Anomalous communities in phone call data: – European country, 4M clients, data over 2 weeks

~200 calls to EACH receiver on EACH day!

1 caller 5 receivers 4 days of activity

=

15-826 Copyright: C. Faloutsos (2016) 56

Faloutsos 15826

15

CMU SCS Anomaly detection in time-

evolving graphs

•  Anomalous communities in phone call data: – European country, 4M clients, data over 2 weeks

~200 calls to EACH receiver on EACH day!

1 caller 5 receivers 4 days of activity

=

15-826 Copyright: C. Faloutsos (2016) 57

CMU SCS Anomaly detection in time-

evolving graphs

•  Anomalous communities in phone call data: – European country, 4M clients, data over 2 weeks

~200 calls to EACH receiver on EACH day!

=

Miguel Araujo, Spiros Papadimitriou, Stephan Günnemann, Christos Faloutsos, Prithwish Basu, Ananthram Swami, Evangelos Papalexakis, Danai Koutra. Com2: Fast Automatic Discovery of Temporal (Comet) Communities. PAKDD 2014, Tainan, Taiwan.

15-826 Copyright: C. Faloutsos (2016) 58

CMU SCS

Roadmap

•  Introduction – Motivation •  Unsupervised

–  static graphs – Time-evolving graphs

•  Supervised: BP – Ebay fraud (‘NetProbe’) – Malware (Polonium)

•  Conclusions

15-826 Copyright: C. Faloutsos (2016) 59

CMU SCS

E-bay Fraud detection

w/ Polo Chau & Shashank Pandit, CMU [www’07]

15-826 Copyright: C. Faloutsos (2016) 60

Faloutsos 15826

16

CMU SCS

E-bay Fraud detection

15-826 Copyright: C. Faloutsos (2016) 61

CMU SCS

E-bay Fraud detection

15-826 Copyright: C. Faloutsos (2016) 62

CMU SCS

E-bay Fraud detection - NetProbe

15-826 Copyright: C. Faloutsos (2016) 63

CMU SCS

Popular press

And less desirable attention: •  E-mail from ‘Belgium police’ (‘copy of

your code?’) 15-826 Copyright: C. Faloutsos (2016) 64

Faloutsos 15826

17

CMU SCS

Roadmap

•  Introduction – Motivation •  Part#1: Patterns in graphs

– Patterns – Anomaly / fraud detection

•  CopyCatch •  Spectral methods (‘fBox’) •  Belief Propagation; antivirus app

•  Part#2: time-evolving graphs; tensors •  Conclusions 15-826 Copyright: C. Faloutsos (2016) 65

CMU SCS

Roadmap

•  Introduction – Motivation •  Unsupervised

–  static graphs – Time-evolving graphs

•  Supervised: BP – Ebay fraud (‘NetProbe’) – Malware (Polonium)

•  Conclusions

15-826 Copyright: C. Faloutsos (2016) 66

CMU SCS

PoloChauMachineLearningDept

CareyNachenbergVicePresident&Fellow

JeffreyWilhelmPrincipalSo9wareEngineer

AdamWrightSo9wareEngineer

Prof.ChristosFaloutsosComputerScienceDept

Polonium:Tera-ScaleGraphMiningandInferenceforMalwareDetecCon

PATENTPENDING

SDM 2011, Mesa, Arizona

CMU SCS

Polonium:TheData60+terabytesofdataanonymouslycontributedbyparCcipantsofworldwideNortonCommunityWatchprogram

50+millionmachines900+millionexecutablefiles

Constructedamachine-filebiparCtegraph(0.2TB+)

1billionnodes(machinesandfiles)37billionedges

15-826 Copyright: C. Faloutsos (2016) 68

Faloutsos 15826

18

CMU SCS

Polonium:KeyIdeas•  UseBeliefPropagaContopropagatedomainknowledgeinmachine-filegraphtodetectmalware

•  Use“guilt-by-associaCon”(i.e.,homophily)–  E.g.,filesthatappearonmachineswithmanybadfilesaremorelikelytobebad

•  Scalability:handles37billion-edgegraph

15-826 Copyright: C. Faloutsos (2016) 69

CMU SCS

Polonium:One-InteracDonResults

84.9%TruePosiCveRate1%FalsePosiCveRate

True Positive Rate %ofmalware

correctlyidenCfied

Ideal

False Positive Rate %ofnon-malwarewronglylabeledasmalware15-826 Copyright: C. Faloutsos (2016) 70

CMU SCS

Roadmap

•  Introduction – Motivation •  Unsupervised •  Supervised: BP (Belief Propagation)

– Ebay fraud (‘NetProbe’) – Malware (Polonium) –  Intuition behind BP

•  Conclusions

15-826 Copyright: C. Faloutsos (2016) 71

CMU SCS

UnifyingGuilt-by-AssociationApproaches:TheoremsandFastAlgorithms

Danai Koutra U Kang

Hsing-Kuo Kenneth Pao

Tai-You Ke Duen Horng (Polo) Chau

Christos Faloutsos

ECML PKDD, 5-9 September 2011, Athens, Greece

Faloutsos 15826

19

CMU SCS Problem Definition:

GBA techniques

Given: Graph; & few labeled nodes Find: labels of rest (assuming network effects)

15-826 Copyright: C. Faloutsos (2016) 73

CMU SCS

Are they related? •  RWR (Random Walk with Restarts)

–  google’s pageRank (‘if my friends are important, I’m important, too’)

•  SSL (Semi-supervised learning) – minimize the differences among neighbors

•  BP (Belief propagation) –  send messages to neighbors, on what you

believe about them

15-826 Copyright: C. Faloutsos (2016) 74

CMU SCS

Are they related? •  RWR (Random Walk with Restarts)

–  google’s pageRank (‘if my friends are important, I’m important, too’)

•  SSL (Semi-supervised learning) – minimize the differences among neighbors

•  BP (Belief propagation) –  send messages to neighbors, on what you

believe about them

YES!

15-826 Copyright: C. Faloutsos (2016) 75

CMU SCS

Correspondence of Methods

Method Matrix Unknown known RWR [I – c AD-1] × x = (1-c)ySSL [I + a(D - A)] × x = y

FABP [I + a D - c’A] × bh = φh

0 1 0 1 0 1 0 1 0

? 0 1 1

1 1

1

d1 d2 d3 final

labels/ beliefs

prior labels/ beliefs

adjacency matrix

15-826 Copyright: C. Faloutsos (2016) 76

Faloutsos 15826

20

CMU SCS

Cast

Akoglu, Leman

Chau, Polo

Kang, U

Prakash, Aditya

Koutra, Danai

Beutel, Alex

Papalexakis, Vagelis

Shah, Neil

Lee, Jay Yoon

Araujo, Miguel

15-826 Copyright: C. Faloutsos (2016) 77

CMU SCS

CONCLUSION#1 – Big data •  Patterns Anomalies

•  Large datasets reveal patterns/outliers that are invisible otherwise

15-826 Copyright: C. Faloutsos (2016) 78

CMU SCS

Conclusions, cont’d •  Spectral/tensor methods: spot

lockstep behavior

•  OddBall: strange, single nodes

•  BP: good for the supervised setting (ie., some labels are given)

15-826 Copyright: C. Faloutsos (2016) 79