c lustering networked data based on link and similarity in a ctive learning advisor : sing ling lee...

29
CLUSTERING NETWORKED DATA BASED ON LINK AND SIMILARITY IN ACTIVE LEARNING Advisor : Sing Ling Lee Student : Yi Ming Chang Speaker : Yi Ming Chang 1

Upload: loreen-goodwin

Post on 30-Dec-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

1

CLUSTERING NETWORKED DATA BASED ON LINK AND

SIMILARITY IN ACTIVE LEARNING

Advisor : Sing Ling Lee

Student : Yi Ming Chang

Speaker : Yi Ming Chang

2

OUTLINE

Introduction

Active Learning

Networked data

Related Work

Newman’s Modularity

Collective Classification(ICA)

ALFNET

CLAL

Experimental Results

Conclusion

3

PASSIVE LEARNING

-

+

++-

-

-

+ ClassifierTrain

Classify-

+

++-

-

-

++

++

+

+

+--

-

-

: Unlabeled instance

: Labeled instance

+Testing data

Training data

Wrong : 5+

-

+

-

4

ACTIVE LEARNING

+-

-

ClassifierTrain

Classify-

+

++-

-

-

++

++

+

+

+--

-

-

: Unlabeled node

: Labeled node

Testing data

Training data

+

+

-

+

+

-

Query

EX : Query batch number = 3

+

+-

-

Wrong : 2

NETWORK DATA

5Classifier

training classify

+

-+

+

+

-

-

-

-

: Unlabeled node

: Labeled node

6

OUTLINE

Introduction

Active Learning

Networked data

Related Work

Newman’s Modularity

Collective Classification(ICA)

ALFNET

CLAL

Experimental Results

Conclusion

7

NEWMAN’S MODULARITY FOR CLUSTERING

m = 5 : Real edge : Degree of node : Group of node

= (1 – 2*2 /10 ) = (0 – 2*2/10 ) = (1 – 2*3/10 ) = (0 – 2*1/10 )

1

32

5

4

ijAik

iisi

21ss12B13B 31ss14B 41ss15B 51ss

121 ss131 ss141 ss151 ss

1,1,

1

23

4

5

8

NEWMAN’S MODULARITY FOR CLUSTERING

Example :

= (1 – 5*2 /16 ) = 0.375 = (0 – 5*3/ 16 ) = -0.9375 = (1 – 2*5/ 16 ) = 0.375 = (1 – 2*3/ 16 ) = 0.625 = (0 – 3*5/ 16 ) = -0.9375 = (1 – 3*2/ 16) = 0.625

1 32

21ss12B13B 31ss21B 12ss23B 32ss31B32B

13ss23ss

21ss31ss12ss32ss13ss23ss

0.625+0.625 > 0.375+0.375

9

NEWMAN’S MODULARITY FOR CLUSTERING

Maximizing 0.3 0.1 -0.5

11-1

TuuB

10

COLLECTIVE CLASSIFICATION(ICA)

Iterative Classification Algorithm(ICA)

-

-

+

?

?

?

+

Content-Onlylearner

?

?

?

?

training

Collectivelearner

Compute neighbor feature using CO

training

Until stable orthreshold of iteration have elapsed

Iteration 1

Iteration 2

Iteration 3

Compute neighbor feature using CC

.

.

.

1 0 0 1 0 … 1 3/5 2/5 ..1 0 0 1 0 … 1

feature Neighbor featureCOCC

11

CC PROBLEMHow to set threshold?

-

-

+: Labeled node

: Unlabeled node

-

+1

2

Infer neighbor feature :

-

1

2

3

Iteration 1:+ -

2/5 3/5

3/5+

2/5

3 0/1 1/1

Iteration 2: 1

2

3

3/5 2/5

2/5 3/5

1/1 0/1

-

+

Iteration 3: 1

2

3

2/5 3/5

4/5 1/5

0/1 1/1

+

-+

Iteration 4: 1

2

3

3/5 2/5

2/5 3/5

1/1 0/1

-

+-

+

-+

+ -

Iteration 5: 1

2

3

2/5 3/5

4/5 1/5

0/1 1/1

-

+-

12

ALFNET

1. Cluster data at least k clusters.

2. Pick k clusters based on size and initialize Content-Only(CO) classifier

cluster cluster cluster

… ……

k

COClassifier

SVM

13

ALFNET

3.while (labeled nodes < budget )3.1 Re-train CO and CC classifier

3.2 pick k cluster based on score :

CO

CC

cluster cluster cluster

… ……

k

Trainingset

train

14

3.2 pick an item form each cluster based on

CO

CCTraining

settrain

15

ALFNET

CO CCMain Label

Class A

Class B

Class C

Class D

entropy(1/3) + entropy(1/3) + entropy(1/3) = 0.3662 *3

predicted category

proportion of three classifier predicted

predict

entropy(2/3) + entropy(1/3) = 0.2703 + 0.3662

entropy(3/3) = 0

CO

CC

Main

16

OUTLINE

Introduction

Active Learning

Networked data

Related Work

Newman’s Modularity

Collective Classification(ICA)

ALFNET

CLAL

Experimental Results

Conclusion

17

MODULARITY AND SIMILARITY

Node 1Node 2

1 1 0 01 0 0 0

Node 3Node 4

1 1 0 0 0 0 1 1

4

1

44

11

4

0

44

00

44

1

16

1

441

16

1

441

EX:

18

MAXIMUM Q

Maximizing

19

CLAL

: Labeled node

: Unlabeled node

trainingCO trainingCO

Query &classify

Query &classify

Until Labeled node > budget

20

TUNING AND GREEDY MECHANISM

??

?

?

?

??

??

??

?

??

?

: Labeled node

: Unlabeled node

CO

Query &classify

trainingCO

Query &classify

Retrain &

MoveOut-link > In-link

reserve the greater COs

Moving priority:OutLink - Inlink3 -> 2 -> 1 -> 1

Clustering priority :Low accuracy -> High accuracy

MoveOut-link > In-link

CO CO

21

OUTLINE

Introduction

Active Learning

Networked data

Related Work

Newman’s Modularity

Collective Classification(ICA)

ALFNET

CLAL

Experimental Results

Conclusion

22

BACKGROUNDNetworked data

Social network

Citation network

word

Paper NO.

word…

word

nodecite

word

Paper NO.

word…

word

feature

Person name

feature…

feature

feature

Person name

feature…

feature

node

Attribute

Attribute

friend

23

OUTLINE

Introduction

Active Learning

Networked data

Related Work

Newman’s Modularity

Collective Classification(ICA)

ALFNET

CLAL

Experimental Results

Conclusion

24

APPENDIX

25

SVMTraining data sets :

+

+

++

+

+

Margin

-

-

-

-

-

+

+

++

+

+

Margin

-

-

-

-

-

Hyper-plan

1,1,...,2,1,, , id

ini yRxiyxi ,

26

CHALLENGE

Query efficiency from discriminative feature

Paper name word

word …

word

510Sum of 2 class

word …word

Class 1

Class 2

400 250

word

word

word

word

250

260 180

220 100

150

Paper name

Paper name

27

CC PROBLEM :HOW TO SET TERMINAL CONDITION? Different iteration will obtain diverse result.

: CO predicted label : true labeled : labeled

Infer neighbor feature

Neighbor feature

NF_A NF_B

3/5 2/5

BB

BB

AA

A

AA

Local feature

0,1,0,…

F1,F2,…

A

BB

2/3 1/3

A

1/3 2/3

B

AA

Iteration 1Iteration 2

4/5 1/52/3 1/32/3 1/3

A

A

A

CC classifier

28

ALFNETQuery and training CO

Query and training classifier

Compute

Compute

Iteration > ?

Ni

NiN

Y

Labeled node

>Budget?

Y

N

Output

29

REPRESENTATION AND CHALLENGE

In a citation network

node

nodenode node

node

How to use link information