2015/10/111 dbconnect: mining research community on dblp data osmar r. zaïane, jiyang chen, randy...

29
111/06/20 1 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop in conjunction with ACM SIGKDD, SNA-KDD'07 報報報 : 報報報

Upload: ezra-rogers

Post on 31-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

112/04/19 1

DBconnect: Mining Research Community on DBLP Data

Osmar R. Zaïane, Jiyang Chen, Randy Goebel

Web Mining and Social Network Analysis Workshop in conjunction with ACM SIGKDD, SNA-KDD'07

報告人 : 吳建良

Page 2: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

Outline Community Motivation

Understand research community – recommend collaborations Proposed Apporach

Rank the relevance with a random walk approach DBconnect

A navigational system to investigate community relations Conclusion

2

Page 3: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

What is community? In Graph Theory:

Densely connected groups of vertices, with sparser connection between groups

In Social Network Analysis: Groups of entities that share

similar properties or connect to each other via certain relations

3

Page 4: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

Why is community important? Interesting data with community structure:

Researcher collaboration, friendship network, WWW,

Massive Multi-player on-line gaming, electronic

communications…

Groups in social networks correspond to social communities, which can be used to understand organizational structure, academic collaboration, shared interests and affinities, etc.

4

Page 5: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

Motivation

Understand the research network between authors,

conferences and topics (rank entities by relevance

for given entities)

Find and recommend research collaborators for

given authors

Explore the academic social network

5

Page 6: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

Proposed Approach

Build bipartite graph in the author-conference space

Limitation of traditional bipartite graph model

Extend the bipartite model to include co-authorship

information

Further extend the model to tripartite to include topic

information

Use random walk with restart on such models

6

Page 7: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

An example Author Publication Records in Conferences

7

a, b, c, d, e are authors ac(3) means that author a and c published three papers together in

KDD(y) conference

Page 8: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

Bipartite model for conference-author social network

8

Weight(edge)=publishing frequency of author in a certain conferenceLimitation:Fail to represent any co- co-authorships

To capture the co-author relations:1.Add a link between a and c miss the role of KDD2.Make the link connecting a and c to KDD make the random walk infeasible3.Add additional nodes to represent each co-author relation impractical, a huge number of such relations

Page 9: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

Extend the bipartite model to include co-authorship information

Add a virtual level of nodes to replace the conference partition, and add direction to the edges

9

3

7

7

A nodes then connect to their own split

relation nodes with the original weight C’ nodes to all author nodes

If the A node and C’ node have a co-author

relation edge weight: co-author

frequency * a parameter f

Otherwise, the edge is weighted as original

Set f=k (k is the total author number of

a conference)

3f

3f

3

77

7

7

3 7

Page 10: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

Further extend the model to tripartite to include topic information

Research topic is an important component to differentiate any research community

Authors that attend the same conferences might work on various topics

10

Page 11: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

Adding topic information Very few conference proceedings have their table of

contents included in DBLP Table of contents include session titles

Extract relevant topics from DBLP Use paper title, and find frequent co-locations in title text

Method Manually select a list of stopwords to remove frequently

used but non-topic-related words

Ex: Towards, Understanding, Approach, … 11

Page 12: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

Adding topic information (cond.)

Count frequency of every co-located pairs of stemmed words

Select the top 1000 most frequent bi-grams as topics Manually add several tri-grams

Ex: World Wide Web, Support Vector Machine, …

12

Page 13: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

Random walk on DBLP social network

Problem to be solving: Given an author node a A , compute a relevance score for

each author b A Simple example: conference-author network G

13

Relational matrix M3×5

Page 14: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

Random walk on DBLP social network (cond.)

Normalize M such that every column sum up to 1: Q(M) = col_norm(M), Q(MT) = col_norm(MT)

Construct the adjacency matrix J of G after normalization

14

0)(

)(0TMQ

MQJ

22.00.108.00

77.000.1038.0

0002.062.0

)(MQ

22.041.00

33.000

041.00

44.0016.0

018.084.0

)( TMQ

Page 15: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

Random walk on DBLP social network (cond.)

Normalized adjacency matrix J of G

15

Q(MT )

Q(M )

Page 16: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

A random walk on this graph moves from one node to one of its neighbors based on the probability Probability: proportional to the weight of the edge over the

sum of weights of all edges that connect to this node EX: if we start from node SIGMOD, then build u as

the start vector u is a one-column vector, consisting of (3+7) elements The value of element corresponding to SIGMOD is set to 1

16

Random walk on DBLP social network (cond.)

Page 17: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

u=Ju After step1 of the first iteration, the random walk hits

the author nodes with b=1×0.44, d=1×0.33, e=1×0.22

After step2 of the first iteration, the chance that the random walk goes back to SIGMOD is 0.44×0.8+0.33 ×1+0.22 ×0.22 = 0.73, and the other 0.27 goes to the other two conference nodes

17

Random walk on DBLP social network (cond.)

Page 18: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

After a few iterations, the vector will converge and gives a stable score to every node

However, these scores are always the same no matter where the walk begins

Solved by random walk with restart Given a restarting probability c Use another vector v, and the value of element corresponding

to SIGMOD is set to 1 In each random walk iteration, the walker goes back to the

start node with a restart probability18

Random walk on DBLP social network (cond.)

u=(1-c)u + cv

Page 19: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

Random walk with restart algorithm(1)

19

Random walk on DBLP social network (cond.)

Input: node α A, a bipartite graph model G, restarting probability c, converge threshold ε.Output: relevance score vector B for author nodes.1. Compute the adjacency matrices J(n+m) ×(n+m) of G. /* n conferences and m authors */2. Initialize vα = 0, set element for α to 1: vα(α) = 1.3. While (△uα > ε ) uα = Juα

uα = (1 − c) uα + cvα

4. Set vector B = uα(n+1:n+m).5. Return B.

Page 20: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

Extend the bipartite model into a directed bipartite graph G'=(C',A,E') A has m author nodes, and C has n conference nodes C' is generated based on C and has n*m nodes

Assume every node in C is split into m nodes

First generate a matrix M(n*m)×m for directional edges from C' to A

Then form a matrix Nm×(n*m) for edges from A to C'

20

Random walk on DBLP social network (cond.)

Page 21: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

The adjacency matrix J of G‘

Algorithm(2): The random walk with restart algorithm for directed bipartite model

21

Random walk on DBLP social network (cond.)

Page 22: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

Extend to the tripartite graph model G''=(C,A,T,E'') Assume n conferences, m authors and l topics in G'‘

Three corresponding matrices: Un×m, Vm×l and Wn×l

The adjacency matrices of G'' after normalization:

22

Random walk on DBLP social network (cond.)

Page 23: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

Algorithm(3): The random walk with restart algorithm for tripartite model

23

Random walk on DBLP social network (cond.)

Page 24: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

DBLP dataset Download the publication data for conferences from

the DBLP website9 in July 2007 It contains more than 300,000 authors, about 3,000

conferences and the selected 1,000 N-gram topics The entire adjacency matrix becomes too big to make

the random walk efficient Use the METIS algorithm to partition the large graph into ten

subgraphs of about the same size

24

Page 25: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

The DBconnect System http://kingman.cs.ualberta.ca/research/demos/co

ntent/dbconnect/ A navigational system to investigate the

community connections and relations Displaying researcher statistics from academic

search engines Providing lists of recommended entities to given

authors, topics and conferences

25

Page 26: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

The DBconnect System (cond.)

Academic Information Conference contribution, earliest publication year and

average publication per year H-index is calculated based on information retrieved from

Google Scholar Approximate citation numbers

Related Conferences Based on author-conference-topic model

Related Topics Based on author-conference-topic model

26

Page 27: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

The DBconnect System (cond.)

Co-authors Co-author name and number of paper

Related Researchers Based on the directed bipartite graph model

Recommended Collaborators Based on author-conference-topic model Co-authors’ names are not shown here The result implies that the given author shares similar topics

and conference experiences with these listed researchers, hence the recommendation

27

Page 28: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

The DBconnect System (cond.)

Recommended To The recommendation is not symmetric Author A may be recommended as a possible future

collaborator to author B but not vice versa EX: Jiawei Han has been recommended as collaborator for

6201 authors, but apparently only a few of them is recommended as collaborators to him

The given author has been recommended to the author lists Symmetric Recommendations

The author lists have been recommended to the given author

28

Page 29: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop

Conclusion Extend a bipartite graph model to incorporate

co-authorship Propose a random walk with restart approach

Find related conferences, authors, and topics for a given entity

Present DBconnect system Help explore the relational structure and discover

implicit knowledge within the DBLP data collection

29