fast nearest-neighbor search in disk-resident graphs

30
Fast Nearest-neighbor Search in Disk-resident Graphs 报报报 报报报

Upload: mikel

Post on 23-Feb-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Fast Nearest-neighbor Search in Disk-resident Graphs. 报告人:鲁轶奇. Outline. Introduction Background & related works Proposed Work Experiments. Introduction-Motivation. Graph becoming enormous Streaming algorithm must take passes over the entire dataset - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Fast Nearest-neighbor Search in Disk-resident Graphs

Fast Nearest-neighbor Search in Disk-resident Graphs

报告人:鲁轶奇

Page 2: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Outline

Introduction Background & related works Proposed Work Experiments

Page 3: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Introduction-Motivation

Graph becoming enormous Streaming algorithm must take passes over the entire dataset Other perform clever preprocessing which use a specific similarity measure

This paper introduces analysis and algorithms which try to address the scalability problem in a generalizable way: not specific to one kind of graph partitioning nor one specific proximity measure.

Page 4: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Introduction-Motivation(cont.)

Real world graphs contain high-degree nodes Computing node value by combining that of its neighbors. Whenever a high degree node is encountered, these algorithm have to examine a

much large neighborhood leading to severely degraded performance.

Page 5: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Introduction-Motivation(cont.)

Algorithms can no longer assume that entire graph can be stored in memory. Compression techniques still have at least three setting where these might not

work social networks are far less compressible than Web graphs decompression might lead to an unacceptable increase in query response time even if a graph could be compressed down to a gigabyte, it might be undesirable to

keep it in memory on a machine which is running other applications

Page 6: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Contribution

a simple transform of the graph (turning high degree nodes into sinks) a deterministic local algorithm guaranteed to return nearest neighbors in

personalized pagerank from the disk-resident clustered graph. we develop a fully external-memory clustering algorithm (RWDISK) that uses

only sequential sweeps over data files.

Page 7: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Background-Personalized Pagerank

A random walk starting at node a, at any step the walk can be reset to the start node with probability α

PPV(a, j) : PPV entry from a to j Large value indicates high similarity

Page 8: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Background-Clustering

Using random walk based approaches for computing good quality local graph partition near a given anchor node.

Main intuition: A random walk started inside a low conductance cluster will mostly stay inside the

cluster. Conductance:

ФV(A) denote conductance and μ(A)=Σi A∈ degree(i)

Page 9: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Proposed Work

First problem: most local algorithms for computing nearest neighbors suffer from the presence of high degree nodes.

Second issue: computing proximity measures on large disk-resident graphs. Third issue: Finding a good clustering

Page 10: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Effect of high degree nodes

High degree nodes are performance bottleneck Effect on personalized pagerank

Main intuition: a very high degree node passes on a small fraction of its value to the out-neighbors, which might not be significant enough to invest our computing resources on.

Argue: stopping a random walk at a high degree node does not change the personalized pagerank value at other nodes which have relatively smaller degree.

Page 11: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Effect of high degree nodes

error incurred in personalized pagerank is inversely proportional to the degree of the sink node.

Page 12: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Effect of high degree nodes

faα(i, j) is simply the probability of hitting a node j for the first time from node i, in this α-discounted walk.

Page 13: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Effect of high degree nodes

Page 14: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Effect of high degree nodes

the error for introducing a set of sink nodes

Page 15: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Nearest-neighbors on clustered graphs

how to use the clusters for deterministic computation of nodes "close" to an arbitrary query.

Use degree-normalized personalized pagerank For a given node i, the PPV from j to it, i.e. PPV (j, i) can be written as

Page 16: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

assume that j and i are in the same cluster S.

Don’t have access to PPV-1(k), , replace it with upper and lower bound lower bound: 0, we pretend that S is completely disconnected to the rest of the

graph Upper bound : A random walk from outside S has to cross the boundary of S

to hit node i.

Page 17: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

S is small in size, the power method suffice

At each iteration, maintain the upper and lower bounds for nodes within S To expand S: bring in the clusters for x of the external neighbors of

this global upper boundfalls below a pre-specified small threshold γ In reality, using an additive slack ε, (ubk+1- ε)

Page 18: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Ranking Step

return all nodes which have lower bound greater than the (k+1)th largest upper bound

Why: All nodes outside the cluster are guaranteed to have personalized pagerank smaller than the global upperbound, which is smaller than γ

Page 19: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Clustered Representation on Disk

Intuition: use a set of anchor nodes and assign each remaining node to its “closest” anchor.

Using personalized page-rank as the measure of “closeness” Algorithm:

Start with a random set of anchors Iteratively add new anchors from the set of unreachable nodes, and the recompute the

cluster assignments Two properties:

new anchors are far away from the existing anchors when the algorithm terminates, each node i is guaranteed to be assigned to its closest

anchor.

Page 20: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

RWDISK

4 kinds of files Edge file: Each line represents an edge by a triplet {src,dst,p}, p = P(X t = dst| Xt-1=src)

Last file: each line in Last is {src,anchor,value}, value= P(X t-1=src| X0=anchor) Newt file: Newt contains xt, each line is {src,anchor,value}, where value equals

P(Xt=src|X0 =anchor) Ans file: represents the values for vt. Thus each line in Ans is {src,anchor,value},

where value = Algorithm to compute vt by power iterations

Page 21: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

RWDISK(cont.)

Newt is simply a matrix-vector product between the transition matrix stored in Edges and Last.

File are stored lexicographically, this can be obtained by a file-join like algorithm.

First step: simply joins the two files, and accumulates the probability values at a node from its in-neighbors.

Next step: the Newt file is sorted and compressed, in order to add up the values from different in-neighbors

multiply the probabilities by α(1-α)t-1 Fix the number of iterations at maxiter.

Page 22: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

One major problem is that intermediate files can become much larger than the number of edges

in most real-world networks within 4-5 steps it is possible to reach a huge fraction of the whole graph

Intermediate file getting too large Using rounding for reducing file sizes

Page 23: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Experiments

Dataset

Page 24: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Experiments(cont.)

System Detail On a off-the-shelf PC Least recently used replacement scheme Page size 4KB

Page 25: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Experiments(cont.)-Effect of high degree nodes

Three-fold advantages:- Speed up external memory clustering- Reduce number of page-faults in random-walk simulation

Effect on RWDISK

Page 26: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Experiments(cont.)-Deterministic vs. Simulations

Computing top-10 neighbors with approximation slack 0.005 for 500 randomly picked nodes

Citeseer original graph DBLP turned nodes with degree above 1000 into sinks LiveJournal turn nodes with degree above 100 into sinks

Page 27: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Experiments(cont.)-RWDISK vs. METIS

maxiter = 30, α = 0.1 and ε = 0.001 for PPV METIS for baseline algorithm

break DBLP into 50000 parts, which used 20GB of RAM Break LiveJournal into 75000 parts, which used 50GB of RAM

In comparison, RWDISK can be excuted on a 2-4 GB standard PC

Page 28: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Experiments(cont.)-RWDISK vs. METIS

Measure of cluster quality A good disk-based clustering must satisfy :

- Low conductance- Fit in disk-sized pages

Page 29: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Experiments(cont.)-RWDISK vs. METIS

Page 30: Fast Nearest-neighbor Search in Disk-resident Graphs

IBM – China Research Lab

Experiments(cont.)-RWDISK vs. METIS