fast pair-wise and node-wise algorithms for commute times and katz scores
DESCRIPTION
Slides from a seminar I gave at Emory University on 2012-01-20TRANSCRIPT
FAST MATRIX COMPUTATIONS FOR
PAIRWISE AND COLUMN-WISE KATZ
SCORES AND COMMUTE TIMES
David F. Gleich
Purdue University
Emory University
Math/CS Seminar
January 20, 2012
With Pooya Esfandiar, Francesco Bonchi, Chen Greif,
Laks V. S. Lakshmanan, and Byung-Won On
David F. Gleich (Purdue) Emory Math/CS Seminar 1 / 47
MAIN RESULTS
A – adjacency matrix
L – Laplacian matrix
Katz score :
Commute time:
For Commute
Compute one fast
Estimate
David F. Gleich (Purdue) Emory Math/CS Seminar
For Katz Compute one fast Compute top fast
2 of 47
OUTLINE
Why study these measures?
Katz Rank and Commute Time
How else do people compute them?
Quadrature rules for pairwise scores
Sparse linear systems solves for top-k
As many results as we have time for…
David F. Gleich (Purdue) Emory Math/CS Seminar 3 of 47
WHY? LINK PREDICTION
David F. Gleich (Purdue) Emory Math/CS Seminar
Liben-Nowell and Kleinberg 2003, 2006 found that path based link prediction was more efficient
Neighborhood based
Path based
4 of 47
NOTE
All graphs are undirected
All graphs are connected
David F. Gleich (Purdue) Emory Math/CS Seminar 5 of 47
LEO KATZ
David F. Gleich (Purdue) Emory Math/CS Seminar 6 of 47
NOT QUITE, WIKIPEDIA
: adjacency, : random walk
PageRank
Katz
These are equivalent if has constant degree
David F. Gleich (Purdue) Emory Math/CS Seminar 7 of 47
WHAT KATZ ACTUALLY SAID
Leo Katz 1953, A New Status Index Derived from Sociometric Analysis, Psychometria 18(1):39-43
“we assume that each link independently has the
same probability of being effective” …
“we conceive a constant , depending
on the group and the context of the particular
investigation, which has the force of a probability
of effectiveness of a single link. A k-step chain
then, has probability of being effective.”
“We wish to find the column sums of the matrix”
David F. Gleich (Purdue) Emory Math/CS Seminar 8 of 47
A MODERN TAKE
The Katz score (node-based) is
The Katz score (edge-based) is
David F. Gleich (Purdue) Emory Math/CS Seminar 9 of 47
RETURNING TO THE MATRIX
Carl Neumann
David F. Gleich (Purdue) Emory Math/CS Seminar 10 of 47
Carl Neumann
I’ve heard the Neumann series called the “von Neumann”
series more than I’d like! In fact, the von Neumann kernel
of a graph should be named the “Neumann” kernel!
David F. Gleich (Purdue) Emory Math/CS Seminar
Wikipedia page
11 / 47
PROPERTIES OF KATZ’S MATRIX
is symmetric
exists when
is spd when
Note that 1/max-degree suffices
David F. Gleich (Purdue) Emory Math/CS Seminar 12 of 47
COMMUTE TIME
David F. Gleich (Purdue) Emory Math/CS Seminar
Picture taken from Google images, seems to be
Bay Bridge Traffic by Jim M. Goldstein.
13 / 47
COMMUTE TIME
Consider a uniform random walk on a graph
David F. Gleich (Purdue) Emory Math/CS Seminar
Also called the hitting
time from node i to j, or
the first transition time
14 of 47
SKIPPING DETAILS
: graph Laplacian
is the only null-vector
David F. Gleich (Purdue) Emory Math/CS Seminar 15 of 47
WHAT DO OTHER PEOPLE DO?
1) Just work with the linear algebra formulations
2) For Katz, Truncate the Neumann series as a few (3-5) terms, or MMQ (Benzi & Boito)
3) Use low-rank approximations from EVD(A) or EVD(L)
4) For commute, use Johnson-Lindenstrauss inspired random sampling
5) Approximately decompose into smaller problems
David F. Gleich (Purdue) Emory Math/CS Seminar
Liben-Nowell and Kleinberg CIKM2003, Acar et al. ICDM2009,
Spielman and Srivastava STOC2008, Sarkar and Moore UAI2007, Wang et al. ICDM2007 16 of 47
THE PROBLEM
All of these techniques are
preprocessing based because
most people’s goal is to compute
all the scores.
We want to avoid
preprocessing the graph.
David F. Gleich (Purdue) Emory Math/CS Seminar
There are a few caveats here! i.e. one could solve the system instead of looking for the matrix inverse
17 of 47
WHY NO PREPROCESSING?
The graph is constantly changing
as I rate new movies.
David F. Gleich (Purdue) Emory Math/CS Seminar 18 of 47
WHY NO PREPROCESSING?
David F. Gleich (Purdue) Emory Math/CS Seminar
Top-k predicted “links”
are movies to watch!
Pairwise scores give
user similarity
19 of 47
PAIR-WISE ALGORITHMS
David F. Gleich (Purdue) Emory Math/CS Seminar 20 / 47
PAIRWISE ALGORITHMS
Katz
Commute
David F. Gleich (Purdue) Emory Math/CS Seminar
Golub and Meurant
to the rescue!
21 of 47
MMQ - THE BIG IDEA
Quadratic form
Weighted sum
Stieltjes integral
Quadrature approximation
Matrix equation David F. Gleich (Purdue) Emory Math/CS Seminar
Think
A is s.p.d. use EVD
“A tautology”
Lanczos
22 of 47
LANCZOS
, k-steps of the Lanczos method produce
and
David F. Gleich (Purdue) Emory Math/CS Seminar
=
23 of 47
PRACTICAL LANCZOS
Only need to store the last 2 vectors in
Updating requires O(matvec) work
is not orthogonal
David F. Gleich (Purdue) Emory Math/CS Seminar 24 of 47
MMQ PROCEDURE
Goal
Given
1. Run k-steps of Lanczos on starting with
2. Compute , with an additional eigenvalue at ,
set 3. Compute , with an additional eigenvalue at , set
4. Output as lower and upper bounds on
David F. Gleich (Purdue) Emory Math/CS Seminar
Correspond to a Gauss-Radau rule, with
u as a prescribed node
Correspond to a Gauss-Radau rule, with
l as a prescribed node
25 of 47
PRACTICAL MMQ
Increase k to become more accurate
Bad eigenvalue bounds yield worse results
and are easy to compute
not required, we can iteratively
update it’s LU factorization
David F. Gleich (Purdue) Emory Math/CS Seminar 26 of 47
PRACTICAL MMQ
David F. Gleich (Purdue) Emory Math/CS Seminar 27 of 47
ONE LAST STEP FOR KATZ
Katz
David F. Gleich (Purdue) Emory Math/CS Seminar 28 of 47
COLUMN-WISE
ALGORITHMS
David F. Gleich (Purdue) Emory Math/CS Seminar 29 / 47
COLUMN-WISE COMMUTE-TIME
requires entire diagonal of
David F. Gleich (Purdue) Emory Math/CS Seminar
Paige and Saunders, 1975
Each vector is computed by the a Lanczos based CG algorithm.
30 of 47
=
COLUMN-WISE COMMUTE TIME
David F. Gleich (Purdue) Emory Math/CS Seminar
following Hein et al. 2010 is a MUCH better and faster approximation.
31
KATZ SCORES ARE LOCALIZED
David F. Gleich (Purdue) Emory Math/CS Seminar
Up to 50 neighbors is 99.65% of the total mass
32 of 47
PARTICIPATION RATIOS
David F. Gleich (Purdue) Emory Math/CS Seminar
Participation
Ratios
“effective non-zeros” in a vector
33 of 47
TOP-K ALGORITHM FOR KATZ
Approximate
where is sparse
Keep sparse too
Ideally, don’t “touch” all of
David F. Gleich (Purdue) Emory Math/CS Seminar 34 of 47
INSPIRATION - PAGERANK
Approximate
where is sparse
Keep sparse too? YES!
Ideally, don’t “touch” all of ? YES!
David F. Gleich (Purdue) Emory Math/CS Seminar
McSherry WWW2005, Berkhin 2007, Anderson et al. FOCS2008 – Thanks to Reid Anderson for telling me McSherry did this too.
35 of 47
THE ALGORITHM - MCSHERRY
For
Start with the Richardson iteration
Rewrite
Richardson converges if
David F. Gleich (Purdue) Emory Math/CS Seminar 36 of 47
THE ALGORITHM
Note is sparse.
If , then is sparse.
Idea
only add one component of to
David F. Gleich (Purdue) Emory Math/CS Seminar 37 of 47
THE ALGORITHM
For
Init:
How to pick ?
David F. Gleich (Purdue) Emory Math/CS Seminar 38 of 47
THE ALGORITHM FOR KATZ
For
Init:
Pick as max David F. Gleich (Purdue) Emory Math/CS Seminar
Storing the non-zeros of the residual in a heap makes picking the max log(n) time. See Anderson et al. FOCS2008 for more
39 of 47
CONVERGENCE?
If 1/max-degree then is sub-stochastic and the PageRank based proof applies because the matrix is diagonally dominant
For , then for symmetric , this algorithm is the Gauss-Southwell procedure and it still converges.
David F. Gleich (Purdue) Emory Math/CS Seminar
40 of 47
NONSYMMETRIC ALGORITHM?
Was this algorithm known in the non-symmetric case?
YES! It’s just Gauss-Seidel/Gauss-Southwell, but with a column vs. row orientation.
Called dynamic residual to AMGers
David F. Gleich (Purdue) Emory Math/CS Seminar 41 of 47
RESULTS – DATA, PARAMETERS
All unweighted, connected graphs
Easy
Hard
David F. Gleich (Purdue) Emory Math/CS Seminar 42 of 47
KATZ BOUND CONVERGENCE
David F. Gleich (Purdue) Emory Math/CS Seminar 43 of 47
COMMUTE BOUND CONVERG.
David F. Gleich (Purdue) Emory Math/CS Seminar 44 of 47
KATZ SET CONVERGENCE
David F. Gleich (Purdue) Emory Math/CS Seminar
For arXiv graph.
45 of 47
TIMING
David F. Gleich (Purdue) Emory Math/CS Seminar 46 of 47
CONCLUSIONS
These algorithms are faster than many alternatives.
For pairwise commute, stopping criteria are simpler with bounds
For top-k problems, we often need less than 1 matvec for good enough results
David F. Gleich (Purdue) Emory Math/CS Seminar 47 of 47
By AngryDogDesign on DeviantArt
Paper at WAW2010 and J. Internet Mathematics
Slides online
Code is online already
www.cs.purdue.edu/
homes/dgleich/codes/fast-katz-2011
David F. Gleich (Purdue) Emory Math/CS Seminar 48