Fast and Accurate Influence Maximization
on Large Networks
with Pruned Monte-Carlo Simulations
Naoto Ohsaka (UTokyo)
Takuya Akiba (UTokyo)
Yuichi Yoshida (NII & PFI)
Ken-ichi Kawarabayashi (NII)
JST, ERATO, Kawarabayashi Large Graph Project
1
2014/7/30 AAAI-14 @ Quรฉbec, Canada
Influence Maximization[Kempe, Kleinberg, Tardos. KDDโ03]
Input
Directed graph ๐บ = ๐, ๐ธ Edge probability ๐๐ ๐ โ ๐ธ Size of seed set ๐
Problem
maximize ๐ ๐ ๐ โค ๐ ๐ โ : the spread of influence
2
๐จ๐ช ๐ซ
๐ฉ๐ฌ ๐ญ
0.6 0.1
0.30.4 0.8
0.2 0.5
Motivation
Viral (word-of-mouth) Marketing[Domingos, Richardson. KDDโ01], [Richardson, Domingos. KDDโ02]
Q. How to find a small group of influential individuals?
mathematically formalizing
Each vertex has 2 states (inactive / active)
Diffusion Process
0. Activate vertices in ๐ โ ๐ called seed set
1. Active vertex ๐ข activates inactive vertex ๐ฃwith probability ๐๐ข๐ฃ (single trial)
2. Repeat 1 while new activations occur
Independent Cascade Model[Goldenberg, Libai, Muller. Marketing Lettersโ01]
inactive active
3
๐๐
success or failure
๐๐ข๐ฃ = 0.1
Influence spread ๐ ๐
Expected number of active verticesgiven a seed set ๐
Example of
Independent Cascade Model
4
Seed
Inactive
Active
Success
Failure
๐จ๐ช ๐ซ
๐ฉ๐ฌ ๐ญ
๐จ๐ช ๐ซ
๐ฉ๐ฌ ๐ญ
๐จ๐ช ๐ซ
๐ฉ๐ฌ ๐ญ
0.6 0.1
0.30.4 0.8
0.2 0.5
Previous Results
Hardness
Influence Maximization is
NP-hard[Kempe, Kleinberg, Tardos. KDDโ03]
Exact Computation of
๐ โ is
#P-hard[Chen, Wang, Wang. KDDโ10]
Original Greedy
ApproachGreedy Algorithm
[Kempe, Kleinberg, Tardos. KDDโ03]
Approx. ratio โ 63%
Monte-Carlo Simulations
Good approximation
5
Greedy Algorithm [Kempe, Kleinberg, Tardos. KDDโ03]
Monte-Carlo Simulations (1 ยฑ ๐ approximation)[Kempe, Kleinberg, Tardos. KDDโ03]
Simulating diffusion process repeatedly
Averaging # of active vertices
Original Greedy Approach
6
๐ โ โ while ๐ < ๐ do
๐ก โ argmax๐ฃโ๐๐ ๐ โช {๐ฃ} โ ๐(๐)
๐ โ ๐ โช {๐ก}
Due to submodularity of ๐ โ
๐ ๐ โฅ 1 โ1
๐OPT โฅ 0.63 OPT
[Nemhauser, Wolsey, Fisher.
Mathematical Programmingโ78]
Produces near-optimal 1 โ1
๐โ ๐โฒ solutions
Issue: Original Greedy Approach
Suffers from Scalability
Greedy Algorithm
# of Evaluating ๐ โ :
๐๐
Monte-Carlo Simulations
Computation Time of ๐ โ :
๐ถ ๐๐น
Total Time: ๐ถ ๐๐๐๐น (๐ โ 10,000)
๐ = ๐ >106
๐ = ๐ธ >107
๐: # of seeds
๐ = poly(๐โ1): # of simulations
TOO SLOW
7
Previous Methods
for Influence Maximization
Low Quality High Quality
Slow
Greedy Approach[Kempe, Kleinberg, Tardos. KDDโ03]
CELF[Leskovec, Krause, Guestrin, Faloutsos,
VanBriesen, Glance. KDDโ07]
StaticGreedyDU[Cheng, Shen, Huang, Zhang, Cheng. CIKMโ13]
Fast
DegreeDiscount[Chen, Wang, Yang. KDDโ09]
PMIA[Chen, Wang, Wang. KDDโ10]
SAEDV[Jiang, Song, Cong, Wang, Si, Xie. AAAIโ11]
IRIE[Jung, Heo, Chen. ICDMโ12]
CHALLENGE
8
Simulation-based
Heuristic-based
Our Contribution
Propose a simulation-based fast algorithm
Fast
Comparable to heuristics
Can handle graphs
with 60M edges in 20 min.
Accurate
Has a theoretical guarantee
Better than heuristics
9
Outline of Proposed Method
Preprocessing: Generating random graphs
Greedy Strategy
10
๐ โ โ while ๐ < ๐ do
๐ก โ argmax๐ฃโ๐๐ ๐ โช {๐ฃ} โ ๐(๐)
๐ โ ๐ โช {๐ก} โง Our Speed-up Techniques
โง Coin Flip Technique
Preprocessing:
Generating Random Graphs
Edge ๐ lives w.p. ๐๐
11
โฆโฆ
๐ฎ๐
Input graph ๐ฎ
๐ฎ๐น
๐ random graphs
๐จ๐ช ๐ซ
๐ฉ๐ฌ ๐ญ
๐จ๐ช ๐ซ
๐ฉ๐ฌ ๐ญ
๐จ๐ช ๐ซ
๐ฉ๐ฌ ๐ญ
Coin Flip Technique[Kempe, Kleinberg, Tardos. KDDโ03]
Computing influence spread ๐(๐)||
Counting # of vertices reachable
from ๐ on random graph
live edge: success
blocked edge: failure
How to Approximate ๐(๐)
12
๐ ๐๐ฎ๐ ๐ โฆ ๐๐ฎ๐น ๐ ๐ ๐
๐จ ๐ โฆ ๐ ๐. ๐
๐ฉ ๐ โฆ ๐ ๐. ๐
๐ช ๐ โฆ ๐ ๐. ๐
๐ซ ๐ โฆ ๐ ๐
๐ฌ ๐ โฆ ๐ ๐
๐ญ ๐ โฆ ๐ ๐. ๐
๐จ๐ช ๐ซ
๐ฉ๐ฌ ๐ญ
๐จ๐ช ๐ซ
๐ฉ๐ฌ ๐ญ
โฆ
๐ ๐ โ1
๐
๐=1
๐
๐๐บ๐ ๐
๐๐บ๐ ๐ = # of vertices
reachable from ๐ on ๐บ๐
CHALLENGE
Computing this table
as fast as possible
๐น = 200
106
Proposed Speed-up Techniques(we apply each random graph)
1. Pruned BFS for reachability tests (on random graphs)(We will focus on this)
[Akiba, Iwata, Yoshida. SIGMODโ13]
[Yano, Akiba, Iwata, Yoshida. CIKMโ13]
[Akiba, Iwata, Kawarabayashi, Kawata. ALENEXโ14]
2. Reducing unnecessary influence recomputations
3. Reducing # of random graphs by
Sample Average Approximation approach[Kimura, Saito, Nakano. AAAIโ07], [Cheng, Shen, Huang, Zhang, Cheng. CIKMโ13]
[Sheldon et al., UAIโ10]
We provide nice theoretical bound
13
These techniques do NOT affect
the estimation of ๐ โ
CORE IDEA
of
our paradigm
Pruned BFS
Idea: Most BFSs are redundant
Preprocessing: Compute ancestors and
descendants of vertex ๐ป with max. deg.
Pruning ๏ผBFS from ๐ฃ๏ผ: If ๐ฃ is ancestor of ๐ป,
we ignore descendants of ๐ป
14
๐ฏ
๐ฉ๐จ ๐ช
๐ฌ๐ซ ๐ญ
2
+
4
(# of vertices visited during BFS)
+
(# of descendants of ๐ป)
โง Precomputed
Is Pruned BFS Really Effective?
For Path Graphs
Pruned BFS is NOT effective ฮ ๐ 2
But, for Social Networks
Pruned BFS works effectively
since there is a hub(or giant component)
15
๐ฏ
A path graph
Giant
Component
๐ฏ
A social network
Effect of Pruned BFS
on Social Networks(LiveJournal dataset, ๐ = 4.8M, ๐ธ = 69M, ๐๐ = 0.1 โ๐)
# of vertices visited during Naive & Pruned BFSs
16
Average # of visited vertices (from each vertex):
400,000 (Naive BFS) โจ 6 (Pruned BFS)
Giant
Component
๐ฏ
Pru
ne
d B
FS
Naive BFS
Experiments: Influence SpreadWe set ๐๐ = ๐ท for every edge. Size of seed set = 50
17
Ours & StaticGreedyDU
give the best results
Dataset Ours(this work)
StaticGreedy
DU[Cheng+'13]
IRIE[Jung+'12]
PMIA[Chen+'10]
SAEDV[Jiang+'11]
DBLP
๐ท = ๐. ๐๐332 330 323 317 76
DBLP
๐ท = ๐. ๐100076 -- 99533 99505 99579
LiveJournal
๐ท = ๐. ๐๐47527 -- 41906 40544 26066
LiveJournal
๐ท = ๐. ๐1686629 -- 1682436 -- 1682242
Dataset ๐ฝ ๐ฌ
DBLP 655K 2.0M
Live Journal 4.8M 69M
significantly
better
Experiments: Running Time [s]We set ๐๐ = ๐ท for every edge. Size of seed set = 50
18Environment: Intel Xeon X5670 (2.93GHz), 48GB, Language: C++
As fast as heuristics
Robust against value of ๐ท
Dataset Ours(this work)
StaticGreedy
DU[Cheng+'13]
IRIE[Jung+'12]
PMIA[Chen+'10]
SAEDV[Jiang+'11]
DBLP
๐ท = ๐. ๐๐27 117 77 4 388
DBLP
๐ท = ๐. ๐52 OOM 77 289 388
LiveJournal
๐ท = ๐. ๐๐327 OOM 1622 500 1275
LiveJournal
๐ท = ๐. ๐663 OOM 1635 OOM 1294
Dataset ๐ฝ ๐ฌ
DBLP 655K 2.0M
Live Journal 4.8M 69M
Future Work
Applying other models
Parallelization
Analysis of Pruned BFS on social networks
19
Supplement
21
Dataset
Pruned BFS
+
Technique 2
Naive BFS
+
Technique 2
Pruned BFS Naive BFS
DBLP
๐ท = ๐. ๐๐27 26 149 158
DBLP
๐ท = ๐. ๐54 3036 306 3275
LiveJournal
๐ท = ๐. ๐๐327 1934 2176 3820
LiveJournal
๐ท = ๐. ๐634 272518 2426 272973
Running Time [s] for Each Variant
of Our Method
Construct a Vertex-weighted DAG
from a Random Graph
Strongly Connected Component Decomposition
23
A
B C3
1
2
2
Other Models for
Information Diffusion Linear Threshold Model [Kempe, Kleinberg, Tardos. KDDโ03]
Inactive vertex ๐ฃ becomes active if
๐ข: active neighbor of ๐ฃ
๐๐ข๐ฃ โฅ ๐๐ฃ
๐๐ฃ: Threshold chosen from 0,1 uniformly at random
Equivalent to reachability tests on random graphs
Independent Cascade with Meeting Events [Chen, Lu, Zhang. AAAIโ12]
Maximizing the influence spread within a given deadline
We have to consider shortest paths(not only reachability)
24
Running Time for Each Value of ๐ท
25The Value of ๐ท
Ru
nn
ing
Tim
e
A Social Network
26http://www.cise.ufl.edu/research/sparse/matrices/SNAP/soc-LiveJournal1.html