partialorderstructureand informationgeometry - mahito · november16,2016 ibis2016...
Post on 19-Jun-2020
0 Views
Preview:
TRANSCRIPT
November 16, 2016IBIS2016
Partial Order Structure andInformation Geometry(順序構造と情報幾何)
Mahito Sugiyama (ISIR, Osaka University, PRESTO)(杉山 麿人;大阪大学産業科学研究所, JSTさきがけ)
Today’s Model on Poset (S, ≤)
log p(x) =s∈S
ζ(s, x)θ(s)
p(x) =s∈S
µ(x , s)η(s)
1/39
Today’s Model on Poset (S, ≤)
log p(x) =s∈S
ζ(s, x)θ(s)
p(x) =s∈S
µ(x , s)η(s)
Probability
Coe�cient of log-linear model(Bias/weight in Boltzmann machines)(Natural parameter of exponential family)
Zeta function
Möbius function Expectation(Frequency in pattern mining)(Su�cient statistics in exponential family)
log p(x) =s∈S
ζ(s, x)θ(s)
p(x) =s∈S
µ(x , s)η(s)
1/39
Outcome
• Given a poset (S , ≤) and consider distributions on S– The least element⊥ ∈ S is assumed
1. KL divergence decomposition:DKL[P, R] = DKL[P, Q] + DKL[Q , R]with Q s.t. θQ (x) = θR(x) or ηQ (x) = ηP(x) for all x ∈ S \ {⊥}
2. The set of probability distributions on (S , ≤) isa dually flat manifold w.r.t. θ and η– p, θ, and η are coordinate systems– θ and η are orthogonal– θ introduces the structure of exponential family– η introduces the structure of mixture family
2/39
Partially Ordered Sets
∅
{x, y, z}
{x, z} {y, z}
{y} {z}
{x, y}
{x}
Power set
3/39
Partially Ordered Sets
∅
{x, y, z}
{x, z} {y, z}
{y} {z}
{x, y}
{x}
Power set
0
1
2
3
Positive integers
3/39
Partially Ordered Sets
∅
{x, y, z}
{x, z} {y, z}
{y} {z}
{x, y}
{x}
Power set
0
1
2
3
Positive integers
01 10
0 1
1100
111110101100011010001000
λ
Pre�xes
3/39
Partially Ordered Sets
∅
{x, y, z}
{x, z} {y, z}
{y} {z}
{x, y}
{x}
Power set
0
1
2
3
Positive integers
01 10
0 1
1100
111110101100011010001000
λ
Pre�xes
Directed Acyclic Graph
3/39
Posets with Probability Distribution
Probability distributionon posets (partially ordered sets)
Prob
abili
ty
4/39
Posets with Probability Distribution
Probability distributionon posets (partially ordered sets)
Prob
abili
ty
Informationgeometry
log p(x) = ∑ ζ(s, x)θ(s)Decomposition inthe log-linear model
4/39
Posets with Probability Distribution
Probability distributionon posets (partially ordered sets)
Prob
abili
ty
log p(x) = ∑ ζ(s, x)θ(s)Decomposition inthe log-linear model
(e.g. Neurons, SNPs, ...)
0 0 11 0 01 1 10 0 01 1 00 1 11 0 11 0 11 0 11 1 0
Numerical score(KL divergence)and the p-valuefor higher-orderintractions
∅
{x, y, z}{x, z}
{y, z}
{y}
{z}
{x, y}
{x}
∅
{x, y, z}{x, z}
{y, z}
{y}
{z}
{x, y}
{x}
4/39
ID 1:ID 2:ID 3:ID 4:ID 5:ID 6:ID 7:ID 8:ID 9:ID10:
Binary vectors(Transactiondatabase)
1 1 01 1 11 1 01 1 11 1 01 0 11 0 11 1 11 0 00 1 0
∅
{ , }{ , }
{ } { }
{ , }
{ , , }
{ }
Poset (itemset lattice)
5/39
ID 1:ID 2:ID 3:ID 4:ID 5:ID 6:ID 7:ID 8:ID 9:ID10:
Binary vectors(Transactiondatabase)
1 1 01 1 11 1 01 1 11 1 01 0 11 0 11 1 11 0 00 1 0
∅
{ , }{ , }
{ } { }
{ , }
{ , , }
{ }
Poset (itemset lattice)
1.0
0.9 0.7 0.5
0.6 0.5 0.3
Frequency = 0.3
6/39
ID 1:ID 2:ID 3:ID 4:ID 5:ID 6:ID 7:ID 8:ID 9:ID10:
Binary vectors(Transactiondatabase)
1 1 01 1 11 1 01 1 11 1 01 0 11 0 11 1 11 0 00 1 0
∅
{ , }{ , }
{ } { }
{ , }
{ , , }
{ }
Poset (itemset lattice)
1.0
0.9 0.7 0.5
0.6 0.5 0.3
Frequency = 0.3
7/39
ID 1:ID 2:ID 3:ID 4:ID 5:ID 6:ID 7:ID 8:ID 9:ID10:
Binary vectors(Transactiondatabase)
1 1 01 1 11 1 01 1 11 1 01 0 11 0 11 1 11 0 00 1 0
∅
{ , }{ , }
{ } { }
{ , }
{ , , }
{ }
Poset (itemset lattice)
1.0
0.9 0.7 0.5
0.6 0.5 0.3
Frequency = 0.3
, 0.0
0.1 0.1 0.0
0.3 0.2 0.0
Probability = 0.3
7/39
∅
{ , }{ , }
{ } { }
{ , }
{ , , }
{ }
Poset (itemset lattice)
1.0
0.9 0.7 0.5
0.6 0.5 0.3
Frequency = 0.3
, 0.0
0.1 0.1 0.0
0.3 0.2 0.0
Probability = 0.3
Upward =Pattern mining
η( ) = p( ) + p( ){ , } { , } { , , }
η: Frequencyp: Probability
7/39
∅
{ , }{ , }
{ } { }
{ , }
{ , , }
{ }
Poset (itemset lattice)
1.0
0.9 0.7 0.5
0.6 0.5 0.3
Frequency = 0.3
, 0.0
0.1 0.1 0.0
0.3 0.2 0.0
Probability = 0.3
Upward =Pattern mining
η( ) = p( ) + p( ){ , } { , } { , , }
η: Frequencyp: Probabilityη: Frequencyp: Probabilityθ: Coe�cient of
log-linear model
log p( ) = θ( ) + θ( ) + θ( ) + θ(∅){ , } { , } { } { }
Downward =Log-linear analysis
8/39
∅
{ , }{ , }
{ } { }
{ , }
{ , , }
{ }
log p(x) = ∑ ζ(s, x)θ(s)
9/39
∅
{ , }{ , }
{ } { }
{ , }
{ , , }
{ }
log p(x) = ∑ ζ(s, x)θ(s)
p(x) = exp( ∑ θ(s)Fs(x) – ψ(θ) )Exponentialfamily:
Naturalparameter
e.g. Gaussian
10/39
∅
{ , }{ , }
{ } { }
{ , }
{ , , }
{ }
log p(x) = ∑ ζ(s, x)θ(s)
p(x) = exp( ∑ θ(s)Fs(x) – ψ(θ) )Exponentialfamily:
Naturalparameter
e.g. Gaussian
Su�cientstatistics ofexponentialfamily
η(x) = �[ Fx(s) ]
η(x) = ∑ ζ(x, s)p(s)
11/39
Möbius Inversion on Posets
• Zeta function ζ∶S × S → {0, 1}:ζ(s, x) = { 1 if s ≤ x ,
0 otherwise
• Möbius function µ∶S × S → Z, defined as µ = ζ −1:
µ(x , y) = ⎧⎪⎪⎪⎨⎪⎪⎪⎩1, if x = y,−∑x≤s<y µ(x , s) if x < y,0 otherwise
• The Möbius inversion formula [Rota (1964)]:g(x) = ∑
s∈Sζ(s, x)f (s) ⇔ f (x) = ∑
s∈Sµ(s, x)g(s)
12/39
Möbius Function Is Generalization ofInclusion-Exclusion Principle
• For sets A, B, C ,∣A ∪ B ∪ C∣ = ∣A∣ + ∣B∣ + ∣C∣ − ∣A ∩ B∣ − ∣B ∩ C∣ − ∣A ∩ C∣+ ∣A ∩ B ∩ C∣
• In general, for A1 , A2 , . . . , An ,»»»»»»»»»»⋃i A i
»»»»»»»»»» = ∑J⊆{1, . . . ,n}, J/=∅(−1)∣J∣−1
»»»»»»»»»»»⋂j∈J A j
»»»»»»»»»»»• The Möbius function µ is the generalization of “(−1)∣J∣−1”
13/39
Log-linear Model with Möbius Inversion
• Log-linear model and its sufficient statistics:
log p(x) = ∑s∈S
ζ(s, x)θ(s) = ∑s≤x
θ(s),η(x) = ∑
s∈Sζ(x , s)p(s) = ∑
s≥xp(s)
– Generalization of the log-linear model on binary vectors:
log p(x) = ∑i
θ i x i +∑i< j
θ i j x i x j + ⋅ ⋅ ⋅ + θ 1. . .n x 1x2 . . . xn ,
• From the Möbius inversion formula,θ(x) = ∑
s∈Sµ(s, x) log p(s), p(x) = ∑
s∈Sµ(x , s)η(s)
14/39
∅
{ , }
{ } { }
0.30.5
1.10
0.20.2
–1.79
0.11.0
–2.30
0.40.6
1.39
pηθ
Triple for each node
η(x) = ∑ p(s)s ≥ x
log p(x) = ∑ θ(s)s ≤ x
15/39
∅
{ , }
{ } { }
0.30.5
1.10
0.20.2
–1.79
0.11.0
–2.30
0.40.6
1.39
pηθ
Triple for each node
η(x) = ∑ p(s)s ≥ x
log p(x) = ∑ θ(s)s ≤ x
{ }
{ }p( )
p( )
{ , }p( )
15/39
∅
{ , }
{ } { }
0.30.5
1.10
0.20.2
–1.79
0.11.0
–2.30
0.40.6
1.39
pηθ
Triple for each node
η(x) = ∑ p(s)s ≥ x
log p(x) = ∑ θ(s)s ≤ x
{ }
{ }p( )
p( )
{ , }p( )0.3
0.20.2
0.40.4
Probability distributionis a “point” in 3D space
16/39
∅
{ , }
{ } { }
0.30.5
1.10
0.20.2
–1.79
0.11.0
–2.30
0.40.6
1.39
pηθ
Triple for each node
η(x) = ∑ p(s)s ≥ x
log p(x) = ∑ θ(s)s ≤ x
{ }
{ }η( )
η( )
{ , }η( )
0.20.2
0.60.6
0.50.5
Probability distributionis a “point” in 3D space
17/39
∅
{ , }
{ } { }
0.30.5
1.10
0.20.2
–1.79
0.11.0
–2.30
0.40.6
1.39
pηθ
Triple for each node
η(x) = ∑ p(s)s ≥ x
log p(x) = ∑ θ(s)s ≤ x
{ }
{ }θ( )
θ( )
{ , }θ( )
–1.79–1.791.391.39
1.101.10
Probability distributionis a “point” in 3D space
18/39
∅
{ , }
{ } { }
0.30.5
1.10
0.20.2
–1.79
0.11.0
–2.30
0.40.6
1.39
pηθ
Triple for each node
η(x) = ∑ p(s)s ≥ x
log p(x) = ∑ θ(s)s ≤ x
{ , }
{ }
{ }
one-to-one
19/39
∅
{ , }
{ } { }
0.30.5
1.10
0.20.2
–1.79
0.11.0
–2.30
0.40.6
1.39
pηθ
Triple for each node
η(x) = ∑ p(s)s ≥ x
log p(x) = ∑ θ(s)s ≤ x
{ , }
one-to-one
{ }θ( ){ }
{ }η( )θ and η aredually orthogonal
19/39
Orthogonality of θ and η
• FromMöbius inversion,
∑s∈S
ζ(x , s)µ(s, y) = δx ,y , δx ,y = { 1 if x = y,0 otherwise
• θ and η are dually orthogonal:
E [ ∂∂θ(x) log p(s) ∂
∂η(y) log p(s)] = ∑s∈S
ζ(x , s)µ(s, y) = δx ,y
• Partial order structure leads to the samedually flat structure with the exponential family
20/39
Existing Approach Limited To Power Set
∅
{x, y, z}
{x, z} {y, z}
{y} {z}
{x, y}
{x}
Power set
21/39
Our Approach Applies To Any Posets
∅
{x, y, z}
{y, z}
{y}
{x, y}
{x}
Subset of power set
22/39
Our Approach Applies To Any Posets
0
1
2
3
Positive integers
01 10
0 1
1100
111110101100011010001000
λ
Pre�xes
Directed Acyclic Graph
∅
{x, y, z}
{y, z}
{y}
{x, y}
{x}
Subset of power set
22/39
KL Divergence Decomposition
• KL divergence decomposition:DKL[P, R] = DKL[P, Q] + DKL[Q , R]with Q s.t. θQ (x) = θR(x) or ηQ (x) = ηP(x) for all x ∈ S \ {⊥}– Q is called the mixed distribution of (P, R)– It is known as the (generalized) Pythagoras theoremin Information Geometry
• We can derive fromMöbius inversion:DKL[P, Q] + DKL[Q , R] − DKL[P, R]
= ∑s∈S
(ηQ (s) − ηP(s)) (θQ (s) − θR(s))23/39
∅
0.30.51.10
0.20.2
–1.790.40.61.39 ∅
0.70.81.95
0.10.1
–1.950.10.20.0
Dist. P Dist. Rpηθ
24/39
∅
0.30.51.10
0.20.2
–1.790.40.61.39 ∅
0.70.81.95
0.10.1
–1.950.10.20.0
Dist. P Dist. Rpηθ
choose ηchoose θMixed
distribution Q
∅
??????1.95
???0.2???
??????0.0
24/39
∅
0.30.51.10
0.20.2
–1.790.40.61.39 ∅
0.70.81.95
0.10.1
–1.950.10.20.0
Dist. P Dist. Rpηθ
choose ηchoose θMixed
distribution Q
∅
0.620.51.95
0.0890.60.0
–1.13
0.20.2
–1.13
25/39
∅
0.30.51.10
0.20.2
–1.790.40.61.39 ∅
0.70.81.95
0.10.1
–1.950.10.20.0
Dist. P Dist. Rpηθ
choose ηchoose θMixed
distribution Q
∅
0.620.51.95
0.0890.60.0
–1.13
0.20.2
–1.13
KL[P, R]
26/39
∅
0.30.51.10
0.20.2
–1.790.40.61.39 ∅
0.70.81.95
0.10.1
–1.950.10.20.0
Dist. P Dist. Rpηθ
choose ηchoose θMixed
distribution Q
∅
0.620.51.95
0.0890.60.0
–1.13
0.20.2
–1.13
KL[P, R] KL[P, Q]+ KL[Q, R]
=
Nonnegative decompositionof the KL divergence
27/39
∅
0.30.51.10
0.20.2
–1.790.40.61.39 ∅
0.70.81.95
0.10.1
–1.950.10.20.0
Dist. P Dist. Rpηθ
choose ηchoose θMixed
distribution Q
∅
0.620.51.95
0.0890.60.0
–1.13
0.20.2
–1.13
+ =
Nonnegative decompositionof the KL divergence
0.3946 0.04440.4390
28/39
∅
0.30.51.10
0.20.2
–1.790.40.61.39 ∅
0.250.50.0
0.250.250.0
0.250.50.0
Dist. P Uniform dist. P0pηθ
29/39
∅
0.30.51.10
0.20.2
–1.790.40.61.39 ∅
0.250.50.0
0.250.250.0
0.250.50.0
Dist. P Uniform dist. P0pηθ
choose η choose θ(KNOCK DOWN)
log p(x) = ∑ θ(s)s ≤ x
Log-linear model
∅
???0.5???
??????0.0
???0.6???
Mixeddistribution Q
29/39
∅
0.30.51.10
0.20.2
–1.790.40.61.39 ∅
0.250.50.0
0.250.250.0
0.250.50.0
Dist. P Uniform dist. P0pηθ
choose η choose θ(KNOCK DOWN)
log p(x) = ∑ θ(s)s ≤ x
Log-linear model
∅
0.20.50.0
0.30.30.0
0.30.60.405
Mixeddistribution R
Contribution of the node= KL[P, Q] = 0.086
30/39
∅
0.30.51.10
0.20.2
–1.790.40.61.39 ∅
0.250.50.0
0.250.250.0
0.250.50.0
Dist. P Uniform dist. P0pηθ
choose η choose θ(KNOCK DOWN)
log p(x) = ∑ θ(s)s ≤ x
Log-linear model
∅
0.20.50.0
0.30.30.0
0.30.60.405
Mixeddistribution R
Contribution of the node= KL[P, Q] = 0.086
The statistics λ: λ = 2·[sample size]·KL[P, Q]follows χ2-distributionwith d.f. [#nodes – 1]⇒ p-value can be obtained!
31/39
Poset of Subgraphs
32/39
Log-Linear Model on Subgraphs
log p(x) = ∑ θ(s)s ⊑ x
Log-linear model:
Natural parameterof exponential familyNatural parameterof exponential family
η(x) = ∑ p(s)s ⊒ x
Su�cient statisticsof exponential familySu�cient statisticsof exponential family
33/39
Information of Each Subgraph
η0 η1 η2 η3 η4 η5 ηkθ0 θ1 θ2 θ3 θ4 θ5 θk
P: Empirical distribution ηj
θi
Freq.Coef.
34/39
Information of Each Subgraph
η0 η1 η2 η3 η4 η5 ηkθ0 θ1 θ2 θ3 θ4 θ5 θk
P: Empirical distribution ηj
θi
Freq.Coef.
KL(P, Q)
η0 η1 η2 η3 ? η5 ηk? ? ? ? 0 ? ?
Q: Null distribution KL(P, Q)
Freq.Coef.
34/39
Information of Each Subgraph
η0 η1 η2 η3 η4 η5 ηkθ0 θ1 θ2 θ3 θ4 θ5 θk
P: Empirical distribution ηj
θi
Freq.Coef.
KL(P, Q)
η0 η1 η2 η3 ? η5 ηk? ? ? ? 0 ? ?
Q: Null distribution KL(P, Q)
Freq.Coef.
= KL(P, Q)= KL(P, R)+ KL(Q, Q
KL(Q, R)
KL(P, R)= KL(P, Q)+ KL(Q, R)
1.0 ? ? ? ? ? ?θ0’ 0 0 0 0 0 0
R: Uniform distribution KL(Q, R)
Freq.Coef.
34/39
Make a Poset from Data
ID 1:ID 2:ID 3:ID 4:ID 5:ID 6:ID 7:ID 8:ID 9:ID10:
Dataset
1 1 01 1 11 1 01 1 11 1 01 0 11 0 11 1 11 0 00 1 0
Number of nodes = 2#features
⇒combinatorial explosion!
∅
{ , }{ , }
{ } { }
{ , }
{ , , }
{ }
1.0
0.9 0.7 0.5
0.6 0.5 0.3
Frequency = 0.3
, 0.0
0.1 0.1 0.0
0.3 0.2 0.0
Probability = 0.3
35/39
Make a Poset from Data
ID 1:ID 2:ID 3:ID 4:ID 5:ID 6:ID 7:ID 8:ID 9:ID10:
Dataset
1 1 01 1 11 1 01 1 11 1 01 0 11 0 11 1 11 0 00 1 0
∅
{ , }{ , }
{ } { }
{ , }
{ , , }
{ }
1.0
0.9 0.7 0.5
0.6 0.5 0.3
Frequency = 0.3
, 0.0
0.1 0.1 0.0
0.3 0.2 0.0
Probability = 0.3
Probability ≥ 0.2(user speci�ed threshold)
35/39
Remove Nodes with Probability 0
ID 1:ID 2:ID 3:ID 4:ID 5:ID 6:ID 7:ID 8:ID 9:ID10:
Dataset
1 1 01 1 11 1 01 1 11 1 01 0 11 0 11 1 11 0 00 1 0
∅
{ , }{ , }
{ , , }0.30.30.0
pηθ
0.30.6
0.405
0.20.50.0
0.21.0
–1.61
= 1 – ∑ p(s)
36/39
Example on Real Data (kosarak)
ID 1:ID 2:ID 3:ID 4:ID 5:
1 1 01 1 11 1 01 1 11 1 0
Sample size:990,002
# features: 41,270# nodes: 3,253(Threshold: 10–5)
# signi�cant interactions: 583Single feature: 537Pairwise interactions: 41Triple interactions: 5
Total runtime:4.95 seconds
37/39
Example on Real Data (accidents)
ID 1:ID 2:ID 3:ID 4:ID 5:
1 1 01 1 11 1 01 1 11 1 0
Sample size:340,183
# nodes: 281(Threshold: 5×10–6)
# signi�cant interactions: 280# features in each interactionis between 26 to 41
Total runtime:4.95 seconds
# features: 468
38/39
Conclusion
• A close connection between the partial order structureand information geometry– Möbius inversion leads to the dually flat manifolds
◦ M. Sugiyama, H. Nakahara, K. Tsuda,Information Decomposition on Structured Space,IEEE ISIT (2016)
◦ S. Amari, Information geometry on hierarchy ofprobability distributions, IEEE Trans. Info. Theory (2001)
◦ H. Nakahara, S. Amari, Information-geometric measurefor neural spikes, Neural Computation (2002)
• We can decompose the KL divergence andasses the significance on any posets
39/39
top related