partialorderstructureand informationgeometry - mahito · november16,2016 ibis2016...

November 16, 2016IBIS2016

Partial Order Structure andInformation Geometry（順序構造と情報幾何）

Mahito Sugiyama (ISIR, Osaka University, PRESTO)（杉山麿人;大阪大学産業科学研究所, JSTさきがけ）

Today’s Model on Poset (S, ≤)

log p(x) =s∈S

ζ(s, x)θ(s)

p(x) =s∈S

µ(x , s)η(s)

Today’s Model on Poset (S, ≤)

log p(x) =s∈S

ζ(s, x)θ(s)

p(x) =s∈S

µ(x , s)η(s)

Probability

Coe�cient of log-linear model(Bias/weight in Boltzmann machines)(Natural parameter of exponential family)

Zeta function

Möbius function Expectation(Frequency in pattern mining)(Su�cient statistics in exponential family)

log p(x) =s∈S

ζ(s, x)θ(s)

p(x) =s∈S

µ(x , s)η(s)

Outcome

• Given a poset (S , ≤) and consider distributions on S– The least element⊥ ∈ S is assumed

1. KL divergence decomposition:DKL[P, R] = DKL[P, Q] + DKL[Q , R]with Q s.t. θQ (x) = θR(x) or ηQ (x) = ηP(x) for all x ∈ S \ {⊥}

2. The set of probability distributions on (S , ≤) isa dually flat manifold w.r.t. θ and η– p, θ, and η are coordinate systems– θ and η are orthogonal– θ introduces the structure of exponential family– η introduces the structure of mixture family

Partially Ordered Sets

{x, y, z}

{x, z} {y, z}

{y} {z}

{x, y}

Power set

{x, y, z}

{x, z} {y, z}

{y} {z}

{x, y}

Power set

Positive integers

{x, y, z}

{x, z} {y, z}

{y} {z}

{x, y}

Power set

Positive integers

111110101100011010001000

Pre�xes

{x, y, z}

{x, z} {y, z}

{y} {z}

{x, y}

Power set

Positive integers

111110101100011010001000

Pre�xes

Directed Acyclic Graph

Posets with Probability Distribution

Probability distributionon posets (partially ordered sets)

Informationgeometry

log p(x) = ∑ ζ(s, x)θ(s)Decomposition inthe log-linear model

(e.g. Neurons, SNPs, ...)

0 0 11 0 01 1 10 0 01 1 00 1 11 0 11 0 11 0 11 1 0

Numerical score(KL divergence)and the p-valuefor higher-orderintractions

{x, y, z}{x, z}

{y, z}

{x, y}

{x, y, z}{x, z}

{y, z}

{x, y}

ID 1:ID 2:ID 3:ID 4:ID 5:ID 6:ID 7:ID 8:ID 9:ID10:

Binary vectors(Transactiondatabase)

1 1 01 1 11 1 01 1 11 1 01 0 11 0 11 1 11 0 00 1 0

{ , }{ , }

{ } { }

{ , , }

Poset (itemset lattice)

1 1 01 1 11 1 01 1 11 1 01 0 11 0 11 1 11 0 00 1 0

{ , }{ , }

{ } { }

{ , , }

0.9 0.7 0.5

0.6 0.5 0.3

Frequency = 0.3

1 1 01 1 11 1 01 1 11 1 01 0 11 0 11 1 11 0 00 1 0

{ , }{ , }

{ } { }

{ , , }

0.9 0.7 0.5

0.6 0.5 0.3

Frequency = 0.3

1 1 01 1 11 1 01 1 11 1 01 0 11 0 11 1 11 0 00 1 0

{ , }{ , }

{ } { }

{ , , }

0.9 0.7 0.5

0.6 0.5 0.3

Frequency = 0.3

0.1 0.1 0.0

0.3 0.2 0.0

Probability = 0.3

{ , }{ , }

{ } { }

{ , , }

0.9 0.7 0.5

0.6 0.5 0.3

Frequency = 0.3

0.1 0.1 0.0

0.3 0.2 0.0

Probability = 0.3

Upward =Pattern mining

η( ) = p( ) + p( ){ , } { , } { , , }

η: Frequencyp: Probability

{ , }{ , }

{ } { }

{ , , }

0.9 0.7 0.5

0.6 0.5 0.3

Frequency = 0.3

0.1 0.1 0.0

0.3 0.2 0.0

Probability = 0.3

Upward =Pattern mining

η( ) = p( ) + p( ){ , } { , } { , , }

η: Frequencyp: Probabilityη: Frequencyp: Probabilityθ: Coe�cient of

log-linear model

log p( ) = θ( ) + θ( ) + θ( ) + θ(∅){ , } { , } { } { }

Downward =Log-linear analysis

{ , }{ , }

{ } { }

{ , , }

log p(x) = ∑ ζ(s, x)θ(s)

{ , }{ , }

{ } { }

{ , , }

p(x) = exp( ∑ θ(s)Fs(x) – ψ(θ) )Exponentialfamily:

Naturalparameter

e.g. Gaussian

{ , }{ , }

{ } { }

{ , , }

p(x) = exp( ∑ θ(s)Fs(x) – ψ(θ) )Exponentialfamily:

Naturalparameter

e.g. Gaussian

Su�cientstatistics ofexponentialfamily

η(x) = �[ Fx(s) ]

η(x) = ∑ ζ(x, s)p(s)

Möbius Inversion on Posets

• Zeta function ζ∶S × S → {0, 1}:ζ(s, x) = { 1 if s ≤ x ,

0 otherwise

• Möbius function µ∶S × S → Z, defined as µ = ζ −1:

µ(x , y) = ⎧⎪⎪⎪⎨⎪⎪⎪⎩1, if x = y,−∑x≤s<y µ(x , s) if x < y,0 otherwise

• The Möbius inversion formula [Rota (1964)]:g(x) = ∑

s∈Sζ(s, x)f (s) ⇔ f (x) = ∑

s∈Sµ(s, x)g(s)

Möbius Function Is Generalization ofInclusion-Exclusion Principle

• For sets A, B, C ,∣A ∪ B ∪ C∣ = ∣A∣ + ∣B∣ + ∣C∣ − ∣A ∩ B∣ − ∣B ∩ C∣ − ∣A ∩ C∣+ ∣A ∩ B ∩ C∣

• In general, for A1 , A2 , . . . , An ,»»»»»»»»»»⋃i A i

»»»»»»»»»» = ∑J⊆{1, . . . ,n}, J/=∅(−1)∣J∣−1

»»»»»»»»»»»⋂j∈J A j

»»»»»»»»»»»• The Möbius function µ is the generalization of “(−1)∣J∣−1”

Log-linear Model with Möbius Inversion

• Log-linear model and its sufficient statistics:

log p(x) = ∑s∈S

ζ(s, x)θ(s) = ∑s≤x

θ(s),η(x) = ∑

s∈Sζ(x , s)p(s) = ∑

s≥xp(s)

– Generalization of the log-linear model on binary vectors:

log p(x) = ∑i

θ i x i +∑i< j

θ i j x i x j + ⋅ ⋅ ⋅ + θ 1. . .n x 1x2 . . . xn ,

• From the Möbius inversion formula,θ(x) = ∑

s∈Sµ(s, x) log p(s), p(x) = ∑

s∈Sµ(x , s)η(s)

{ } { }

0.30.5

0.20.2

–1.79

0.11.0

–2.30

0.40.6

Triple for each node

η(x) = ∑ p(s)s ≥ x

log p(x) = ∑ θ(s)s ≤ x

{ } { }

0.30.5

0.20.2

–1.79

0.11.0

–2.30

0.40.6

η(x) = ∑ p(s)s ≥ x

{ }p( )

{ , }p( )

{ } { }

0.30.5

0.20.2

–1.79

0.11.0

–2.30

0.40.6

η(x) = ∑ p(s)s ≥ x

{ }p( )

{ , }p( )0.3

0.20.2

0.40.4

Probability distributionis a “point” in 3D space

{ } { }

0.30.5

0.20.2

–1.79

0.11.0

–2.30

0.40.6

η(x) = ∑ p(s)s ≥ x

{ }η( )

{ , }η( )

0.20.2

0.60.6

0.50.5

{ } { }

0.30.5

0.20.2

–1.79

0.11.0

–2.30

0.40.6

η(x) = ∑ p(s)s ≥ x

{ }θ( )

{ , }θ( )

–1.79–1.791.391.39

1.101.10

{ } { }

0.30.5

0.20.2

–1.79

0.11.0

–2.30

0.40.6

η(x) = ∑ p(s)s ≥ x

one-to-one

{ } { }

0.30.5

0.20.2

–1.79

0.11.0

–2.30

0.40.6

η(x) = ∑ p(s)s ≥ x

one-to-one

{ }θ( ){ }

{ }η( )θ and η aredually orthogonal

Orthogonality of θ and η

• FromMöbius inversion,

∑s∈S

ζ(x , s)µ(s, y) = δx ,y , δx ,y = { 1 if x = y,0 otherwise

• θ and η are dually orthogonal:

E [ ∂∂θ(x) log p(s) ∂

∂η(y) log p(s)] = ∑s∈S

ζ(x , s)µ(s, y) = δx ,y

• Partial order structure leads to the samedually flat structure with the exponential family

Existing Approach Limited To Power Set

{x, y, z}

{x, z} {y, z}

{y} {z}

{x, y}

Power set

Our Approach Applies To Any Posets

{x, y, z}

{y, z}

{x, y}

Subset of power set

Our Approach Applies To Any Posets

Positive integers

111110101100011010001000

Pre�xes

Directed Acyclic Graph

{x, y, z}

{y, z}

{x, y}

Subset of power set

KL Divergence Decomposition

• KL divergence decomposition:DKL[P, R] = DKL[P, Q] + DKL[Q , R]with Q s.t. θQ (x) = θR(x) or ηQ (x) = ηP(x) for all x ∈ S \ {⊥}– Q is called the mixed distribution of (P, R)– It is known as the (generalized) Pythagoras theoremin Information Geometry

• We can derive fromMöbius inversion:DKL[P, Q] + DKL[Q , R] − DKL[P, R]

= ∑s∈S

(ηQ (s) − ηP(s)) (θQ (s) − θR(s))23/39

0.30.51.10

0.20.2

–1.790.40.61.39 ∅

0.70.81.95

0.10.1

–1.950.10.20.0

Dist. P Dist. Rpηθ

0.30.51.10

0.20.2

–1.790.40.61.39 ∅

0.70.81.95

0.10.1

–1.950.10.20.0

choose ηchoose θMixed

distribution Q

??????1.95

???0.2???

??????0.0

0.30.51.10

0.20.2

–1.790.40.61.39 ∅

0.70.81.95

0.10.1

–1.950.10.20.0

distribution Q

0.620.51.95

0.0890.60.0

–1.13

0.20.2

–1.13

0.30.51.10

0.20.2

–1.790.40.61.39 ∅

0.70.81.95

0.10.1

–1.950.10.20.0

distribution Q

0.620.51.95

0.0890.60.0

–1.13

0.20.2

–1.13

KL[P, R]

0.30.51.10

0.20.2

–1.790.40.61.39 ∅

0.70.81.95

0.10.1

–1.950.10.20.0

distribution Q

0.620.51.95

0.0890.60.0

–1.13

0.20.2

–1.13

KL[P, R] KL[P, Q]+ KL[Q, R]

Nonnegative decompositionof the KL divergence

0.30.51.10

0.20.2

–1.790.40.61.39 ∅

0.70.81.95

0.10.1

–1.950.10.20.0

distribution Q

0.620.51.95

0.0890.60.0

–1.13

0.20.2

–1.13

Nonnegative decompositionof the KL divergence

0.3946 0.04440.4390

0.30.51.10

0.20.2

–1.790.40.61.39 ∅

0.250.50.0

0.250.250.0

0.250.50.0

Dist. P Uniform dist. P0pηθ

0.30.51.10

0.20.2

–1.790.40.61.39 ∅

0.250.50.0

0.250.250.0

0.250.50.0

choose η choose θ(KNOCK DOWN)

Log-linear model

???0.5???

??????0.0

???0.6???

Mixeddistribution Q

0.30.51.10

0.20.2

–1.790.40.61.39 ∅

0.250.50.0

0.250.250.0

0.250.50.0

Log-linear model

0.20.50.0

0.30.30.0

0.30.60.405

Mixeddistribution R

Contribution of the node= KL[P, Q] = 0.086

0.30.51.10

0.20.2

–1.790.40.61.39 ∅

0.250.50.0

0.250.250.0

0.250.50.0

Log-linear model

0.20.50.0

0.30.30.0

0.30.60.405

Mixeddistribution R

Contribution of the node= KL[P, Q] = 0.086

The statistics λ: λ = 2·[sample size]·KL[P, Q]follows χ2-distributionwith d.f. [#nodes – 1]⇒ p-value can be obtained!

Poset of Subgraphs

Log-Linear Model on Subgraphs

log p(x) = ∑ θ(s)s ⊑ x

Log-linear model:

Natural parameterof exponential familyNatural parameterof exponential family

η(x) = ∑ p(s)s ⊒ x

Su�cient statisticsof exponential familySu�cient statisticsof exponential family

Information of Each Subgraph

η0 η1 η2 η3 η4 η5 ηkθ0 θ1 θ2 θ3 θ4 θ5 θk

P: Empirical distribution ηj

Freq.Coef.

KL(P, Q)

η0 η1 η2 η3 ? η5 ηk? ? ? ? 0 ? ?

Q: Null distribution KL(P, Q)

Freq.Coef.

KL(P, Q)

η0 η1 η2 η3 ? η5 ηk? ? ? ? 0 ? ?

Q: Null distribution KL(P, Q)

Freq.Coef.

= KL(P, Q)= KL(P, R)+ KL(Q, Q

KL(Q, R)

KL(P, R)= KL(P, Q)+ KL(Q, R)

1.0 ? ? ? ? ? ?θ0’ 0 0 0 0 0 0

R: Uniform distribution KL(Q, R)

Freq.Coef.

Make a Poset from Data

Dataset

1 1 01 1 11 1 01 1 11 1 01 0 11 0 11 1 11 0 00 1 0

Number of nodes = 2#features

⇒combinatorial explosion!

{ , }{ , }

{ } { }

{ , , }

0.9 0.7 0.5

0.6 0.5 0.3

Frequency = 0.3

0.1 0.1 0.0

0.3 0.2 0.0

Probability = 0.3

Make a Poset from Data

Dataset

1 1 01 1 11 1 01 1 11 1 01 0 11 0 11 1 11 0 00 1 0

{ , }{ , }

{ } { }

{ , , }

0.9 0.7 0.5

0.6 0.5 0.3

Frequency = 0.3

0.1 0.1 0.0

0.3 0.2 0.0

Probability = 0.3

Probability ≥ 0.2(user speci�ed threshold)

Remove Nodes with Probability 0

Dataset

1 1 01 1 11 1 01 1 11 1 01 0 11 0 11 1 11 0 00 1 0

{ , }{ , }

{ , , }0.30.30.0

0.30.6

0.20.50.0

0.21.0

–1.61

= 1 – ∑ p(s)

Example on Real Data (kosarak)

ID 1:ID 2:ID 3:ID 4:ID 5:

1 1 01 1 11 1 01 1 11 1 0

Sample size:990,002

# features: 41,270# nodes: 3,253(Threshold: 10–5)

# signi�cant interactions: 583Single feature: 537Pairwise interactions: 41Triple interactions: 5

Total runtime:4.95 seconds

Example on Real Data (accidents)

ID 1:ID 2:ID 3:ID 4:ID 5:

1 1 01 1 11 1 01 1 11 1 0

Sample size:340,183

# nodes: 281(Threshold: 5×10–6)

# signi�cant interactions: 280# features in each interactionis between 26 to 41

Total runtime:4.95 seconds

# features: 468

Conclusion

• A close connection between the partial order structureand information geometry– Möbius inversion leads to the dually flat manifolds

◦ M. Sugiyama, H. Nakahara, K. Tsuda,Information Decomposition on Structured Space,IEEE ISIT (2016)

◦ S. Amari, Information geometry on hierarchy ofprobability distributions, IEEE Trans. Info. Theory (2001)

◦ H. Nakahara, S. Amari, Information-geometric measurefor neural spikes, Neural Computation (2002)

• We can decompose the KL divergence andasses the significance on any posets

partialorderstructureand informationgeometry - mahito · november16,2016 ibis2016...

Documents

你幾點下課 final

learningandcomputing - mahito · data law that generalizes...

olivier sigaud isir olivier.sigaud@lip6.fr

caractérisation de signaux pour l'interaction sociale...

modis 影像轉檔與幾何糾正

フィンスラー幾何による...

olivier sigaud isir olivier.sigaud@lip6.fr 01.44.27.88.53

comportements d’agents...

スピン幾何入門その2...

institut des systèmes intelligents et de robotique...

絕美的森林道場 -...

到底你在那裡 -幾米

partialorderstructureand informationgeometrynovember21,2016...

情報幾何学 #2 #infogeo16

ookk333311...

cours de traitement du signal - isir

情報幾何学と統計多様体の幾何学kawabi/matsuzoe-okayama.pdf ·...

ammar mahdhaoui - isir mahdhaoui/cv... · 2011-01-17 ·...

情報幾何学 #2.4

幾何概論 ii 講義ノート目次 - kumamoto...