yusuf hakan kalayci - github pages · 2020. 9. 15. · yusuf hakan kalayci (boun) short title march...

Deep Exponential Families

Yusuf Hakan Kalayci

Bogazici University

[email protected]

March 22, 2018

Yusuf Hakan Kalayci (BOUN) Short title March 22, 2018 1 / 21

Overview

1 Exponential FamilyDefinitionSufficient StatisticsMean and Variance

2 Deep Exponential FamiliesWhat is DEFModeling DocumentExample

3 InferenceVariational InferenceBlack Box Variational InferenceInference on DEFs


Exponential Family

Definition

A probability density function p(x |θ), x ∈ X ⊂ Rq, which is labeled byθ ∈ Θ ⊂ Rk , is said to belong to the k-parameter exponential family ifit is of the form

p(x |θ) = h(x) · exp

k∑j=1

ηj(θ) · Tj(x)− B(θ)

where

ηj ,B are real valued: Θ→ RTj , h are real valued: Rq → R.

We also say that family is regular whenever X does not depend on θ andnon-regular otherwise.


Canonical Form

Re-parametrization of η = η(θ) as the natural parameter gives us thefollowing density function

p(x |η) = h(x) · exp[ηTT (x)− A(η)

]here:

η = (η1, ..., ηk)T : natural parameter

T (x) = (T1(x), ...Tk(x))T : sufficient statistics

A(η) = log

∫ ∫...

∫︸︷︷︸

q times

h(x) · exp[ηTT (x)

]dx : logarithm of

normalization constant i.e. log normalizer.


Bernoulli Distribution

p(x |π) = πx(1− π)1−x

= exp

[log(

π

1− π)x + log(1− π)

](1)

η = log( π1−π )

T (x) = x

A(η) = − log(1− π) = log(1 + eη)h(x) = 1


Sufficient Statistics

Definition

Statistic is any function on the sample space that is not a function of theparameter.

T (x) is sufficient for θ if there is no information in x regarding θ beyondthat in T (x). Bayesian Approach:

p(θ|T (x), x) = p(θ|T (x))

Frequentist Approach:

p(x |T (x), θ) = p(x |T (x))


Mean and Variance

∂A

∂ηT=

∂

∂ηT

{log

∫h(x) · exp

{ηTT (x)

}ν(dx)

}=

∫T (x)h(x) · exp

{ηTT (x)

}ν(dx)∫

h(x) · exp {ηTT (x)} ν(dx)

=

∫T (x)h(x) · exp

{ηTT (x)− A(η)

}ν(dx)

= E[T (X )].

(2)

∂2A

∂η∂ηT= Var [T (X )] (3)



Deep Exponential Families are:

flexible family of distributions,

hierarchical latent variable model,

built from layers using exponential family distributions,

designed to capture hidden patterns from coarse to fine grained.



for each data xn, L layers ofhidden variables {zn,1, ...zn,L}each zn,l = {zn,l ,1, ...zn,l ,Kl}each weight Wl is shared acrossdata and is a collection of Klvectors with Kl+1 dimension.


Deep Exponential Families (EKSIK)

zn,L,k ∼ expfamL(η)zn,l ,k ∼ expfaml(gl(wTl ,kzn,l+1))xn,i ∼ expfam1(g1(wT1,kzn,1))W ∼ expfam(ξ)

With the help of Exponential Familywe can say:

E[T (zn,l ,k)] = ∇ηA(gl(wTl ,kzn,l+1))


Modeling Documents

Aim: clustering words into topics in a hierarchical way and Bayesianmanner(probability distributions of topics) in order to analyze largevolumes of text.We can think an example of modeling document problem in DEF as:

documents: vectors of term counts,

topics: first latent layer,

super topics: second latent layer,

concepts: third latent layer,


Modeling Documents


Poisson DEF Example

Canonical Exponential Family form of Poisson Distribution:

p(z) = (z!)−1 · exp(ηz − exp(η))

E[Z ] = exp(η)

Link function is chosen as logarithm. Then conditional distribution:

p(zl ,k |zl+1,wl ,k) = (zl ,k !)−1 · exp(log zTl+1wl ,kzl ,k − zTl+1wl ,k)

Since A is exponential function,

E[zl ,k ] = ∇ηA(log zTl+1wl ,k) = zTl+1wl ,k

In case of document modeling, the value of zn,2,k represents ”howmany times super topic k is represented in nth example”.


Variational Inference

In VI, we seek to minimize KL divergence to the posterior of ourmodel from an approximate distribution qλ.

minimizing KL(qλ||p) = Eq[log q(z |λ)]− Eq[log p(z |x)]⇐⇒

maximizing L(qλ) = Eq[log p(x , z)− log q(z |λ)].


Black Box Variational Inference

Aim: find ∇λL without requiring gradients of the model and take a step inthe direction of that gradient and maximize L.Key Observations:

write ∇λL as an expectation under the q such that

∇λL = Eq[∇λ log q(z |λ)(log p(x , z)− log q(z |λ))]

estimate this expectation with Monte Carlo sampling.

∇λL ≈1

S

S∑s=1

∇λ log q(zs |λ)(log p(x , xs)− log q(zs |λ))

where zs ∼ q(z |λ).


Stochastic Optimization

f (x) be a function to be maximized

ht(x) ∼ H(x)E[H(x)] = ∇x f (x)ρt be the learning rate

xt+1 ← xt + ρtht(xt)

converges to a maximum point of f(x). When the learning rate schedulefollows the Robbins-Monro condition

∞∑t=1

pt =∞

∞∑t=1

p2t

Inference on DEFs

For the approximate distribution q, we use the mean field variationalfamily.

q(z ,W ) = q(W0)L−1∏l=1

q(Wl)N∏

n=1

q(zn, l)

Each q(Wl) and q(zn,l) are fully factorized and each factor is exponentialfamily with corresponding layer. i.e.

q(zn,l ,k) = expfaml(zn,l ,k , λn,l ,k)


BBVI on DEFs

To update the variational parameters λn,l ,k using BBVI, we need tocalculate following gradients for each coordinate.

∇λn,l,kL = Eq[∇λn,l,k log q(zn,l ,k)(log pn,l ,k(x , z ,W )− log q(zn, l , k))]

The only quantity that we need to calculate for DEF islog pn,l ,k(x , z ,W ) which is the terms in the log-joint that containszn,l ,k

log pn,l ,k(x , z ,W ) = log p(zn,l ,k |zn,l+1,Wl ,k) + log p(zn,l−1|zn,l ,Wl−1)

And similarly for W’s this will be

log pl ,i ,j(x , z ,W ) = log p(wl ,i ,j |ξl ,i ,j) + log p(zl ,j |zl+1,wl ,j)


References

Ranganath, Tang, Charlin, Blei (2016)


https: // www. cs. princeton. edu/ ~ rajeshr/ papers/ def_ aistats. pdf

Ranganath, Gerrish, Blei (2013)

Black Box Variational Inference

https: // arxiv. org/ pdf/ 1401. 0118. pdf

Robbins, Monro (1951)

A Stochastic Approximation Method

https: // projecteuclid. org/ euclid. aoms/ 1177729586


https://www.cs.princeton.edu/~rajeshr/papers/def_aistats.pdfhttps://arxiv.org/pdf/1401.0118.pdfhttps://projecteuclid.org/euclid.aoms/1177729586

The End


Exponential FamilyDefinitionSufficient StatisticsMean and Variance

Deep Exponential FamiliesWhat is DEFModeling DocumentExample

InferenceVariational InferenceBlack Box Variational InferenceInference on DEFs