abc with data cloning for mle in state space models

Maximum likelihood estimation of state-spaceSDE models using data-cloning approximate

Bayesian computation

Umberto PicchiniCentre for Mathematical Sciences,

Lund University

AMS-EMS-SPM 2015, Porto

Umberto Picchini ([email protected])

Nowadays there are several ways to deal with “intractablelikelihoods”, that is models for which an explicit likelihood functionis unavailable.

“Plug-and-play methods”: the only requirements is the ability tosimulate from the data-generating-model.

particle marginal methods (PMMH, PMCMC) based on SMCfilters [Andrieu et al. 2010].Iterated filtering [Ionides et al. 2011]approximate Bayesian computation (ABC) [Marin et al. 2012].

In the following I will focus on ABC methods.

Andrieu, Doucet and Holenstein 2010. Particle Markov chain Monte Carlo methods.JRSS-B.Ionides, Bhadra, Atchade and King 2011. Iterated filtering. Ann. Stat.Marin, Pudlo, Robert and Ryder 2012. Approximate Bayesian computational methods.Stat. Comput.


A state-space model (SSM){Yt ∼ f (yt|Xt,φ), t > t0Xt ∼ g(xt|xt−1,η).

(1)

We have data y = (y0, y1, ..., yn) from (1) at discrete time-points0 6 t0 < ... < tn.

Transition densities g(xt|xt−1,η) are typically unknown.

We are interested in inference for the vector parameter θ = (φ,η),however the likelihood function is intractable

p(y|θ) =∫ T∏

t=1

p(yt|xt; θ)p(x1)

T∏t=2

p(xt|xt−1; θ)︸︷︷︸unavailable

dx1:T


Approximate Bayesian computation (ABC)

Consider the posterior distribution of θ:

π(θ|y) ∝ p(y|θ)π(θ)

Purpose of ABC is to obtain an approximation πδ(θ|y) to the trueposterior π(θ|y).

Here δ > 0 is a tolerance value. The smaller δ the better theapproximation to π(θ|y).

In practice inference is carried via some Monte Carlo sampling fromπδ(θ|y).However for a “small” δ sampling from πδ(θ|y) can be difficult (highrejection rates).


ABC gives a way to approximate a posterior distribution

π(θ|y) ∝ p(y|θ)π(θ)

key to the success of ABC is the ability to bypass the explicitcalculation of the likelihood p(y|θ)...only forward-simulation from the model is required!

Simulate artificial-data y∗ from the SSM model (1):

y∗ ∼ p(y|θ)

for SDEs, use numerical discretization (arbitrarily accurate as thestepsize h→ 0) or exact simulation (seeBeskos,Roberts,Fearnhead,Papaspiliopulos).ABC had an incredible success in genetic studies since mid 90’s(Tavare et al ’97, Pritchard et al. ’99). Now is everywhere.


ABC basics

Generate θ∗ ∼ π(θ), x∗t ∼ p(X|θ∗), y∗ ∼ f (yt|x∗t , θ∗).proposal θ∗ is accepted if y∗ is “close” to data y, according to athreshold δ > 0.

The above generate draws from the augmented approximatedposterior

πδ(θ, y∗|y) ∝ Jδ(y, y∗; θ) p(y∗|θ)π(θ)︸︷︷︸∝π(θ|y∗)

Jδ(·) weights the intractable posterior π(θ|y∗) ∝ p(y∗|θ)π(θ) withhigh values when y∗ ≈ y.

Rationale: if Jδ(·) constant when δ = 0 (y = y∗) recover the exactposterior π(θ|y).

Example: Jδ(y, y∗; θ) ∝∏n

i=11δe−

(y∗i −yi

)2

2δ2


a completely made-up illustration

green: the target posterior; prior distribution is uniform.Let’s decrease δ progressively...


Typically we cannot reduce δ as much as we like.When incurring into high rejection rates we might have to stop at thepink approximation.

0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

For the “best feasible δ” (pink) we get the MAP pretty much ok.Tails are awful though...


Suppose we are in a scenario where it’s not feasible to decrease δfurther...What to do?

Here I am borrowing the data cloning idea.data-cloning was independently introduced in:

1 Doucet, Godsill, Robert. Statistics and Computing (2002)2 Jacquier, Johannes, Polson. J. Econometrics (2007)3 popularized in ecology by Lele, Dennis, Lutscher. Ecology

Letters (2007).


“data cloning” for state-space models

(forget about ABC for the moment)data: ylikelihood: L(θ; y)choose an integer K > 1 and stack K copies of your datay(K) = (y, y, ..., y)︸︷︷︸

K timesThe corresponding posterior is

π(θ|y(K)) ∝ (L(θ; y(K)))π(θ)

Consider K independent realizations X(1), ..., X(K) of {Xt}, withX(k) = (X(k)

0 , ..., X(k)n ) , k = 1, ..., K

L(θ; y(K)) =

K∏k=1

∫f (y|X(k), θ)p(X(k)|θ)dX(k) = (L(θ; y))K .

use MCMC to sample from π(θ|y(K)) for “large” K.


Asymptotics, K →∞ (Jacquier et al. 2007; Lele et al. 2007)

K is the # of data “clones”

when K →∞ we have...

θ = sample mean of MCMC draws from π(θ|y(K))⇒ θmle

(whatever the prior!)

K× [sample covariance of draws] from π(θ|y(K))⇒ I−1θmle

theinverse of the Fisher information of the MLE.

θ⇒ N(θmle, K−1 · I−1

θmle

)

1 Jacquier, Johannes, Polson. J. Econometrics (2007)

2 Lele, Dennis, Lutscher. Ecology Letters (2007).


Our idea

Compensate for the inability to decrease δ by increasing K.1 Run ABC-MCMC for decreasing δ (fix K = 1, no data-cloning);2 Stop decreasing δ and start increasing K > 1 (data-cloning).3 distribution shrinks around the MLE (tick vertical line)


Rationale

Rationale (with abuse of notation):

from ABC theory:

limδ→0

πδ(θ|y(K)) = π(θ|y(K))

from data-cloning theory:

limK→∞π(θ|y(K)) = N(θmle, K−1 · I−1

θmle)

hence first reduce δ then enlarge K

limK→∞

(limδ→0

πδ(θ|y(K)))= N(θmle, K−1 · I−1

θmle)


limK→∞

(limδ→0

πδ(θ|y(K)))= N(θmle, K−1 · I−1

θmle)

Now:

of course we can’t really let both δ→ 0 and K →∞these two criteria compete! Computationally not feasible tosatisfy both.

I have no proof for the quality of the estimates for δ > 0 and Kfinite.


in Summary:

non-ABC (augmented) target posterior for a SSM:

π(θ, X(K)|y(K)) ∝{ K∏

k=1

f (y|X(k), θ)p(X(k)|θ)

}π(θ)

here X(K) = (X(1), ..., X(K)), each X(k) ∼ p(X|θ) i.i.d.

my ABC data-cloned posterior for a SSM:

πδ(θ, y∗(K)|y(K)) ∝

{ K∏k=1

Jδ(y, y∗(k), θ)p(X(k)|θ)

}π(θ)

as an example: Jδ(y, y∗(k); θ) :=∏n

i=11δe−

(y∗(k)

i −yi

)2

2δ2


Main problem with ABC: for complex models it is difficult toobtain a decent acceptance rate during ABC-MCMC when δ“small”.

Idea: set δ to a large (manageable) value, and compensate by“powering up” the posterior→ data-cloning. That is...

1 Preliminary step: use a typical ABC-MCMC with K = 1.Determine the main mode θ of πδ(θ|y) with δ “not-too-small”(5% acceptance rate).

2 Start a further ABC-MCMC with K > 1 by drawing proposalusing independence Metropolis centred at θ.

3 Increase K progressively...


Algorithm 4 data-cloning ABC (P. 2015)ABC-MCMC stage K = 1 using adaptive Metropolis random walk AMRW1. Generate X∗ from p(X|θ∗) and a corresponding y∗ from SSM. ComputeJδ(y, y∗; θ∗).2. Generate θ# := AMRW(θ∗,Σ). Generate X#’s from p(X|θ#) and correspondingy#. Compute Jδ(y, y#; θ#).3. Accept θ∗ with probability

α = min[

1,Jδ(y, y#; θ#)

Jδ(y, y∗; θ∗)× u1(θ

∗|θ#,Σ)u1(θ#|θ∗,Σ)

× π(θ#)

π(θ∗)

]

Data-cloning stage using a Metropolis independent sampler MIS4. Fetch the maximum θ from ABC-MCMC then do as above but proposing usingθ# := MIS(θ, Σ).5. Increase K := K + 1. Generate independently y#(1), ..., y#(K) from p(y|θ#)6. Accept proposal with probability

α = min[

1,∏K

k=1 Jδ(y, y#(k); θ#)∏Kk=1 Jδ(y, y∗(k); θ∗)

× u2(θ∗|θ, Σ)

u2(θ#|θ, Σ)× π(θ

#)

π(θ∗)

].


Stochastic Gompertz model

dXt = BCe−CtXtdt + σXtdWt, X0 = Ae−B

Used in ecology for population growth, e.g. chicken growth data [Donnet,Foulley, Samson 2010]

0 5 10 15 20 25 30 35 400

1

2

3

4

5

6

7

8

9

12 observations from {log Xt}. X0 assumed known.We wish to estimate θ = (A, B, C,σ)Exact MLE available as transition densities are known.


Priors: log A ∼ U(6, 9), log C ∼ U(0.5, 4), σ ∼ LN(0, 0.15)


Comparison with exact MLE

0 0.5 1 1.5 2 2.5 3

x 106

6

6.5

7

7.5

8

8.5

9

log A0 0.5 1 1.5 2 2.5 3

x 106

1

1.5

2

2.5

3

3.5

4

0 0.5 1 1.5 2 2.5 3

x 106

−1

−0.5

0

0.5

1

log σ

True values Exact MLE ABC ((K, δ) = (5, 0.5))log A 8.01 7.8 (0.486) 7.716 (0.471)log B(∗) 1.609 1.567 1.550log C 2.639 2.755 (0.214) 2.872 (0.473)logσ 0 -0.14 (0.211) -0.251 (0.228)

Table: (*) log B deterministically determined as log(log(A/X0)) as X0 = Ae−B withX0 known.


Gompertz state-space model

{Yti = log(Xti) + εti εti ∼ N(0,σ2

ε)

dXt = BCe−CtXtdt + σXtdWt, X0 = Ae−B

12 observations from {Yti}. State {Xt} is unobserved. X0 assumed

known.

Wish to estimate θ = (A, B, C,σ,σε)


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1

2

3

4

5

6

7

8

9

t

Figure: data and three sample trajectories from the estimated state-space model.

True values ABC-DC ((K, δ) = (4, 0.8))log A 8.01 8.01 (0.567)log B(*) 1.609 1.611log C 2.639 3.152 (0.982)logσ 0 -0.080 (0.258)logσε −0.799 -0.577 (0.176)


Take-home message

1 Sometimes we want to do MLE but we are unable to...

2 Sometimes we want to go full Bayesian but we can’t...

3 Sometimes even ABC is challenging...

4 There are endless possibilities out there (EP, VB and more...)

5 Working paper:P. (2015) “Approximate maximum likelihood estimation usingdata-cloning ABC‘”, arXiv:1505.06318.

6 blog discussion by Christian P. Robert (2 June)https://xianblog.wordpress.com

Thank You


https://xianblog.wordpress.com

Appendix

Appendix


Appendix

“Likelihood free” Metropolis-Hastings

Suppose at a given iteration of Metropolis-Hastings we are in the(augmented)-state position (θ#, x#) and wonder whether to move (ornot) to a new state (θ ′, x ′). The move is generated via a proposaldistribution “q((θ#, x#)→ (x ′, θ ′))”.

e.g. “q((θ#, x#)→ (x ′, θ ′))” = u(θ ′|θ#)v(x ′ | θ ′);move “(θ#, x#)→ (θ ′, x ′)” accepted with probability

α(θ#,x#)→(x′,θ′) = min(

1,π(θ ′)π(x ′|θ ′)π(y|x ′, θ ′)q((θ ′, x ′)→ (θ#, x#))

π(θ#)π(x#|θ#)π(y|x#, θ#)q((θ#, x#)→ (θ ′, x ′))

)= min

(1,π(θ ′)π(x ′|θ ′)π(y|x ′, θ ′)u(θ#|θ

′)v(x# | θ#)

π(θ#)π(x#|θ#)π(y|x#, θ#)u(θ ′|θ#)v(x ′ | θ ′)

)now choose v(x | θ) ≡ π(x | θ)

= min(

1,π(θ ′)��π(x ′|θ ′)π(y|x ′, θ ′)u(θ#|θ

′)��π(x# | θ#)

π(θ#)��π(x#|θ#)π(y|x#, θ#)u(θ ′|θ#)��π(x ′ | θ ′)

)This is likelihood–free! And we only need to know how to generate x ′

(not a problem...)Umberto Picchini ([email protected])