pde analysis for sampling dynamics and generative...

35
PDE analysis for sampling dynamics and generative models Jianfeng Lu (鲁剑锋) Duke University [email protected] IPAM, UCLA, April 2020 Joint with Yu Cao (Duke Courant Institute) Yulong Lu (Duke UMass Amherst) Lihan Wang (Duke)

Upload: others

Post on 04-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

PDE analysis for sampling dynamics and generativemodels

Jianfeng Lu (鲁剑锋)

Duke [email protected]

IPAM, UCLA, April 2020

Joint withYu Cao (Duke → Courant Institute)Yulong Lu (Duke → UMass Amherst)Lihan Wang (Duke)

Page 2: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Sampling high dimensional probability distributions is a ubiquitouschallenge in many fields:• machine learning;• Bayesian statistics;• computational statistical mechanics;• quantum many-body problems;• ...

Popular approaches for sampling:• Markov chain Monte Carlo, e.g., diffusion based sampling;• Interacting samplers, e.g., particle filters;• Generative models, e.g., normalizing flows, generative adversarial

networks.

Page 3: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

In this talk, we will discuss two recent sampling-related works:

Convergence to equilibrium of Langevin dynamics(joint with Yu Cao and Lihan Wang)

Generative models for high dimensional probability distributions(joint with Yulong Lu)

Page 4: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Convergence to equilibrium of Langevin dynamics(joint with Yu Cao and Lihan Wang)

Generative models for high dimensional probability distributions(joint with Yulong Lu)

Page 5: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Langevin dynamicsConsider the (underdamped) Langevin dynamics:

d𝑥 = 𝑣 d𝑡𝑚 d𝑣 = −∇𝑈(𝑥 ) d𝑡 − 𝛾𝑣 d𝑡 + 2𝛾𝑘 𝑇 d𝑊

• 𝑥: position; 𝑣: velocity;• 𝑈: potential energy;• 𝛾: friction coefficient;• 𝑚: particle mass;• 𝑇: temperature;• 𝑘 Boltzmann constant;• 𝑊 : 𝑑-dimensional standard

Brownian motion.Paul Langevin (1872–1946)

For simplicity, we will set 𝑚 = 1 and 𝑘 𝑇 = 1 (after possibly a rescaling).

Page 6: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Langevin dynamics (with only friction parameter 𝛾)

d𝑥 = 𝑣 d𝑡d𝑣 = −∇𝑈(𝑥 ) d𝑡 − 𝛾𝑣 d𝑡 + 2𝛾 d𝑊

As 𝛾 → 0, we get the Hamiltonian dynamics for 𝐻(𝑣, 𝑥) = |𝑣| + 𝑈(𝑥):

d𝑥 = 𝑣 d𝑡 = 𝜕𝐻(𝑥 , 𝑣 )𝜕𝑣 d𝑡

d𝑣 = −∇𝑈(𝑥 ) d𝑡 = −𝜕𝐻(𝑥 , 𝑣 )𝜕𝑥 d𝑡

As 𝛾 → ∞, on the 𝑂(1) time scale, it is dominated by theOrnstein-Uhlenbeck process in the velocity variable with “equilibrium”

"𝑣 d𝑡 = −𝛾 ∇𝑈(𝑥 ) d𝑡 + 2𝛾 d𝑊, "

after rescaling 𝛾𝑡 ↦ 𝑡, we obtain the overdamped Langevin dynamics

d𝑥 = −∇𝑈(𝑥 ) d𝑡 + √2 d𝑊.

Page 7: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Langevin dynamics (with only friction parameter 𝛾)

d𝑥 = 𝑣 d𝑡d𝑣 = −∇𝑈(𝑥 ) d𝑡 − 𝛾𝑣 d𝑡 + 2𝛾 d𝑊

As 𝛾 → 0, we get the Hamiltonian dynamics for 𝐻(𝑣, 𝑥) = |𝑣| + 𝑈(𝑥):

d𝑥 = 𝑣 d𝑡 = 𝜕𝐻(𝑥 , 𝑣 )𝜕𝑣 d𝑡

d𝑣 = −∇𝑈(𝑥 ) d𝑡 = −𝜕𝐻(𝑥 , 𝑣 )𝜕𝑥 d𝑡

As 𝛾 → ∞, on the 𝑂(1) time scale, it is dominated by theOrnstein-Uhlenbeck process in the velocity variable with “equilibrium”

"𝑣 d𝑡 = −𝛾 ∇𝑈(𝑥 ) d𝑡 + 2𝛾 d𝑊, "

after rescaling 𝛾𝑡 ↦ 𝑡, we obtain the overdamped Langevin dynamics

d𝑥 = −∇𝑈(𝑥 ) d𝑡 + √2 d𝑊.

Page 8: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Langevin dynamics (with only friction parameter 𝛾)

d𝑥 = 𝑣 d𝑡d𝑣 = −∇𝑈(𝑥 ) d𝑡 − 𝛾𝑣 d𝑡 + 2𝛾 d𝑊

As 𝛾 → 0, we get the Hamiltonian dynamics for 𝐻(𝑣, 𝑥) = |𝑣| + 𝑈(𝑥):

d𝑥 = 𝑣 d𝑡 = 𝜕𝐻(𝑥 , 𝑣 )𝜕𝑣 d𝑡

d𝑣 = −∇𝑈(𝑥 ) d𝑡 = −𝜕𝐻(𝑥 , 𝑣 )𝜕𝑥 d𝑡

As 𝛾 → ∞, on the 𝑂(1) time scale, it is dominated by theOrnstein-Uhlenbeck process in the velocity variable with “equilibrium”

"𝑣 d𝑡 = −𝛾 ∇𝑈(𝑥 ) d𝑡 + 2𝛾 d𝑊, "

after rescaling 𝛾𝑡 ↦ 𝑡, we obtain the overdamped Langevin dynamics

d𝑥 = −∇𝑈(𝑥 ) d𝑡 + √2 d𝑊.

Page 9: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Langevin dynamics

d𝑥 = 𝑣 d𝑡d𝑣 = −∇𝑈(𝑥 ) d𝑡 − 𝛾𝑣 d𝑡 + 2𝛾 d𝑊

The corresponding backward Kolmogorov equation (known as the kineticFokker-Planck equation) is given by

𝜕 𝑓 = ℒ𝑓𝑓(0, 𝑥, 𝑣) = 𝑓 (𝑥, 𝑣)

with the generator given by ℒ = ℒham + 𝛾ℒFD with

ℒham = 𝑣 ⋅ ∇ − ∇ 𝑈 ⋅ ∇ and ℒFD = Δ − 𝑣 ⋅ ∇

The invariant measure of the Langevin dynamics is given by

𝜌 ( d𝑥, d𝑣) = 1𝑍𝑒

( , ) d𝑥 d𝑣 = 1𝑍𝑒

( ) | | d𝑥 d𝑣,

where 𝑍 is the normalizing constant (recall that 𝛽 = 𝑘 𝑇 = 1).

Page 10: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Langevin dynamics

d𝑥 = 𝑣 d𝑡d𝑣 = −∇𝑈(𝑥 ) d𝑡 − 𝛾𝑣 d𝑡 + 2𝛾 d𝑊

The corresponding backward Kolmogorov equation (known as the kineticFokker-Planck equation) is given by

𝜕 𝑓 = ℒ𝑓𝑓(0, 𝑥, 𝑣) = 𝑓 (𝑥, 𝑣)

with the generator given by ℒ = ℒham + 𝛾ℒFD with

ℒham = 𝑣 ⋅ ∇ − ∇ 𝑈 ⋅ ∇ and ℒFD = Δ − 𝑣 ⋅ ∇

The invariant measure of the Langevin dynamics is given by

𝜌 ( d𝑥, d𝑣) = 1𝑍𝑒

( , ) d𝑥 d𝑣 = 1𝑍𝑒

( ) | | d𝑥 d𝑣,

where 𝑍 is the normalizing constant (recall that 𝛽 = 𝑘 𝑇 = 1).

Page 11: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

The invariant measure of Langevin dynamics is

𝜌 ( d𝑥, d𝑣) ∝ 𝑒 ( ) | | d𝑥 d𝑣,

its marginal on 𝑥 is the Boltzmann-Gibbs distribution 𝜇 ∝ 𝑒 ( ).

Thus, the Langevin dynamics and its overdamped limit can be used tosample the Boltzmann-Gibbs distribution.• Smart Monte Carlo [Rossky, Doll, Friedman 1978] (which in fact

discussed both underdamped and overdamped Langevin);• The overdamped version is known in the statistics literature as the

Metropolis-adjusted Langevin algorithm (MALA) [Besag 1994;Roberts, Tweedie 1996];

• The underdamped version becomes popular in machine learning inrecent years, see e.g., [Cheng, Chatterji, Bartlett, Jordan 2018];

• The underdamped version is also closely related to the HamiltonianMonte Carlo method [Duane, Kennedy, Pendleton, Roweth 1987];

• Often combined with Metropolis algorithm, though can be usedwithout accept/reject, see e.g., [Neal 1993].

Page 12: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

The invariant measure of Langevin dynamics is

𝜌 ( d𝑥, d𝑣) ∝ 𝑒 ( ) | | d𝑥 d𝑣,

its marginal on 𝑥 is the Boltzmann-Gibbs distribution 𝜇 ∝ 𝑒 ( ).Thus, the Langevin dynamics and its overdamped limit can be used tosample the Boltzmann-Gibbs distribution.• Smart Monte Carlo [Rossky, Doll, Friedman 1978] (which in fact

discussed both underdamped and overdamped Langevin);• The overdamped version is known in the statistics literature as the

Metropolis-adjusted Langevin algorithm (MALA) [Besag 1994;Roberts, Tweedie 1996];

• The underdamped version becomes popular in machine learning inrecent years, see e.g., [Cheng, Chatterji, Bartlett, Jordan 2018];

• The underdamped version is also closely related to the HamiltonianMonte Carlo method [Duane, Kennedy, Pendleton, Roweth 1987];

• Often combined with Metropolis algorithm, though can be usedwithout accept/reject, see e.g., [Neal 1993].

Page 13: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Question: How fast Langevin dynamics converges to equilibrium?Recall the kinetic Fokker-Planck equation

𝜕 𝑓 = ℒ𝑓 = ℒham + 𝛾ℒFD 𝑓; 𝑓(0, 𝑥, 𝑣) = 𝑓 (𝑥, 𝑣),

where

ℒham = 𝑣 ⋅ ∇ − ∇ 𝑈 ⋅ ∇ and ℒFD = Δ − 𝑣 ⋅ ∇

Since ℒ∗𝜌 = 0, ∫𝑓(𝑡, ⋅) d𝜌 is invariant in time. As 𝑡 → ∞, 𝑓 willconverge to the constant ∫𝑓(0, ⋅) d𝜌 . How fast?

c.f. the Fokker-Planck equation for overdamped dynamics

𝜕 ℎ = −∇ 𝑈 ⋅ ∇ ℎ + Δ ℎ, ℎ(0, 𝑥) = ℎ (𝑥).

The convergence of Fokker-Planck is well understood, as the generator isself-adjiont and coercive with respect to 𝐿 .

Page 14: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Question: How fast Langevin dynamics converges to equilibrium?Recall the kinetic Fokker-Planck equation

𝜕 𝑓 = ℒ𝑓 = ℒham + 𝛾ℒFD 𝑓; 𝑓(0, 𝑥, 𝑣) = 𝑓 (𝑥, 𝑣),

where

ℒham = 𝑣 ⋅ ∇ − ∇ 𝑈 ⋅ ∇ and ℒFD = Δ − 𝑣 ⋅ ∇

Since ℒ∗𝜌 = 0, ∫𝑓(𝑡, ⋅) d𝜌 is invariant in time. As 𝑡 → ∞, 𝑓 willconverge to the constant ∫𝑓(0, ⋅) d𝜌 . How fast?

c.f. the Fokker-Planck equation for overdamped dynamics

𝜕 ℎ = −∇ 𝑈 ⋅ ∇ ℎ + Δ ℎ, ℎ(0, 𝑥) = ℎ (𝑥).

The convergence of Fokker-Planck is well understood, as the generator isself-adjiont and coercive with respect to 𝐿 .

Page 15: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Assumption (Poincaré inequality for 𝜇 )

ℎ − ℎ d𝜇 d𝜇 ≤ 1𝑚 |∇ ℎ| d𝜇

This implies that the overdamped dynamics has convergence rate 𝑚.e.g., when 𝑈 is 𝑚-strongly convex.Question: Any improvement by the underdamped Langevin dynamics?

gradient descent ⟶ accelerated gradient descent

overdamped dynamics ⟶ Langevin dynamics

Langevin dynamics can be thought of as a stochastic version ofaccelerated gradient dynamics.Thus, if we take the analogy of optimization vs sampling, we would hopethat (underdamped) Langevin dynamics gives convergence rate 𝒪(√𝑚) forstrongly convex 𝑈.

Page 16: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Assumption (Poincaré inequality for 𝜇 )

ℎ − ℎ d𝜇 d𝜇 ≤ 1𝑚 |∇ ℎ| d𝜇

This implies that the overdamped dynamics has convergence rate 𝑚.e.g., when 𝑈 is 𝑚-strongly convex.Question: Any improvement by the underdamped Langevin dynamics?

gradient descent ⟶ accelerated gradient descent

overdamped dynamics ⟶ Langevin dynamics

Langevin dynamics can be thought of as a stochastic version ofaccelerated gradient dynamics.Thus, if we take the analogy of optimization vs sampling, we would hopethat (underdamped) Langevin dynamics gives convergence rate 𝒪(√𝑚) forstrongly convex 𝑈.

Page 17: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Assumption (Poincaré inequality for 𝜇 )

ℎ − ℎ d𝜇 d𝜇 ≤ 1𝑚 |∇ ℎ| d𝜇

This implies that the overdamped dynamics has convergence rate 𝑚.Question: Any improvement by the underdamped Langevin dynamics?

Theorem (Cao-L.-Wang 2019)For convex 𝑈 satisfying |Hess𝑈| ≲ 1 + |∇𝑈| and superlinear as |𝑥| → ∞,

‖𝑓(𝑡, ⋅)− 𝑓(𝑡, ⋅) d𝜌 ‖ ( ) ≤ 𝐶 exp(−𝜆𝑡)‖𝑓(0, ⋅)− 𝑓(0, ⋅) d𝜌 ‖ ( )

with explicit estimate of 𝜆 as

𝜆 = √𝑚 log 1 + 𝛾√𝑚𝑐 (√𝑚 + 𝛾)

= 𝒪(√𝑚) if we take 𝛾 = 𝒪(√𝑚).

Results available for more general case; we will not discuss those here.

Page 18: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Exponential convergence with rate 𝒪(√𝑚) (setting ∫𝑓 d𝜌 = 0)

‖𝑓(𝑡, ⋅)‖ ( ) ≤ 𝐶 exp(−𝑐√𝑚𝑡)‖𝑓(0, ⋅)‖ ( )

• The 𝒪(√𝑚) convergence rate is optimal, as can be seen when 𝑈 is aGaussian (so explicit calculation can be done);

• First result in literature for sharp √𝑚 convergence rate (accelerationcompared with overdamped dynamics with rate 𝑚);

• Convergence in 𝐿 implies convergence of density in 𝜒 -divergence,and thus in relatively entropy and total variation distance with𝒪(√𝑚) rate;

• The constant 𝐶 > 1 is unavoidable, since the operator ℒ is notcoercive (more on next slide).

Page 19: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Kinetic Fokker-Planck equation

𝜕 𝑓 = ℒ𝑓 = ℒham + 𝛾ℒFD 𝑓; 𝑓(0, 𝑥, 𝑣) = 𝑓 (𝑥, 𝑣),

where

ℒham = 𝑣 ⋅ ∇ − ∇ 𝑈 ⋅ ∇ and ℒFD = Δ − 𝑣 ⋅ ∇ .

The operator is hypo-elliptic [Hörmander 1967] (as the diffusion isdegenerate in the 𝑥 direction).In particular, we cannot hope for the exponential convergence to followfrom a Poincaré (coercivity) estimate for ℒ. Hypo-coercivity needs to beestablished.

Page 20: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Kinetic Fokker-Planck equation

𝜕 𝑓 = ℒ𝑓 = ℒham + 𝛾ℒFD 𝑓; 𝑓(0, 𝑥, 𝑣) = 𝑓 (𝑥, 𝑣),

where

ℒham = 𝑣 ⋅ ∇ − ∇ 𝑈 ⋅ ∇ and ℒFD = Δ − 𝑣 ⋅ ∇ .

The analysis of kinetic Fokker-Planck eqns dates back to [Kolmogorov1934]. Related previous results on convergence:• Convergence in 𝐻 norm [Villani 2009];• Convergence in a modified 𝐿 norm [Dolbeault, Mouhot, Schmeiser

2009; 2015] (also earlier idea from [Herau 2006]).This was applied to kinetic Fokker-Planck equation by [Roussel,Stoltz 2018], which gives explicit constants, though not sharp.

• Convergence in Wasserstein distance:Bakry-Émery framework [Boudoin 2016];Coupling approaches [Eberle, Guillin, Zimmer 2019; Cheng, Chatterji,Bartlett, Jordan 2018; Dalalyan, Riou-Durand 2018].

Page 21: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Kinetic Fokker-Planck equation

𝜕 𝑓 = ℒ𝑓 = ℒham + 𝛾ℒFD 𝑓; 𝑓(0, 𝑥, 𝑣) = 𝑓 (𝑥, 𝑣),

where

ℒham = 𝑣 ⋅ ∇ − ∇ 𝑈 ⋅ ∇ and ℒFD = Δ − 𝑣 ⋅ ∇ .

Our analysis method was inspired by a recent variational framework[Armstrong, Mourrat 2019] for kinetic FP equation on compact domainwithout potential, which implicitly used the bracket condition dating backto [Hörmander 1967].As ℒ is not coercive, the idea is to resort to augmenting the state space bya time interval 𝐼 = (0, 𝑇) equipped with Lebesgue measure 𝜆. Since intime, the diffusion in 𝑣 direction will propagate to the 𝑥 direction.

Page 22: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Kinetic Fokker-Planck equation

𝜕 𝑓 = ℒ𝑓 = ℒham + 𝛾ℒFD 𝑓; 𝑓(0, 𝑥, 𝑣) = 𝑓 (𝑥, 𝑣),

where

ℒham = 𝑣 ⋅ ∇ − ∇ 𝑈 ⋅ ∇ and ℒFD = Δ − 𝑣 ⋅ ∇ .

Let 𝜅 be the Gaussian measure in velocity (𝜌 ( d𝑥 d𝑣) = 𝜇 ( d𝑥)𝜅( d𝑣)).The exp. conv. follows from separating into intervals with 𝑇 = 1/√𝑚.

Theorem (Poincaré inequality in time augmented state space)

‖𝑓 − (𝑓) × ‖ ( × ; ) ≲ 1 + 1𝑇√𝑚

‖∇ 𝑓‖ ( × ; )

+ 1√𝑚

+ 𝑇 ‖𝜕 𝑓 − ℒham𝑓‖ ( × ; ),

where (𝑓) × ∶= ∫ 𝑓(𝑡, 𝑥, 𝑣) d𝑡 d𝜌 .

Page 23: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Convergence to equilibrium of Langevin dynamics(joint with Yu Cao and Lihan Wang)

Generative models for high dimensional probability distributions(joint with Yulong Lu)

Page 24: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Generative models: Transform simple probability distributions (such asGaussian) often by neural networks to the desired probability distribution:

𝜋 ≈ 𝑔#𝑝 .

Question? Given source and target distributions, can one construct a DNNto approximate to push-forward?This is a distribution version of the question of universal approximation.

cf. [Lee, Ge, Ma, Risteski, Arora 2017]

To make the question more quantitative, we need to specify the metric weuse to measure the difference of two distributions.Integral probability metric (IPM):

𝑑ℱ (𝑝, 𝜋) ∶= sup∈ℱ

𝔼 𝑓(𝑋) − 𝔼 𝑓(𝑌) ,

where 𝑓 can be understood as a discriminator.

Page 25: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Generative models: Transform simple probability distributions (such asGaussian) often by neural networks to the desired probability distribution:

𝜋 ≈ 𝑔#𝑝 .

Question? Given source and target distributions, can one construct a DNNto approximate to push-forward?This is a distribution version of the question of universal approximation.

cf. [Lee, Ge, Ma, Risteski, Arora 2017]

To make the question more quantitative, we need to specify the metric weuse to measure the difference of two distributions.Integral probability metric (IPM):

𝑑ℱ (𝑝, 𝜋) ∶= sup∈ℱ

𝔼 𝑓(𝑋) − 𝔼 𝑓(𝑌) ,

where 𝑓 can be understood as a discriminator.

Page 26: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Integral probability metric (IPM):

𝑑ℱ (𝑝, 𝜋) ∶= sup∈ℱ

𝔼 𝑓(𝑋) − 𝔼 𝑓(𝑌) .

1-Wasserstein distance: ℱ : 1-Lipschitz functions

𝒲 (𝑝, 𝜋) = inf∈ ( , )

|𝑥 − 𝑦|𝛾(d𝑥 d𝑦).

Maximum mean discrepancy (MMD): ℱ : unit ball of a reproducing kernelHilbert space (RKHS):

MMD(𝑝, 𝜋) = sup‖ ‖ℋ

𝔼 𝑓(𝑋) − 𝔼 𝑓(𝑌)

Kernelized Stein discrepancy (KSD): ℱ = 𝒯 𝑓 ∣ 𝑓 ∈ ℋ , ‖𝑓‖ℋ ≤ 1

KSD(𝑝, 𝜋) = 𝔼 𝒯 𝑓(𝑋)

where 𝒯 is the Stein operator (for 𝑓 ∶ ℝ → ℝ )

𝒯 𝑓 ∶= ∇ log𝜋 ⋅ 𝑓 + ∇ ⋅ 𝑓.

Page 27: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Theorem (L.-Lu 2020)For source distribution 𝑝 absolutely continuous wrt Lebesgue measure,there exists a fully connected and feed-forward deep neural network 𝑢 withdepth 𝐿 = ⌈log 𝑛⌉ and width 𝑁 = 2 such that

𝑑ℱ (∇𝑢)#𝑝 , 𝜋) ≤ 𝜀,

where the complexity 𝑛 depends on the choice of metric.• If 𝑑ℱ = 𝒲 and 𝜋 has finite 3rd moment,

𝑛 ≲𝜀 , 𝑑 = 1;log (𝜀)𝜀 , 𝑑 = 2;𝜀 , 𝑑 ≥ 3.

Page 28: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Theorem (L.-Lu 2020)For source distribution 𝑝 absolutely continuous wrt Lebesgue measure,there exists a fully connected and feed-forward deep neural network 𝑢 withdepth 𝐿 = ⌈log 𝑛⌉ and width 𝑁 = 2 such that

𝑑ℱ (∇𝑢)#𝑝 , 𝜋) ≤ 𝜀,

where the complexity 𝑛 depends on the choice of metric.• If 𝑑ℱ = MMD with kernel 𝑘 such that sup |𝑘(𝑥, 𝑥)| ≲ 1,

𝑛 ≲ 𝜀 .

• If 𝑑ℱ = KSD with kernel 𝑘 plus some technical assumptions, 𝜋 issub-Gaussian and ∇ log𝜋(𝑥) is Lipschitz,

𝑛 ≲ 𝑑𝜀 .

Remark: It is better to parameterize 𝑢 using NN and use ∇𝑢 topush-forward than directly parameterize the map due to regularity.

Page 29: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Proof sketch: The overall strategy is similar to standard approximationtheory: Take a sieve for the whole space and construct approximation forelements in the sieve.• Approximate 𝜋 by empirical measures 𝑋 ∼ 𝜋:

𝜋 ≈ 𝜋 = 1𝑛 𝛿

Wasserstein: [Fourier, Guillin 2015; Weed, Bach 2019; Lei 2020];MMD: [Sriperumbudur 2016]. KSD: Our work.

• Push-forward the source dist. 𝑝 to the empirical distribution 𝑃 ;this is based on the theory of semi-discrete optimal transport.

Page 30: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Monge formulation of optimal transport with quadratic loss

inf∶ ℝ →ℝ

12|𝑥 − 𝑇(𝑥)| 𝜇(d𝑥) s.t. 𝑇#𝜇 = 𝜈.

Kantorovich formulation of optimal transport

𝒲 (𝜇, 𝜈) = inf∈ ( , )

12|𝑥 − 𝑦| 𝛾(d𝑥 d𝑦).

The Kantorovich dual formulation

𝒲 (𝜇, 𝜈) = sup∈ ( ), ∈ ( )

( ) ( ) | |

𝜑 d𝜇 + 𝜓 d𝜈

= sup∈ ( )

𝜓 d𝜇 + 𝜓 d𝜈,

where 𝜓 is given by the 𝑐-transform:

𝜓 (𝑥) = inf∈ℝ

12|𝑥 − 𝑦| − 𝜓(𝑦).

Page 31: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Monge formulation of optimal transport with quadratic loss

inf∶ ℝ →ℝ

12|𝑥 − 𝑇(𝑥)| 𝜇(d𝑥) s.t. 𝑇#𝜇 = 𝜈.

Kantorovich formulation of optimal transport

𝒲 (𝜇, 𝜈) = inf∈ ( , )

12|𝑥 − 𝑦| 𝛾(d𝑥 d𝑦).

The Kantorovich dual formulation

𝒲 (𝜇, 𝜈) = sup∈ ( ), ∈ ( )

( ) ( ) | |

𝜑 d𝜇 + 𝜓 d𝜈

= sup∈ ( )

𝜓 d𝜇 + 𝜓 d𝜈,

where 𝜓 is given by the 𝑐-transform:

𝜓 (𝑥) = inf∈ℝ

12|𝑥 − 𝑦| − 𝜓(𝑦).

Page 32: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Monge formulation of optimal transport with quadratic loss

inf∶ ℝ →ℝ

12|𝑥 − 𝑇(𝑥)| 𝜇(d𝑥) s.t. 𝑇#𝜇 = 𝜈.

The Kantorovich dual formulation

𝒲 (𝜇, 𝜈) = sup∈ ( )

𝜓 d𝜇 + 𝜓 d𝜈,

where 𝜓 is given by the 𝑐-transform:

𝜓 (𝑥) = inf∈ℝ

12|𝑥 − 𝑦| − 𝜓(𝑦).

Theorem (Brenier 1987)Assume that 𝜇 is absolutely continuous with respect to the Lebesguemeasure. Then there exists a convex function �̄� ∶ ℝ → ℝ

𝑇(𝑥) = ∇�̄�(𝑥), 𝜇-a.e. 𝑥.

In fact, �̄�(𝑥) can be chosen as |𝑥| − 𝜑(𝑥).

Page 33: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Semi-discrete optimal transport:• Source measure 𝜇 ∈ 𝒫 (ℝ ) absolutely continuous: d𝜇 = 𝜌 d𝑥;• Target measure 𝜈 = ∑ 𝜈 𝛿 .

In the discrete case, the Kantorovich dual becomes the followingfunctional depending on (𝜓 ,… , 𝜓 ):

ℱ(𝜓 ,⋯ ,𝜓 ) = 𝜓 (𝑥)𝜌(𝑥) d𝑥 + 𝜓 𝜈

= inf12|𝑥 − 𝑦 | − 𝜓 𝜌(𝑥) d𝑥 + 𝜓 𝜈 .

Theorem (Gu-Luo-Sun-Yau 2016 for compact domain; L.-Lu 2020)The optimal transport plan 𝑇 is given by for some 𝑚 ∈ ℝ.

𝑇(𝑥) = ∇�̄�(𝑥) ∶= ∇max{𝑥 ⋅ 𝑦 + 𝑚 }.

In particular, �̄� is piecewise linear and can be easily approx. by ReLU NN.

Page 34: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

Semi-discrete optimal transport:• Source measure 𝜇 ∈ 𝒫 (ℝ ) absolutely continuous: d𝜇 = 𝜌 d𝑥;• Target measure 𝜈 = ∑ 𝜈 𝛿 .

In the discrete case, the Kantorovich dual becomes the followingfunctional depending on (𝜓 ,… , 𝜓 ):

ℱ(𝜓 ,⋯ ,𝜓 ) = 𝜓 (𝑥)𝜌(𝑥) d𝑥 + 𝜓 𝜈

= inf12|𝑥 − 𝑦 | − 𝜓 𝜌(𝑥) d𝑥 + 𝜓 𝜈 .

Theorem (Gu-Luo-Sun-Yau 2016 for compact domain; L.-Lu 2020)The optimal transport plan 𝑇 is given by for some 𝑚 ∈ ℝ.

𝑇(𝑥) = ∇�̄�(𝑥) ∶= ∇max{𝑥 ⋅ 𝑦 + 𝑚 }.

In particular, �̄� is piecewise linear and can be easily approx. by ReLU NN.

Page 35: PDE analysis for sampling dynamics and generative modelshelper.ipam.ucla.edu/publications/hjws2/hjws2_16259.pdf[Armstrong, Mourrat 2019] for kinetic FP equation on compact domain without

• Underdamped Langevin dynamics has faster convergence toequilibrium compared with overdamped dynamics for convex pot.

• Universal approximation without curse-of-dimensionality forgenerative models for distributions;

Thank you! Any questions?

Email: [email protected]

url: http://www.math.duke.edu/~jianfeng/

References:• with Yu Cao and Lihan Wang, On explicit 𝐿 -convergence rate

estimate for underdamped Langevin dynamics, arXiv:1908.04746• with Yulong Lu, A universal approximation theorem of deep neural

networks for expressing distributions, arXiv:2004.08867