analysis of autocorrelation times in neural markov chain

Analysis of autocorrelation times in Neural Markov Chain Monte Carlosimulations

Piotr Bia lasa, Piotr Korcylb, Tomasz Stebelb

aInstitute of Applied Computer Science, Jagiellonian University, ul. Lojasiewicza 11, 30-348 Krakow, PolandbInstitute of Theoretical Physics, Jagiellonian University, ul. Lojasiewicza 11, 30-348 Krakow, Poland

Abstract

We provide a deepened study of autocorrelations in Neural Markov Chain Monte Carlo simulations, a versionof the traditional Metropolis algorithm which employs neural networks to provide independent proposals. Weillustrate our ideas using the two-dimensional Ising model. We propose several estimates of autocorrelationtimes, some inspired by analytical results derived for the Metropolized Independent Sampler, which wecompare and study as a function of inverse temperature β. Based on that we propose an alternativeloss function and study its impact on the autocorelation times. Furthermore, we investigate the impactof imposing system symmetries (Z2 and/or translational) in the neural network training process on theautocorrelation times. Eventually, we propose a scheme which incorporates partial heat-bath updates. Theimpact of the above enhancements is discussed for a 16 × 16 spin system. The summary of our findingsmay serve as a guidance to the implementation of Neural Markov Chain Monte Carlo simulations of morecomplicated models.

1. Introduction

The original idea behind Monte Carlo simulations proposed by Stanislaw Ulam and Enrico Fermi andlater expressed in a form of an algorithm applied to study a simple classical statistical mechanics problemby Metropolis et al. [1] turned out to be a very powerful approach to tackle a large variety of problemsin all computational sciences (see for example [2]). In case of complicated probability distributions mostformulations resort to the construction of an associated Markov chain of consecutive proposals. The statis-tical uncertainty of any outcome of Monte Carlo simulation depends directly on the number of statisticallyindependent configurations used to estimate it. Hence, the effectiveness of simulation algorithms is measuredby autocorrelation time which quantify how many configurations are produced by the algorithm before anew, statistically independent configuration appears. Increasing autocorrelation times, a phenomenon calledcritical slowing down, is usually the main factor which limits the statistical precision of outputs. In the con-text of field theory simulations in elementary particle physics several proposals were advanced in order toalleviate that problem, in particular metadynamics [3, 4], instanton updates [5] or multiscale thermalization[6].

The recent great interest in machine learning techniques has also provided new ideas in the domain ofMonte Carlo simulations which aim at reducing the autocorrelation times. The ability of artificial neuralnetworks to approximate a very wide class of probability distributions was used in Ref.[7] to propose avariational estimate of free energy of statistical systems. Subsequently the idea was extended and used asa mechanism of providing uncorrelated proposals in a Monte Carlo simulation in Ref.[8]. The Authors callthe resulting algorithm Neural Markov Chain Monte Carlo (NMCMC).

Email addresses: [email protected] (Piotr Bia las), [email protected] (Piotr Korcyl),[email protected] (Tomasz Stebel)

Preprint submitted to Elsevier January 13, 2022

arX

iv:2

111.

1018

9v2

[co

nd-m

at.s

tat-

mec

h] 1

1 Ja

n 20

22

Since the proposal of the NMCMC algorithm a few studies discussed its effectiveness. In particular,the Authors of Ref. [9] applied a similar method to spin glass models, whereas in Ref. [10] an alternativeupdate mechanism was proposed. In the present study, we concentrate on the classical test bench providedby the two-dimensional Ising model in order to stay in a simple and well understood physical situation andfocus our attention on the algorithmic effectiveness of the approach. We quantify the latter using severalpossible measures of autocorrelation times and compare them with the original Metropolis algorithm [1]and with the cluster algorithm proposed by Wolff [11, 12]. The Ising model offers also exact solution forthe partition function of the finite-size system [13] from which the free energy can be obtained and whichprovides measures of how well the neural network approximates the desired probability distribution.

In the following we start in Section 1.1 by briefly defining the model, in Section 1.2 we present the NeuralMarkov Chain Monte Carlo (NMCMC) algorithm using Autoregressive Networks (AN)[8]. In Section 2 wesummarize the known analytical results for the autocorrelation time in a simulation using the MetropolizedIndependent Sampler (MIS) derived by Liu in Ref. [14] and discuss them in the present setup. Subsequently,in Section 3 we define several possible estimators for the autocorrelation time. In Section 4 we collect twoloss functions used in the present context in the Literature and we supplement them with a new loss functioninspired by the analytic insight discussed in Section 2. The results of our numerical experiments are dividedinto two parts: Sections 5 and 6. In the former part we use a small system size which enables us to estimate allanalytical quantities of interest and provide their comparisons with the approximated counterparts definedon a small number of configurations contained in a batch. In the latter part we use a larger system size inorder to check the algorithmic effectiveness for more realistic system sizes. Finally, in the remaining Section7 we describe two techniques which improve the quality of the training of the neural network. First, weexploit the symmetries. In Section 7.1 we monitor the effect of enforcing the Z2 and translational symmetriesduring the training process. Second, in Section 7.2, we use the locality of nearest-neighbour interactionsof the Hamiltonian of the Ising model to reduce the number of relevant spin that have to be modelled bythe neural network. By using so-called chessboard partitioning half of the spins has to be generated bythe neural network while the remaining half can be obtained exactly from the local Boltzmann probabilitydistribution providing the heath-bath type of update. We provide a summary of our results and an outlookin Section 8.

1.1. Ising modelWe consider the two-dimensional ferromagnetic Ising model with N = L2 spins which are located on the

L×L square lattice. We will always compute observables on configurations coming from a Markov chain,hence labelled by a ”Monte Carlo” time. We denote the k-th configuration in the chain by sk. The spinswithin a single configuration are labelled by a linear upper index i; thus the i-th spin of k-th configurationis sik. Each state of the system is thus a configuration of spins sk = (s1k, . . . s

Nk ), and sik = ±1. For a

configuration of spins sk the energy of that configuration is given by the Hamiltonian

H(sk) = −J∑〈i,j〉

sik sjk , (1)

where the sum runs over all neighbouring pairs of lattice sites. In the following, we set the coupling constantJ = 1 without loosing generality. We assume periodic boundary conditions for the lattice.

The corresponding Boltzmann distribution is then given by

p(sk) =1

Zexp(−βH(sk)) , (2)

with partition function Z =∑

s exp(−βH(s)), where the sum is performed over all M = 2N states of thesystem.

It is a well known fact that, in the infinite volume limit, the model undergoes second-order phasetransition at βc = 1

2 ln(1 +√

2) ≈ 0.441. The corresponding order parameter is the absolute magnetizationper spin

m =

⟨1

N

∣∣∣∣∣N∑i=1

si

∣∣∣∣∣⟩p

=

M∑k=1

p(sk)

(1

N

∣∣∣∣∣N∑i=1

sik

∣∣∣∣∣)≈ 1

Nbatch

Nbatch∑k=1

1

N

∣∣∣∣∣N∑i=1

sik

∣∣∣∣∣ , (3)

2

where the last expression is an average over Nbatch � 1 configurations sampled from distribution p. In afinite volume the system exhibits a crossover at βc(L), i.e. the value of β at given L where |∂m∂β | is maximal.

We assume that the finite volume system is in an ordered state when β > βc(L) and in a disordered statefor β < βc(L).

1.2. Neural Markov Chain Monte Carlo (NMCMC)

In the approach discussed in this paper, we use an artificial neural network to model the probabilityp(s

¯). In order to do so we must first recast p(s

¯) in the form that will ensure proper normalisation and allow

for easy sampling from the learned distribution. To this end we rewrite p(sk) as a product of conditionalprobabilities of consecutive spins, i.e.

p(sk) = p(s1 = s1k)

N∏i=2

p(si = sik|s1 = s1k, s2 = s2k, . . . , s

i−1 = si−1k ) (4)

Note that factorization of the total probability of the configuration into N conditional probabilities requiresa unique ordering of all spins on the lattice. However, the precise form the ordering does not matter if itis maintained during the entire simulation. For definiteness, in our implementation we use a lexicographicordering on the two-dimensional lattice, the x coordinate being faster than the y coordinate.

Now we can use an artificial neural network to model the conditional probabilities p(si|·). In the simplestrealization, the neural network will have N input neurons and N output neurons. The value of the outputof the i-th neuron is interpreted as

qθ(si = 1|s1, s2, . . . , si−1) (5)

In order to ensure that the conditional probabilities depend only on the values of the preceding spins aparticular architecture of artificial neural networks is used, the so-called autoregressive networks. They arebuild out of a class of dense or convolutional layers where the neural connections are only allowed to theneurons associated to the preceding spins. The details of such a construction can be found in Ref.[7].

It is the purpose of the training steps to tune the weights of the neural network, collectively denotedas θ in such a way as to approximate p(s

¯) as close as possible. This is accomplished by looking for the

minimum of the loss function, usually the Kullback–Leibler divergence which quantifies the difference of twoprobability distributions. In the following we also discuss other possible choices and their interpretation.

We can easily generate configurations from the qθ(s) distribution by generating subsequent spins fromconditional probabilities qθ(s

i|s1, . . . , si−1). This requires N sequential calls to qθ(s|·) functions i.e. makingN forward passes through the neural network. Please note that this procedure does not rely in any place onany preceding spin configuration, hence it provides a mechanism for generating uncorrelated configurationsfrom the qθ distribution avoiding the construction of a Markov chain.

In the end we are interested in spin configurations generated with the target, Boltzmann probabilitydistribution. Having a neural network generating states according to the distribution qθ we can correct thedifference between p(s) and qθ(s) within an accept/reject step with the accept probability for the state sk+1,

min

(1,p(sk+1)qθ(sk)

p(sk)qθ(sk+1)

). (6)

This is know as metropolized independent sampler [14, 15, 16, 17].The fraction of the accepted propositions is another measure of the difference between qθ and p. When

qθ = p acceptance ratio is 1. When two distributions are not identical some propositions can be rejectedwhat leads to repetitions in the Markov chain. Therefore, even though propositions are independent fromeach other, a nonzero autocorrelation is possible. The range of those correlations is determined by theeigenvalues and eigenvectors of the transition matrix of the Markov-Chain which are the subject of the nextsection.

3

2. Analytic results for the eigenvalues of transition matrix

For the sake of completeness we include some results from Ref.[14] and we use this opportunity to set upthe notation. We consider two probability distributions: q(s

¯) and p(s

¯), defined on some set X. Distribution

p(s¯) is the target probability distributions. Using the metropolised independent sampler as described in the

previous section to sample from p we generate a configuration s¯k+1 from q and accept it with probability

P(s¯k→ s

¯k+1) = min

{1,w(s

¯k+1)

w(s¯k

)

}, (7)

where the importance ratio w(s¯) is defined for each configuration s

¯∈ X by

w(s¯) =

p(s¯)

q(s¯). (8)

Since p(s¯), q(s

¯) > 0 it follows that 0 < w(s

¯) <∞1.

The main result of Liu presented in Ref.[14] provides analytic formulae for the eigenvalues of the transitionmatrix of the resulting Markov chain in terms of the importance ratios. For a system with a finite number Mof elements in X one can order the importance ratios w(s

¯) in an decreasing order (countable and continuous

sets X were also discussed in [14]), so we adopt the labeling (which changes if p or q is modified) ofconfigurations s

¯isuch that the corresponding importance ratios satisfy

w(s¯1

) ≥ w(s¯2

) ≥ · · · ≥ w(s¯M

). (9)

Direct diagonalization of the transition matrix of the Markov chain gives its eigenvalues λk, where

λk =

1 for k = 0,

M∑i=k

q(s¯i

)(

1− w(s¯i

)

w(s¯k

)

)for 0 < k ≤M − 1

(10)

The eigenvalue λ0 = 1 corresponds to the stationary state and all remaining eigenvalues λk>0 are nonnegativeand smaller than 1. Due to the introduced ordering of importance ratios (9) and the fact that all w arepositive, it follows that 1 > λ1 > λ2 > . . . > λM−1 > 0.

In the case of the Neural Markov Chain Monte Carlo algorithm, the sampling distribution q(s¯) is encoded

in the neural network and therefore depends on all the weights θ of the network, q(s¯) = qθ(s

¯). The aim

of the training process is to ensure that qθ(s¯) is as close to p(s

¯) as possible thus driving the importance

ratios closer to 1 and hence decreasing the values of λk>0. This in turn translates into the reduction of theautocorrelation times as explained in more details in the next section.

In Fig. 1 we show histograms of all eigenvalues of transition matrix smaller than 1 – given by eq. (10)for Ising model of size 4× 4 (so 216 − 1 = 65535 eigenvalues in total). Temperature was chosen as β = 0.6and the DKL loss function was employed (see section 4). Interval (0, 1) was divided into 100 bins.

The left histogram shows the situation when the network is at the initial state (1 epoch), right panelshows the histogram when the neural network is trained. At the beginning of the simulation (1 epoch) theeigenvalues are distributed randomly with fluctuations coming from the random initial network weights.During the training process the eigenvalues decrease, as expected, since qθ approaches p distribution. Thelargest eigenvalue, λ1 is around 0.75 when the network is fully trained.

3. Autocorrelation times

In order to quantitatively discuss the algorithmic effectiveness we will estimate and compare the associ-ated autocorrelation times. In practice there exist several ways to define the relevant autocorrelation time.

1In principle q(s) could become zero for some s, but then it would mean that this configuration would never be proposedas the trial configuration thus avoiding the problem of infinite w(s

¯)

4

0.0 0.2 0.4 0.6 0.8 1.0100

101

102

103

104

105 1 epoch

0.0 0.2 0.4 0.6 0.8 1.0100

101

102

103

104

105 30000 epochs

Figure 1: Histograms of M − 1 eigenvalues of transition matrix, λk>0, for system 4 × 4. Training was performed usingthe DKL loss function. Left figure is for initial state of network, right is for fully trained network. Green line denotesM − 1 = 216 − 1 = 65535 value.

On one hand, one can focus on the transition matrix of the Markov chain and consider the analyticallyknown or approximated autocorrelation time extracted from the eigenvalues of that matrix (see Section 2).On the other hand, one can take the perspective of a particular observable of interest and estimate theautocorrelation time of that particular observable. In certain limiting situations all mentioned approachesshould provide comparable estimates of the same, say longest, autocorrelation time. However, in the caseof finite Markov chains and finite statistical precision, the autocorrelation times extracted along the threeways may vary by several orders of magnitude. We explain this effect in Section 5, meantime we provide allneeded definitions below. The autocorrelation function is defined as

ΓO(t) =1

var OE[(O(si)− O

) (O(si+t)− O

)], (11)

where the expectation value is calculated with respect to all the realisations of the Markov chain. TheO is the average of the observable O with respect to the equilibrium distribution of the Markov chain.Normalisation is such that Γ(0) = 1. This function can be solely determined from the properties of theeigenvectors and eigenvalues of the Markov chain transition matrix. In general autocorrelation function Γ(t)for any observable O will be of the form:

ΓO(t) =∑k>0

c2O,k λtk =

∑k>0

c2O,k e− tτk , with τk = −(log λk)−1. (12)

The coefficients cO,k can be calculated using the eigenvectors of transition matrix. One expects that forlarge separations t the autocorrelation function decays exponentially:

ΓO(t) ∼ e−t/τOL for t→∞. (13)

The τOL will be given by the largest eigenvalue λk for which cO,k > 0, and we can define

τL = supOτOL = −(log λ1)−1, (14)

where supO is taken over all possible observables.In the extreme case when w(s) = 1 for all s, i.e. q = p, it follows that λk>0 = 0 and therefore τL = 0.

Note that we do not refer to any particular observable in defining τL. It is a property of the Markov chainitself. Different observables will couple to the different modes of the Markov chain and will exhibit differentleading autocorrelation times, which however will be always smaller or equal to τL.

The most important measure of autocorrelations is the integrated autocorrelation time which enters asa factor into the definition of the statistical uncertainty of any value estimated from the Markov chain. For

5

the observable O it is defined as [18]:

τO,int = 1 + 2

∞∑i=1

ΓO(t). (15)

In order to operate at the level of integrated autocorrelation times we can define ”integrated version” of theτOL :

τO,intL ≡ 1 + 2

∞∑t=1

e− t

τOL =

1 + e−1/τOL

1− e−1/τOL

(16)

and similarly for τ intL

τ intL =1 + e−1/τL

1− e−1/τL. (17)

Note that among those three theoretical autocorrelation times only τ intL can be calculated without knowingcO,k coefficients. What is more, this is possible only for very small systems where all states can be constructedexplicitly. Calculating cO,k is beyond scope of this paper. Instead, we define below other estimators ofintegrated autocorrelation times which are calculable for systems of any size.

3.1. Estimators of autocorrelation times

In practice we estimate ΓO(t) from the finite sample:

ΓO(t) ≈ ΓO(t) =1

var O1

Nsample

Nsample∑i=1

(O(si)− O

) (O(si+t)− O

), (18)

where O and var O are now mean and variance calculated using Nsample states. In the following we markall the quantities that are estimated from finite sample with a hat sign. The autocorrelation function ΓO(t)is always nonnegative but its estimator ΓO(t) can become negative for some tmax due to finite fluctuations.For the purpose of these studies we consider autocorrelation function for separations smaller than tmax:ΓO(t), t < tmax. This immediately gives us the estimator of integrated time Eq. (15):

τO,int = 1 + 2

tmax∑i=1

ΓO(t). (19)

To estimate τO,intL we fit the exponent to the autocorrelation function Γ at large t. In order to takeinto account also subleading autocorrelation time which can contribute at moderate t, we use two exponentfunction of the form:

a1e−t/τ1 + a2e

−t/τ2 . (20)

We have checked that introducing more terms in Eq. (20) does not change the value of largest τi. We thusdefine τOF = max(τ1, τ2) which describes the behaviour of ΓO(t) at large t (but t < tmax). The integratedversion of τOF is:

τO,intF =1 + e−1/τ

OF

1− e−1/τOF

. (21)

The quantity τO,intF given by this definition is an estimator of τO,intL defined in Eq. (16).Also definition Eq. (14) of τL has an estimator that is readily available and can be monitored during the

training process of the neural network. The exact value of λ1 is not available for most systems, but can beapproximated by λ1 calculated from a finite sample of configurations (also called batch). Hence, we definethe ”batch version” of largest eigenvalue λ1,

λ1 ≡1

Nbatch

Nbatch∑k=1

(1− w(sk)

wmax on batch

)∣∣∣with sk sampled from qθ

, (22)

6

and correspondingly the ”batch version” of leading autocorrelation time τL:

τB = −(log λ1)−1, (23)

which is an analogue of Eq. (14). Integrated version of this τB is:

τ intB ≡ 1 + e−1/τB

1− e−1/τB. (24)

To conclude, we have defined here τO,int, τO,intF , τ intB which are estimators of three versions of theoretical

integrated autocorrelation times: τO,int, τO,intL and τ intL . In the following we choose the energy H, Eq. (1),as observable O. We have checked that the absolute value of the magnetization gives very similar results.We estimate ΓO(t) at the fixed stage of the training process and we generate a long, O(105), sequence ofstates. Subsequently we apply the condition Eq. (6) for consecutive states in this sequence and obtain aMarkov chain whose stationary state is the Boltzmann distribution.

3.2. Some relations among the different definitions

Let us now comment on the hierarchy of (integrated) autocorrelation times we have just defined, τO,int,

τO,intL and τ intL . By construction the largest of them will be τ intL . As mentioned earlier, it is the property of

the Markov chain not associated to any particular observable. Hence, τO,intL and τO,int are smaller than orequal to τ intL . In principle τO,int is equal to τ intL only if the observable O couples exclusively to the slowestmode of the Markov chain. One can show that once the observable has a non-zero coupling to another mode,τO,int will be strictly smaller than τ intL . Also by definition, τO,int, being the integral of the autocorrelation

function, is smaller than τO,intL , except for the case where the observable couples exactly to one of the modes

of the Markov chain. In that situation we expect that τO,int = τO,intL .

As for the τO,int, τO,intF , τ intB estimators their hierarchy cannot be theoretically established as they areall calculated in a very different way. However we could expect that the ordering presented above is, atleast approximately, preserved leading to τ intB ≥ τO,intF ≥ τO,int. We present the measurements of theabove integrated autocorrelation times on a 4 × 4 and 16 × 16 lattice for the Ising model in the followingsections and find that while τO,intF ≥ τO,int, τ intB can be smaller than τO,intF . We would also expect that allautocorrelation times are smaller than τ intL and this is indeed the case.

4. Neural network loss functions

We explained in the previous section that the difference between the target probability distribution p(s¯)

and the sampling probability distribution qθ(s¯) directly affects the autocorrelation times via Eq.(10). Several

measures can be used to quantify the distance between two distributions, in the context of machine learningthe most popular being the Kullback-Leibler divergence. Of course, in the ideal case, finding the globalzero of any of the correctly defined measures will lead to finding such θ weights that qθ(s

¯) = p(s

¯). However,

for a finite size of the neural network and thus at a limited ability of the neural networks to model anyprobability function as well as for a limited number of training epochs, different measures may potentiallybe more efficient than others as they may lead to qθ(s

¯) that gives smaller autocorrelation times. Below we

provide three definitions of loss functions which we will apply to the case of two dimensional Ising modelfollowing sections.

4.1. Kullback–Leibler (KL) divergence

The most commonly used in the literature loss function is defined as:

DKL(qθ|p) =

M∑k=1

qθ(sk) log

(qθ(sk)

p(sk)

). (25)

7

Substituting Eq.(2) into the last equation one obtains:

DKL(qθ|p) = β(Fq − F ), (26)

where F = − 1β logZ is a true free energy and

Fq =1

β

M∑k=1

qθ(sk) [βH(sk) + log qθ(sk)] (27)

is variational free energy. Since DKL is a non-negative quantity the minimalization of KL divergence isequivalent to minimization of the variational free energy.

As far as the importance ratios are concerned we can easily identify their presence in Eq.(25), since wecan write

DKL(qθ|p) = −M∑k=1

qθ(sk) logw(sk) = −〈logw〉qθ . (28)

It should be clear from this reformulation that during the training process the neural network will try toimprove qθ for configurations for which the importance ratio is the smallest, i.e. for which qθ(sk) is drasticallyoverestimated with respect to p(sk).

Calculating the gradient of the loss function

dDKL(qθ|p)dθ

= − d

dθ〈logw〉qθ (29)

requires certain caution. The derivative in Eq. (29) acts not only on the expression being averaged but alsoon the distribution used for averaging. More generally

d

dθ

⟨S(s|θ)

⟩qθ

=d

dθ

∑s¯

qθ(s¯)S(s|θ) =

⟨d log qθ

dθS

⟩qθ

+

⟨dS

dθ

⟩qθ

. (30)

In the case of KL divergence, S = log qθ − log p, and⟨dS

dθ

⟩qθ

=d

dθ

∑s

qθ(s) = 0. (31)

So finallydDKL(qθ|p)

dθ=

⟨d log qθ

dθ(log qθ − log p)

⟩qθ

. (32)

As was noticed in Ref. [7], Eq. (32) can be interpreted as a REINFORCE algorithm [19] with score function−S.

4.2. Largest Eigenvalue of transition matrix λ1

An alternative way to measure the difference between qθ and p is to use directly the expression for thelargest eigenvalue of the associated Markov chain transition matrix provided by Eq. (10)

Dλ1(qθ|p) ≡ λ1 =

M∑k=1

qθ(sk)

(1− w(s

¯k)

w(s¯1

)

)=

⟨1− w

w(s¯1

)

⟩qθ

. (33)

Because 〈w〉qθ = 1

Dλ1(qθ|p) = 1− 1

w(s¯1

). (34)

8

We cannot however use this form without knowing the the overall normalization factor, partition functionZ. In (33) p enters only as a ratio and so the partition function Z is not needed to calculate the eigenvalue.As already stated, λ1 ≥ 0 and it vanishes only when qθ ≡ p.

Although this loss function has the same minimum as the KL divergence, the contributions coming fromparticular configurations are different, and hence during minimization qθ will be brought closer to p for adifferent subset of configurations first.

Using Eq. (30) one can obtain the gradient of Dλ1 ,

dDλ1(qθ|p)

dθ=

⟨d log qθ

dθ

⟩qθ

−⟨

w

w(s¯1

)

d log qθ(s1)

dθ

⟩qθ

. (35)

Alternatively, one can follow the REINFORCE algorithm and use the following quantity in backward prop-agation, ⟨

d log qθdθ

(1− w

w(s¯1

)

)⟩qθ

, (36)

instead of the derivative Eq. (35). Note that, in contrast to DKL, these two approaches are different.Numerical studies showed that the approach given by Eq. (36) results in a much more efficient training forthe neural network. The comparison of these two approaches and their efficiencies is beyond scope of thispaper. In what follows we show results only for the REINFORCE approach.

4.3. Effective Sample Size (ESS)

Following Ref.[14, 20] we define yet another possible loss function, which for the sake of completeness wecompare with the two previously defined,

ESS =〈w〉2qθ〈w2〉qθ

. (37)

The quantity is normalized such that ESS ∈ [0, 1] and for perfectly matched distributiond one has ESS = 1.So we define the loss function as 1− ESS

DESS(qθ|p) = 1−〈w〉2qθ〈w2〉qθ

=

⟨w2⟩qθ− 〈w〉2qθ

〈w2〉qθ=

varqθ w

〈w2〉qθ. (38)

Again, the partition function Z is not needed to calculate ESS.This loss function cannot be expressed as a weigthed average so the REINFORCE algorithm is not

available. We differentiate Eq. (38) obtaining:

dDESS(qθ|p)dθ

= −〈w〉2qθ〈w2〉2qθ

⟨d log qθ

dθw2

⟩qθ

. (39)

4.4. Batch versions of the loss functions

For all but very small systems, the sum over all configurations∑Mi=k(. . .) is prohibitively large to perform

exactly. In the training process one disposes of a set of configurations s drawn from the current probabilitydistribution qθ(s). Hence, all the three loss functions defined above can be expressed as batch averages. Tothis goal we do the following replacements in Eqs. (28), (33) and (38):

〈. . .〉qθ =

M∑i=1

qθ(si)(. . .)→1

Nbatch

Nbatch∑i=1

(. . .). (40)

For the KL divergence we thus have,

DKL(qθ|p) ≈ DKL(qθ|p) ≡ −1

Nbatch

Nbatch∑k=1

logw(sk)∣∣∣with sk sampled from qθ

= −〈logw〉batch average, (41)

9

where we used the hat symbol to denote quantity calculated from the batch of configurations. In the case ofDλ1(qθ|p) we also need to modify the value of w1 since in practice its value may be unknown. We thus say

Dλ1(qθ|p) ≈ Dλ1(qθ|p) ≡1

Nbatch

Nbatch∑k=1

(1− w(sk)

wmax on batch

)∣∣∣with sk sampled from qθ

= 〈1− w

wmax on batch〉batch average

(42)

It should be clear that w(s¯1

) ≥ wmax on batch. This however does not necessarily entails that Dλ1(qθ|p) is

smaller than Dλ1(qθ|p) as the relation Eq.(34) need not hold for the finite sample.

5. Small lattice

In Section 2 we recalled analytical results concerning NMCMC algorithms, in particular explicit expres-sions for the eigenvalues of the transition matrix. The evaluation of these formulae is only possible whenthe system size is small and one can visit all configurations s

¯∈ X. The aim of this section is to study a

system small enough, we choose the lattice of size 4 × 4, and investigate the differences between the exactanalytic results and their approximated batch versions. For systems of physically reasonable sizes only thelatter are available and therefore it is important to understand their properties.

5.1. Neural network architecture

In this work we employ the simplest implementation of an autoregressive neural network with denselayers. We use one hidden layer and the width of all layers is equal to the number of spins on the lattice.As activation functions we use PReLU in between layers and sigmoid for output layer. Adam optimizer isused. The β-annealing is used for training, except for Figs. 1, 2, 3. We demonstrate in the remaining ofthis section that such an architecture is sufficient for our purposes. We checked that deeper neural networksdo not provide improvements in the final results, in particular no significant change can be seen in theautocorrelation times.

5.2. Dynamics of the neural network during training

The small size of the 4× 4 Ising model allows us to monitor the dynamics of the neural network duringthe training process. All loss functions have a common, unique minimum where the probabilities providedby the neural network, log qθ, are equal to the Boltzmann probabilities, log p. By plotting the former againstthe latter at several values of the training time we can monitor how the differences between the two evolveduring the training. In the presented case we used the KL divergence loss function Eq. (28). Other lossfunctions show a similar quantitative behaviour, with differences at the intermediate stages of the trainingassociated to the different weighting of contributions from particular configurations. At the initial momentthe weights in the neural network are random and all probabilities are approximately equal. They form ahorizontally elongated cloud along the yellow line shown in Fig. 2. The yellow line corresponds to a uniformprobability distribution, qθ(s

¯) = 2−16. When log qθ = log p the data points form a straight line, as can

be seen in the last stages of the training shown in the most right hand side plots in Fig. 2. The two plotsin-between are the most interesting ones as they capture the way the algorithm improves on qθ. As expected,first the configurations which have the largest probabilities log p (upper right corner of the figures) are tunedto the desired values and position themselves on the straight line. Once they have the correct values thetraining improves the approximation of less probable configurations. Please note that at the same timethe approximated probabilities for configurations which are very rare are also improved, although it is veryunlikely that they have appeared in any batch. This is due to the generalisation properties of the network.

The same information but plotted in somewhat different manner is provided in Fig. 3 where we showsnapshots of the histograms of the log of importance ratios logw. In the ideal case, the neural networkwould provide log qθ which are equal to log p and all importance ratios would be equal to 1. Indeed, as thetraining proceeds the histogram is more and more peaked at log 1 = 0. It is also evident that it is mucheasier for the neural network to learn the probability distribution p at β = 0.3 than at β = 0.6.

10

β = 0.3:

β = 0.6:

Figure 2: Dynamics of the neural network training using the KL divergence loss function Eq. (28). For two inverse temperatures:β = 0.3 (upper row) and β = 0.6 (lower row) we provide snapshots after 1, 300, 4000 and 30000 epochs of log qθ vs log p. It isevident from the plots that the training is more efficient for smaller values of β: dots representing states position themselveson log qθ = log p line (green) much faster. The yellow horizontal line shows a uniform probability distribution of p(s) = 2−16.

β = 0.3:

30 20 10 0 10 20 30log w

100

101

102

103

104

105 1 epoch

30 20 10 0 10 20 30log w

100

101

102

103

104

105 300 epochs

30 20 10 0 10 20 30log w

100

101

102

103

104

105 4000 epochs

30 20 10 0 10 20 30log w

100

101

102

103

104

105 30000 epochs

β = 0.6:

30 20 10 0 10 20 30log w

100

101

102

103

104

105 1 epoch

30 20 10 0 10 20 30log w

100

101

102

103

104

105 300 epochs

30 20 10 0 10 20 30log w

100

101

102

103

104

105 4000 epochs

30 20 10 0 10 20 30log w

100

101

102

103

104

105 30000 epochs

Figure 3: Dynamics of the neural network training using the KL divergence loss function Eq. (28). For two inverse temperatures:β = 0.3 (upper row) and β = 0.6 (lower row) we provide snapshots after 1, 300, 4000 and 30000 epochs of histograms of logwi.

11

0 2000 4000 6000 8000 10000 12000 14000epochs

2.092

2.091

2.090

2.089

2.088

2.087

2.086

2.085F for = 0.6

Fq - for all statesF calculated analiticallyFq - from batch

0 2000 4000 6000 8000 10000 12000 14000epochs

2.665

2.664

2.663

2.662

2.661

2.660

2.659F for = 0.3

Fq - for all statesF calculated analiticallyFq - from batch

Figure 4: Free energy F for system of size 4 × 4 at β = 0.6. We show value estimated in the batch Fq (red dots) and value Fqcalculated using all states (black curve). They are compared with analytical result for F (green horizontal line). It has to beemphasized that the red data obtained from the batch averages follows closely the full estimate of the free energy plotted withthe black line.

5.3. Variational free energy

Now we turn our attention to the variational estimation of the free energy given by Eq. (27). This isa particular observable for which we search for a variational approximation from above using the setupwe have described in previous section. Note, that the results presented below are not outcomes from aMonte Carlo simulation. The discussion below will serve two purposes. On one hand, we demonstrate howaccurately the correct probability distribution can be reproduced from the point of view of a physicallyrelevant observable. On the second hand, this benchmark observable can be used to select the simplestneural network architecture which will be sufficient for our purposes. In Fig.4 we show the free energyevaluated in three different ways as a function of the training epoch at β = 0.6. We used a small neuralnetwork architecture – 2 dense layers with autoregressive connections. Each layer contains 16 (i.e. numberof spins) neurons. We use the Z2 symmetry of the system (see also Section 7.1). Different data points inFig. 4 correspond to: Eq. (27) which is the full Fq (black curve), the batch estimate Fq which is Eq. (27)with replacement Eq. (40) (red dots) and the value of F calculated analytically (green horizontal line) – seeRef. [13] or [8] for formulas. The convergence of Fq and Fq to the exact value F confirms that such a verysmall neural network architecture was enough for this system size. Additionally, we can remark that twoways of calculating the free energy: Fq and Fq agree very well. Note also that at early epochs Fq is smallerthan true F because β-annealing was used. A similar test performed at other values of β confirmed thatthe neural network architecture is sufficient. When considering larger system size we will keep the numberof layers equal to 2 and only adjust their width to match the increasing number of spins.

5.4. Autocorrelation times

The next quantity which we can discuss at this small system size are the different integrated autocor-relation times defined in section 3 by equations (17), (19), (21) and (24). We start by showing how theautocorrelation time changes during the training process. The left panel of Fig. 5 shows the evolution atβ = 0.6 as a function of training epochs. The most striking is the black curve corresponding to the theo-retical integrated autocorrelation time τ intL , Eq. (17). It provides the reference scale for the worst possiblebehaviour. It is interesting that after a period of very pronounced values larger than 104 it decreases andstabilizes around 10 when the training is finished. The transition time coincides with the time at whichthe variational free energy converges towards the exact value, see Fig. 4, i.e. around 8000 epochs. Similarbehaviour can be seen for the other estimators, they all show a correlated drop in this period of training.

The large discrepancy between τ intL and the other three curves has two sources. The large value of τ intL

is driven by some underestimated configuration which has a very small probability to appear in the batch,

12

0 2000 4000 6000 8000 10000 12000 14000 16000epochs

100

101

102

103

104H, int

H, intFintLintB

0.0 0.2 0.4 0.6 0.8 1.0100

101

102

103

104H, int

H, intFintLintB

Figure 5: Four versions of (integrated) autocorrelation time (defined in section 3) for system of size 4×4. Left: autocorrelationtimes during the training process for β = 0.6. Right: autocorrelation times as function of β, they were measured at epoch15000, when networks were fully trained. The system size 4× 4 allows us to compare the theoretical integrated autocorrelationtime with the other definitions. The behaviour of autocorrelation times for a large system size 16 × 16 is shown in Fig.8 inwhich case theoretical autocorrelation time τ intL is not available.

therefore, it will not contribute to τ intB . The configurations which contribute to τ intB are those which aregenerated in the batch and these are the same configurations which are used to improve qθ. Hence, τ intB isestimated from configurations whose importance ratios are closer to one, leading to smaller autocorrelationtime. The other source is related to the fact that τH,intF and τH,int are calculated for a specific observable,in our case the energy, and hence may couple differently to different modes of the Markov chain. Thediscrepancy with τ intL signals that the contribution of the slowest mode is small in the case of such physicalobservables as energy or magnetization. As far as the statistical uncertainties of such physical observablesare concerned, it will be τH,int and τH,intF which matter. It is worth noticing that τ intB is very close to τH,intF

and τH,int, hence it provides a reasonable estimator of the autocorrelation time in the Markov chain withoutmaking reference to any specific observable. The increased fluctuations of τ intB are present because at eachepoch new states in the batch are drawn and different importance ratios appear.

In right panel of Fig. 5 we show the dependence of the various definitions of the autocorrelation timeestimated after 15000 epochs of training as a function of β. From the traditional Metropolis algorithmwe expect a sharp maximum around the ordered/disordered phase cross-over inverse temperature aroundβ ≈ 0.4, however no rise of autocorrelation time is observed in this region. What we see instead is astrong rise of τ intL which signifies that there are configurations for which qθ provides an increasingly badlyunderestimated approximation to the target probability. Such configurations have a very small probability ofbeing generated and so they did not appear in any batch hence do not contribute to any other autocorrelationtime, τH,int, τH,intF and also τ intB are always of the order of 1.0.

6. Medium lattice: loss functions and critical slowing down

We now move to describe results from larger lattices. We study the Ising model on a lattice with extentL = 16. For this system size the eigenvalues given by the analytic formula derived by Liu, Eq. (10), cannotbe evaluated directly. Instead, one has to rely on quantities which can be estimated as batch averages. Weshowed how the two, the exact one and the batch-approximated one, correlate in a smaller system in theprevious section.

We consider different loss functions and study the behaviour of several observables as a function oftraining epochs to visualise some aspects of the neural network training process. Again we keep the simplestneural network architecture, with two dense layers of width equal to the number of spins on the lattice: 256.

13

We checked that deeper architectures do not provide an improvement of the results, but they increase thenecessary number of epochs in the training phase.

6.1. Comparison of loss functions

We have mentioned in Section 4 that the different loss functions are sensitive to different regions ofthe support of qθ probability distribution. In order to try to quantify the effect of these differences on theobservables accessible from the level of a simulation we now take a closer look at six quantities: the freeenergy Fq, ESS, acceptance ratio in NMCMC and three definitions of the integrated autocorrelation times.We show all of them as a function of training epochs in Figs. 6 and 7, with each of them calculated with thethree loss functions introduced earlier. In order to have a better perspective we compare the situation atβ = 0.6 (Fig. 6) and β = 0.3 (Fig. 7). All data points have been averaged over 16 simulations with different,random initial weights of the neural network. We also stress that at β = 0.3 we impose the translationalsymmetry (to be described and discussed in Section 7.1).

In the first panel of Fig. 6 we present the variational batch estimate of the free energy. The greenline denotes the exact result and one clearly sees that the Kullback–Leibler divergence loss function is themost effective, i.e. provides the best approximation at a given number of training epochs. The DESS lossfunction fails to converge to the expected value for this larger system. The authors of Ref. [20] also notein the context of normalizing flows that this function is unsuitable for training. The same behaviour ofthe discussed loss functions is reproduced in other observables. In other words, the better approximated isthe free energy, the larger is the ESS coefficient, the larger is the acceptance ratio and the smaller is theintegrated autocorrelation time τH,int. Also other estimates of the autocorrelation time, τH,intF and τ intB

exhibit the same ordering. Analogous conclusions can be drawn from Fig. 7 where we present simulationsperformed at β = 0.3. Note that the DESS loss functions is unable to show any improvement over trainingperiod.

We note that in all panels of Fig. 6 the observables behave monotonously with the training time. Someof them show a scatter related to the variations between consecutive batches, however this does not affectthe general trend. In particular, we do not see any period of stagnation of the training process.

When comparing the two most effective loss functions, DKL and Dλ1, we observe that the failure to

reproduce the correct value of the free energy by Dλ1 by 0.5 ‰ corresponds to an increase of the integratedautocorrelation time τH,int by a factor 2.

6.2. Critical slowing down

The central issue of the MCMC simulations is the critical slowing down – the increase of autocorrelationtime when system is simulated closer to the phase transition. In the traditional approach to the simulationsof the Ising model with algorithms built upon local updates this can be very pronounced as is shown in theleft panel of Fig. 8. For the sake of comparison, we show there the integrated autocorrelation time τH,int

measured for the Markov chain obtained using three algorithms: the NMCMC, the original Metropolisalgorithm [1] and the Wolff’s cluster algorithm [11, 12]. The latter is a well-known case of cluster algorithmwhere the critical slowing down was completely eliminated. In order to compare the efficiencies of the Wolffand Metropolis algorithm we measured the energy after one sweep (a sweep consists of L2 proposals so thateach spin of the lattice is updated). In the case of cluster algorithm one sweep was defined as L2 updates ofclusters initiated at each spin of the lattice.

The autocorrelation time τH,int in the NMCMC algorithm (magenda curve) clearly exhibits a increasearound β ≈ 0.44 with the maximum value close to 2. Still, such an increase is much less pronounced than theone of the Metropolis algorithm (red curve). Although, from the presented comparison the cluster algorithm(black curve) is clearly preferred, one should remember that there exist relevant physical system, for instancelattice field theories with fermionic degrees of freedom, for which a cluster formulation is unknown, whereasthere are no limitations in principle in implementation of the NMCMC approach for other systems.

In the right panel of Fig. 8 we show, for NMCMC, the dependence of various estimates of the integratedautocorrelation time on β. A similar result was discussed in the context of the small system size in Fig. 5where no significant dependence on β was seen for these autocorrelation times. Here we clearly observe amaximum around β ≈ 0.44 for all three quantities.

14

0 5000 10000 15000 20000 25000 30000epochs

2.022

2.021

2.020

2.019

2.018

2.017

2.016

2.015Free energy Fq

DKL lossD 1 loss

0 5000 10000 15000 20000 25000 30000epochs

0.0

0.2

0.4

0.6

0.8

1.0 ESS

DKL lossD 1 lossDESS loss

0 5000 10000 15000 20000 25000 30000epochs

0.0

0.2

0.4

0.6

0.8

1.0 Accept ratio


0 5000 10000 15000 20000 25000 30000epochs

100

101

102

H, int


0 5000 10000 15000 20000 25000 30000epochs

100

101

102

103

H, intF


0 5000 10000 15000 20000 25000 30000epochs

100

101

102

103

intB


Figure 6: Learning process for 16 × 16 system at β = 0.6 using three loss functions. Different observables has been shown asfunction of training epochs. Functions shown on the plots averages of 16 learning process in order to exclude effects of randomnet initialization.

15

0 5000 10000 15000 20000 25000 30000epochs

2.636

2.635

2.634

2.633

2.632

2.631

2.630

2.629Free energy Fq

DKL lossD 1 loss

0 5000 10000 15000 20000 25000 30000epochs

0.0

0.2

0.4

0.6

0.8

1.0 ESS


0 5000 10000 15000 20000 25000 30000epochs

0.0

0.2

0.4

0.6

0.8

1.0 Accept ratio


0 5000 10000 15000 20000 25000 30000epochs

100

101

102

H, int


0 5000 10000 15000 20000 25000 30000epochs

100

101

102

103

H, intF


0 5000 10000 15000 20000 25000 30000epochs

100

101

102

103

intB


Figure 7: The same as Fig. 6 but for β = 0.3

16

0.0 0.2 0.4 0.6 0.8 1.0100

101

102H, int NMCMCH, int MetropolisH, int Wolff

0.0 0.2 0.4 0.6 0.8 1.0100

101

H, int

H, intFintB

Figure 8: Left: comparison of the autocorrelation time τH,int between different algorithms for system of size 16×16 as functionsof β. Right: different autocorrelation times in NMCMC measured at epoch 30000, when network was fully trained.

7. Reduction of autocorrelation time

In the NMCMC approach the autocorrelation times can be reduced by improving training hence reducingthe difference between qθ and p. Here we shall discuss two ideas which help obtaining this goal: i) making useof the symmetries of the model and ii) reducing effective number of spins using nearest neighbour interactionproperty. We shall test those two approaches using larger system size, i.e. 16× 16 spin lattices.

7.1. Impact of symmetry on training efficiency and autocorrelation times

In the NMCMC algorithm discussed so far we did not discuss any of the symmetries of the simulatedsystem. We define symmetries as global transformations of configurations which do not change their energies.It immediately follows that configurations related by a symmetry have equal target probabilities. In orderto be more precise, consider a finite group consisting of I elements {S0,S1, . . . ,SI−1}, where S0 ≡ 1 isidentity transformation S0s = s. The transformations S leave the energy unchanged, i.e. H(s) = H(S1s) =. . . = H(SI−1s). Symmetries are not present in the conditional probabilities parametrized by the neuralnetwork and hence there is no obvious way of imposing them on the output of the neural network. Theapproach suggested in Ref.[7] was to force the network to treat equivalent configurations as equally probableby symmetrizing their approximated probabilities within the batch. For that purpose, for each configurations proposed by the neural network we replace its probability q(s

¯) by the following average:

qθ(s)→1

I

I−1∑i=0

qθ(Sis) (43)

where qθ(Sis) are obtained by applying network to the transformed spin configuration. This induces anadditional numerical cost because for each configuration from the batch we need to call the neural networkI − 1 more times to estimate the probabilities of transformed configurations. However, note that already inorder to generate the original configuration s

¯we have to evaluate L2 conditional probabilities, meaning that

we have called the neural network L2 times. Hence, if I � L2 the additional cost of imposing the symmetryis subleading, whereas the benefits may be significant, as we discuss below.

In the case of the Ising model we consider two different types of symmetry:

• the Z2 symmetry was already discussed in Ref.[7]. It amounts to the statement that flipping all thespins of a configuration produces a configuration with the same energy. In this case I = 2 and

SZ21 sx,y = −sx,y. (44)

Eq.(43) becomes qθ(s)→ [qθ(s) + qθ(−s)] /2,

17

• periodic boundary conditions in both directions of the plane allow for existence of translational sym-metries which we denote Tx and Ty, each composed of L generators, I = L, defined as

STxn sx,y = sx−n,y, STyn sx,y = sx,y−n. 0 ≤ n < L. (45)

In the following we only implement the Ty symmetry. Likewise, Eq.(43) in this situation becomes

qθ(s)→[qθ(s) + q(STy1 s) + · · ·+ q(STyL−1s)

]/L, (46)

Symmetries Z2 and Ty are independent and can be imposed separately or simultaneously. We show theireffects in Figs. 9 and 10 respectively for β = 0.6 and β = 0.3 by discussing four observables: free energyFq, acceptance ratio in NMCMC, ESS and τH,int. Each figure contains five sets of data: results from theNMCMC simulation without the imposition of any of the system symmetries, NMCMC simulation with theZ2 symmetry alone, NMCMC simulation with the Ty symmetry alone, NMCMC simulation with both Z2

and Ty symmetries imposed simultaneously and finally a data set obtained from a NMCMC simulation wherethe heathbath updates were incorporated on a chessboard lattice. The details of the latter are described inthe next subsection. Figures showing the free energy contain additionally a green line at the exact value. Allplots show the evolution of the four observables as a function of the training period, so we expect convergenceof the different data sets at the right edges of the figures.

Fig. 9 shows the results obtained at β = 0.6 when the system is expected to be in the ordered phase. Fromthe first plot one readily concludes that the imposition of Z2 symmetry is crucial. Data sets correspondingto simulation without this symmetry converge very slowly or do not converge at all towards the exact valueof the free energy. Only when we include the Z2 symmetry the algorithm is able to reproduce the expectedvalue which is clearly visible on the zoomed plot. Other observables exhibit a similar behaviour: acceptancerate is of the order of 40% without the imposition of Z2 symmetry, but once it is included it jumps above90%.

The situation is different at β = 0.3 which we show in Fig. 10. At this temperature the system isexpected to be in the disordered phase and the role of Z2 and Ty symmetries is reversed. In this situationthe imposition of the Z2 symmetry does not produce any visible improvement: for all observables thedata set obtained from the simulation without any symmetry and with the Z2 symmetry only are barelydistinguishable. It is now the translational symmetry which improves the performance of the algorithm. Letus also note that the training is much more efficient than it was at β = 0.6 as all four observables reachsatisfactory values already after 10000 epochs.

Comparing the results shown in Figs. 9 and 10 one concludes that the imposition of system symmetriesis pivotal in order to make the training process more efficient. Additionally, we observe that the differentsymmetries play different role in the various regions of the phase space of the system, and hence one shouldimplement as many of them as possible. Let us highlight again that in the case of two-dimensional systems, asthe system size grows, the number of invocations of the neural network scales like L2 whereas the cost of thesymmetries depends on the number of symmetry transformations that are used to average the approximatedprobabilities.

7.2. Chessboard factorization

Figs. 9 and 10 contained another data set which was not commented in the previous section and whichwe explain below. The chessboard factorization which was used to generated this data can be applied tomodels with only nearest-neighbours interactions. One has to divide all spins into two species, odd andeven, according to the parity of the sum of their coordinates on the lattice, this results in a ”chessboard”pattern. From the locality of the interaction it turns out that the total probability of that spin from an evensite depends exclusively on the state of the four neighbouring spins from the odd sites. In order words, if allspins at odd sites have been fixed, the spins at even sites can be generated from the exact, local Boltzmannprobability distribution.

In the present context the above formulation means that one can reduce the number of spins which haveto be set by the neural network by a half. Not only this reduces significantly the number of invocations of the

18

0 5000 10000 15000 20000 25000 30000epochs

2.025

2.020

2.015

2.010

2.005

2.000Free energy Fq

no symZ2Ty

Z2 + Ty

chess

0 5000 10000 15000 20000 25000 30000epochs

2.0216

2.0214

2.0212

2.0210

2.0208

2.0206Free energy Fq (zoomed)

no symZ2Ty

Z2 + Ty

chess

0 5000 10000 15000 20000 25000 30000epochs

0.0

0.2

0.4

0.6

0.8

1.0 Accept ratio

no symZ2Ty

Z2 + Ty

chess

0 5000 10000 15000 20000 25000 30000epochs

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00 Accept ratio (zoomed)no symZ2Ty

Z2 + Ty

chess

0 5000 10000 15000 20000 25000 30000epochs

0.0

0.2

0.4

0.6

0.8

1.0 ESS

no symZ2Ty

Z2 + Ty

chess

0 5000 10000 15000 20000 25000 30000epochs

0

5

10

15

20

25H, int

no symmZ2Ty

Z2 + Ty

chess

Figure 9: Four quantities (free energy, accept ratio in NMCMC, ESS and integrated autocorrelation time τH,int) as functionsof epochs during training for system 16 × 16 and β = 0.6. For each quantity we compare the quality of learning processfor different symmetries. Free energy and accept ratio we show zoomed in order to distinguish between the most efficientapproaches.

19

0 5000 10000 15000 20000 25000 30000epochs

2.6355

2.6350

2.6345

2.6340

2.6335

2.6330

2.6325Free energy Fq

no symZ2Ty

Z2 + Ty

chess

0 5000 10000 15000 20000 25000 30000epochs

0.0

0.2

0.4

0.6

0.8

1.0 Accept ratio

no symZ2Ty

Z2 + Ty

chess

0 5000 10000 15000 20000 25000 30000epochs

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00 Accept ratio (zoomed)

no symZ2Ty

Z2 + Ty

chess

0 5000 10000 15000 20000 25000 30000epochs

0.0

0.2

0.4

0.6

0.8

1.0 ESS

no symZ2Ty

Z2 + Ty

chess

0 5000 10000 15000 20000 25000 30000epochs

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

H, int

no symmZ2Ty

Z2 + Ty

chess

Figure 10: The same as Fig. 9 but for β = 0.3.

20

neural network, but also the size of the neural network is reduced and more importantly half of the spins canbe generated independently from each other from the target probability distribution further decorrelatingthe system.

The formula Eq. (4) is now modified:

qθ(s) =

N/2∏i=1

qO(si|si−1 . . . s1)

N∏i=N/2+1

qE(si, n(si)), (47)

where qO(si|si−1 . . . s1) is a probability of i-th odd spin to be si when other odd spins si−1 . . . s1 are fixedand qE(si, nA(si) is probability of i-th even spin to be si when it’s neighbours n(si) are fixed (note that evenspins have only odd neighbours). The procedure to get the full spin state is the following: i) run forwardnetwork to get odd spins, ii) using values of fixed odd spins calculate probabilities qE of all even spins, iii)draw the even spins using calculated qE . Probability of i-th even spin to be +1 or −1 is equal:

qE(si = ±1, n(si)) =

1 + exp

∓ 2β∑j∈n(si)

sj

−1 , (48)

where the sum is performed over nearest neighbours n(si) of spin si.As the neural network is used to generate only half of spins in full state we expect that training is faster

and more effective when chessboard factorization is used. In Figures 9 and 10 we show (light blue diamonds)observables during the training when this factorization is applied, no symmetries involved here (so shouldbe compared with magenta triangles). We observe dramatic difference in performance of the network -chessboard factorization improves significantly all measures of training quality, comparable to that of thesymmetries imposition.

8. Summary and outlook

In this work we have investigated, from many perspectives, the neural Markov Chain Monte Carloalgorithm for simulating statistical systems with a finite number of discrete degrees of freedom. We quantifiedits performance by measuring various estimators of integrated autocorrelation times and discussed theirbehaviour as a function of inverse temperature β and neural network training time. We have verified thatthe NMCMC algorithm is mildly affected by the critical slowing down close to the phase transition, i.e. theintegrated autocorrelation time does not increase considerably in this region. Of course, this is only true ifthe neural network was trained well enough so that the encoded probability distribution is sufficiently closeto the target one. We suggested some measures of the quality of the training based on importance ratios. Thelatter are the building blocks of the analytic expressions for the eigenvalues of the Markov chain transitionmatrix and hence are used to describe the autocorrelations in the Metropolized Independent Sampler classof algorithms. Comparing with the local Metropolis algorithm we stress that the critical slowing down ofthe Metropolis algorithm can be explained by the growing size of spin clusters in the parameter regionaround the phase transition, hence has a physical interpretation. In the case of the NMCMC algorithmthe non-zero autocorrelation times are due to an imperfect approximation of the Boltzmann distribution byqθ. Therefore the phenomenon of the critical slowing down, if any, stems from the dynamics of the neuralnetwork training. Furthermore the configurations with very small p are the ones for which the network willhave biggest difficulty for getting qθ right. This can lead to very large relative errors and hence very largevalues of τL. As those configurations have qθ(s

¯)� p(s

¯) they will be proposed very rarely if ever. However if

they do appear they will repeated approximately w(s¯) times. At this moment it is unclear to us if this may

have an impact on the results of simulations, as we have not encountered such situation in practice.As far as the loss functions are concerned we have constructed a loss function directly proportional to the

largest eigenvalue of the transition matrix of the Markov chain. Although its performance is inferior to thepreviously studied DKL loss function, it provides an interesting example of a loss function which behaves

21

differently when the neural network training is implemented with the REINFORCE algorithm or with thegradient learning. This is in contrast to DKL loss function that is the same in both approaches. Hence, infurther studies Dλ1

may be used to investigate different approaches to neural network training. As a thirdalternative, we checked the DESS loss function discussed in Ref. [20], and confirmed that its efficiency isvery poor in the studied context.

We have investigated the impact of symmetries on the training process. It was already noted in Ref. [7]that the Z2 symmetry is crucial in order to correctly reproduce the free energy of the system. In our detailedstudy we observed that this is true only in a specific range of inverse temperatures β. For instance, for smallβ the Z2 symmetry does not improve the quality of training. Instead, another symmetry turns out to bemore important, namely the translational symmetry, which in turn was not as efficient at large values ofβ as the Z2 symmetry. Hence, we conclude that in order to efficiently train the neural network in the fullrange of β it may be advisable to impose both, or all accessible, symmetries simultaneously.

Our third observation concerns the chessboard factorization, which turns out to considerably reduceboth the width of the neural network and the training time. We find that its effects are as important as theeffects of the symmetries and therefore it may be advantageous to incorporate it in the algorithm in the fullrange of β.

In summary we note that the improvements proposed and discussed in this work also apply to thevariational estimation of free energy, as proposed in Ref. [7].

We leave as an open question and as a next step in the investigation of NMCMC algorithms, how manyof the above improvements can be applied in the context of systems with continuous degrees of freedom.

Acknowledgement

Computer time allocation ’plgnnformontecarlo’ on the Prometheus supercomputer hosted by AGH Cyfronetin Krakow, Poland was used through the polish PLGRID consortium. T.S. kindly acknowledges sup-port of the Polish National Science Center (NCN) Grant No. 2019/32/C/ST2/00202 and support of theFaculty of Physics, Astronomy and Applied Computer Science, Jagiellonian University Grant No. 2021-N17/MNS/000062. This research was funded by the Priority Research Area Digiworld under the programExcellence Initiative – Research University at the Jagiellonian University in Krakow.

Author contributions

All Authors have equally contributed to all stages of this work and of the manuscript preparation.

Declaration of competing interest

The Authors declare no competing interests.

References

[1] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller, “Equation of state calculations by fastcomputing machines,” The Journal of Chemical Physics, vol. 21, no. 6, pp. 1087–1092, 1953.

[2] K. Binder and D. Heermann, Monte Carlo Simulation in Statistical Physics: An Introduction. Springer, 2019.[3] A. Laio and F. L. Gervasio, “Metadynamics: a method to simulate rare events and reconstruct the free energy in biophysics,

chemistry and material science,” Reports on Progress in Physics, vol. 71, p. 126601, Dec. 2008.[4] G. Bussi and A. Laio, “Using metadynamics to explore complex free-energy landscapes,” Nature Reviews Physics, vol. 2,

p. 200, 2020.[5] F. Fucito, C. Rebbi, and S. Solomon, “Finite Temperature QCD in the Presence of Dynamical Quarks,” Nucl. Phys. B,

vol. 248, pp. 615–628, 1984.[6] M. G. Endres, R. C. Brower, W. Detmold, K. Orginos, and A. V. Pochinsky, “Multiscale monte carlo equilibration: Pure

yang-mills theory,” Phys. Rev. D, vol. 92, p. 114516, Dec 2015.[7] D. Wu, L. Wang, and P. Zhang, “Solving Statistical Mechanics Using Variational Autoregressive Networks,” Phys. Rev.

Lett., vol. 122, p. 080602, Mar. 2019.

22

[8] K. A. Nicoli, S. Nakajima, N. Strodthoff, W. Samek, K.-R. Muller, and P. Kessel, “Asymptotically unbiased estimationof physical observables with neural samplers,” Phys. Rev. E, vol. 101, p. 023304, Feb. 2020.

[9] B. McNaughton, M. V. Milosevic, A. Perali, and S. Pilati, “Boosting Monte Carlo simulations of spin glasses usingautoregressive neural networks,” Phys. Rev. E, vol. 101, p. 053312, May 2020.

[10] D. Wu, R. Rossi, and G. Carleo, “Unbiased Monte Carlo Cluster Updates with Autoregressive Neural Networks,” arXive-prints, p. arXiv:2105.05650, May 2021.

[11] U. Wolff, “Collective monte carlo updating for spin systems,” Phys. Rev. Lett., vol. 62, pp. 361–364, Jan 1989.[12] U. Wolff, “Comparison between cluster monte carlo algorithms in the ising model,” Physics Letters B, vol. 228, no. 3,

pp. 379–382, 1989.[13] A. E. Ferdinand and M. E. Fisher, “Bounded and inhomogeneous ising models. i. specific-heat anomaly of a finite lattice,”

Phys. Rev., vol. 185, pp. 832–846, Sep 1969.[14] J. S. Liu, “Metropolized independent sampling with comparisons to rejection sampling and importance sampling,” Statis-

tics and Computing, vol. 6, no. 2, pp. 113–119, 1996.[15] W. K. Hastings, “Monte Carlo Sampling Methods Using Markov Chains and Their Applications,” Biometrika, vol. 57,

pp. 97–109, 1970.[16] L. Tierney, “Markov Chains for Exploring Posterior Distributions,” The Annals of Statistics, vol. 22, no. 4, pp. 1701 –

1728, 1994.[17] G. Roberts and J. Rosenthal, “Markov-Chain Monte Carlo: Some Practical Implications of Theoretical Results,” The

Canadian Journal of Statistics / La Revue Canadienne De Statistique, vol. 26, pp. 5–20, 1998.[18] A. Sokal, “Monte carlo methods in statistical mechanics: Foundations and new algorithms,” in Functional Integration:

Basics and Applications (C. DeWitt-Morette, P. Cartier, and A. Folacci, eds.), (Boston, MA), pp. 131–192, Springer US,1997.

[19] A. B. R.S. Sutton, Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 2018.[20] M. S. Albergo, D. Boyda, D. C. Hackett, G. Kanwar, K. Cranmer, S. Racaniere, D. Jimenez Rezende, and P. E. Shanahan,

“Introduction to Normalizing Flows for Lattice Field Theory,” arXiv e-prints, p. arXiv:2101.08176, Jan. 2021.

23

analysis of autocorrelation times in neural markov chain

Documents