Riemannian manifold Langevin dynamics without the manifolds
Published:
I discuss how Riemannian manifold Langevin dynamics (RMLD) can be defined without appealing to manifolds by connecting RMLD with reversible perturbations.
Quick review of Riemannian manifold Langevin dynamics
Let π(x) be a target probability density on Rd. Riemannian manifold Langevin dynamics (RMLD) has been presented as a better proposal for the Metropolis-adjusted Langevin algorithm (MALA) than standard Langevin dynamics (LD) [1]. Let Xt be a diffusion evolving according to the stochastic differential equation (SDE)
dXt=12G−1(Xt)∇logπ(Xt)dt+12Ω(Xt)dt+√G−1(Xt)dWt
where G(Xt) is the Riemmanian metric, which is a d×d positive-definite matrix, Wt is a standard d-dimensional Brownian motion, and Ω(x) is a vector-valued function defined entry-wise as follows:
Ωi(x)=|G(x)|−1/2d∑j=1∂∂xj[(G−1)ij(x)|G(x)|1/2].
Here |⋅| denotes the determinant. In [1], the authors argue that using the Fisher information matrix will help with choosing G(x). Some intuition for why using LD as a proposal for MCMC is attractive partially stems from the fact that LD has π as its invariant density. It was noted in [2], however, that the invariant density of the RMLD presented above is not π! While the distribution it converges to is π(⋅), the density it samples from is defined with respect to the measure defined on the manifold! In fact, on Rd, RMLD has invariant density π∗(x)=π(x)|G(x)|1/2, and the note [2] corrects RMLD so that it has π(x) as its invariant density. We will walk through the steps for deriving the corrected version of RMLD in the sequel.
Position-dependent MALA
The Fokker-Planck equation defines the evolution of densities of SDEs. Suppose Xt is a diffusion process evolving in Rd according to dXt=a(Xt)dt+b(Xt)dWt and define η(t,x) to be the density of Xt. The Fokker-Planck equation is ∂η(t,x)∂t=−∑i∂∂xi[ai(x)η(t,x)]+12∑i,j∂2∂xi∂xj[Vij(x)η(t,x)] where V(x)=b(x)b(x)∗. As a quick example, one can easily check that π(x) is the invariant density of LD. Recall that LD is dXt=∇logπ(Xt)/2dt+dWt. We have −d∑i=1∂∂xi[12π∂π∂xiπ]+12d∑i=1∂2π∂x2i=−12d∑i=1∂2π∂x2i+12d∑i=1∂2π∂x2i=0. Note, however, that this is not the only choice of SDE that has π as the invariant density. Indeed, by appealing to (4), if a and b are chosen such that ai(x)π(x)+12d∑j=1∂∂xj[Vij(x)π(x)]=0, then the resulting diffusion process will have π as the invariant density. The notion of preconditioning the gradient was the inspiration for pursuing RMLD [1], and they used Riemannian geometry concepts to derive the correction term Ω(x). In the same spirit, [2] also supposed that we start with a diffusion of the form dXt=12A(Xt)∇logπ(Xt)dt+12Γ(Xt)dt+√A(Xt)dWt where A(x) is a positive-definite matrix and Γ is the correction term. Using the conditions derived above, [2] derives the correction term to be Γi(x)=d∑j=1∂Aij(x)∂xj. The resulting algorithm is called position-dependent MALA or PMALA. Clearly this is a much cleaner formulation than RM-MALA, and since it truly has π(x) as the invariant density, numerical results show that PMALA performs better than RM-MALA on a few examples. More specifically, they show that the essential sample size (ESS) and ESS per second is larger for PMALA than for RM-MALA. See [2] for further details.
RMLD as a reversible perturbation
To my knowledge, there is no rigorous analysis for why PMALA or RM-MALA works as well as they do. Studying RMLD as a reversible perturbation may provide some intuition as to why it performs better. In [3], the authors, in effect, study the unadjusted Langevin algorithm in the continuous case, and show that reversibly perturbed LD can accelerate convergence to the invariant density and reduce the asymptotic variance. This section is entirely based on [3].
Let f be a function on Rd, we are interested in computing its expectation with respect to π: ˉf=∫f(x)π(x)dx. Because LD is ergodic, this implies that limt→∞1t∫t0f(Xs)ds=∫f(x)π(x)dx. Therefore, a good enough estimate for ˉf is Stf/t for large enough t, where Stf=∫t0f(Xs)ds.
We present RMLD as a reversible perturbation by appealing to the theory of Markov generators. Define T0t to be the Markov semigroup on L2(π) of LD. note that T0tf(x)=E[f(Xt)|X0=x]. The generator of the semigroup is a linear operator acting on functions in L2(π)
L0f=limt→0T0tf−ft,
defined for functions where the limit is well-defined. For SDEs, the generator can be expressed in terms of the drift and diffusion terms:
L0f=d∑i=1ai(x)∂f∂xi+12d∑i,j=1(b(x)b(x)∗)ij∂2f∂xi∂xj.
The L2-adjoint (not weighted by π) of L0 is the Fokker-Planck operator of the system. Define the inner product ⟨f,g⟩=∫f(x)g(x)π(dx). Reversible processes have generators that are symmetric with respect to this inner product where π is the invariant distribution, i.e., ⟨f,L0g⟩=⟨L0f,g⟩. Denote σ(L0) to be the spectrum of T0t (which by the spectral mapping theorem is also the spectrum of L0). We say T0t has a spectral gap if there exists λ0>0 such that σ(L0)∖{0}⊂(−∞,−λ0], and −λ0 is called the spectral gap. Alternatively, λ0=−sup{Re(z)|z∈σ(L0),z≠0}. The existence of a spectral gap implies that L0 is negative semi-definite, i.e., ⟨f,L0f⟩≤0. Since the generator is a linear operator, the following result should not be surprising:
‖T0tf−∫fdπ‖≤Ce−λ0t‖f−∫fdπ‖
where the norm is defined to be that of L2(π).
Reversible perturbations on LD are not understood to be changes to the drift or diffusion terms directly. Rather they are understood in terms of perturbing the generator of the process, and then determining how that affects the terms in the SDE.
Definition 1
Let L=L0+S. We say S is a reversible perturbation to L0 if
The domain of L0 is contained in the domain of S, and L has invariant distribution π
⟨f,Sg⟩=⟨Sf,g⟩
⟨f,Sf⟩≤0, that is, S is negative semi-definite.
From this perspective, the RMLD can be viewed as a reversible perturbation to LD. The generator of (5) and LD are, respectively,
Lf=12(A∇logπ⋅∇f+(∇⋅A)∇f+TrA∇2f)=12(A∇logπ⋅f+∇⋅(A∇f))L0f=12(∇logπ⋅∇f+Δf).Observe that by some algebra and the divergence theorem,
⟨f,Lg⟩=12∫f(A∇logπ⋅∇g+∇⋅(A∇g))πdx=12∫f(∇π⋅A∇g+π∇⋅(A∇g))dx=12∫f∇⋅(πA∇g)dx=12∫∇⋅(fπA∇g)dx−12∫∇f⋅(πA∇g)dx=−12∫(∇f⋅A∇g)πdx.Likewise, we have ⟨f,L0g⟩=−12∫(∇f⋅∇g)πdx. Therefore, the reversible perturbation is ⟨f,Sg⟩=−12∫∇f⋅(A−I)∇gπdx. This operator is negative semi-definite if A(x)−I is positive semi-definite. This shows that RMLD is a reversibly perturbed version of LD. We can show that when this is the case, the spectral gap of L is smaller than that of L0. This is Lemma 1 in [3]
Lemma 1
The spectral gap λ of the generator L=L0+S is smaller than the spectral gap of L0.
We presented a watered down version of the proof that does not make rigorous use of results from semigroup theory and instead appeal to the reader’s linear algebra intuition.
Proof
Without loss of generality, suppose f is such that ˉf=0. Let Tt be the semigroup generated from L. If one thinks of L as a matrix, then Ttf=eLtf, where eL is something akin to the matrix exponential. (I believe this is formally referred to as the C0-semigroup. The key theorem used in the formal proof is the Hille-Yosida theorem). Then observe that ddt‖Ttf‖2=2⟨Ttf,(L0+S)Ttf⟩=2⟨Ttf,(L0+S)Ttf⟩≤2⟨Ttf,L0Ttf⟩≤−2λ0‖Ttf‖2. This implies that ‖Ttf‖≤e−λ0t‖f‖, and therefore that Re(⟨(L+λ0)f,f⟩)≤0. Therefore, the spectrum of Tt is such that Re(z)≤−λ0, which implies the spectral gap is smaller.
However, we still have not addressed the question: why is this useful for sampling? The next part shows that a smaller spectral gap implies a smaller asymptotic variance of Stt/t. First note that we may write
σ2(f)≡limt→∞tVarπ(Stf/t)=2∫∞0⟨Tt(f−ˉf),f−ˉf⟩dt=2⟨(f−ˉf),(−L)−1(f−ˉf)⟩.
For the proof, refer to Chapter 4.1-4.2 of [4]. The following lemma shows that the asymptotic variance decreases with the appropriate reversible perturbation.
Lemma 2
For any f∈L2(π), σ20(f)≤σ2(f).
The proof is straightforward and worth writing down to demonstrate the power of the semigroup/generator perspective.
Proof
We must show that ⟨(f−ˉf),(−L)−1(f−ˉf)⟩≤⟨(f−ˉf),(−L0)−1(f−ˉf)⟩. Restrict L0 to only act on functions f∈L2(π) where ˉf=0. This can be done without loss of generality. Then L0 will be negative definite (not just semidefinite) on this restricted space, which implies −L0 is positive definite and its square root is well defined. Then observe that −L0−S=(−L0)1/2(1+(−L0)−1/2(−S)(−L0)−1/2)(−L0)1/2
which implies that (−L0−S)−1=(−L0)−1/2(1+(−L0)−1/2(−S)(−L0)−1/2)−1(−L0)−1/2.
Define R=(−L0)−1/2(−S)(−L0)−1/2. The reversible perturbation S is defined to be negative semidefinite, so −S is positive semidefinite, and so R is positive semidefinite as well. Define g=(−L0)1/2(f−ˉf). The statement we need to prove reduces to ⟨g,(1+R)−1g⟩≤⟨g,g⟩. But this statement is trivial since R is positive semidefinite. (As an aside, this is true by the spectral theorem for compact self-adjoint operators and the functional calculus.) This shows that as long as S is a reversible perturbation that is negative semidefinite, the asymptotic variance of the ergodic average estimator will decrease. The rest of the paper in [3] derives many large deviations principles for these reversibly (and irreversibly) perturbed systems, and relates the asymptotic variance with the derived rate functions. They also formally show conditions that are required of the perturbations to see strict improvement. (Namely, they show that A(x)−I needs to be positive semidefinite for a reversible perturbation to lead to improvements).
References
[1] Girolami, M. and Calderhead, B., 2011. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2), p.123-214.
[2] Xifara, T., Sherlock, C., Livingstone, S., Byrne, S. and Girolami, M., 2014. Langevin di usions and the Metropolis-adjusted Langevin algorithm. Statistics & Probability Letters, 91, pp.14-19.
[3] Rey-Bellet, L. and Spiliopoulos, K., 2016. Improving the convergence of reversible samplers. Journal of Statistical Physics, 164(3), pp.472-494.
[4] Asmussen, S. and Glynn, P.W., 2007. Stochastic simulation: algorithms and analysis (Vol. 57). Springer Science & Business Media.