Riemannian manifold Langevin dynamics without the manifolds
Published:
I discuss how Riemannian manifold Langevin dynamics (RMLD) can be defined without appealing to manifolds by connecting RMLD with reversible perturbations.
Quick review of Riemannian manifold Langevin dynamics
Let $\pi(x)$ be a target probability density on $\mathbb{R}^d$. Riemannian manifold Langevin dynamics (RMLD) has been presented as a better proposal for the Metropolis-adjusted Langevin algorithm (MALA) than standard Langevin dynamics (LD) [1]. Let $X_t$ be a diffusion evolving according to the stochastic differential equation (SDE)
\begin{align} dX_t = \frac{1}{2}G^{-1}(X_t) \nabla \log \pi(X_t) \,dt + \frac{1}{2}\Omega(X_t) \, dt + \sqrt{G^{-1}(X_t)} \, dW_t \end{align}
where $G(X_t)$ is the Riemmanian metric, which is a $d\times d$ positive-definite matrix, $W_t$ is a standard $d$-dimensional Brownian motion, and $\Omega(x)$ is a vector-valued function defined entry-wise as follows:
\begin{align} \Omega_i(x) = |G(x)|^{-1/2} \sum_{j = 1}^d \frac{\partial}{\partial x_j}[(G^{-1})_{ij} (x) |G(x)|^{1/2}]. \end{align}
Here $|\cdot|$ denotes the determinant. In [1], the authors argue that using the Fisher information matrix will help with choosing $G(x)$. Some intuition for why using LD as a proposal for MCMC is attractive partially stems from the fact that LD has $\pi$ as its invariant density. It was noted in [2], however, that the invariant density of the RMLD presented above is not $\pi$! While the distribution it converges to is $\pi(\cdot)$, the density it samples from is defined with respect to the measure defined on the manifold! In fact, on $\mathbb{R}^d$, RMLD has invariant density $\pi^*(x) = \pi(x) |G(x)|^{1/2}$, and the note [2] corrects RMLD so that it has $\pi(x)$ as its invariant density. We will walk through the steps for deriving the corrected version of RMLD in the sequel.
Position-dependent MALA
The Fokker-Planck equation defines the evolution of densities of SDEs. Suppose $X_t$ is a diffusion process evolving in $\mathbb{R}^d$ according to \begin{align} dX_t = a(X_t) \, dt + b(X_t) \, dW_t \end{align} and define $\eta(t,x)$ to be the density of $X_t$. The Fokker-Planck equation is \begin{align} \frac{\partial \eta(t,x)}{\partial t} = -\sum_{i} \frac{\partial}{\partial x_i} \left[a_i(x) \eta(t,x) \right] + \frac{1}{2} \sum_{i,j} \frac{\partial^2}{\partial x_i \partial x_j} \left[V_{ij}(x) \eta(t,x)\right] \label{eq:fp} \end{align} where $V(x) = b(x) b(x)^*$. As a quick example, one can easily check that $\pi(x)$ is the invariant density of LD. Recall that LD is $dX_t = \nabla \log \pi(X_t)/2 \, dt + dW_t$. We have \begin{align*} -\sum_{i = 1}^d \frac{\partial}{\partial x_i}\left[ \frac{1}{2\pi} \frac{\partial \pi}{\partial x_i} \pi\right] + \frac{1}{2} \sum_{i = 1}^ d \frac{\partial^2 \pi}{\partial x_i^2} = - \frac{1}{2} \sum_{i = 1}^d \frac{\partial^2\pi}{\partial x_i^2} + \frac{1}{2} \sum_{i = 1}^ d \frac{\partial^2 \pi}{\partial x_i^2} = 0. \end{align*} Note, however, that this is not the only choice of SDE that has $\pi$ as the invariant density. Indeed, by appealing to \eqref{eq:fp}, if $a$ and $b$ are chosen such that \begin{align*} a_i(x) \pi(x) + \frac{1}{2} \sum_{j = 1}^d \frac{\partial}{\partial x_j}\left[ V_{ij}(x) \pi(x) \right] = 0, \end{align*} then the resulting diffusion process will have $\pi$ as the invariant density. The notion of preconditioning the gradient was the inspiration for pursuing RMLD [1], and they used Riemannian geometry concepts to derive the correction term $\Omega(x)$. In the same spirit, [2] also supposed that we start with a diffusion of the form \begin{align} dX_t = \frac{1}{2} A(X_t) \nabla \log \pi(X_t) \, dt+\frac{1}{2}\Gamma(X_t) dt + \sqrt{A(X_t)} \, dW_t \label{eq:pld} \end{align} where $A(x)$ is a positive-definite matrix and $\Gamma$ is the correction term. Using the conditions derived above, [2] derives the correction term to be \begin{align} \Gamma_i(x) = \sum_{j = 1}^d \frac{\partial A_{ij}(x)}{\partial x_j}. \end{align} The resulting algorithm is called position-dependent MALA or PMALA. Clearly this is a much cleaner formulation than RM-MALA, and since it truly has $\pi(x)$ as the invariant density, numerical results show that PMALA performs better than RM-MALA on a few examples. More specifically, they show that the essential sample size (ESS) and ESS per second is larger for PMALA than for RM-MALA. See [2] for further details.
RMLD as a reversible perturbation
To my knowledge, there is no rigorous analysis for why PMALA or RM-MALA works as well as they do. Studying RMLD as a reversible perturbation may provide some intuition as to why it performs better. In [3], the authors, in effect, study the unadjusted Langevin algorithm in the continuous case, and show that reversibly perturbed LD can accelerate convergence to the invariant density and reduce the asymptotic variance. This section is entirely based on [3].
Let $f$ be a function on $\mathbb{R}^d$, we are interested in computing its expectation with respect to $\pi$: $\bar{f} = \int f(x) \pi(x) \, dx$. Because LD is ergodic, this implies that \begin{align*} \lim_{t \to \infty} \frac{1}{t} \int_0^t f(X_s) \, ds = \int f(x) \pi(x) \, dx. \end{align*} Therefore, a good enough estimate for $\bar{f}$ is $S_t f/t$ for large enough $t$, where $S_tf = \int_0^t f(X_s) \, ds$.
We present RMLD as a reversible perturbation by appealing to the theory of Markov generators. Define $T_t^0$ to be the Markov semigroup on $L^2(\pi)$ of LD. note that $T_t^0 f(x) = \mathbb{E}[f(X_t) |X_0 = x]$. The generator of the semigroup is a linear operator acting on functions in $L^2(\pi)$
\begin{align} \mathcal{L}_0f = \lim_{t \to 0} \frac{T_t^0 f - f}{t}, \end{align}
defined for functions where the limit is well-defined. For SDEs, the generator can be expressed in terms of the drift and diffusion terms:
\begin{align*} \mathcal{L}_0f = \sum_{i =1 }^d a_i(x) \frac{\partial f}{\partial x_i} + \frac{1}{2} \sum_{i,j = 1}^d (b(x)b(x)^*)_{ij} \frac{\partial^2 f}{\partial x_i\partial x_j}. \end{align*}
The $L^2$-adjoint (not weighted by $\pi$) of $\mathcal{L}_0$ is the Fokker-Planck operator of the system. Define the inner product $\langle f,g\rangle = \int f(x) g(x) \pi(dx)$. Reversible processes have generators that are symmetric with respect to this inner product where $\pi$ is the invariant distribution, i.e., $\langle f,\mathcal{L}_0 g\rangle = \langle \mathcal{L}_0 f,g\rangle$. Denote $\sigma(\mathcal{L}_0)$ to be the spectrum of $T_t^0$ (which by the spectral mapping theorem is also the spectrum of $\mathcal{L}_0$). We say $T_t^0$ has a spectral gap if there exists $\lambda_0>0$ such that $\sigma(\mathcal{L}_0) \backslash \{0\} \subset (-\infty, -\lambda_0]$, and $-\lambda_0$ is called the spectral gap. Alternatively, $ \lambda_0 = -\sup\{\text{Re}(z)| z\in \sigma(\mathcal{L}_0), z\neq 0\}$. The existence of a spectral gap implies that $\mathcal{L}_0$ is negative semi-definite, i.e., $\langle f,\mathcal{L}_0 f \rangle \le 0$. Since the generator is a linear operator, the following result should not be surprising:
\begin{align} \left\lVert T_t^0 f - \int f\, d\pi \right\rVert \le Ce^{-\lambda_0 t} \left\lVert f- \int f \, d\pi \right\rVert \end{align}
where the norm is defined to be that of $L^2(\pi)$.
Reversible perturbations on LD are not understood to be changes to the drift or diffusion terms directly. Rather they are understood in terms of perturbing the generator of the process, and then determining how that affects the terms in the SDE.
Definition 1
Let $\mathcal{L} = \mathcal{L}_0 + \mathcal{S}$. We say $\mathcal{S}$ is a reversible perturbation to $\mathcal{L}_0$ if
The domain of $\mathcal{L}_0$ is contained in the domain of $\mathcal{S}$, and $\mathcal{L}$ has invariant distribution $\pi$
$\langle f,\mathcal{S} g\rangle = \langle \mathcal{S}f , g\rangle $
$\langle f,\mathcal{S} f\rangle \le 0$, that is, $\mathcal{S}$ is negative semi-definite.
From this perspective, the RMLD can be viewed as a reversible perturbation to LD. The generator of \eqref{eq:pld} and LD are, respectively,
\[\begin{align} \mathcal{L}f &= \frac{1}{2}(A \nabla \log \pi \cdot \nabla f + (\nabla \cdot A) \nabla f + \text{Tr} A \nabla^2 f ) = \frac{1}{2}\left(A\nabla \log \pi \cdot f + \nabla \cdot( A\nabla f) \right) \\ \mathcal{L}_0 f &= \frac{1}{2} \left(\nabla \log \pi \cdot \nabla f + \Delta f \right). \end{align}\]Observe that by some algebra and the divergence theorem,
\[\begin{align*} \langle f,\mathcal{L}g\rangle &= \frac{1}{2} \int f(A\nabla \log \pi \cdot \nabla g + \nabla \cdot (A\nabla g) ) \pi \,dx \nonumber \\ &= \frac{1}{2} \int f(\nabla \pi \cdot A\nabla g + \pi \nabla \cdot (A\nabla g)) \,dx \nonumber \\ &= \frac{1}{2} \int f\nabla \cdot (\pi A\nabla g) \,dx \nonumber \\ &= \frac{1}{2} \int \nabla\cdot(f\pi A\nabla g) \, dx - \frac{1}{2} \int \nabla f \cdot ( \pi A\nabla g) \, dx \nonumber \\ &= -\frac{1}{2} \int (\nabla f \cdot A\nabla g) \pi \, dx. \nonumber \end{align*}\]Likewise, we have $\langle f, \mathcal{L}_0 g\rangle = -\frac{1}{2} \int (\nabla f \cdot \nabla g) \pi \,dx$. Therefore, the reversible perturbation is \begin{align} \langle f,\mathcal{S} g\rangle = - \frac{1}{2} \int \nabla f \cdot (A - I) \nabla g \, \pi \, dx. \end{align} This operator is negative semi-definite if $A(x) - I$ is positive semi-definite. This shows that RMLD is a reversibly perturbed version of LD. We can show that when this is the case, the spectral gap of $\mathcal{L}$ is smaller than that of $\mathcal{L}_0$. This is Lemma 1 in [3]
Lemma 1
The spectral gap $\lambda$ of the generator $\mathcal{L}= \mathcal{L}_0 + \mathcal{S}$ is smaller than the spectral gap of $\mathcal{L}_0$.
We presented a watered down version of the proof that does not make rigorous use of results from semigroup theory and instead appeal to the reader’s linear algebra intuition.
Proof
Without loss of generality, suppose $f$ is such that $ \bar{f} = 0$. Let $T_t$ be the semigroup generated from $\mathcal{L}$. If one thinks of $\mathcal{L}$ as a matrix, then $T_t f = e^{\mathcal{L} t} f$, where $e^{\mathcal{L}}$ is something akin to the matrix exponential. (I believe this is formally referred to as the $C_0$-semigroup. The key theorem used in the formal proof is the Hille-Yosida theorem). Then observe that \begin{align*} \frac{d}{dt} \left\lVert T_t f\right\rVert^2 = 2\langle T_t f, (\mathcal{L}_0 + \mathcal{S}) T_t f\rangle = 2\langle T_t f, (\mathcal{L}_0 + \mathcal{S}) T_t f \rangle \le 2 \langle T_t f, \mathcal{L}_0T_t f\rangle \le -2 \lambda_0 \left\lVert T_tf \right\rVert^2. \end{align*} This implies that $\left\lVert T_t f\right\rVert \le e^{-\lambda_0t} \left\lVert f \right\rVert$, and therefore that $\text{Re}(\langle ( \mathcal{L} + \lambda_0) f, f\rangle) \le 0$. Therefore, the spectrum of $T_t$ is such that $\text{Re}(z) \le -\lambda_0$, which implies the spectral gap is smaller.
However, we still have not addressed the question: why is this useful for sampling? The next part shows that a smaller spectral gap implies a smaller asymptotic variance of $S_t t/t$. First note that we may write
\begin{align*} \sigma^2(f) \equiv \lim_{t\to\infty} t \text{Var}_\pi(S_t f/t) = 2\int_0^\infty \langle T_t (f-\bar{f}), f-\bar{f} \rangle \, dt = 2 \langle (f-\bar{f}),(-\mathcal{L})^{-1}(f-\bar{f}) \rangle. \end{align*}
For the proof, refer to Chapter 4.1-4.2 of [4]. The following lemma shows that the asymptotic variance decreases with the appropriate reversible perturbation.
Lemma 2
For any $f\in L^2(\pi)$, $\sigma_0^2(f) \le \sigma^2(f)$.
The proof is straightforward and worth writing down to demonstrate the power of the semigroup/generator perspective.
Proof
We must show that $\langle (f-\bar{f}),(-\mathcal{L})^{-1}(f-\bar{f})\rangle \le \langle (f-\bar{f}),(-\mathcal{L}_0)^{-1}(f-\bar{f})\rangle$. Restrict $\mathcal{L}_0$ to only act on functions $f\in L^2(\pi)$ where $\bar{f} = 0$. This can be done without loss of generality. Then $\mathcal{L}_0$ will be negative definite (not just semidefinite) on this restricted space, which implies $-\mathcal{L}_0$ is positive definite and its square root is well defined. Then observe that \begin{align*} -\mathcal{L}_0 - \mathcal{S} = (-\mathcal{L}_0)^{1/2}\left(\mathbf{1} +(-\mathcal{L}_0)^{-1/2}(-\mathcal{S})(-\mathcal{L}_0)^{-1/2}\right) (-\mathcal{L}_0)^{1/2} \end{align*}
which implies that \begin{align*} (-\mathcal{L}_0 - \mathcal{S})^{-1} = (-\mathcal{L}_0)^{-1/2}\left(\mathbf{1} +(-\mathcal{L}_0)^{-1/2}(-\mathcal{S})(-\mathcal{L}_0)^{-1/2}\right)^{-1} (-\mathcal{L}_0)^{-1/2}. \end{align*}
Define $\mathcal{R} = (-\mathcal{L}_0)^{-1/2}(-\mathcal{S})(-\mathcal{L}_0)^{-1/2}$. The reversible perturbation $\mathcal{S}$ is defined to be negative semidefinite, so $-\mathcal{S}$ is positive semidefinite, and so $\mathcal{R}$ is positive semidefinite as well. Define $g = (-\mathcal{L}_0)^{1/2}(f-\bar{f})$. The statement we need to prove reduces to \begin{align} \langle g,(\mathbf{1}+ \mathcal{R})^{-1}g \rangle \le \langle g,g\rangle. \end{align} But this statement is trivial since $\mathcal{R}$ is positive semidefinite. (As an aside, this is true by the spectral theorem for compact self-adjoint operators and the functional calculus.) This shows that as long as $\mathcal{S}$ is a reversible perturbation that is negative semidefinite, the asymptotic variance of the ergodic average estimator will decrease. The rest of the paper in [3] derives many large deviations principles for these reversibly (and irreversibly) perturbed systems, and relates the asymptotic variance with the derived rate functions. They also formally show conditions that are required of the perturbations to see strict improvement. (Namely, they show that $A(x)-I$ needs to be positive semidefinite for a reversible perturbation to lead to improvements).
References
[1] Girolami, M. and Calderhead, B., 2011. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2), p.123-214.
[2] Xifara, T., Sherlock, C., Livingstone, S., Byrne, S. and Girolami, M., 2014. Langevin di usions and the Metropolis-adjusted Langevin algorithm. Statistics & Probability Letters, 91, pp.14-19.
[3] Rey-Bellet, L. and Spiliopoulos, K., 2016. Improving the convergence of reversible samplers. Journal of Statistical Physics, 164(3), pp.472-494.
[4] Asmussen, S. and Glynn, P.W., 2007. Stochastic simulation: algorithms and analysis (Vol. 57). Springer Science & Business Media.