A MCMC Game-Changer
9/10/2024
A continuous-time variant of MCMC algorithms
A Blog Entry on Bayesian Computation by an Applied Mathematician
$$
$$
PDMP (Piecewise Deterministic1 Markov Process2) (Davis, 1984)
Plays a complementary role to SDEs / Diffusions
Property | PDMP | SDE |
---|---|---|
Exactly simulatable? | ||
Subject to discretization errors? | ||
Driving noise | Poisson | Gauss |
History of PDMP Applications
What We’ve Learned
The new algorithm ‘Zig-Zag Sampler’ is based on comtinuous-time process called PDMP.
What We’ll Learn in the Rest of this Section 1
We will review 3 instances of the standard (discrete-time) MCMC algorithm: MH, Lifted MH, and MALA.
Input: Target distribution p, (symmetric) proposal distribution q
Alternative View: MH is a generic procedure to turn a simple q-Markov chain into a Markov chain converging to p.
The Choise of Proposal q
Reversibility (a.k.a detailed balance): p(x)q(x|y)=p(y)q(y|x). In words: \text{Probability}[\text{Going}\;x\to y]=\text{Probability}[\text{Going}\;y\to x]. Harder to explore the entire space
Slow mixing of MH
Lifting: A method to make MH’s dynamics irreversible
How?: By adding an auxiliary variable \sigma\in\{\pm1\}, called momentum
Lifted MH (Turitsyn et al., 2011)
Input: Target p, two proposals q^{(+1)},q^{(-1)}, and momentum \sigma\in\{\pm1\}
q^{(+1)}: Only propose \rightarrow moves
q^{(-1)}: Only propose \leftarrow moves
Once going uphill, it continues to go uphill.
This is irreversible, since
\begin{align*} &\text{Probability}[x\to y]\\ &\qquad\ne\text{Probability}[y\to x]. \end{align*}
Reversible dynamic of MH has ‘irreversified’
Caution
Scale is different in the vertical axis!
Lifted MH successfully explores the edges of the target distribution.
Zig-Zag corresponds to the limiting case of lifted MH as the step size of proposal q goes to zero, as we’ll learn later.
Zig-Zag has a maximum irreversibility.
Irreversibility actually improves the efficiency of MCMC.
Faster decay of autocorrelation \rho_t\approx\mathrm{Corr}[X_0,X_t] implies
Langevin diffusion: A diffusion process defined by the following SDE:
dX_t=\nabla\log p(X_t)\,dt+\sqrt{2\beta^{-1}}dB_t.
Langevin diffusion itself converges to the target distribution p in the sense that 1 \|p_t-p\|_{L^1}\to0,\qquad t\to\infty.
Two MCMC algorithms derived from Langevin diffusion:
ULA (Unadjusted Langevin Algorithm)
\quad Use the discretization of (X_t). Discretization errors accumulate.
MALA (Metropolis Adjusted Langevin Algorithm)
\quad Use ULA as a proposal in MH, erasing the errors by MH steps.
How fast do they go back to high-probability regions? 1
Irreversibility of Zig-Zag accelerates its convergence.
Caution: Fake Continuity
The left plot looks continuous, but it actually is not.
MH, including MALA, is actually a discrete-time process.
The plot is obtained by connecting the points by line segments.
Monte Carlo estimation is also done differently:
MALA outputs (X_n)_{n\in[N]} defines
\frac{1}{N}\sum_{n=1}^Nf(X_n)\xrightarrow{N\to\infty}\int_{\mathbb{R}^d} f(x)p(x)\,dx.
Zig-Zag outputs (X_t)_{t\in[0,T]} defines
\int^T_0f(X_t)\,dt\xrightarrow{T\to\infty}\int_{\mathbb{R}^d} f(x)p(x)\,dx.
Fast and exact simulation of continuous trajectory.
As we’ve learned before, Zig-Zag corresponds to the limiting case of lifted MH as the step size of proposal q goes to zero.
‘Limiting case of lifted MH’ means that we only simulate where we should flip the momentum \sigma\in\{\pm1\} in Lifted MH.
‘Limiting case of lifted MH’ means that we only simulate where we should flip the momentum \sigma\in\{\pm1\} in Lifted MH.
(1d 1 Zig Zag sampler Bierkens, Fearnhead, et al., 2019)
Input: Gradient \nabla\log p of log target density p
For n\in\{1,2,\cdots,N\}:
(Fundamental Property of Zig-Zag Sampler (1d) Bierkens, Fearnhead, et al., 2019)
Let U(x):=-\log p(x). Simluating a Poisson point process with a rate function \lambda(x,\sigma):=\biggr(\sigma U'(x)\biggl)_++\;\gamma(x) ensures the Zig-Zag sampler converges to the target p, where \gamma is an arbitrary non-negative function.
Its ergodicity is ensured as long as there exists c,C>0 such that1 p(x)\le C\lvert x\rvert^{-c}.
Given a rate function \lambda(x,\sigma):=\biggr(\sigma U'(x)\biggl)_++\;\gamma(x) how to simulate a corresponding Poisson point process?
What We’ll Learn in the Rest of this Section 2
Take Away: Zig-Zag sampling reduces to Poisson Thinning.
What is a Poisson Point Process with rate \lambda?
The number of points in [0,t] follows a Poisson distribution with mean \int^t_0\lambda(x_s,\sigma_s)\,ds: N([0,t])\sim\mathrm{Pois}\left(M(t)\right),\qquad M(t):=\int^t_0\lambda(x_s,\sigma_s)\,ds. We want to know when the first point T_1 falls on [0,\infty).
When \displaystyle\lambda(x,\sigma)\equiv c\;(\text{constant}),
blue line: Poisson Process
red dots: Poisson Point Process
satisfying \displaystyle\textcolor{#0096FF}{N_t}=\textcolor{#E95420}{N([0,t])}\sim\mathrm{Pois}(ct).
Proposition (Simulation of Poisson Point Process)
The first arrival time T_1 of a Poisson Point Process with rate \lambda can be simulated by T_1\overset{\text{d}}{=}M^{-1}(E),\qquad E\sim\operatorname{Exp}(1),M(t):=\int^t_0\lambda(x_s,\sigma_s)\,ds, where \operatorname{Exp}(1) denotes the exponential distribution with parameter 1.
Since \displaystyle\lambda(x,\sigma):=\biggr(\sigma U'(x)\biggl)_++\;\gamma(x), M can be quite complicated.
Inverting M can be impossible.
We need more general techniques: Poisson Thinning.
To obtain the first arrival time T_1 of a Poisson Point Process with rate \lambda,
m(t): Defined via \displaystyle\lambda(x,\sigma):=\biggr(\sigma U'(x)\biggl)_++\;\gamma(x).
M(t): Simple upper bound m\le M, such that M^{-1} is analytically tractable.
In order to simulate a Poisson Point Process with rate \lambda(x,\sigma):=\biggr(\sigma U'(x)\biggl)_++\;\gamma(x), we find a invertible upper bound M that satisfies \int^t_0\lambda(x_s,\sigma_s)\,ds=m(t)\le\textcolor{#0096FF}{M}(t). for all possible Zig-Zag trajectories \{(x_s,\sigma_s)\}_{s\in[0,T]}.
Quick demonstration of the state-of-the-art performance on a toy example.
Given a target p,
Setting
Data: y_1,\cdots,y_n\in\mathbb{R} aquired by y_i\overset{\text{i.i.d.}}{\sim}\mathrm{N}(x_0,\sigma^2),\qquad i\in[n], with \sigma>0 known, x_0\in\mathbb{R} unknown.
Prior: \mathrm{N}(0,\rho^2) with known \rho>0.
Goal: Sampling from the posterior p(x)\,\propto\,\left(\prod_{i=1}^n\phi(x|y_i,\sigma^2)\right)\phi(x|0,\rho^2), where \phi(x|y,\sigma^2) is the \mathrm{N}(y,\sigma^2) density.
The negative log-likelihood: \begin{align*} U(x)&=-\log p(x)\\ &=\frac{x^2}{2\rho^2}+\frac{1}{2\sigma^2}\sum_{i=1}^n(x-y_i)^2+\mathrm{const.},\\ U'(x)&=\frac{x}{\rho^2}+\frac{1}{\sigma^2}\sum_{i=1}^n(x-y_i),\\ U''(x)&=\frac{1}{\rho^2}+\frac{n}{\sigma^2}. \end{align*}
In the rest of this Section 3, we’ll learn:
Fixing \gamma\equiv0, we obtain the upper bound M \begin{align*} m(t)&=\int^t_0\lambda(x_s,\sigma_s)\,ds=\int^t_0\biggr(\sigma U'(x_s)\biggl)_+\,ds\\ &\le\left(\frac{\sigma x}{\rho^2}+\frac{\sigma}{\sigma^2}\sum_{i=1}^n(x-y_i)+t\left(\frac{1}{\rho^2}+\frac{n}{\sigma^2}\right)\right)_+\\ &=:(a+bt)_+=\textcolor{#0096FF}{M}(t), \end{align*}
where a=\frac{\sigma x}{\rho^2}+\frac{\sigma}{\sigma^2}\sum_{i=1}^n(x-y_i),\quad b=\frac{1}{\rho^2}+\frac{n}{\sigma^2}.
We generated 100 samples from \mathrm{N}(x_0,\sigma^2) with x_0=1.
MSE (Mean Squared Error) of \{X_i\}_{i=1}^n is defined as \frac{1}{n}\sum_{i=1}^n(X_i-x_0)^2. Epoch: Unit computational cost.
The following is considered as one epoch:
Case-by-case construction of an upper bound M is too complicated / demanding.
Therefore, we are trying to automate the whole procedure.
Automatic Zig-Zag
Construction of ZZ-CV (Zig-Zag with Control Variates).
U' has an alternative form:
\begin{align*} U'(x)&=\frac{x}{\rho^2}+\frac{1}{\sigma^2}\sum_{i=1}^n(x-y_i)=:\frac{1}{n}\sum_{i=1}^nU'_i(x), \end{align*} where U'_i(x)=\frac{x}{\rho^2}+\frac{n}{\sigma^2}(x-y_i).
We only need one sample y_i to evaluate U'_i.
Instead of \lambda_{\textcolor{#78C2AD}{\text{ZZ}}}(x,\sigma)=\biggr(\sigma U'(x)\biggl)_+ we use \lambda_{\textcolor{#E95420}{\text{ZZ-CV}}}(x,\sigma)=\biggr(\sigma U'_I(x)\biggl)_+,\qquad I\sim\mathrm{U}([n]). Then, the latter is an unbiased estimator of the former: \operatorname{E}_{I\sim\mathrm{U}([n])}\biggl[\lambda_{\textcolor{#E95420}{\text{ZZ-CV}}}(x,\sigma)\biggr]=\lambda_{\textcolor{#78C2AD}{\text{ZZ}}}(x,\sigma).
Find an invertible upper bound M that satisfies \int^t_0\lambda_{\textcolor{#E95420}{\text{ZZ-CV}}}(x_s,\sigma_s)\,ds=:m_I(t)\le\textcolor{#0096FF}{M}(t),\qquad I\sim\mathrm{U}([n]). It is harder to bound \lambda_{\textcolor{#E95420}{\text{ZZ-CV}}}, since it is now an estimator (random function).
Preprocessing (once and for all)
Then, with a re-parameterization of m_i, m_i(t)\le M(t):=a+bt,
where a=(\sigma U'(x_*))_++\|U'\|_\mathrm{Lip}\|x-x_*\|_p,\qquad b:=\|U'\|_\mathrm{Lip}. And m_i is redefined as m_i(t)=U'(x_*)+U'_i(x)-U'_i(x_*).
Zig-Zag sampler with the random rate function \lambda_{\textcolor{#E95420}{\text{ZZ-CV}}}(x,\sigma)=\biggr(\sigma U'_I(x)\biggl)_+,\qquad I\sim\mathrm{U}([n]). and the upper bound M(t)=a+bt is called Zig-Zag with Control Variates (Bierkens, Fearnhead, et al., 2019).
There are currently two main approaches to scaling up MCMC for large data.
Devide-and-conquer
Devide the data into smaller chunks and run MCMC on each chunk.
Subsampling
Use a subsampling estimate of the likelihood, which does not require the entire data.
Devide the data into smaller chunks and run MCMC on each chunk.
Unbiased? | Method | Reference |
---|---|---|
WASP | (Srivastava et al., 2015) | |
Consensus Monte Carlo | (Scott et al., 2016) | |
Monte Carlo Fusion | (Dai et al., 2019) |
Use a subsampling estimate of the likelihood, which does not require the entire data.
Unbiased? | Method | Reference |
---|---|---|
Stochastic Gadient MCMC | (Welling and Teh, 2011) | |
Zig-Zag with Subsampling | (Bierkens, Fearnhead, et al., 2019) | |
Stochastic Gradient PDMP | (Fearnhead et al., 2024) |