Roberts and Rosenthal (1998) Optimal Scaling of Discrete Approximations to Langevin Diffusions

概要

Roberts and Rosenthal [Journal of the Royal Statistical Society. Series B 60(1998) 255-268] は MALA (Metropolis-Adjusted Langevin Algorithm) の最適スケーリングを論じたもの．

概要

MALA は (Besag, 1994) で提案され，(Roberts and Tweedie, 1996) で指数エルゴード性が示されている．(Roberts and Rosenthal, 1998) では最適スケーリングが論じられている．

Roberts and Tweedie (1996) Exponential Convergence of Langevin Distributions and Their Discrete Approximations

Roberts and Tweedie [Bernoulli 2(1996) 341-363] は MALA (Metropolis-Adjusted Langevin Algorithm) の指数エルゴード性を議論したもの．

なお，本論文 (Roberts and Rosenthal, 1998) では MALA を “Hastings-Metropolis algorithms derived from Langevin diffusions” と呼んでいる．(Neal, 1993) では Langevin Monte Carlo を Hybrid Monte Carlo の特別な場合として取り上げている．

Brown 運動の離散化を Monte Carlo 法に用いるというアイデアは (Rossky et al., 1978) でもすでに見られていた．

(Fearnhead et al., 2018) において，MALA は BPS と比較されている．¹ モデルは AR(1) を用いており，低次元ではほとんど変わらないが，高次元では BPS の方が自己相関時間が５倍良いという結論が得られている．

1 導入

1.1 MALA について

乱歩 MH は目標分布に依存せず同一の実装をもつ点が利点であるが，それ故に目標分布が複雑である場合は収束が遅いことがある．

一方で，Langevin アルゴリズムは目標分布の勾配を利用した対称的な手法で，\(\pi\) が増加する方向への提案を増やすことで収束を速めた MH 法である．

Langevin algorithms use local problem-specific information and are therefore often almost as easy to implement.

(Roberts and Tweedie, 1996) では Langevin アルゴリズムはいつでも指数エルゴード性を持つ訳ではないことがわかったが，それでも，特に高次元の設定で，乱歩 MH 法より速い収束が確認されている．本論文の結果を通じて，このことを理論的に確認することもできる！

1.2 最適スケーリングについて

(Roberts et al., 1997) では乱歩 MH の最適スケーリングが調べられ，漸近的な採択率を \(0.234\) とするのが良いこと，そして提案分散は次元 \(n\) に対して \(n^{-1}\) のオーダーで漸近的に分散させるのが良いことが示されている．

いずれも \(n\to\infty\) の漸近論的な結果であるが，(Gelman et al., 1996) で確認されたように，比較的低次元でもこの指針は有効である．

これと同じように，Langevin アルゴリズムにおいても最適スケーリングを論じたい．

MALA は目標分布の情報を用いた提案をするため，最適な採択率はより高い水準で調整されるべきであるだろう，という結果の予測が立つ．

1.3 主結果の概要

最適な漸近的スケーリングは，採択率を \(0.574\) としたものである．これは (Mountain and Thirumalai, 1994) でも実験的に検証されていた．
提案分散は \(n^{-1/3}\) のスケーリングを持つべきである．これは (Kennedy and Pendleton, 1991) でも実験的に観察されていた．
従って，アルゴリズムが収束するには \(O(n^{1/3})\) のステップが必要であり，ランダムウォーク MH 法の \(O(n)\) よりもよっぽど良い（スピード測度 1.4 を踏まえた議論）．

Furthermore, the proposal variance should scale as \(n^{-1/3}\), and thus \(O(n^{1/3})\) steps are required for the Langevin algorithm to converge.

1.4 スピード測度

speed measure が MCMC の効率性を測るにあたって極めて重要な指標であることを説明している．

まず第一に，ベイズ計算法としては，Markov 連鎖 \(X\) と関数 \(f:E\to\mathbb{R}\) に対して，その Monte Carlo 推定量の漸近分散の逆数 \[ e_f:=\left(\lim_{n\to\infty}n\mathrm{V}\left[\frac{1}{n}\sum_{i=1}^nf(X_i)\right]\right)^{-1} \] が最重要指標の１つとして挙げられる．

しかし拡散極限では，\(e_f\) は \(f\) に依存せず，スピード測度のみの関数になるのである．

これにより，高次元極限 \(n\to\infty\) においては MCMC アルゴリズムの性能比較が理論的に行える．

All other measures of efficiency are equivalent (up to a normalization constant), including those described above.

また，\(0.574\) がスピード速度 \(h\) の最大値点であるが，\([0.4,0.8]\) の区間ならばほとんど変わらない．

1.5 設定

目標分布は，ある密度 \(f\) に従う確率変数の独立同分布観測であるとし， \[ \pi_n(x):=\prod_{i=1}^nf(x_i)=:\prod_{i=1}^ne^{g(x_i)} \] であると仮定して議論する．

\(\pi_n\) に対する分散パラメータ \(\sigma^2\) を持った 可逆な Langevin 拡散 (reversible Langevin diffusion) とは， \[ d\Lambda_t=\sigma dB_t+\frac{\sigma^2}{2}\nabla\log(\pi_n(\Lambda_t))dt \] をいう．

この step variance \(\{\sigma_n^2\}\) による離散化を \[ \widetilde{\Lambda}_{t+1}=\widetilde{\Lambda}_t+\sigma_nZ_{t+1}+\frac{\sigma^2_n}{2}\nabla\log(\pi_n(\widetilde{\Lambda}_t)) \] とする．これは \(\sigma_n^2\) をどんなに小さく取っても収束しない可能性がある．

\(\widetilde{\Lambda}\) の不変分布はそのままでは \(\pi_n\) ではないから，\(\widetilde{\Lambda}\) の MH 法による修正を考える． \[ Y_{t+1}:=X_t+\sigma_nZ_{t+1}+\frac{\sigma^2_n}{2}\nabla\log(\pi_n(X_t)) \] とし， \[ X_{t+1}:=\begin{cases}Y_{t+1}&\text{確率}\;\alpha_n(X_t,Y_{t+1})\\X_t&\text{確率}1-\alpha_n(X_t,Y_{t+1})\end{cases} \] \[ \alpha_n(X_t,Y_{t+1}):=\frac{\pi_n(Y_{t+1})q_n(Y_{t+1},Y_t)}{\pi_n(X_t)q_n(X_t,Y_{t+1})}\land1 \] \[\begin{align*} &q_n(x,y):=\\ &\frac{1}{(2\pi\sigma^2_n)^{n/2}}\exp\left(-\frac{1}{2\sigma^2_n}\left\|y-x-\frac{\sigma^2_n}{2}\nabla\log(\pi_n(x))\right\|^2_2\right)\\ &=:\prod_{i=1}^nq(x_i^n,y_i). \end{align*}\]

ステップサイズ \(\sigma_n^2\) はちょうど乱歩 MH のステップサイズのような役割を果たす．大きくすると，一歩で動く幅が大きくなるが，採択率が低くなりすぎると逆効果である．

2 主結果

MALA \(\{X_t\}\) については，定常性を仮定する：\(X_0\sim\pi_n\)．

\(\pi_n\) は任意次の積率を持ち，\(g\in C^8_p(\mathbb{R}^n)\) で，ある多項式 \(M_0\) について， \[ \lvert g^{(i)}(x)\rvert\le M_0(x) \] \[ x\in\mathbb{R}^n,i=0,1,\cdots,8, \] とする．加えて，SDE の議論のために \(g'\in\mathrm{Lip}(\mathbb{R}^n)\) も仮定する．

\(\{J_t\}\subset\mathcal{L}(\Omega;\mathbb{R}^n)\) をレート \(n^{1/3}\) を持つ Poisson 過程で，\(\sigma^2_n:=l^2n^{-1/3}\;(l>0)\) に対して \[ \Gamma^n_t:=X_{J_t} \] で定まる \(\{\Gamma^n_t\}\) を純粋跳躍過程とする．

\[ \alpha_n(l):=\iint_{\mathbb{R}^n}\pi_n(x)q_n(x,y)\alpha_n(x,y)dxdy=\operatorname{E}\left[\frac{\pi_n(Y)q_n(Y,X)}{\pi_n(X)q_n(X,Y)}\land 1\right] \] を \(\Gamma\) を定める採択率の \(\pi_n\)-平均とする．

定理１（平均採択率の極限）

\[ \lim_{n\to\infty}a_n(l)=a(l):=2\Phi\left(-\frac{Kl^3}{2}\right) \] \[ K:=\sqrt{\operatorname{E}_{\pi_1}\left[\frac{5g'''(X)^2-3g''(X)^3}{48}\right]}>0. \]

定理２（拡散極限）

\(\{U^n\}\subset\mathcal{L}(\Omega;\mathbb{R})\) を \(\Gamma^n\) の第一成分とする．このとき，\(\{U^n\}\) は Skorokhod 位相について次の拡散過程 \(U\) に弱収束する： \[ dU_t=\sqrt{h(l)}dB_t+\frac{1}{2}h(l)\frac{d }{d x}\log(\pi_1(U_t))dt. \] \[ h(l):=2l^2\Phi\left(-\frac{Kl^3}{2}\right). \]

この \(h\) は \(a(l)=0.574\) を満たす点 \(l\) で最大化される．

2.1 実際的含意

Besag, J. E. (1994). Comments on “Representations of Knowledge in Complex Systems” by U. Grenander and M. I. Miller. Journal of the Royal Statistical Society. Series B (Methodological), 56(4), 591–592.

Fearnhead, P., Bierkens, J., Pollock, M., and Roberts, G. O. (2018). Piecewise deterministic markov processes for continuous-time monte carlo. Statistical Science, 33(3), 386–412.

Gelman, A., Roberts, G. O., and Gilks, W. R. (1996). Efficient Metropolis Jumping Rules. In Bayesian Statistics 5: Proceedings of the Fifth Valencia International Meeting. Oxford University Press.

Kennedy, A. D., and Pendleton, B. (1991). Acceptances and autocorrelations in hybrid monte carlo. Nuclear Physics B - Proceedings Supplements, 20, 118–121.

Mountain, R. D., and Thirumalai, D. (1994). Quantative measure of efficiency of Monte Carlo simulations. Physica A: Statistical Mechanics and Its Applications, 210(3), 453–460.

Neal, R. M. (1993). Probabilistic Inference using Markov Chain Monte Carlo Methods. Department of Computer Science, University of Toronto. Retrieved from https://glizen.com/radfordneal/review.abstract.html

Roberts, G. O., Gelman, A., and Gilks, W. R. (1997). Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms. The Annals of Applied Probability, 7(1), 110–120.

Roberts, G. O., and Rosenthal, J. S. (1998). Optimal scaling of discrete approximations to langevin diffusions. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 60(1), 255–268.

Roberts, G. O., and Tweedie, R. L. (1996). Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, 2(4), 341–363.

Rossky, P. J., Doll, J. D., and Friedman, H. L. (1978). Brownian dynamics as smart Monte Carlo simulation. The Journal of Chemical Physics, 69(10), 4628–4633.

Footnotes

その際なぜか (Roberts and Rosenthal, 1998) を引用．↩︎