最適化手法 – Hirofumi Shiba

非斉次 Markov 過程とも見れる？ (Robert and Casella, 2004, p. 162)．

1 確率的最適化の概観

1.1 はじめに

ニューラルネットワークの訓練に関して，目的関数の勾配を上るようにパラメータを更新する手法が取られている．

一度に全てのデータを使って勾配を計算する場合を バッチ学習 (batch method) と呼び，これが勾配降下法または最急降下法にあたる．

一方で，データを分割して逐次的に勾配を計算し最適化を実施する手法をオンライン学習 (online learning) といい，これを 確率的勾配降下法 (stochastic gradient descent, SGD) または逐次的勾配降下法 (sequential gradient descent) と呼ぶ．¹

1.2 確率的最適化の歴史

確率的最適化は，はじめは統計学の文脈で (Robbins and Monro, 1951) によってオンラインの最尤推定を題材に考察された．

これを一般化する形で (Kiefer and Wolfowitz, 1952) は 確率的勾配降下法 (SGD) を提案した．

SGD を拡張し，適応的に学習率を調整する手法としては，AdaGrad (Duchi et al., 2011) や RMSprop (Tieleman and Hinton, 2012)，そしてこれら２つの長所を組み合わせた Adam (Kingma and Ba, 2017) が提案された．²

1.3 (Robbins and Monro, 1951)

目的関数が \[ h(x)=\operatorname{E}[H(x,Z)] \] の形で与えられるとする．³

\(h(x)=\beta\) の解 \(x=\theta\) を求める問題を考える．のちに \(\max_{x\in\mathcal{X}}h(x)\) を求める問題に拡張したのが (Kiefer and Wolfowitz, 1952) である．

定理（Robbins-Monro アルゴリズム）⁴

\(Z_j\overset{\text{i.i.d.}}{\sim}p(z|X_j)\) とし， \[ X_{j+1}=X_j+\gamma_j\biggr(\beta-H(Z_j,X_j)\biggl) \] によって Markov 連鎖 \(\{X_j\}\) を定める．

このとき，\(\{\gamma_n\}\subset\mathbb{R}^+\) が次の３条件を満たすならば，\(X_j\xrightarrow{\;\text{a.s.}}\theta\) が成り立つ．

\[ \|\gamma\|_1=\infty,\quad\|\gamma\|_2<\infty \]
\(\{X_j\}\) は有界 \(\sup_{j\in\mathbb{N}}\lvert X_j\rvert<\infty\) で \[ \operatorname{E}[H(X_j,Z_j)|Z_j]=h(X_j) \]
ある \(\theta\in\mathcal{X}\) が存在して，任意の \(\delta\in(0,1)\) について \[ \inf_{\delta\le\lvert\theta-x\rvert\le1/\delta}(x-\theta)(h(x)-\beta)>0 \]

証明

(Robbins and Monro, 1951, pp. 401–402) の結果を拡張した形になっている．

2 SGD の振る舞い

2.1 はじめに

ニューラルネットワークの訓練において，SGD は特に良い性質を示しているが，その理由は未だ十分に解明されていない．

例えば，正則化に寄与している（暗黙的正則化 implicit regularization）ということが明らかになりつつある．⁵

(Smith and Le, 2018) によると，鋭い谷 (sharp minima) に捕まりにくく，広い谷 (flat minima) に入りやすいという性質が汎化性能に寄与しているという．(Imaizumi and Schmidt-Hieber, 2023) は理論的な説明を与えた．

References

Bishop, C. M. (2006). Pattern recognition and machine learning. Springer New York.

Bouleau, N., and Lépingle, D. (1993). Numerical methods for stochastic processes. Wiley.

Chizat, L., and Bach, F. (2020). Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In J. Abernethy and S. Agarwal, editors, Proceedings of thirty third conference on learning theory,Vol. 125, pages 1305–1338. PMLR.

Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. In Journal of machine learning research,Vol. 12, pages 2121–2159.

Imaizumi, M., and Schmidt-Hieber, J. (2023). On generalization bounds for deep networks based on loss surface implicit regularization. IEEE Transactions on Information Theory, 69(2), 1203–1223.

Kiefer, J., and Wolfowitz, J. (1952). Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, 22(3), 462–466.

Kingma, D. P., and Ba, J. (2017). Adam: A method for stochastic optimization.

Moroshko, E., Woodworth, B. E., Gunasekar, S., Lee, J. D., Srebro, N., and Soudry, D. (2020). Implicit bias in deep linear classification: Initialization scale vs training accuracy. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in neural information processing systems,Vol. 33, pages 22182–22193. Curran Associates, Inc.

Murphy, K. P. (2022). Probabilistic machine learning: An introduction. MIT Press.

Robbins, H., and Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22(3), 400–407.

Robert, C. P., and Casella, G. (2004). Monte carlo statistical methods. Springer New York.

Smith, S. L., and Le, Q. V. (2018). A bayesian perspective on generalization and stochastic gradient descent. In International conference on learning representations.

Tieleman, T., and Hinton, G. (2012). Neural networks for machine learning. Lecture 6.5: Divide the gradient by a running average of its recent magnitude.

麻生英樹, 安田宗樹, 前田新一, 岡野原大輔, 岡谷貴之, 久保陽太郎, and ボレガラダヌシカ. (2015). 深層学習. 近代科学社.

Footnotes

(Bishop, 2006, p. 240) 5.2.4節．↩︎
(麻生英樹 et al., 2015, p. 145) など．↩︎
この稿も参照．↩︎
(Bouleau and Lépingle, 1993)，(Robert and Casella, 2004, p. 202) 定理5.24．↩︎
(Murphy, 2022, p. 455), (Chizat and Bach, 2020), (Moroshko et al., 2020) も参照．↩︎