強化学習 – Hirofumi Shiba

1 導入

1.1 他の機械学習手法との違い

強化学習は機械学習の３分類の１つとして「他の２類型（教師あり・教師なし学習）とは大きく文脈が違う」という注記と共によく紹介される．それもその通り．強化学習は，エージェントが世界の中でどのように行動するかを，環境との相互作用を通じて自律的に学ぶ，という人工知能分野の問題設定から生じた学問である (Sutton and Barto, 2018)．

強化学習の設定にはいくつか特徴がある．

試行錯誤の中で学ぶこと：教師あり学習のように，報酬を最大化するように学んでいくが，学習データというものはなく，試行錯誤の中で学ぶ必要がある．
報酬は遅れてくるものもあること：行動の結果がすぐに報酬として返ってくるわけではなく，以前の行動が未来の報酬に影響を与えることがあり，それを踏まえて学習をすることが求められる．

すると，強化学習は専ら動的計画法の議論が中心となる．

1.2 強化学習の応用

強化学習は，部分的に観測されている Markov 決定過程の最適制御として理解される．

在庫管理 (Van Roy et al., 1997)，動的なチャンネルの割り当て (Singh and Bertsekas, 1996)，エレベータ制御 (Crites and Barto, 1998)，テーブルゲーム (Silver et al., 2018)，気候変動対策 (Rolnick et al., 2022) などにも用いられている．

また，深層学習と組み合わせることで，DeepMind の AlphaGo (Silver et al., 2016) と AlphaGoZero (Silver et al., 2017) は囲碁において人類の追随を許さない実力をつけた．

今後も，人間のフィードバックによる強化学習 (RLHF: Reinforcement Learning through Human Feedback) (Christiano et al., 2017) や GNN のトレーニングなど，他の機械学習手法と組み合わせることでより大きな AI システムを作るにあたって，強化学習は必要不可欠な立場を占めていくだろう．

1.3 歴史

2 Markov 決定過程

定義 (Markov decision process, policy)¹

状態空間 \(\mathcal{X}\)，行動空間 \(\mathcal{A}\) という２つの可測空間と，確率核 \[ P:\mathcal{X}\times\mathcal{A}\to\mathcal{X}\times\mathbb{R} \] との3-組を Markov 決定過程 という．
\(\mathcal{X},\mathcal{A}\) がいずれも有限集合であるとき，MDP \((\mathcal{X},\mathcal{A},P)\) を有限であるという．
確率核 \(\pi:\mathcal{X}\to\mathcal{A}\) を方策という．

多くの場合，\(P(x,a)\in\mathcal{P}(\mathcal{X}\times\mathbb{R})\) は直積 \[ P(x,a)=P_{\mathcal{X}}(x,a)\otimes P_{\mathcal{R}}(x,a) \] で与えられるとする．

3 \(Q\)-学習

強化学習の最も標準的かつ古典的なアルゴリズムが，\(Q\)-学習 (Watkins, 1989), (Watkins and Dayan, 1992) と，その派生アルゴリズムである SARSA である (Powell, 2011, p. 122)．

3.1 TD 学習

推移確率が未知である場合，これをシミュレーションによって推定することができる．この設定では 時間差分学習 (Temporal Difference Learning) とも呼ばれる (Sutton, 1988)．

References

Bellemare, M. G., Dabney, W., and Rowland, M. (2023). Distributional reinforcement learning. The MIT Press.

Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences. In Advances in neural information processing systems,Vol. 30.

Crites, R. H., and Barto, A. G. (1998). Elevator group control using multiple reinforcement learning agents. Machine Learning, 33, 235–262.

Powell, W. B. (2011). Approximate dynamic programming: Solving the curses of dimensionality. John Wiley & Sons.

Rolnick, D., Donti, P. L., Kaack, L. H., Kochanski, K., Lacoste, A., Sankaran, K., … Bengio, Y. (2022). Tackling climate change with machine learning. ACM Computing Survey, 55(2).

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Driessche, G. van den, … Hassabis, D. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489.

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., … Hassabis, D. (2018). A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419), 1140–1144.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., … Hassabis, D. (2017). Mastering the game of go without human knowledge. Nature, 550(7676), 354–359.

Singh, S., and Bertsekas, D. (1996). Reinforcement learning for dynamic channel allocation in cellular telephone systems. In Advances in neural information processing systems 9.

Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.

Sutton, R. S., and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT Press.

Van Roy, B., Bertsekas, D. P., Lee, Y., and Tsitsiklis, J. N. (1997). A neuro-dynamic programming approach to retailer inventory management. In Proceedings of the 36th IEEE conference on decision and control.

Watkins, C. J. C. H. (1989). Learning from delayed rewards (PhD thesis). University of London. Retrieved from https://www.cs.rhul.ac.uk/~chrisw/thesis.html

Watkins, C. J. C. H., and Dayan, P. (1992). Q-leraning. Machine Learning, 8, 279–292.

Footnotes

(Bellemare et al., 2023, p. 15), (Sutton and Barto, 2018, p. 48)．↩︎