VAE：変分自己符号化器

概要

変分自己符号化器 (VAE) は，データを周辺分布にもつ潜在変数モデルを変分 Bayes 推論によって学習するアルゴリズムである．従来計算・近似が困難であった変分下界を，ニューラルネットワークによって近似するアプローチである．学習されたベイズ潜在変数モデルからはサンプリングによって新たなデータを生成することができるため，深層生成モデルの一つに分類されることもある．

1 自己符号化器 (AE)

1.1 はじめに

主成分分析 (PCA) とは，データを線型変換により低次元の線型空間に変換することで，データの良い要約を得ようとする多変量解析手法である．

この手法をカーネル法により非線型化することで，データの隠れた構造をよりよく表現することができる (Lawrence, 2005)．全く同様にニューラルネットワークを使って PCA を非線型化することもでき (Cottrell and Munro, 1988)，これが 自己符号化器 と呼ばれるアーキテクチャである．¹

自己符号化器は，VAE と対比する場合には 決定論的 自己符号化器 (deterministic autoencoder) とも呼ばれる．

AE と VAE の違い

エンコーダー \(q\) は VAE では確率核であるが，AE では決定論的な関数である．²
訓練時の目的関数が違う．AE では復元誤差のみであるが，VAE では潜在表現が事前に設定した分布 \(p(z)dz=\operatorname{N}(0,I_d)\) と近いことを要請する KL-分離度の項が追加される．³

その結果，AE は基本的には生成モデルとしての使い方は出来ない．事前分布からのサンプル \(Z\sim\operatorname{N}(0,I_d)\) は全く想定されていない．

一方で，データ内の画像の復元は AE の方が上手であり，\(\beta\)-VAE の \(\beta\) が大きいほど画像にはもやがかかるようになる．

1.2 NN による PCA

そもそも PCA は，３層からなる自己符号化器の自乗復元誤差の最小化と（学習された基底が正規直交化されていないことを除いて）等価になる (Bourlard and Kamp, 1988), (Baldi and Hornik, 1989), (Karhunen and Joutsensalo, 1995)．

しかし，３層のままでは非線型な活性化を加えても非線型な次元削減が出来ない (Bourlard and Kamp, 1988) が，４層以上では話が違い，PCA の真の非線型化による拡張になっている (Japkowicz et al., 2000)．

しかしこれにより目的関数は２次関数とは限らず，凸最適化の範疇を逸脱するので，大域的最適解が必ず見つかるなどの理論保証ができる世界からは逸脱してしまう．

1.3 罰則による潜在表現獲得

上述の NN は砂時計型をしており，中央の中間層を細くすることで低次元の潜在表現を獲得しようとするものである．

このようにアーキテクチャによって潜在表現獲得を制御するのではなく，明示的に目的関数に含めることで潜在表現をすることができる．

例えば，元の目的関数 \(E\) に対して LASSO 様の罰則項 \[ \widetilde{E}(w)=E(w)+\lambda\sum_{k=1}^K\lvert z_k\rvert \] を加えることで，スパースな潜在表現の獲得を促すことが考えられる．この正則化項を activity regularization という．⁴

1.4 ニューロンによるスパース表現

罰則を課す代わりに，rectifying neuron \(f(x)=x\lor0\) を用いることも，スパースな潜在表現を獲得することにつながる (Glorot et al., 2011)．

このように獲得された潜在表現は，\(l^1\)-罰則による場合よりも，「常にゼロ」になる素子が少ない．このことは脳の活動により近い (Beyeler, 2019) ため，好ましいと考えられている．

1.5 Denoising Autoencoder (DAE)

データベクトル \(x_n\) にノイズを加えたもの \(\widetilde{x}_n\) を元のデータに復元することを \[ E(w)=\sum_{n=1}^N\|y_w(\widetilde{x}_n)-x_n\|^2 \] などの目的関数で学習することで，ノイズにロバストな潜在表現を獲得することができる．

これは denoising autoencoder (DAE) (Vincent et al., 2008), (Vincent et al., 2010) として提案され，直ちにあるエネルギーベースモデルをスコアマッチングにより推定していることと等価であること (Vincent, 2011) が自覚された．

(Vincent et al., 2008) の問題意識は，深層モデルの初期値を設定する層ごとの教師なし事前学習がなぜ成功しているか？にあった．その結果，この denoising autoencoder のような目的関数が，深層モデルの学習を成功させるような初期値を与えることに成功していた要因であることを示唆している．

DAE の成功は，これがスコアベクトル場を学習しているためだと言える． \[ \widetilde{x_i}=x_i+\sigma\epsilon\qquad\epsilon\sim\operatorname{N}_1(0,1) \] によってノイズを印加し， \[ \ell(x,r(\widetilde{x}_i))=\|r(\widetilde{x})-x\|_2^2 \] を損失関数として DAE を学習したとすると，一定の条件の下で \[ r(\widetilde{x})-x\approx\nabla\log p(x)\qquad(\sigma\to0) \] が成り立つという．すなわち，少し摂動が与えられたデータが与えられても，データの真の多様体上に射影して（ノイズを除去して）これを返すことができる．⁵

1.6 Contractive Autoencoder (CAE) (Rifai et al., 2011)

元の目的関数 \(E\) に対して，エンコーダー \(f\) の Jacobian の Frobenius ノルムに対して罰則を課すことを考える： \[ \widetilde{E}(w)=E(w)+\lambda\|J_f(x)\|_2 \] これにより，エンコーダー \(f\) は Jacobian が縮小的になるものが学習されるため，データがなす部分多様体から外れた入力に対してこれを部分多様体内に押し込める形の \(f\) が学習される．

これを縮小的自己符号化器という．\(J_f\) を計算するために，訓練は減速される．

1.7 マスキングによる潜在表現獲得

BERT (Devlin et al., 2019) はランダムにデータを脱落させ（マスキング），これを予測することで言語に対する極めて豊かな潜在表現を獲得した．

masked autoencoder (K. He et al., 2022) では，ノイズ印加の代わりに，データの脱落を行って AE を訓練する．これが現状の SOTA である．

この方法は ViT の事前訓練として使われる．言語と違って画像ではより多くの部分を脱落させることで，より豊かな潜在表現を獲得することができる．⁶

するとマスキングがほとんどデータの軽量化になっており，大規模なトランスフォーマーの事前訓練としてよく選択される．この場合，デコーダーはエンコーダーより軽量な非対称な構造をしている場合が多い．

加えて，ひとたび訓練が終わればデコーダーは取り外し，種々のタスクに対して調整されたデコーダーを改めて訓練して使われることが多い．

2 変分自己符号化器 (VAE) (Kingma and Welling, 2014)

Samples from VQ-VAE-2 Taken from Figure 6 (Razavi et al., 2019, p. 8)

2.1 はじめに

VAE (Variational Auto-Encoder) (Kingma and Welling, 2014), (Rezende et al., 2014) も GAN と同じく，深層生成モデル \(p_\theta\) にもう１つの深層ニューラルネットワーク \(q_\phi\) を対置する．

一方でこのニューラルネット \(q_\phi\) は GAN のように判別をするのではなく，近似推論によってデータ生成分布（の拡張分布）を \(q_\phi(x,z)p(z)\) の形で再構成しようとする 認識モデル (recognition model) である．⁷

このスキームを変分ベイズの文脈では償却推論 2.4 ともいう．\(q_\phi\) を エンコーダー，\(p\) を 事前分布 ともいう．

2.2 エンコーダーによる表現獲得

すなわち，VAE ではエンコーダーは（少なくとも形式的な意味で）「推論」するように設計された自己符号化器である．この際のベイズ推論は変分推論によって達成されるが，reparametrization trick によって \(q_\phi\) の変分推論をデコーダー \(p_\phi\) と同時に SGD によって実行できる点が革新的である．

エンコーダー \(q_\phi\) は \[ q_\phi(x,z)\,dz=\operatorname{N}\biggr(\mu_\phi(z),\mathrm{diag}_\phi(\sigma^2(z))\biggl) \] という形を仮定し，平均 \(\mu_\phi\) と分散 \(\sigma^2_\phi\) の関数形をニューラルネットワークで表現する．

一方でデコーダー \(p_\phi(z,x)\) はこの潜在表現からデータを再構成することを目指し，ひとたび学習されれば \(p(z)p_\phi(z,x)\) の形でデータ生成ができるというわけである．

学習は深層生成モデル \(p_\theta\) のデータとの乖離度の最小化と，データで条件づけた潜在変数 \(Z\) の事後分布 \(q_\phi\) の近似推論器とを，確率勾配降下法によって同時に実行する．

VAE 自体は拡散モデルの登場以降，画像生成モデルとしては下火になったが，エンコーダー \(q_\phi\) は Sora (Brooks et al., 2024) における動画データの圧縮表現の学習など，その他の下流タスクの構成要素としても用いられる（VQ-VAE 3 も参照）．

2.3 デコーダーの変分ベイズ学習

データ \(X\) の生成過程 \(Z\to X\) に，モデル \(p_\theta(z)p_\theta(x|z)\) を考える．これがニューラルネットワークによるモデルであるとすると，周辺尤度 \[ p_\theta(x)=\int_\mathcal{Z}p_\theta(z)p_\theta(x|z)\,dz \] の評価は容易でない．

このとき，対数周辺尤度は次のように下から評価できるのであった：⁸ \[ \begin{align*} \log p_\theta(x)&=\log\int_\mathcal{Z}p_\theta(x,z)\,dz\\ &=\log\int_\mathcal{Z}q_\phi(z)\frac{p_\theta(x,z)}{q_\phi(z)}\,dz\\ &\ge\int_\mathcal{Z}q_\phi(z)\log\frac{p_\theta(x|z)p_\theta(z)}{q_\phi(z)}\,dz\\ &=-\operatorname{KL}(q_\phi,p_\theta)+\int_\mathcal{Z}q_\phi(z)\log p_\theta(x|z)\,dz\\ &=:F(\theta,\phi) \end{align*} \]

この \(F\) を 変分下界 （機械学習では ELBO）といい，\(\theta,\phi\) に関して逐次的に最大化する（＝\(\operatorname{KL}(q,p)\) を最小化する）ことによって，\(\log p_\theta\) を直接評価することなく最大化する \(\theta\) を見つけるのが変分 Bayes の枠組みである．

これを一般のモデルについて実行するためには \(q_\phi\) に平均場近似などの追加の仮定や \(E\)-ステップの近似が必要であるが，ここでは \(q_\phi\) は NN からなる認識モデルとし，\(F\) の勾配 \(D_\phi F\) の推定量を用いて，\(p_\theta,q_\phi\) を同時に学習することが出来るというのである．

2.4 償却推論 (amortized inference)

データ \(x_1,\cdots,x_n\) が互いに独立で，潜在変数 \(z_1,\cdots,z_n\) も同じ数だけ用意し，互いに独立であるとする．実際，VAE では \(z\sim\operatorname{N}_n(\mu,\Sigma)\) とし，\(\Sigma\) は対角行列とする．

このとき，変分下界は \[ F(\theta,\phi)=\sum_{i=1}^n\int_\mathcal{Z}q_\phi(z_i)\log\frac{p_\theta(x_i|z)p_\theta(z_i)}{q_\phi(z_i)}\,dz_i \tag{1}\] と表せる．さらに \(p_\theta(x_i|z)=p_\theta(x_i|z_i)\) と仮定すると， \[ q_\phi(z_i)=p(z_i|x_i)=\frac{p(x_i|z_i)p(z_i)}{p(x_i)} \] と取った場合が \(F\) を最大化する．

償却推論 (Gershman and Goodman, 2014), (Rezende et al., 2014) では，\(i\in[n]\) ごとにフィッティングするのではなく，確率核 \(p(x_i,z_i)\,dz_i\) を \(i\in[n]\) に依らずに単一のニューラルネットワーク \(q_\phi\) でモデリングする確率的変分推論法をいう．⁹

\(i\in[n]\) ごとにデータを説明するのではなく，データセット全体にフィットする \(q_\phi\) を得ることを考える．このコストを払えば，新たなデータが到着した際も極めて安価な限界費用で推論を更新できる，ということに基づく命名である．¹⁰

VAE では，EM アルゴリズムのように \(\theta,\phi\) を交互に更新していくわけではなく，両方 NN であることを利用して同時に SGD によって最適化する．換言すれば，EM アルゴリズムの様に本当にデータを最もよく説明する変分推論を実行したいという様な目的関数にはなっておらず，あくまで生成と表現学習が目的である．

2.5 確率的勾配変分近似 (SGVB)

式 (1) はデータ点ごとに \[ F(\theta,\phi)=\sum_{i=1}^n\int_\mathcal{Z}q_\phi(z_i)\log p_\theta(x_i|z_i)\,dz_i-\operatorname{KL}(q_\phi,p_\theta) \tag{2}\] と表示できる．事前分布 \(p_\theta\) もエンコーダー \(q_\phi\) も正規分布族としたので，第二項は簡単に計算できる： \[ \operatorname{KL}\biggr(q_\phi(z_i|x_i),p_\theta(z_i)\biggl)=\frac{1}{2}\sum_{j=1}^m\biggr(1+\log\sigma^2_j(x_i)-\mu^2_j(x_i)-\sigma^2_j(x_i)\biggl). \]

そこで第１項が問題である．勾配 \(D_\phi F,D_\theta F\) 自体は計算不可能でも，不偏な推定量は得られないだろうか？

しかも，単に \(q_\phi(z|x)\) からのサンプルを用いた crude Monte Carlo \[ \int_\mathcal{Z}q_\phi(z_i|x_i)\log p_\theta(x_i|z_i)\,dz_i\approx\frac{1}{N}\sum_{n=1}^N\log p_\theta(x_i|z_i^{(n)}) \] では，分散が非常に大きくなってしまう (Paisley et al., 2012) ため，効率的な不偏推定量である必要もある．また，\(\theta\) に関する勾配は数値的に計算できても，ここから \(D_\phi F\) を得ることが困難である．

これを 重点サンプリングの考え方により解決した のが \(D_\phi F,D_\theta F\) に対する SGVB 推定量である．¹¹ (Kingma and Welling, 2014) では reparameterization trick と呼んでいる．

なお，この重点サンプリング法を，より効率的な SIS や AIS に変えることも多く提案されている (Thin et al., 2021)．

一般的な設定での SGVB 推定量

ある分布 \(P\in\mathcal{P}(E)\) と可微分同相 \(g_\phi:E\times\mathcal{X}\to\mathcal{Z}\) であって \[ g_\phi(\epsilon,x)\sim q_\phi(z,x)\quad(\epsilon\sim P) \] を満たすものを見つけることができるとき，この \(P\) を提案分布とする重点サンプリング推定量 \[ \begin{align*} \operatorname{E}_{q_\phi}[f(Z)]&=\operatorname{E}_{P}[f(g_\phi(\epsilon,x))]\\ &\simeq\frac{1}{M}\sum_{i=1}^Mf(g_\phi(\epsilon^i,x)) \end{align*} \] により，Monte Carlo 推定量の分散を減らすことができる．\(f=F\) と取ることで SGVB 推定量を得る．

エンコーダー \(q_\phi\) から直接サンプル \(z_i\) を得るわけではなく， \[ z_i=\sigma_\phi(x_i)\epsilon+\mu_\phi(x_i),\qquad\epsilon\sim\operatorname{N}_1(0,1) \] によって Monte Carlo サンプルを得れば，これはサンプリングと \(\phi\) に関する微分が分離されている．

加えて，元の方法よりも Monte Carlo 分散が低減される．

2.6 目的関数

以上を総じて，目的関数は \[ \mathcal{L}=\sum_{i=1}^n\left(\operatorname{KL}\biggr(q_\phi(z_i|x_i),p_\theta(z_i)\biggl)+\frac{1}{N}\sum_{n=1}^N\log p_\theta(x_i|z_i^{(n)})\right) \] となる．Monte Carlo サンプルは \(N=1\) が採用され，SGD と組み合わせるとこの設定が良い効率を与えるという．¹²

訓練過程

データをエンコーダーを前方向に伝播させ，\(\mu,\sigma\) の値を得て，そこからサンプルする．
この Monte Carlo サンプルをデコーダーに入れて伝播させ，変分下界 \(F\)（の推定量）を評価する．
自動微分により \(\theta,\phi\) に関する \(\mathcal{L}\) の勾配を計算する．

2.7 AIS (Neal, 2001) による ELBO 近似

2.7.1 はじめに

AIS の提案分布に拡散過程の時間反転を用いた場合，効率的で可微分な ELBO の AIS 推定量が得られる (Doucet et al., 2022)．

拡散過程の時間反転の学習には SGM (Song et al., 2019) を用いることができる．

2.7.2 AIS のアイデア

焼きなまし重点サンプリング (AIS: Annealed Importance Sampling) では，目標分布 \(\pi\) に至る列 \(\pi_0,\cdots,\pi_p=\pi\) が得られており，\(\pi_0\) からのサンプリングは可能である場合の重点サンプリング法のテクニックである．

具体的には，拡張された空間 \(\mathcal{X}^{p+1}\) 上の目標分布 \[ \pi_p\otimes P_p'\otimes\cdots\otimes P_1' \] に対して \(\pi_0\otimes P_1\otimes P_2\otimes\cdots\otimes P_p\) を提案分布に用いたとして荷重荷重を計算する．特に (Neal, 2001) では \(P_i'\) として \[ P_i(x_{i-1},x_i)\pi_{i-1}(x_i-1)=\pi_i(x_i)P_i^{-1}(x_{i-1},x_i) \] を満たす後ろ向き核 \(P_i^{-1}\) を用いており，この場合の重点荷重は次の表示を持つ：¹³ \[ w(X_{1:p}):=\frac{\pi_p(X_p)}{\pi_{p-1}(X_{p})}\frac{\pi_{p-1}(X_{p-1})}{\pi_{p-2}(X_{p-1})}\cdots\frac{\pi_2(X_2)}{\pi_1(X_2)}\frac{\pi_1(X_1)}{\pi_0(X_1)} \]

すなわち，AIS では \(\mathcal{X}^{p+1}\) 上で重点サンプリングを行い，\(x_p\) の成分のみに注目することで周辺分布では \(\pi_p\) に対する効率的な重点サンプリングが実現される，という手法であるが，その分効率が落ちている (Doucet et al., 2022)．

例えば目標分布 \(\pi_p\) からのシミュレーションが実は可能で，\(P_i\equiv\pi_p\) と取れる場合，重点サンプリングは正確に行えるが，AIP は迂回した分 \[ \mathrm{V}[\log w(X_{1:p})]=\sum_{i=1}^p\mathrm{V}\left[\log\frac{\pi_i(X_i)}{\pi_{i-1}(X_i)}\right]>0 \] という分散が発生してしまう．

または，ある \(\pi_p\) に収束する MCMC 核 \(P\) に関して \(P_i\equiv P\) と取った場合，もし比 \[ \frac{\pi_p(X_p)}{\pi_0P_1\cdots P_p(X_0)} \] が計算できたならば，十分 \(p>0\) を大きく取ることで極めて効率的な重点サンプリングが可能になるが，AIS では \(w\) の分散が大きくなってしまう．

実は，分散を最小にする \(\mathcal{X}^{p+1}\) 上での提案分布は，後ろ向き確率核 \(P_i^{-1}\) ではなく，提案分布 \(Q:=\pi_0\otimes P_1\otimes P_2\otimes\cdots\otimes P_p\) の時間反転が与える (Del Moral et al., 2006)．

2.7.3 非一様 Langevin 過程を用いるもの

連続な架橋 \((\pi_t)_{t\in[0,p]}\) を取り， \[ dX_t=\nabla\log\pi_t(X_t)\,dt+\sqrt{2}\,dB_t,\qquad X_0\sim\pi_0, \] を連続時間極限とするような離散化 \[ P_k(x_{k-1},dx_k):=\operatorname{N}_d\biggr(x_{k-1}+\delta\nabla\log\pi_k(x_{k-1}),2\delta I_d\biggl),\qquad\delta>0, \] を考える．これは (Heng et al., 2020), (Wu et al., 2020), (Thin et al., 2021) などで扱われている．この過程が時刻 \(p\) において \(\operatorname{N}_d(0,I_d)\) とどれほど乖離があるかの知見は，模擬アニーリングに対する数理解析の下で蓄積されている (Fournier and Tardif, 2021), (Tang et al., 2024)．

この時間反転は (Haussmann and Pardoux, 1986) により導かれている： \[ d\overline{X}_t=\biggr(-\nabla\pi_{T-t}(\overline{X}_t)+2\nabla\log q_{T-t}(\overline{X}_t)\biggl)\,dt+\sqrt{2}d\overline{B}_t,\qquad\overline{X}_0\sim q_T, \] ただし，\(q_t\) は \((X_t)\) の周辺分布とした．

3 ベクトル量子化変分自己符号化器 (VQ-VAE)

VQ-VAE は，VAE を特に表現学習に用いるために，潜在表現層を離散変数とした変種である．この際の潜在表現は符号帳 (codebook) とも呼ばれる．

加えて，(van den Oord et al., 2017) では，事後分布 \(q_\phi(z|x)\) が事前分布 \(p(z)\) に十分近くない場合には，事前分布を使ってサンプルを生成するのではなく，\(q_\phi(z|x)\) を改めて Pixel-CNN などを用いて推論してそこからサンプルを得ることを提案している．

(van den Oord et al., 2017) では CNN が使われていたが，近年はトランスフォーマーによるデコーダーが用いられることも多い．

3.1 ベクトル量子化

一般に，画像・音声・動画などの複雑なデータに対しては，背後の構造をよく掴んだ低次元な潜在表現を得ることを重要なステップとして含むため，データの潜在表現を得る汎用手法は価値が高い．このようなタスクを 表現学習 という (Bengio, Courville, et al., 2013)．

VAE の主な応用先に画像データがある．その際は，デコーダーを通じた画像生成モデルとして用いるだけでなく，エンコーダーを用いてデータ圧縮をすることも重要な用途である (Ballé et al., 2017)．

その際，潜在空間を離散空間にすることで，連続データである画像を離散化することができる．これをベクトル量子化と結びつけたのが VQ-VAE である．ベクトル量子化は DALL-E (Ramesh et al., 2021) など，より大きな画像生成モデルの構成要素としても利用される．

3.2 分布崩壊 (Posterior collapse)

VAE を表現学習に使う際の最大の問題は 分布崩壊 である (J. He et al., 2019)．これはデコーダーが強力すぎる場合，ほとんどデコーダー層のみでデータの生成に成功してしまい，潜在表現が十分組織されないまま最適化が完了され，潜在表現が縮退してしまうことをいう

VQ-VAE は潜在表現を離散変数にすることでこれが解決できるとし，連続潜在変数による VAE とデータの復元力を変えず，同時に良い潜在表現も獲得できるという．

実際，(van den Oord et al., 2017) が，言語が離散的であることに首肯するならば，人間は言表によって画像や動画の概要を掴めるように，画像や動画の有効な潜在表現は離散変数で十分であるはずという議論は十分説得的である．

3.3 表現学習をする VAE

3.3.1 \(\beta\)-VAE (Higgins et al., 2017)

変分下界 (2) の KL 乖離度の項に新たなハイパーパラメータ \(\beta>0\) を追加する： \[ \int_\mathcal{Z}q_\phi(z_i)\log p_\theta(x_i|z_i)\,dz_i-\beta\operatorname{KL}(q_\phi,p_\theta). \] \(\beta=0\) の場合が決定論的な AE，\(\beta=1\) の場合が元々の VAE に当たる．

この \(\beta\) を適切なスケジュールで \(0\) から \(1\) に段階的に引き上げることによって，分布崩壊が防げる．これを KL アニーリング という (Bowman et al., 2016)．

一般に \(\beta\) は潜在表現の圧縮度合いを意味しており，\(\beta<1\) では画像の復元が得意になり，\(\beta>1\) ではデータの圧縮が得意になる (Higgins et al., 2017)．

特に，データの潜在表現の disentanglement が得意になるとして，表現学習に重要な応用を持つ (Locatello et al., 2019)．

3.3.2 Variational Lossy Autoencoder (Chen et al., 2017)

デコーダー \(p(x|z)\) と事前分布 \(p(x)\) を自己回帰モデルにし，VAE のスキームを純粋なエンコーダー \(q(z|x)\) の訓練に用いた．

その際，用いる自己回帰モデルの予測性能の強さを制御することで，どのような潜在表現を生成するかの制御が可能になることを論じている (Chen et al., 2017)．

3.4 VQ-VAE

VQ-VAE (van den Oord et al., 2017), VQ-VAE-2 (Razavi et al., 2019) は，自己符号化器の中間表現にベクトル量子化を施し，JPEG (Wallace, 1992) のような画像データの圧縮を行うことで，不要な情報のモデリングを回避している．

すなわち，エンコーダーの出力 \(z\in\mathbb{R}^{H\times W\times K}\) は最終的に符号帳 \(\{e_k\}_{k=1}^K\subset\mathbb{R}^L\) と見比べて最近傍点の符号 \(k\in[K]\) のみが記録される．デコーダーには符号帳の要素 \(\{e_k\}_{k=1}^K\) のみが入力される．これにより，デコーダーに対して元データの 30 分の 1 以下のサイズで学習を行うことができるのも美点である．

符号帳も同時に学習され，そのための項が目的関数に追加される．

一つの技術的な難点に，離散化のステップが途中に含まれるために勾配の計算が困難になることがあるが，stright-through 推定量 (Bengio, Léonard, et al., 2013) の利用によって解決している．

GAN は元データのうち，尤度が低い部分が無視され，サンプルの多様性が失われがちであったが，VQ-VAE はこの問題を解決している．また，GAN にはないようなモデル評価の指標が複数提案されている．

3.5 連続緩和

VQ-VAE ではコードブックへの対応はハードな帰属をしている．すなわち，全ての出力はどれか１つのエントリー \(e_k\) を選んで \(k\) のみが記録されるが，これをソフトな帰属に変更し，連続な表現を許すことが考えられる．¹⁴

この際には，元々の reparametrization trick 2.5 が離散変数には直ちに一般化できないところが，新たな方法が見つかり引き続き勾配による最適化が可能という美点もある．

標準正規分布 \(\operatorname{N}(0,1)\) の代わりに，質的変数のサンプリングにおいて，Gumbel 分布を提案分布として用いることが有効であり，この reparametrization trick を Gumbel Max Trick (Chris J. Maddison et al., 2014), (Jang et al., 2017) という．

Concrete (Continuous Relaxatino of Discrete) (Chris J. Maddison et al., 2017) はこれを連続分布に拡張し，reparametrization trick に応用したものである．

これらの手法は VAE だけでなく，DALL-E (Ramesh et al., 2021) の訓練にも応用されている．

3.6 VQ-VAE-2 (Razavi et al., 2019)

VQ-VAE-2 は，VQ-VAE から潜在空間に階層構造を持たせた，エンコード・デコードを各２回以上繰り返したものである．

3.7 Codebook collapse

VQ-VAE は符号帳 (codebook) に冗長性が生まれ，符号帳の一部が使われなくなるという問題がある．これを解決するためには，符号帳への対応を softmax 関数を用いて軟化することが dVAE (Ramesh et al., 2021) として考えられている．

しかしこの dVAE も codebook collapse から完全に解放されるわけではない．これは softmax 関数の性質によると考えられ，実際，Dirichlet 事前分布を導入した Bayes モデルによって緩和される (Baykal et al., 2023)．

このような技術を エビデンス付き深層学習 (EDL: Evidential Deep Learning) (Sensoy et al., 2018), (Amini et al., 2020) という．¹⁵

3.8 GAN との比較

VAE は GAN よりも画像生成時の解像度が劣るという問題がある．

3.8.1 Wasserstein VAE (Tolstikhin et al., 2018)

これを，目的関数を Wasserstein 距離に基づいて再定式化することで解決できるというのが Wasserstein Auto-encoder (Tolstikhin et al., 2018) である．

3.8.2 VQ-GAN (Esser et al., 2021)

一方で，目的関数に \(L^2\)-損失を用いている点自体が難点であるとして，ベクトル量子化の考え方を GAN に移植した VQ-GAN が提案された．

VQ-GAN では潜在空間上の事前分布の学習にトランスフォーマーが用いられた．なお，この次回作が生成を VAE 内の潜在空間で行うものを潜在拡散モデル (latent diffusion model) (Rombach et al., 2022) であり，Stable Diffusion の元となっている．

一方，VIM (Vector-quantized Image Modeling) (Yu et al., 2022) では，VAE でも GAN でもなく，エンコーダーもデコーダーもトランスフォーマーにすることで更なる精度が出ることが報告されている．

4 参考文献

決定論的な自己符号化器の解説は (Bishop and Bishop, 2024) 19.1 節に詳しい．

AE と VAE を比較した実験は，こちらの (Murphy, 2023) の Jupyeter Notebook で見れる．

VAE の簡単な実装は次の稿も参照：

VAE：変分自己符号化器

PyTorch によるハンズオン

References

Amini, A., Schwarting, W., Soleimany, A., and Rus, D. (2020). Deep evidential regression.

Baldi, P., and Hornik, K. (1989). Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2(1), 53–58.

Ballé, J., Laparra, V., and Simoncelli, E. P. (2017). End-to-end optimized image compression. In International conference on learning representations.

Baykal, G., Kandemir, M., and Unal, G. (2023). EdVAE: Mitigating codebook collapse with evidential discrete variational autoencoders.

Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8), 1798–1828.

Bengio, Y., Léonard, N., and Courville, A. (2013). Estimating or propagating gradients through stochastic neurons for conditional computation.

Beyeler, E. L. A. C., Michael AND Rounds. (2019). Neural correlates of sparse coding and dimensionality reduction. PLOS Computational Biology, 15(6), 1–33.

Bishop, C. M., and Bishop, H. (2024). Deep learning: Foundations and concepts. Springer Cham.

Bourlard, H., and Kamp, Y. (1988). Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics, 59(4), 291–294.

Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A., Jozefowicz, R., and Bengio, S. (2016). Generating sentences from a continuous space. In S. Riezler and Y. Goldberg, editors, Proceedings of the 20th SIGNLL conference on computational natural language learning, pages 10–21. Berlin, Germany: Association for Computational Linguistics.

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., … Ramesh, A. (2024). Video generation models as world simulators. OpenAI. Retrieved from https://openai.com/research/video-generation-models-as-world-simulators

Chen, X., Kingma, D. P., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., … Abbeel, P. (2017). Variational Lossy Autoencoder. In International conference on learning representations.

Cottrell, G. W., and Munro, P. (1988). Principal component analysis of images via back propagation. In Proceedings of SPIE visual communications and image processings,Vol. 1001, pages 1070–1076.

Del Moral, P., Doucet, A., and Jasra, A. (2006). Sequential Monte Carlo Samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3), 411–436.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human language technologies,Vol. 1, pages 4171–4186.

Doucet, A., Grathwohl, W. S., Matthews, A. G. D. G., and Strathmann, H. (2022). Score-based diffusion meets annealed importance sampling. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in neural information processing systems.

Esser, P., Rombach, R., and Ommer, B. (2021). Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 12873–12883.

Fournier, N., and Tardif, C. (2021). On the simulated annealing in rd. Journal of Functional Analysis, 281(5), 109086.

Gershman, S., and Goodman, N. (2014). Amortized Inference in Probabilistic Reasoning. In Proceedings of the annual meating of the cognitive science society,Vol. 36.

Geyer, C. (1996). Markov chain monte carlo in practice. In W. R. Gilks, S. Richardson, and D. Spiegelhalter, editors, pages 241–258. Chapman; Hall.

Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. In G. Gordon, D. Dunson, and M. Dudík, editors, Proceedings of the fourteenth international conference on artificial intelligence and statistics,Vol. 15, pages 315–323. Fort Lauderdale, FL, USA: PMLR.

Habermann, D., Schmitt, M., Kühmichel, L., Bulling, A., Radev, S. T., and Bürkner, P.-C. (2024). Amortized bayesian multilevel models.

Haussmann, U. G., and Pardoux, E. (1986). Time Reversal of Diffusions. The Annals of Probability, 14(4), 1188–1205.

He, J., Spokoyny, D., Neubig, G., and Berg-Kirkpatrick, T. (2019). Lagging inference networks and posterior collapse in variational autoencoders. In.

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 16000–16009.

Heng, J., Bishop, A. N., Deligiannidis, G., and Doucet, A. (2020). Controlled sequential monte carlo. The Annals of Statistics, 48(5), 2904–2929.

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., … Lerchner, A. (2017). Beta-VAE: Learning basic visual concepts with a constrained variational framework. In International conference on learning representations.

Jang, E., Gu, S., and Poole, B. (2017). Categorical reparameterization with gumbel-softmax. In International conference on learning representations.

Japkowicz, N., Hanson, S. J., and Gluck, M. A. (2000). Nonlinear autoassociation is not equivalent to PCA. Neural Computation, 12(3), 531–545.

Karhunen, J., and Joutsensalo, J. (1995). Generalizations of principal component analysis, optimization problems, and neural networks. Neural Networks, 8(4), 549–562.

Kingma, D. P., and Welling, M. (2014). Auto-encoding variational bayes. In International conference on learning representations,Vol. 2.

Kingma, D. P., and Welling, M. (2019). An introduction to variational autoencoders. Foundations and Treands in Machine Learning, 12(4), 307–392.

Lawrence, N. (2005). Probabilistic non-linear principal component analysis with gaussian process latent variable models. Journal of Machine Learning Research, 6(60), 1783–1816.

Locatello, F., Bauer, S., Lučić, M., Rätsch, G., Gelly, S., Schölkopf, B., and Bachem, O. F. (2019). Challenging common assumptions in the unsupervised learning of disentangled representations. In International conference on machine learning.

Maddison, Chris J., Mnih, A., and Teh, Y. W. (2017). The concrete distribution: A continuous relaxation of discrete random variables. In International conference on learning representations.

Maddison, Chris J., Tarlow, D., and Minka, T. (2014). Asampling. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, editors, Advances in neural information processing systems,Vol. 27. Curran Associates, Inc.

Murphy, K. P. (2023). Probabilistic machine learning: Advanced topics. MIT Press.

Neal, R. M. (2001). Annealed Importance Sampling. Statistics and Computing, 11, 125–139.

Paisley, J., Blei, D. M., and Jordan, M. I. (2012). Variational bayesian inference with stochastic search. In Proceedings of the 29th international conference on machine learning, pages 1363–1370.

Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., … Sutskever, I. (2021). Zero-shot text-to-image generation. In M. Meila and T. Zhang, editors, Proceedings of the 38th international conference on machine learning,Vol. 139, pages 8821–8831. PMLR.

Razavi, A., van den Oord, A., and Vinyals, O. (2019). Generating diverse high-fidelity images with VQ-VAE-2. In Advances in neural information processing systems,Vol. 32.

Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Approximate inference in deep generative models. In Proceedings of the 31st international conference on machine learning,Vol. 32, pages 1278–1286.

Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011). Contractive auto-encoders: Explicit invariance during feature extraction. In Proceedings of the 28th international conference on international conference on machine learning, pages 833–840. Madison, WI, USA: Omnipress.

Robert, C. P., and Casella, G. (2004). Monte carlo statistical methods. Springer New York.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022). High-resolution image systhesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 10684–10695.

Sensoy, M., Kaplan, L., and Kandemir, M. (2018). Evidential deep learning to quantify classification uncertainty. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in neural information processing systems,Vol. 31. Curran Associates, Inc.

Song, Y., Garg, S., Shi, J., and Ermon, S. (2019). Sliced Score Matching: A Scalable Approach to Density and Score Estimation. In.

Tang, W., Wu, Y., and Zhou, X. (2024). Discrete-Time Simulated Annealing: A Convergence Analysis via the Eyring-Kramers Law. Numerical Algebra, Control and Optimization.

Thin, A., Kotelevskii, N., Doucet, A., Durmus, A., Moulines, E., and Panov, M. (2021). Monte carlo variational auto-encoders. In M. Meila and T. Zhang, editors, Proceedings of the 38th international conference on machine learning,Vol. 139, pages 10247–10257. PMLR.

Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. (2018). Wasserstein auto-encoders. In International conference on learning representations.

van den Oord, A., Vinyals, O., and Kavukcuoglu, K. (2017). Neural discrete representation learning. In Advances in neural information processing systems,Vol. 30.

Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural Computation, 23(7), 1661–1674.

Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on machine learning, pages 1096–1103. New York, NY, USA: Association for Computing Machinery.

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(110), 3371–3408.

Wallace, G. K. (1992). The JPEG still picture compression standard. In IEEE transactions on consumer electronics,Vol. 38, page 1.

Wu, H., Köhler, J., and Noe, F. (2020). Stochastic Normalizing Flows. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in neural information processing systems,Vol. 33, pages 5933–5944. Curran Associates, Inc.

Yu, J., Li, X., Koh, J. Y., Zhang, H., Pang, R., Qin, J., … Wu, Y. (2022). Vector-quantized image modeling with improved VQGAN. In International conference on learning representations.

Footnotes

auto-associative NN ともいう (Bishop and Bishop, 2024, p. 563)．↩︎
VAE では正規分布族の２つのパラメータ \(\operatorname{E}[Z|X],\mathrm{V}[Z|X]\) をいずれもモデリングするが，AE では前者のみをモデリングする．↩︎
この項に係数 \(\beta\) をつけたものを \(\beta\)-VAE と言い，\(\beta=0.5\) とすると，AE と VAE の中間的な性格を持つようになる．↩︎
(Murphy, 2023, p. 681) 20.3.4も参照．↩︎
(Murphy, 2023, p. 681) 20.3.2も参照．↩︎
BERT では文章の 15% であるが，ViT では 75% 近くがマスキングされるという (Bishop and Bishop, 2024, p. 568)．↩︎
(Kingma and Welling, 2019, p. 321) の用語に倣った．↩︎
変分ベイズの稿も参照．↩︎
(Habermann et al., 2024)，そして (Murphy, 2023, p. 438) 10.1.5 節も参照．↩︎
(Habermann et al., 2024) はベイズ階層モデルの推定を議論しているが，この特徴は MCMC と比べて美点になると論じている．↩︎
最適化の文脈において，目的関数の評価が困難であるとき，Monte Carlo 推定量でこれを代替する際，重点サンプリングを用いると良いことは従来提案されている (Geyer, 1996)．(Robert and Casella, 2004, p. 203) も参照．↩︎
(Bishop and Bishop, 2024, p. 576) も参照．↩︎
ただし，\(P_i^{-1}\) とは，\[ P_i(x_{i-1},x_i)\pi_{i-1}(x_i-1)=\pi_i(x_i)P_i^{-1}(x_{i-1},x_i) \] で定まる確率核とした．\(\otimes\) の記法はこちらも参照．↩︎
\(k\)-平均クラスタリングのソフトとハードに似ている．↩︎
Present Square 記事，GIGAZINE 記事もある．↩︎