ELBO
Published:
Introduction
Imagine two thermodynamic systems A defined by state variables \(\{\mathbf{x}\}\) and parameters \(\{\boldsymbol{\theta}\}\), and B described by state variables \(\{\mathbf{z}\}\) and parameters \(\{\boldsymbol{\phi}\}\) respectively. We assume that the state of system B determines the state of system A. (Think that X represents the neural activity and Z represents hidden neural activity.) So, the probability of a state {\(\mathbf{z}\)} given state {\(\mathbf{x}\)} is given by, \[p_{\theta}(\mathbf{z|x})=\frac{e^{-\beta E(\mathbf{z|x})}}{\mathcal{Z}}\] Here, \(\beta=(k_B T)^{-1}\) and \(\mathcal{Z}\) is the partition function given by: \[\begin{aligned} \mathcal{Z}=\int e^{-\beta E(\mathbf{z|x})}\, d\mathbf{z}\end{aligned}\] Now, we observe the state of system A and we want to find the parameters that describe system A and B. The general idea is that we would like to minimize the Helmholtz free energy of system B given the observations of system A. The free energy is given by, \[F=U-TS\] The entropy of system B is given by, \[S=-k_B \int p_{\theta}(z|x)\mathrm{ln}(p_{\theta}(z|x)) \, dz\] The internal energy of the system B is given by, \[\begin{aligned} U&=\int p_{\theta}(z|x) E(z|x) \, dz \\ &=\int p_{\theta}(z|x) E(z|x) \, dz \\ &=-\frac{1}{\beta}\int p_{\theta}(z|x)\mathrm{ln}(p_{\theta}(z|x)) \, dz-\frac{\mathrm{ln}(\mathcal{Z})}{\beta}\\ &=-\frac{1}{\beta}\int p_{\theta}(z|x)\mathrm{ln}\left[\frac{p_{\theta}(x,z)}{p_{\theta}(x)}\right] \, dz-\frac{\mathrm{ln}(\mathcal{Z})}{\beta}\\ &=-\frac{1}{\beta}\int p_{\theta}(z|x)\mathrm{ln}(p_{\theta}(x,z)) \, dz+\frac{\mathrm{ln}(p_{\theta}(x))-\mathrm{ln}(\mathcal{Z})}{\beta}\end{aligned}\] Since, \(\mathcal{Z}=p_{\theta}(x)\), the second term cancels out. Assuming \(\beta=1\), we find \[\begin{aligned} F(x)=-\int p_{\theta}(z|x)\mathrm{ln}(p_{\theta}(x,z))\, dz+\int p_{\theta}(z|x)\mathrm{ln}(p_{\theta}(z|x)) \, dz\end{aligned}\] Since the posterior is intractable \(p_{\theta}(z|x)\) we approximate it by \(q_{\phi}(z|x)\) to get variational Helmholtz free energy, \[F[q_{\phi}(z|x)]=-\int q_{\phi}(z|x)\mathrm{ln}(p_{\theta}(x,z))\, dz+\int q_{\phi}(z|x)\mathrm{ln}(q_{\phi}(z|x)) \, dz\] We vary \(q_{\phi}(z|x)\) to find the minimum of the above functional. The negative of the above expression is called ELBO. We thus have, \[\begin{aligned} \mathrm{ELBO}&=\mathbb{E}_{q_{\phi}(z|x)}[\mathrm{ln}(p_{\theta}(x,z))]-\mathbb{E}_{q_{\phi}(z|x)}[\mathrm{ln}(q_{\phi}(z|x))]\\ &=\mathbb{E}_{q_{\phi}(z|x)}[\mathrm{ln}(p_{\theta}(z)p_{\theta}(x|z))]-\mathbb{E}_{q_{\phi}(z|x)}[\mathrm{ln}(q_{\phi}(z|x))]\\ &=\mathbb{E}_{q_{\phi}(z|x)}[p_{\theta}(x|z))]+\mathbb{E}_{q_{\phi}(z|x)}\left[\mathrm{ln}\frac{p_{\theta}(z)}{q_{\phi}(z|x))}\right]\\ &=-D_{KL}(q_{\phi}(z|x)||p_{\theta}(z))+\mathbb{E}_{q_{\phi}(z|x)}[\mathrm{ln}(p_{\theta}(x|z))]\end{aligned}\]
We assume, \(q_{\phi}(\mathbf{z|x}) \sim \mathcal{N}(\boldsymbol{\mu,\sigma}^2\mathbf{I})\), \(p_{\theta}(\mathbf{z})\sim \mathcal{N}(\boldsymbol{0,\mathbf{I}})\) so the first term becomes \[\begin{aligned} D_{KL}(q_{\phi}(z|x)||p_{\theta}(z))&=\frac{1}{2}\sum_{j}[\mu_j^2+\sigma^2_j-\mathrm{ln}(\sigma^2_j)-1]\end{aligned}\] We further assume that \(p_{\theta}(x|z))\) is Bernoulli or Gaussian depending on the type of data. For Gaussian case, \(p_{\theta}(x|z)) \sim \mathcal{N}(\boldsymbol{\mu,\sigma^2 \mathbf{I}})\). So, we get, \[\begin{aligned} \mathrm{log}[p_{\theta}(x|z))]=-\frac{(x-\mu)^2}{2\sigma^2}-\frac{\mathrm{log}[2\pi\sigma^2]}{2}\end{aligned}\] The \(p_{\theta}(x|z))\) is Bernoulli \[\begin{aligned} \mathrm{log}[p_{\theta}(x|z))]=\sum_i x_i\mathrm{log}[y_i]+(1-x_i)\mathrm{log}[1-y_i]\end{aligned}\] The second term is estimated by calculating the binary cross entropy between the reconstruction(\(\mathbf{\hat{x}}\)) and \(\mathbf{x}\)