ELBO

1 minute read

Published: August 10, 2021

Introduction

Imagine two thermodynamic systems A defined by state variables \(\{\mathbf{x}\}\) and parameters \(\{\boldsymbol{\theta}\}\), and B described by state variables \(\{\mathbf{z}\}\) and parameters \(\{\boldsymbol{\phi}\}\) respectively. We assume that the state of system B determines the state of system A. (Think that X represents the neural activity and Z represents hidden neural activity.) So, the probability of a state {\(\mathbf{z}\)} given state {\(\mathbf{x}\)} is given by, \[p_{\theta}(\mathbf{z|x})=\frac{e^{-\beta E(\mathbf{z|x})}}{\mathcal{Z}}\] Here, \(\beta=(k_B T)^{-1}\) and \(\mathcal{Z}\) is the partition function given by: \[\begin{aligned} \mathcal{Z}=\int e^{-\beta E(\mathbf{z|x})}\, d\mathbf{z}\end{aligned}\] Now, we observe the state of system A and we want to find the parameters that describe system A and B. The general idea is that we would like to minimize the Helmholtz free energy of system B given the observations of system A. The free energy is given by, \[F=U-TS\] The entropy of system B is given by, \[S=-k_B \int p_{\theta}(z|x)\mathrm{ln}(p_{\theta}(z|x)) \, dz\] The internal energy of the system B is given by, \[\begin{aligned} U&=\int p_{\theta}(z|x) E(z|x) \, dz \\ &=\int p_{\theta}(z|x) E(z|x) \, dz \\ &=-\frac{1}{\beta}\int p_{\theta}(z|x)\mathrm{ln}(p_{\theta}(z|x)) \, dz-\frac{\mathrm{ln}(\mathcal{Z})}{\beta}\\ &=-\frac{1}{\beta}\int p_{\theta}(z|x)\mathrm{ln}\left[\frac{p_{\theta}(x,z)}{p_{\theta}(x)}\right] \, dz-\frac{\mathrm{ln}(\mathcal{Z})}{\beta}\\ &=-\frac{1}{\beta}\int p_{\theta}(z|x)\mathrm{ln}(p_{\theta}(x,z)) \, dz+\frac{\mathrm{ln}(p_{\theta}(x))-\mathrm{ln}(\mathcal{Z})}{\beta}\end{aligned}\] Since, \(\mathcal{Z}=p_{\theta}(x)\), the second term cancels out. Assuming \(\beta=1\), we find \[\begin{aligned} F(x)=-\int p_{\theta}(z|x)\mathrm{ln}(p_{\theta}(x,z))\, dz+\int p_{\theta}(z|x)\mathrm{ln}(p_{\theta}(z|x)) \, dz\end{aligned}\] Since the posterior is intractable \(p_{\theta}(z|x)\) we approximate it by \(q_{\phi}(z|x)\) to get variational Helmholtz free energy, \[F[q_{\phi}(z|x)]=-\int q_{\phi}(z|x)\mathrm{ln}(p_{\theta}(x,z))\, dz+\int q_{\phi}(z|x)\mathrm{ln}(q_{\phi}(z|x)) \, dz\] We vary \(q_{\phi}(z|x)\) to find the minimum of the above functional. The negative of the above expression is called ELBO. We thus have, \[\begin{aligned} \mathrm{ELBO}&=\mathbb{E}_{q_{\phi}(z|x)}[\mathrm{ln}(p_{\theta}(x,z))]-\mathbb{E}_{q_{\phi}(z|x)}[\mathrm{ln}(q_{\phi}(z|x))]\\ &=\mathbb{E}_{q_{\phi}(z|x)}[\mathrm{ln}(p_{\theta}(z)p_{\theta}(x|z))]-\mathbb{E}_{q_{\phi}(z|x)}[\mathrm{ln}(q_{\phi}(z|x))]\\ &=\mathbb{E}_{q_{\phi}(z|x)}[p_{\theta}(x|z))]+\mathbb{E}_{q_{\phi}(z|x)}\left[\mathrm{ln}\frac{p_{\theta}(z)}{q_{\phi}(z|x))}\right]\\ &=-D_{KL}(q_{\phi}(z|x)||p_{\theta}(z))+\mathbb{E}_{q_{\phi}(z|x)}[\mathrm{ln}(p_{\theta}(x|z))]\end{aligned}\]

We assume, \(q_{\phi}(\mathbf{z|x}) \sim \mathcal{N}(\boldsymbol{\mu,\sigma}^2\mathbf{I})\), \(p_{\theta}(\mathbf{z})\sim \mathcal{N}(\boldsymbol{0,\mathbf{I}})\) so the first term becomes \[\begin{aligned} D_{KL}(q_{\phi}(z|x)||p_{\theta}(z))&=\frac{1}{2}\sum_{j}[\mu_j^2+\sigma^2_j-\mathrm{ln}(\sigma^2_j)-1]\end{aligned}\] We further assume that \(p_{\theta}(x|z))\) is Bernoulli or Gaussian depending on the type of data. For Gaussian case, \(p_{\theta}(x|z)) \sim \mathcal{N}(\boldsymbol{\mu,\sigma^2 \mathbf{I}})\). So, we get, \[\begin{aligned} \mathrm{log}[p_{\theta}(x|z))]=-\frac{(x-\mu)^2}{2\sigma^2}-\frac{\mathrm{log}[2\pi\sigma^2]}{2}\end{aligned}\] The \(p_{\theta}(x|z))\) is Bernoulli \[\begin{aligned} \mathrm{log}[p_{\theta}(x|z))]=\sum_i x_i\mathrm{log}[y_i]+(1-x_i)\mathrm{log}[1-y_i]\end{aligned}\] The second term is estimated by calculating the binary cross entropy between the reconstruction(\(\mathbf{\hat{x}}\)) and \(\mathbf{x}\)

Share on

Twitter Facebook LinkedIn

Achint Kumar

ELBO

Introduction

Share on

You May Also Enjoy

Nutrition

Behavioral and physiological limits to vision in mammals by Greg Field and Alapakkam Sampath

Could a neuroscientist understand a microprocessor? Eric Jonas and Paul Kording, Commentary

HJB Equation