6

Why is the equation $$\log p_{\theta}(x^1,...,x^N)=D_{KL}(q_{\theta}(z|x^i)||p_{\phi}(z|x^i))+\mathbb{L}(\phi,\theta;x^i)$$ true, where $x^i$ are data points and $z$ are latent variables?

I was reading the original variation autoencoder paper and I don't understand how the marginal is equal to the RHS equation. How does the marginal equal the KL divergence of $p$ with its approximate distribution plus the variational lower bound?

nbro
  • 39,006
  • 12
  • 98
  • 176
user8714896
  • 717
  • 1
  • 4
  • 21

1 Answers1

8

In variational inference, the original objective is to minimize the Kullback-Leibler divergence between the variational distribution, $q(z \mid x)$, and the posterior, $p(z \mid x) = \frac{p(x, z)}{\int_z p(x, z)}$, given that the posterior can be difficult to directly infer with the Bayes rule, due to the denominator term, which can contain an intractable integral.

Therefore, more formally, the optimization objective can be written as

\begin{align} q^*(z \mid x) = \operatorname{argmin}_{q(z \mid x)} D_{\text{KL}}(q(z \mid x) \| p(z \mid x))\tag{1} \label{1} \end{align}

However, solving this optimization problem can be as difficult as the original inference one of computing the posterior $p(z \mid x)$ using the Bayes rule, given that it still involves the possibly intractable term $p(z \mid x)$.

If you use the definition of the KL divergence, you can derive the following equation

\begin{align} D_{\text{KL}}(q(z \mid x) \| p(z \mid x)) = \mathbb{E}_{q(z \mid x)} \left[ \log q(z \mid x) \right] - \mathbb{E}_{q(z \mid x)} \left[ \log q(z, x) \right] + \log p(x) \tag{2} \label{2} \end{align}

First, note that the expectations are with respect to the variational distribution, which means that, if you want to approximate these expectations with Monte Carlo estimates, you can do it with respect to the variational distribution, and, given that it is assumed that one can easily sample from the variational distribution (which can e.g. be a Gaussian), this is a nice feature.

Second, the KL divergence contains the term $p(x) = \int_z p(x, z)$, the denominator term in the Bayes rule to compute the posterior $p(z \mid x)$, which (as I said) can be intractable. $p(x)$ is often called the evidence.

The solution is then to optimize an objective that does not contain this annoying intractable term $p(x)$. The objective that is optimized is the so-called ELBO objective

\begin{align} \text{ELBO}(q) = \mathbb{E}_{q(z \mid x)} \left[ \log q(z, x) \right] - \mathbb{E}_{q(z \mid x)} \left[ \log q(z \mid x) \right]\tag{3} \label{3} \end{align}

The KL divergence \ref{2} and the ELBO objective \ref{3} are similar. In fact, ELBO is an abbreviation for Evidence Lower BOund, because the ELBO is a lower bound on the evidence $p(x)$, i.e. it is a number that is smaller than $p(x)$ or, more formally, $\text{ELBO}(q) \leq \log p(x)$. Therefore, if we maximize $\text{ELBO}(q)$, we also maximize the evidence $p(x)$ of the data (where $x$ is the data in your dataset).

So, the objective in variational inference is

\begin{align} q^*(z \mid x) &= \operatorname{argmax}_{q(z \mid x)} \operatorname{ELBO}({q}) \\ &= \operatorname{argmax}_{q(z \mid x)} \mathbb{E}_{q(z \mid x)} \left[ \log q(z, x) \right] - \mathbb{E}_{q(z \mid x)} \left[ \log q(z \mid x) \right] \tag{4} \label{4} \end{align}

First, note that \ref{4} only contains terms that depend on the variational distribution, so we got rid of intractable terms, which was our goal.

Second, note that, as opposed to \ref{1}, we are maximizing (or finding the parameters that maximize the objective).

The ELBO objective is actually the negative of \ref{2} plus the logarithm of the evidence term, $\log p(x)$ (and you can easily verify it), that is

\begin{align} \text{ELBO}(q) = -D_{\text{KL}}(q(z \mid x) \| p(z \mid x)) + \log p(x) \end{align}

which can also be rearranged as

\begin{align} \log p(x) = D_{\text{KL}}(q(z \mid x) \| p(z \mid x)) + \text{ELBO}(q) \tag{5}\label{5} \end{align}

which is your equation (where $\text{ELBO}(q)$ is your $\mathcal{L}$). Therefore, your equation is true by definition, i.e. we define the ELBO such that \ref{5} is true. However, note that we haven't defined the ELBO in the way we dit only for the sake of it, but because it is a lower bound on the log evidence (and this follows from the fact that the KL divergence is never negative).

nbro
  • 39,006
  • 12
  • 98
  • 176