2

For a lot of VAE implementations I've seen in code, it's not really obvious to me how it equates to ELBO.

$$L(X)=H(Q)-H(Q:P(X,Z))=\sum_ZQ(Z)logP(Z,X)-\sum_ZQ(Z)log(Q(Z))$$

The above is the definition of ELBO, where $X$ is some input, $Z$ is a latent variable, $H()$ is the entropy. $Q()$ is a distribution being used to approximate distribution $P()$, which in the above case both $P()$ and $Q()$ are discrete distributions, because of the sum.

A lot of the times when VAEs are built for reconstructing discrete data types, let's say for example an image, where each pixel can be black or white or $0$ or $1$. The main steps of a VAE that I've seen in code are as follows:

  1. $\text{Encoder}(Y) \rightarrow Z_u, Z_{\sigma}$
  2. $\text{Reparameterization Trick}(Z_\mu, Z_\sigma) \rightarrow Z$
  3. $\text{Decoder}(Z) \rightarrow \hat{Y}$
  4. $L(Y)= \text{CrossEntropy}(\hat{Y}, Y) - 0.5*(1+Z_{\sigma}-Z_{\mu}^2-exp(Z_\sigma))$

where

  • $Z$ represents the latent embedding of the auto-encoder
  • $Z_\mu$ and $Z_\sigma$ represent the mean and standard deviation for sampling for $Z$ from a Gaussian distribution.
  • $Y$ represents the binary image trying to be reconstructed
  • $\hat{Y}$ represents its reconstruction from the VAE.

As we can see from the ELBO, it's the entropy of the latent distribution being learned, $Q()$, which is a Gaussian, and the cross entropy of the latent distribution being learned $Q()$ and the actual distribution $P()$ with $Z$ intersected with $X$.

The main points that confuse me are

  • how $\text{CrossEntropy}(\hat{Y}, Y)$ equates to the CE of the distribution for generating latents and its Gaussian approximation, and
  • how $(0.5*(1+Z_{\sigma}-Z_{\mu}^2-exp(Z_\sigma)))$ equates to the entropy

Is it just assumed the CE of $Y$ with $\hat{Y}$ also leads to the CE of the latent distribution with it's approximation, because they're part of $\hat{Y}$'s generation? It still seems a bit off because you're getting the cross entropy of $Y$ with it's reconstruction, not the Gaussian distribution for learning latents $Z$.

Note: $Z_\sigma$ is usually not softplused to be strictly positive as required by a Gaussian distribution, so I think that's what $exp(Z_\sigma)$ is for.

nbro
  • 39,006
  • 12
  • 98
  • 176
user8714896
  • 717
  • 1
  • 4
  • 21

1 Answers1

4

I don't want to think about the correctness of your supposed ELBO equation now. Nevertheless, it's true that the ELBO can be rewritten in different ways (e.g. if you expand the KL divergence below, by applying its definition, you will end up with a different but equivalent version of the ELBO). I will use the most common (and definitely most intuitive, at least to me) one, which you can find in the paper that originally introduced the VAE, which you should use as a reference, when you are confused (although that paper may require at least 2 readings before you fully understand it).

Here's the most common form of the ELBO (for the VAE), which immediately explains the common VAE implementations (including yours):

$$\mathcal{L}\left(\boldsymbol{\theta}, \boldsymbol{\phi} ; \mathbf{x}^{(i)}\right)= \underbrace{ \color{red}{ -D_{K L}\left(q_{\boldsymbol{\phi}}\left(\mathbf{z} \mid \mathbf{x}^{(i)}\right) \| p_{\boldsymbol{\theta}}(\mathbf{z})\right) } }_{\text{KL divergence}} + \underbrace{ \color{green}{ \mathbb{E}_{q_{\boldsymbol{\phi}}\left(\mathbf{z} \mid \mathbf{x}^{(i)}\right)}\left[\color{blue}{ \log p_{\boldsymbol{\theta}}\left(\mathbf{x}^{(i)} \mid \mathbf{z}\right)}\right] } }_{\text{Expected log-likelihood}}, $$

where

  • $p_{\boldsymbol{\theta}}(\mathbf{z})$ is the prior over $\mathbf{z}$ (i.e. the latent variable); in practice, the prior is not actually parametrized by $\boldsymbol{\theta}$ (i.e. it's just a Gaussian with mean zero and variance one, or whatever, depending on your assumptions about $\mathbf{z}$!), but in the paper they assume that $\mathbf{z}$ depends on $\boldsymbol{\theta}$ (see figure 1).

  • $q_{\boldsymbol{\phi}}\left(\mathbf{z} \mid \mathbf{x}^{(i)}\right)$ is the encoder parametrized by $\boldsymbol{\phi}$

  • $p_{\boldsymbol{\theta}}$ is the decoder, parametrized by $\boldsymbol{\theta}$

If you assume that $p_{\boldsymbol{\theta}}(\mathbf{z})$ and $q_{\boldsymbol{\phi}}\left(\mathbf{z} \mid \mathbf{x}^{(i)}\right) $ are Gaussian distributions, then it turns out that the KL divergence has an analytical form, which is also derived in appendix B of the VAE paper (here is a detailed derivation)

$$ \color{red}{ -D_{K L}\left(q_{\boldsymbol{\phi}}\left(\mathbf{z} \mid \mathbf{x}^{(i)}\right) \| p_{\boldsymbol{\theta}}(\mathbf{z})\right)} = \color{red}{ \frac{1}{2} \sum_{j=1}^{J}\left(1+\log \left(\left(\sigma_{j}\right)^{2}\right)-\left(\mu_{j}\right)^{2}-\left(\sigma_{j}\right)^{2}\right) } $$

Hence this implementation, which should be equivalent to your term $0.5*(1+Z_{\sigma}-Z_{\mu}^2-exp(Z_\sigma))$

The expected log-likelihood is actually an expectation, so you cannot compute it exactly, in general. Hence, you can approximate it with Monte Carlo sampling (aka sampling averages: remember the law of large numbers?). More concretely, if you assume that you have a Bernoulli likelihood, i.e. $p_{\boldsymbol{\theta}}\left(\mathbf{x}^{(i)} \mid \mathbf{z}\right)$ is a Bernoulli, then its definition is (again from the VAE paper, Appendix C.1)

$$ \color{blue}{ \log p(\mathbf{x} \mid \mathbf{z})}= \color{blue}{ \sum_{i=1}^{D} x_{i} \log y_{i}+\left(1-x_{i}\right) \cdot \log \left(1-y_{i}\right) }\tag{1}\label{1}, $$ where $\mathbf{y}$ is the output of the decoder (i.e. the reconstruction/generation of the original input).

This formula should be very familiar to you if you are familiar with the cross-entropy. In fact, minimizing the cross-entropy is equivalent to maximizing the log-likelihood (this may still be confusing because of the flipped signs in the ELBO above, but just remember that maximizing a function is equivalent to minimizing its negative!). Hence this loss.

To answer/address some of your questions/doubts directly

how $(0.5*(1+Z_{\sigma}-Z_{\mu}^2-exp(Z_\sigma)))$ equates to the entropy?

I answered this above. That's just the analytical expression for the KL divergence (by the way, the KL divergence is also known as relative entropy). See this answer for a derivation.

It still seems a bit off because you're getting the cross entropy of $Y$ with it's reconstruction, not the Gaussian distribution for learning latents $Z$.

As you can see from the definition of the ELBO above (from the VAE paper), the expected log-likelihood is, as the name suggests, an expectation, with respect to the encoder (i.e. the Gaussian, in case you choose a Gaussian). However, the equivalence is between the log-likelihood (which is the term inside the expectation) and the cross-entropy, i.e. once you have sampled from the encoder, you just need to compute the term inside the expectation (i.e. the cross-entropy). Your term $\text{CrossEntropy}(\hat{Y}, Y)$ represents the CE but after you have sampled a latent variable from the encoder (or Gaussian), otherwise, you could not have obtained the reconstruction $\hat{Y}$ (i.e. the reconstruction depends on this latent variable, see figure 1).

In equation \ref{1}, note that there is no expectation. In fact, in the implementations, you may just sample once from the encoder, and then immediately compute the ELBO. I have seen this also in the implementations of Bayesian neural networks (basically, normal neural networks with the same principles of the VAE). However, in principle, you could sample multiple times from the Gaussian encoder, compute \ref{1} multiple times, then average it, to compute a better approximation of the expected log-likelihood.

Hopefully, some of the information in this answer is clear. Honestly, I don't have much time to write a better answer now (maybe I will come back later). In any case, I think you can find all answers to your questions/doubts in the VAE paper (although, as I said, you may need to read it at least twice to understand it).

By the way, the simplest/cleanest implementation of the VAE that I have found so far is this one.

nbro
  • 39,006
  • 12
  • 98
  • 176
  • 1
    Really good explanation, every part of my question is answered except maybe the $CE(\hat{Y}, Y)$ being equivalent to expected log likelihood. – user8714896 Nov 12 '20 at 23:25
  • @user8714896 I think you can find the answer to that question on the web. If not, you can ask another separate question, and maybe later I will answer. – nbro Nov 12 '20 at 23:28