In variational autoencoders, why do people use MSE for the loss?

Question

In VAEs, we try to maximize the ELBO = $\mathbb{E}_q [\log\ p(x|z)] + D_{KL}(q(z \mid x), p(z))$, but I see that many implement the first term as the MSE of the image and its reconstruction. Here's a paper (section 5) that seems to do that: Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapse (2019) by James Lucas et al. Is this mathematically sound?

It may be a good idea to provide 1-2 examples where you saw this because the MSE is not always used. For example, [here](https://github.com/pytorch/examples/blob/master/vae/main.py) they use the cross-entropy. — nbro, Apr 15 '21 at 10:09
I recently read that MSE loss optimization is equivalent to minimizing Pearson $\chi^{2}$ divergence. Kullback–Leibler divergence (and also cross-entropy) has its [drawbacks](https://ai.stackexchange.com/a/25288/12841). Here is [explanations](https://paperswithcode.com/method/lsgan) of Least Squares loss for GAN — Aray Karjauv, Apr 15 '21 at 14:06
As you mentioned, MSE is used to measure the difference between the original and generated images. This encourages the model to preserve the original content. MSE loss can be used as an additional term, which is done in [CycleGAN](https://paperswithcode.com/method/cyclegan), where the authors use LSGAN loss and cycle-consistent loss, which is MSE-like loss. — Aray Karjauv, Apr 15 '21 at 14:17
@nbro, it is not clear why they use BCE there... In fact, that implementation doesn't seem to sample between the encoder and decoder, so even more strange. Looks like they treat the distribution parameters as the input to the decoder — IttayD, Apr 16 '21 at 09:36
What do you mean by "sample between the encoder and decoder"? Yes, the input to the decoder is a sample from the latent space, so I am not sure what you mean. — nbro, Apr 16 '21 at 09:39
@nbro, I don't see in the implementation how they sample. But my main question is how BCE is achieved on data points (instead of distribution) — IttayD, Apr 16 '21 at 11:05
@IttayD Please, provide the link to the examples of the VAE implementations that use the MSE. To give a proper answer, I would need more context. — nbro, Apr 16 '21 at 11:26
@nbro, MSE or BCE are the same thing. The VAE paper doesn't talk about comparing the reconstructed image with the original one. Just about optimizing PDFs. See section C.2 in the original paper where they calculate p(x|z) with no MSE / BCE — IttayD, Apr 18 '21 at 05:51
@nbro, from the article: "for the decoder we used MLPs with either Gaussian or Bernoulli output". That is, the output of the decoder are the parameters of a distribution, not a reconstructed image. We sample from it to get the image. — IttayD, Apr 18 '21 at 05:54

score 4 · Accepted Answer · edited Jun 12 '22 at 08:14

4

If $p(x|z) \sim \mathcal{N}(f(z), I)$, then

$\begin{align} \log\ p(x|z) &\sim \log\ \exp(-(x-f(z))^2) \\ &\sim -(x-f(z))^2 \\ &= -(x-\hat{x})^2, \end{align}$

where $\hat{x}$, the reconstructed image, is just the distribution mean $f(z)$.

It also makes sense to use the distribution mean when using the decoder (vs. just when training), as it is the one with the highest pdf value. So, the decoder produces a distribution from which we take the mean as our result.

edited Jun 12 '22 at 08:14

nbro

39,006
12
98
176

answered Apr 18 '21 at 09:56

IttayD

189
1
5

This seems correct to me (and this is what I had in mind when I wrote my answer above, although I didn't write the derivation), but note that the second $\sim$ is exactly $=$. The first one is correct because you're omitting the denominator of the Gaussian density (which becomes a constant if you optimize wrt $\theta$), so you use approximately equal, which is correct. Moreover, it might be a good idea to add a underscript to $f$, i.e. $f_\theta$ to emphasize that you will be optimizing with respect to those parameters, although you jointly optimize them with $\phi$ (the encoder parameters). – nbro Jun 12 '22 at 08:16

nbro · Answer 2 · 2022-06-12T08:46:47.013

On page 5 of the VAE paper, it's clearly stated

We let $p_{\boldsymbol{\theta}}(\mathbf{x} \mid \mathbf{z})$ be a multivariate Gaussian (in case of real-valued data) or Bernoulli (in case of binary data) whose distribution parameters are computed from $\mathbf{z}$ with a MLP (a fully-connected neural network with a single hidden layer, see appendix $\mathrm{C}$ ).

...

As explained above and in appendix $\mathrm{C}$, the decoding term $\log p_{\boldsymbol{\theta}}\left(\mathbf{x}^{(i)} \mid \mathbf{z}^{(i, l)}\right)$ is a Bernoulli or Gaussian MLP, depending on the type of data we are modelling.

So, if you are trying to predict real numbers (in the case of images, these can be the RGB values in the range $[0, 1]$), then you can assume $p_{\boldsymbol{\theta}}(\mathbf{x} \mid \mathbf{z})$ is a Gaussian.

It turns out that maximising the Gaussian likelihood is equivalent to minimising the MSE between the prediction of the decoder and the real image. You can easily show this: just replace $p_{\boldsymbol{\theta}}(\mathbf{x} \mid \mathbf{z})$ with the Gaussian pdf, then maximise that wrt the parameters, and you should end up with something that resembles the MSE. G. Hinton shows this in this video lesson. See also this related answer.

So, yes, minimizing the MSE is theoretically founded, provided that you're trying to predict some real number.

When the binary cross-entropy (instead of the MSE) is used (e.g. here), the assumption is that you're maximizing a Bernoulli likelihood (instead of a Gaussian) - this can also be easily shown.

the pdf is $\frac{1}{\sigma_z \sqrt{2\pi} } e^{-\frac{1}{2}\left(\frac{x-\mu_z}{\sigma_z}\right)^2}$. MSE is $\|\hat{x}-x\|^2$. I don't see the connection other than the fact there's an $(x-\mu_z)^2$ at the power of e. Note that in p(x|z), the original x is used, not the reconstruction. — IttayD, Apr 18 '21 at 05:48
See section C.2 in the original paper where they calculate p(x|z) with no MSE / BCE — IttayD, Apr 18 '21 at 05:51
Note sure why you say "Note that in p(x|z), the original x is used, not the reconstruction.". Both the original and the reconstructed images are used. z is used to compute the reconstructed image (or pixel). — nbro, Apr 18 '21 at 11:28

In variational autoencoders, why do people use MSE for the loss?

2 Answers2