5

I understand that with vanilla VAEs, there are a few reasons justifying the production of blurred out images. The InfoVAE paper describes the case when the decoder is flexible enough to ignore the latent attributes and generate an averaged out image that best reduces the reconstruction loss. Thus the blurred image.

How much of the problem of blurring is really mitigated by the MMD formulation in practical experiments? If someone has experience working with MMD-VAEs, I'd like to know their opinion on what the reconstruction quality of MMD-VAEs is really like.

Also, does the replacement of the MSE reconstruction loss metric by other perceptual similarity metrics improve generated image quality?

Ananda
  • 148
  • 9

1 Answers1

2

[Answering my own question after 5 months of studying VAE models]

The point of the MMD-VAE or InfoVAE is not exactly to emphasise on the visual quality of generated samples. It is to preserve greater amount of information through the encoding process. The MMD formulation stems from introducing a mutual coefficient factor into the Evidence Lower BOund (ELBO) loss of VAEs. Refer to the paper appendices for full derivation. This formulation improves information content in latent space and provides for more accurate approximation of the true posterior - these results have also been empirically proven in the paper.

However, the InfoVAE uses pixel-wise or element-wise reconstruction loss. An element-wise reconstruction loss is likely to lead to some extent of blurriness inrespective of the prior loss term. On Github, several developers have implemented the InfoVAE model and shown their results. Here is a link to one such implementation whose results I could personally verify.

From my own experimentations, I can say that even though InfoVAE may give better reconstructions for some data, there is still considerable blurriness.

Perceptual similarity metrics may be learned or computed as a static function of the input image. With a learned perceptual loss, VAEs can produce much sharper images. PixelVAE and VAEGAN are well-known models with such implementations. For a static function of the image itself, reconstruction quality will depend on the nature of that function and such a model may not be very useful for all kinds of datasets. Using measures like SSIM, FSIM, we may still end up getting blurred images.

Ananda
  • 148
  • 9