Why doesn't VAE suffer mode collapse?

Question

Mode collapse is a common problem faced by GANs. I am curious why doesn't VAE suffer mode collapse?

score 8 · Answer 1 · edited Oct 09 '19 at 22:26

With Generative Adversarial Networks, all the generator cares about is fooling the discriminator. There's no requirement to be clever, or exhaustive, or make efficient use of the input space. As long as the discriminator returns "real" (vs. "fake") the generator "wins".

The hope is that as the generator and discriminator are trained simultaneously, each will attempt to capture the faults of the other. As the discriminator gets better for one mode in telling real from fake, the generator will focus more on the other modes where it will have an easier time fooling the discriminator on.

But that isn't the only way things can happen. When the discriminator gets better on distinguishing one mode, the generator can instead concentrate on that mode. Instead of spreading its training over different modes, it can instead attempt to get really good at producing samples within a particular mode. As long as the discriminator can be fooled into reporting "real" for the samples being produced, the generator gets the same reward as if it were to spread itself across multiple modes.

Think about the forger/detective analogy. Instead of a forger trying to learn how to be good at making a range of paintings (Caravaggios, van Goghs, Warhols, etc.) the forger can concentrate on being really good at making a particular style of painting (e.g. Dutch Masters) or even a particular artist. Someone who spends all their time perfecting their Rembrandt forgeries will do a much better job evading detection than someone who tries to be good at both Picasso and Rembrandt (and Klee and Miró and ...)

Variational Autoencoders are different, though. With a VAE, the latent space is built from all of the input examples. There's no way for a VAE to concentrate on a particular mode. All of the training examples are being mapped to the same latent space. If, for some reason, the decoder network decided to concentrate on a single mode, mapping the latent space to outputs of just that mode, then all the training examples from the other modes would have poor recovery. This means there would be a strong loss and gradient correcting that for the decoder training. The decoder training is balanced across all the input examples/modes, so the decoder needs to be similarly balanced when decoding the latent space. GANs don't have this similar balance, as the generator doesn't necessarily need to be balanced in its output - just so long as it can reliably fool the discriminator.

That doesn't mean that VAEs don't have similar issues, though they're not identical to mode collapse. The problem with VAEs is that each individual example is (in extremis) being mapped to the same random distribution in the latent space. This can encourage the decoder to ignore the latent variable input, generating output more-or-less arbitrarily. There's a number of approaches recently (in particular Adversarial Autoencoders as well as InfoVAE and the closely related Wasserstein Autoencoders) which attempt to remedy this by constraining the latent space distribution on an ensemble level, rather than on an individual training example basis.

Exactly. Vanilla vae collapse to a constant variational posterior too often — shouldsee, Sep 18 '22 at 07:56

Why doesn't VAE suffer mode collapse?

1 Answers1