Latent Diffusion Model Can't Learn the Latent Space of a VAE for the MNIST-Fashion Dataset

Question

I'm currently playing around with LDMs on the MNIST-Fashion dataset. I thought the VQVAEs used in the original paper were a bit overkill for what I'm doing (and I don't fully understand how they construct the discretized codebook latent space), so I went with a simple convolutional autoencoder with a kl-regularizer to map to an approximately gaussian latent space. I've run this model a few times and verified that it does reconstruct the original image inputs fairly well.

I run into issues when I try to use this model with my LDM implementation. I first made sure standard diffusion directly in image space works, which it does. I then tried latent diffusion with my trained autoencoder and I can't get the loss to drop below a certain threshold (~1.0), and the image outputs are pretty much gaussian noise still.

As stated in the paper, I sample latent vectors from the encoder part of the autoencoder and scale them with stats collected from the first batch of data like so:

                    batch = autoencoder.encode(batch).sample().detach()
                    # rescale the embeddings to be unit variance
                    if epoch == 0 and step == 0:
                        print("Calculating scale factor...")
                        std = batch.flatten().std()
                        scale_factor = 1. / std
                        cfg.scale_factor = scale_factor.item()
                    batch *= scale_factor

And then pretty much everything else (applying noise, calculating loss, etc) is the same as standard diffusion. Am I missing something, or is the latent space of my simple conv-autoencoder hard to learn for some reason? I would think that, since an autoencoder is just a deterministic mapping from image space to a lower dimensional one, that the LDM should be able to learn the latter just fine, or is there something inherent in the original paper's transformer architectures for the encoder / codebook latent space that is important to learning for the LDM?

Rather than writing "**Question on** Autoencoders Used in Latent Diffusion Models", can you please just put your **specific question** in the title? — nbro, Feb 27 '23 at 11:26
Two things come to my mind: 1. LDM uses a quite deep convolutional auto encoder, this may not required for Fashion-MNIST, but have you tried changing the depth and see how your result changes? 2. LDM uses the attention mechanism to condition the diffusion. How are you currently conditioning your diffusion process? — Ciodar, Mar 03 '23 at 16:18

Latent Diffusion Model Can't Learn the Latent Space of a VAE for the MNIST-Fashion Dataset

0 Answers0