I'm currently playing around with LDMs on the MNIST-Fashion dataset. I thought the VQVAEs used in the original paper were a bit overkill for what I'm doing (and I don't fully understand how they construct the discretized codebook latent space), so I went with a simple convolutional autoencoder with a kl-regularizer to map to an approximately gaussian latent space. I've run this model a few times and verified that it does reconstruct the original image inputs fairly well.
I run into issues when I try to use this model with my LDM implementation. I first made sure standard diffusion directly in image space works, which it does. I then tried latent diffusion with my trained autoencoder and I can't get the loss to drop below a certain threshold (~1.0), and the image outputs are pretty much gaussian noise still.
As stated in the paper, I sample latent vectors from the encoder part of the autoencoder and scale them with stats collected from the first batch of data like so:
batch = autoencoder.encode(batch).sample().detach()
# rescale the embeddings to be unit variance
if epoch == 0 and step == 0:
print("Calculating scale factor...")
std = batch.flatten().std()
scale_factor = 1. / std
cfg.scale_factor = scale_factor.item()
batch *= scale_factor
And then pretty much everything else (applying noise, calculating loss, etc) is the same as standard diffusion. Am I missing something, or is the latent space of my simple conv-autoencoder hard to learn for some reason? I would think that, since an autoencoder is just a deterministic mapping from image space to a lower dimensional one, that the LDM should be able to learn the latter just fine, or is there something inherent in the original paper's transformer architectures for the encoder / codebook latent space that is important to learning for the LDM?