4

To generate synthetic dataset using a trained VAE, there is confusion between two approaches:

  1. Use learned latent space: z = mu + (eps * log_var) to generate (theoretically, infinite amounts of) data. Here, we are learning mu and log_var vectors using the data, and, eps is sampled from multivariate, standard, Gaussian distribution.

  2. Use multivariate, standard, Gaussian distribution.

I am leaning more towards point 1 since we learn the mu and log_var vectors using our dataset. Whereas, point 2 uses the uninformative prior which contains no particular information about the dataset.

One of the reasons of VAE is to be able to learn this "unknown" latent space distribution by constraining it to approximate a multivariate, standard, Gaussian distribution, but at the same time, allow it sufficient flexibility to deviate from it too.

What are your thoughts? I have implemented some VAE, Conditional VAE codes both in TensorFlow 2 and PyTorch which you can refer to here.

nbro
  • 39,006
  • 12
  • 98
  • 176
Arun
  • 225
  • 1
  • 8
  • Is log_var the log variance of the empirical latent distribution? (I'll denote this $\log \sigma_E^2$)? And if so, shouldn't you calculate $z = \epsilon \sigma_E + \mu$ to give $z$ the distribution $N(\mu, \sigma_E^2$)? – Lee Reeves Jun 10 '22 at 14:09

2 Answers2

3

Few more clarifications. While the correct thing to do is draw from the prior, we have no guarantees that the aggregated posterior will cover the prior. Think of the aggregated posterior as the distribution of the latent variables for your dataset (see here for a nice explanation and visualization). Our hope is that this will be like the prior but often in practice we get a mismatch between the prior and the aggregate posterior. In this case sampling from the prior might fail because part of it is not covered by the aggregate posterior. This can be solved in various ways, like learning the prior or computing the aggregated posterior after training.


Maybe there's a misconception, we are not learning a mu and log_var but a mapping (encoder) from an image to mu and log_var. This is quite different because the mu and log_var are not two fixed vectors for the dataset but are computed separately for each image.

In similar fashion, the decoder is a learned mapping from the prior distribution $N(0,I)$ back to the image space.

Essentially the encoder takes the image as input and spits out the parameters of another gaussian (the posterior). This means that during training the input of the decoder is conditioned upon the image. Let's take MNIST for example. We hope that after the training the encoder has learned to spit out similar mu and log_var for similar digits and that the decoder has learned to decode noise from a posterior to a specific digit.

For example with a 1-dimensional latent what we hope for is something like this:

Input digit 0 --> Encoder gives mu 0.1 log_var 0.3
Input digit 0 --> Encoder gives mu 0.2 log_var 0.2
Input digit 1 --> Encoder gives mu 1.4 log_var 0.2
Input digit 1 --> Encoder gives mu 1.5 log_var 0.1
...
Input digit 9 --> Encoder gives mu -4.5 log_var 0.3

This blogpost has a nice visualization with 2d latents.

If we didn't have the encoder, we would always draw noise from the same N(0,I) gaussian. This could also work but then we'd need a different training technique like in GANs.

During test time we many times want to draw a sample from the whole data distribution and for that reason we should use the prior $N(0,I)$. If you for some reason want to condition the output to look like a specific sample then you can use the posterior. For example if you only want digits of 1 then you can pass an image of 1 through the encoder and then use the mu, log_var to draw samples.

So the questions is, do you want a sample from the whole distribution? Then use the prior.

sfotiadis
  • 181
  • 3
  • I agree that if you always draw noise from N(0, I), its akin to a vanilla GAN which does not model the latent space at all. In a VAE, we are updating the posterior q(z|X) (q=encoder). During training, if you increase the weightage for recon loss, KL-div loss keeps increasing. Now, after training, if you sample from N(0, I), you are not using the learned posterior but the prior which will be _far_ from it. Consequently, your reconstructions look "fuzzy". Only wanting to produce digit 1 is Conditional-VAE, which is conceptually different. So, shouldn't I use posterior instead of N(0, I)? – Arun Jun 15 '22 at 06:58
  • 1
    You don't need to draw from the learned posterior because you're not trying to reconstruct a specific x sample but rather probe the distribution $p(x)$ you have learned. After training the VAE the expected $q(z|x)$ over all $x$ should resemble $p(z)=N(0,I)$. You can see that as $p(z)=\int p(z|x)p(x)$, and if $p(z|x)$ and $q(z|x)$ are close this should be the case for $q(z|x)$ too. In lower dimensions you can "fill" the $N(0,I)$ with a reasonable amount of samples but in higher dimensions this number grows exponentially and your model can cover the whole space so you get blurry images. – sfotiadis Jun 16 '22 at 08:51
  • many VAE codes have ```total_loss = (alpha * recon_loss) + (beta * kl_loss)```. To get better reconstructions, ```alpha``` is tuned (say, alpha = 300), then, q(z|x) is _far_ from p(z) and sampling from p(z) is suboptimal since the learned q(z|x) gave more weightage to recon_loss and deviated away from p(z). Therefore, in such case(s), sampling from p(z) is not the right way. Correct me if I am wrong? – Arun Jun 16 '22 at 13:26
  • What you try to optimize in variational inference (hence in VAEs) is the ELBO which is the Evidence Lower Bound (evidence means that comes from the data). The problem is this only a heuristic. Also, due to you model assumptions (ie gaussian latents and decoder) and training simplifications (stochastic estimation of ELBO per batch, not really sampling from the decoder but using moments) it is really unlikely your q(z|x) will be anything close to the real p(z|x). – sfotiadis Jun 16 '22 at 15:24
  • Now in models like the beta-VAE there is weighting of the ELBO terms. But if you think about it having a weight in front of the reconstruction loss is like saying I like my gaussian decoder to have a smaller variance. So your models loses some "imagination" (variance). This essentially only affects you decoder assumptions and not how you should sample. – sfotiadis Jun 16 '22 at 15:28
0

I think method 1 will provide the best output.

Approximating the empirical distribution of $z$ should provide decoder inputs in the subset of latent space that the decoder was trained on.

Sampling from $N(0,I)$ could undersample or omit some regions of the true distribution, oversample others, and even provide inputs to the decoder that it isn't trained for (and neural networks aren't usually good at extrapolation).

Lee Reeves
  • 491
  • 1
  • 5
  • I think so too, but I was told that method #2 is thw _standard way_, to which I am having problems since this is just the prior having no knowledge of our training data. Can you help me prove this? – Arun Jun 10 '22 at 22:05
  • 1
    Yeah, I saw that thread on reddit. I don't know what "the standard way" is, or how one would even find out. Take a poll perhaps? Neither the original VAE paper nor the $\beta$-VAE paper seem to specify the best way to generate images. The latter does say: "The most informative latent units $z_m$ of $\beta$-VAE have the highest KL divergence from the unit Gaussian prior", confirming at least that the posterior distribution is not $N(0,I)$ and the difference matters. (https://www.deepmind.com/publications/beta-vae-learning-basic-visual-concepts-with-a-constrained-variational-framework) – Lee Reeves Jun 11 '22 at 00:18
  • Perhaps you could investigate this experimentally by finding a latent variable (a single dimension) with a distribution far from standard normal and comparing the results of values for that single latent variable from the two distributions. – Lee Reeves Jun 11 '22 at 00:27
  • Nice find for the deepmind publication! I have been doing this and you can find the code in [here](https://github.com/arjun-majumdar/Synthetic_Data/blob/main/VAE_LSTM-Synthetic_Time_Series_Generation-Household_power_consumption.ipynb) where ```generate_new_data(model, x, n_past, n_features, n_future, random_noise = 1)``` function is used to use 3 different ways to generate data. The question is more of __theoretically justification__. – Arun Jun 11 '22 at 07:06
  • I think the theoretical justification would go along the lines we've discussed: it's better to sample from the true posterior distribution, which is not $N(0,I)$; sampling from $N(0,I)$ is likely to oversample some regions of the true distribution and undersample others. – Lee Reeves Jun 11 '22 at 10:46
  • 1
    I haven't read _Variational inference & deep learning: A new synthesis_ PhD thesis of Kingma, D.P. which should hopefully throw more light on this. But, I agree with you. Say you have two datasets, of course the N(0, 1) does not care about these two different datasets. Whereas, the learned latent space for these 2 would contain relevant information and sampling from this posterior is way better than N(0, 1) prior. – Arun Jun 11 '22 at 11:26