1

Caveat: importance sampling doesn't actually work for variational auto-encoders, but the question makes sense regardless

In "L4 Latent Variable Models (VAE) -- CS294-158-SP20 Deep Unsupervised Learning", we see the following.

We want to optimize the following:

$$ \sum_i\log\sum_zp_z(z)p_\theta(x^{(i)}|z) $$

But this is hard, since we cannot enumerate over all possible $z$. So, we could use random sampling instead:

$$ \sum_i\frac{1}{K}\sum_{k=1}^Kp_\theta(x^{(i)}|z_k^{(i)}) $$

where $z_k^{(i)} \sim p_z(z)$.

However, this doesn't work because for most values of $z_k^{(i)}$, $p_\theta$ (and its gradient) will be near 0. For example, consider a 100-dimensional binary latent variable space. The probability you sample the correct $z$ to produce the given $x^{(i)}$ is $0.5^{100}$.

To counter-act this, we can use importance sampling:

$$ \sum_i\log\frac{1}{K}\sum_{k=1}^K\frac{p_z\left(z_k^{(i)}\right)}{q\left(z_k^{(i)}\right)}p_\theta(x^{(i)}|z_k^{(i)}) $$

Where $q(z) = p_\theta(z|x^{(i)})$ ($q$ is good at predicting what latent variable represents the sample $x^{(i)}$).

Of course there are problems with calculating $q$, but let's ignore that for now.

The thing I don't understand here is, sure $q$ might be good at predicting values of $z$, for which $p_\theta(...)$ is non-zero, but we are scaling all of those values by $p_z(z_k^{(i)})$, which will be very small. So the end result is that the final sum will still be very low for a given $x^{(i)}$. (I think) this means when doing back-propagation through $p_\theta(x^{(i)}|z_k^{(i)})$, the gradient update will be very small, which is the problem we were trying to avoid in the first place.

Can someone explain to me what I'm missing?

Foobar
  • 151
  • 5

0 Answers0