Why does importance sampling work with latent variable models?

Question

Caveat: importance sampling doesn't actually work for variational auto-encoders, but the question makes sense regardless

In "L4 Latent Variable Models (VAE) -- CS294-158-SP20 Deep Unsupervised Learning", we see the following.

We want to optimize the following:

$$ \sum_i\log\sum_zp_z(z)p_\theta(x^{(i)}|z) $$

But this is hard, since we cannot enumerate over all possible $z$. So, we could use random sampling instead:

$$ \sum_i\frac{1}{K}\sum_{k=1}^Kp_\theta(x^{(i)}|z_k^{(i)}) $$

where $z_k^{(i)} \sim p_z(z)$.

However, this doesn't work because for most values of $z_k^{(i)}$, $p_\theta$ (and its gradient) will be near 0. For example, consider a 100-dimensional binary latent variable space. The probability you sample the correct $z$ to produce the given $x^{(i)}$ is $0.5^{100}$.

To counter-act this, we can use importance sampling:

$$ \sum_i\log\frac{1}{K}\sum_{k=1}^K\frac{p_z\left(z_k^{(i)}\right)}{q\left(z_k^{(i)}\right)}p_\theta(x^{(i)}|z_k^{(i)}) $$

Where $q(z) = p_\theta(z|x^{(i)})$ ($q$ is good at predicting what latent variable represents the sample $x^{(i)}$).

Of course there are problems with calculating $q$, but let's ignore that for now.

The thing I don't understand here is, sure $q$ might be good at predicting values of $z$, for which $p_\theta(...)$ is non-zero, but we are scaling all of those values by $p_z(z_k^{(i)})$, which will be very small. So the end result is that the final sum will still be very low for a given $x^{(i)}$. (I think) this means when doing back-propagation through $p_\theta(x^{(i)}|z_k^{(i)})$, the gradient update will be very small, which is the problem we were trying to avoid in the first place.

Can someone explain to me what I'm missing?

Why does importance sampling work with latent variable models?

0 Answers0