Caveat: importance sampling doesn't actually work for variational auto-encoders, but the question makes sense regardless
In "L4 Latent Variable Models (VAE) -- CS294-158-SP20 Deep Unsupervised Learning", we see the following.
We want to optimize the following:
$$ \sum_i\log\sum_zp_z(z)p_\theta(x^{(i)}|z) $$
But this is hard, since we cannot enumerate over all possible $z$. So, we could use random sampling instead:
$$ \sum_i\frac{1}{K}\sum_{k=1}^Kp_\theta(x^{(i)}|z_k^{(i)}) $$
where $z_k^{(i)} \sim p_z(z)$.
However, this doesn't work because for most values of $z_k^{(i)}$, $p_\theta$ (and its gradient) will be near 0. For example, consider a 100-dimensional binary latent variable space. The probability you sample the correct $z$ to produce the given $x^{(i)}$ is $0.5^{100}$.
To counter-act this, we can use importance sampling:
$$ \sum_i\log\frac{1}{K}\sum_{k=1}^K\frac{p_z\left(z_k^{(i)}\right)}{q\left(z_k^{(i)}\right)}p_\theta(x^{(i)}|z_k^{(i)}) $$
Where $q(z) = p_\theta(z|x^{(i)})$ ($q$ is good at predicting what latent variable represents the sample $x^{(i)}$).
Of course there are problems with calculating $q$, but let's ignore that for now.
The thing I don't understand here is, sure $q$ might be good at predicting values of $z$, for which $p_\theta(...)$ is non-zero, but we are scaling all of those values by $p_z(z_k^{(i)})$, which will be very small. So the end result is that the final sum will still be very low for a given $x^{(i)}$. (I think) this means when doing back-propagation through $p_\theta(x^{(i)}|z_k^{(i)})$, the gradient update will be very small, which is the problem we were trying to avoid in the first place.
Can someone explain to me what I'm missing?