Confusion over taking gradients in Variational Autoencoders (VAE)

Question

I am confused as to when to hold certain parameters constant in a VAE. I will explain with a concrete example.

We can write $\operatorname{ELBO}(\phi, \theta) = \mathbb{E}_{q_{\phi}(z)}\left[\log \left(p_{\theta}(x| z)\right)\right] - D_{\operatorname{KL}}[q_{\phi}(z) | p(z)]$, where we wish to find $\nabla_{\phi, \theta}\operatorname{ELBO}(\phi, \theta)$. We can take the gradient of the KL divergence quite easily since it can be analytically solved.

My issue is with the graident $\nabla_{\phi, \theta}\mathbb{E}_{q_{\phi}(z)}\left[\log \left(p_{\theta}(x| z)\right)\right]$. I am assuming that the expectation is intractable and therefore we can use a Monte Carlo (MC) approximation and instead find $$ \nabla_{\phi, \theta}\left(\frac{1}{L}\sum_{l=1}^{L}\log p_{\theta}\left(x|z^{(i)}\right)\right) $$ where we use $L$ samples for the MC approximation. From my understanding, the gradient of this term w.r.t $\phi$ should be non-zero since changing $\phi$ should change $z$, which would change the above term. However, when I look at a derivations for gradient estimators such as the Score Function Estimator, I see that $\theta$ is treated as a constant w.r.t $\phi$. Example from Appendix B of the linked paper:

$$ \nabla_{\phi}\sum_{h}Q_{\phi}(h|x)\log P_{\theta}(x, h) \implies \sum_{h}\log P_{\theta}(x, h)\nabla_{\phi}Q_{\phi}(h|x) $$ I am not sure how to connect the two differences here. One potential explanation is that the latent $h$ is being treated fixed in the above equation; however, I don't see why this should be the case since $h$ is a function of the parameters $\phi$ and so changing $\phi$ would in turn change $h$ and thus the value of $\log P_{\theta}(x, h)$?

score 0 · Answer 1 · answered Aug 16 '23 at 09:55

0

When $h$ is conditioned on $x$ there is a function $Q_\theta$ that maps $x$ onto a distribution for h. However, when not conditioned on x there exists just a prior p(h), which is independent of $\theta$ and $\phi$.

Easy to see when you factorise the joint $P_\theta(x|h)P(h)$

answered Aug 16 '23 at 09:55

quest ions

384
1
8

Thanks, but I don't think this answers my question per se. I am not entirely sure why the mapping $Q_{\theta}$ allows the VAE to approximate arbitrary distributions using the Gaussian latent space. – Joel Aug 16 '23 at 19:38
The likelihood $P(x)$ is typically complex and has no closed form, but when conditioned on latent variable h it usually has a simpler form written as $P_\theta(x|h)$. Similarly, we don't know the distribution P(h|x) and instead approximate it as $Q(h|x)$. Updating $\phi$ causes Q(h|x) to tend to P(h|x) based on standard VAE theory but this is limited to how flexible the parameters of the mapping Q is. – quest ions Aug 17 '23 at 15:20
Also, if this is your main question could make this a bit clearer in the post – quest ions Aug 17 '23 at 15:24
Sorry, this is my bad, I got confused with a different question that I posted on the site; ignore my first comment please. That said, I still don't understand your response. My issue is that when we backpropagate through the VAE, we have that $\nabla_{\phi} P_{\theta}(x|h)$ is non-zero since the gradient flows through the network; however, when moving the gradients within the expectation (like the summation from Appendix B shown in my question), we treat $P_{\theta}(x|h)$ as a constant w.r.t $\nabla_{\phi}$, which seems to contradict our backpropagation step. – Joel Aug 17 '23 at 16:48
h in $P_\theta(x|h)$ comes from a prior distribution, p(h), which is independent of $\phi$. They use the form $p_\theta(x,h)$ which can make things a bit confusing but if they factored it into the form p(h|x)p(x) we just get p(x) back which would be pointless. Also note that $p(h|x) \ne q(h|x)$ since q is an approximation. It might also help understanding that Q isn't a true mapping of h to x it is an approximation – quest ions Aug 17 '23 at 17:25
I see, so, for example, if we are finding $\nabla_{\phi} \int_{\Omega_{h}} p_{\theta}(x|h) q_{\phi}(h|x) \operatorname{dh}$, can we treat $p_{\theta}(x|h)$ as a constant w.r.t $\phi$, since we are integrating over the latent space $\Omega_{h}$? I don't see how this implies that we are sampling from the prior distribution, since, in the actual network, we sample $h$ using the parameters that the decoder $q_{\phi}(h|x)$ outputs, so I don't see how we can just assume that $p_{\theta}(x|h)$ is constant w.r.t to $\phi$, since ultimately it is a function of $h \sim q_{\phi}(h|x)$. – Joel Aug 17 '23 at 23:37
Ok so when doing gradient descent, samples $z$, are taken from the distribution $q_\phi(h|x)$. This makes $z$ a constant that can't be propagated through implicitly. The other term is a function expressed explicitly in terms of $\phi$ hence the derivative be taken here. Also note that despite the fact the decoder section of the VAE is trained on q(h|x) in practice it can be used with noise ($p(h)$) as inputs for image synthesis – quest ions Aug 21 '23 at 14:34
@Joel does this answer your question? – quest ions Aug 25 '23 at 07:58

Confusion over taking gradients in Variational Autoencoders (VAE)

1 Answers1