What are the roles of the prior $\mathrm{p}(\mathbf{z})$ in a VAE?

Question

I know the encoder is variational posterior $q_{\phi}(\mathbf{z} \mid \mathbf{x})$.

I also know that the decoder represents the likelihood: $p_{\theta}(\mathbf{x} \mid \mathbf{z})$.

My question is about the prior $\mathrm{p}(\mathbf{z})$.

I know ELBO can be written as:

$E_{q_{\phi}(\mathbf{z} \mid \mathbf{x})}[\log (p_{\theta}(\mathbf{x} \mid \mathbf{z}))]-\mathrm{D}_{\mathrm{KL}}( q_{\phi}(\mathbf{z} \mid \mathbf{x}) \| \mathrm{p}(\mathbf{z})) \leq \log (p_{\theta}( \mathbf{x}))$

And for the VAE, the variational posterior is

$q_{\boldsymbol{\phi}}(\mathbf{z} \mid \mathbf{x}^{(i)})= \mathcal{N}( \boldsymbol{\mu}^{(i)}, \boldsymbol{\sigma}^{2(i)} \mathbf{I}),$

and prior is

$\mathrm{p}(\mathbf{z})=\mathcal{N}( \boldsymbol{0}, \mathbf{I}).$

So

$\mathrm{D}_{\mathrm{KL}}\left(\mathrm{q}_{\Phi}(\mathbf{z} \mid \mathbf{x}) \| p_{z}(\mathbf{z})\right)=\frac{1}{2} \sum_{j=1}^{J}\left(1+\log \left(\sigma_{j}^{2}\right)-\sigma_{j}^{2}-\mu_{j}^{2}\right)$

That's one way I know the prior plays a role, in helping determine part of the loss function.

Is there any other role that the prior plays for the VAE?

score 2 · Accepted Answer · answered Oct 22 '21 at 16:07

The prior $p(z)$ is assumed as part of the problem formulation. A typical case is where $z$ is a vector of iid normal random variables. The ELBO involves a regularization term which encourages $q(z \, | \, x)$ to have a similar distribution to $p(z)$ (the way you've written it, that's the KL term). Thus $q(z \, | \, x)$ will end up having a similar shape to $p(z)$. For example, again assuming $z$ is a vector of iid normals, if you plot samples of $z$ drawn from $q(z \, | \, x)$ you will find it has a roughly spherical shape. If you scroll down to the [16] code block and look at the figure you'll see what I mean. The figure is plotting samples of $z$, colored according to what $x$ is (MNIST example). This is just some random figure I found, and I don't endorse this code, but the image is what you'd expect to see.

The way we end up with a distribution $p(x, z)$ is by using the prior. We sample $z$ according to $p(z)$; we've trained the decoder $p(x \, | \, z)$, and by definition $p(x, z) = p(x \, | \, z) p(z)$.

Also, I moved my 2nd question to a separate post here https://ai.stackexchange.com/questions/32161/how-does-the-vae-learn-px-z — a12345, Oct 23 '21 at 02:13

What are the roles of the prior $\mathrm{p}(\mathbf{z})$ in a VAE?

1 Answers1