2

In $$\log p_{\theta}(x^1,...,x^N)=D_{KL}(q_{\theta}(z|x^i)||p_{\phi}(z|x^i))+\mathbb{L}(\phi,\theta;x^i),$$ why does $p(x^1,...,x^N)$ and $q(z|x^i)$ have the same parameter $\theta?$

Given that $p$ is just the probability of the observed data and $q$ is the approximation of the posterior, shouldn't they be different distributions and thus their parameters different?

nbro
  • 39,006
  • 12
  • 98
  • 176
user8714896
  • 717
  • 1
  • 4
  • 21

1 Answers1

2

I will try to answer your questions directly (but I guess I won't be able to), otherwise, this can become quite confusing, given the inconsistencies that can be found across different sources.

In $logp_{\theta}(x^1,...,x^N)=D_{KL}(q_{\theta}(z|x^i)||p_{\phi}(z|x^i))+\mathbb{L}(\phi,\theta;x^i)$ why is $\theta$ and param for $p$ and $q$?

In a few words, your equation is wrong because it uses the letters $\phi$ and $\theta$ inconsistently.

If you look more carefully at the right-hand side of your equation, you will notice that $q_{\theta}$ has different parameters, i.e. $\theta$, than $p_{\phi}$, which has parameters $\phi$, so $p$ and $q$ have different parameters, and this should be the case, because they are represented by different neural networks in the case of the VAE. However, the left-hand side uses $\theta$ as the parameters of $p$ (while the right-hand side uses $\phi$ to index $p$), so this should already suggest that the equation is not correct (as you correctly thought).

In the case of the VAE, $\phi$ usually represents the parameters (or weights) of the encoder neural network (NN), while $\theta$ usually represents the parameters of the decoder NN (or vice-versa, but you should just be consistent, which is often not the case in your equation). In fact, in the VAE paper, in equation 3, the authors use $\phi$ to represent the parameters of the encoder $q$, while $\theta$ is used to denote the parameters of the decoder $p$.

So, if you follow the notation in the VAE paper, the ELBO can actually be written something like

\begin{align} \mathcal{L}(\phi,\theta; \mathbf{x}) &= \mathbb{E}_{\tilde{z} \sim q_{\phi}(\mathbf{z} \mid \mathbf{x})} \left[ \log p_{\theta} (\mathbf{x} \mid \mathbf{z}) \right] - \operatorname{KL} \left(q_{\phi}(\mathbf{z} \mid \mathbf{x}) \| p_{\theta}(\mathbf{z}) \right) \tag{1} \label{1} \end{align}

The ELBO loss $\mathcal{L}(\phi,\theta; \mathbf{x})$ has both parameters (of the encoder and decoder), which will be optimized jointly. Note that I have ignored the indices in the observations $\mathbf{x}$ (for simplicity), while, in the VAE paper, they are present. Furthermore, note that, both in \ref{1} and in the VAE paper, we use bold letters (because these objects are usually vectors), i.e. $\mathbf{x}$ and $\mathbf{z}$, rather than $x$ and $z$ (like in your equation).

Note also that, even though $p_{\theta}(\mathbf{z})$ is indexed by $\theta$, in reality, this may be an un-parametrized distribution (e.g. a Gaussian with mean $0$ and variance $1$), i.e. not a family of distributions. The use of the index $\theta$ in $p_{\theta}(\mathbf{z})$ comes from the (implicit) assumption that both $p_{\theta}(\mathbf{z})$ and $p_{\theta} (\mathbf{x} \mid \mathbf{z})$ come from the same family of distributions (e.g. a family of Gaussians). In fact, if you consider the family of all Gaussian distributions, then $p_{\theta}(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$ also belongs to that family. But $\theta$ and $\phi$ are also used to denote the parameters (or weights) of the networks, so this becomes understanbly confusing. (To understand equation 10 of the VAE paper, see this answer).

Why does $p(x^1,...,x^N)$ and $q(z|x^i)$ have the same parameter $\theta?$

This is wrong, in fact. If you look at equation 1 of the VAE paper, they use $\theta$ to denote the parameters of $p(\mathbf{x})$, i.e. $p_{\theta}(\mathbf{x})$, while the parameters of the encoder are $\phi$, i.e. $q_{\phi}(\mathbf{z} \mid \mathbf{x}$).

Cause $p$ is just the probability of the observed data and $q$ is the approximation of the posterior so shouldn't they be different distributions and their parameters different?

Yes.

nbro
  • 39,006
  • 12
  • 98
  • 176