I don't quite understand why, in Conditional Variational Autoencoder (CVAE), we concatenate a conditioning vector two times, at encoder and decoder respectively. After we concatenate it once at the beginning, isn't the latent distribution going to already incorporate knowledge about the label?
I know that, from a practical perspective we need to concatenate it at the decoder as well, since we want to be able then to generate new instances associating a specific label, but I'm missing more the mathematical motivations for which we concatenate it two times.
To be more precise, let's consider how the objective function of CVAE is defined:
$$\mathcal{L}_{CVAE} = \mathbb{E}_\textbf{z}[log \, p_\theta(\textbf{x}|\textbf{z},\textbf{c})] - D_{KL}[q_\phi(\textbf{z}|\textbf{x},\textbf{c})||p(\textbf{z}|\textbf{c})] $$
My question: Why do we condition on $\textbf{c}$ in both terms?