This is the analytical form of the KL divergence between two multivariate Gaussian densities with diagonal covariance matrices (i.e. we assume independence). More precisely, it's the KL divergence between the variational distribution
$$
q_{\boldsymbol{\phi}}(\mathbf{z})
=
\mathcal{N}\left(\mathbf{z} ; \boldsymbol{\mu}, \mathbf{\Sigma} = \boldsymbol{\sigma}^{2}\mathbf{I}\right)
=
\frac{\exp \left(-\frac{1}{2}\left(\mathbf{z} - \boldsymbol{\mu}\right)^{\mathrm{T}} \mathbf{\Sigma}^{-1}\left(\mathbf{z}-\boldsymbol{\mu} \right)\right)}{\sqrt{(2 \pi)^{J}\left|\mathbf{\Sigma}\right|}}
\tag{1}\label{1}
$$
and the prior (it's the same as above, but with mean and covariance equal to the zero vector and the identity matrix, respectively)
$$
p(\mathbf{z})=\mathcal{N}(\mathbf{z} ;\boldsymbol{0}, \mathbf{I})
=
\frac{\exp \left(-\frac{1}{2}\mathbf{z}^{\mathrm{T}}\mathbf{z}\right)}{\sqrt{(2 \pi)^{J}}}
\tag{2}\label{2}
$$
where
- $\boldsymbol{\mu} \in \mathbb{R}^J$ is the mean vector (we assume column vectors, so $\boldsymbol{\mu}^T$ would be a row vector)
- $\mathbf{\Sigma} = \boldsymbol{\sigma}^{2}\mathbf{I} \in \mathbb{R}^{J \times J}$ is a diagonal covariance matrix (with the vector $\boldsymbol{\sigma}^{2}$ on the diagonal of the identity)
- $\mathbf{z} \in \mathbb{R}^J$ is a sample (latent vector) from these Gaussians with dimensionality $J$ (or, at the same time, the input variable of the density)
- $\left|\mathbf{\Sigma}\right| = \operatorname{det} \mathbf{\Sigma}$ is the determinant (so a number) of the diagonal covariance matrix, which is just the product of the diagonal elements for a diagonal matrix (which is the case); so, in the case of the identity, the determinant is $1$
- $\boldsymbol{0} \in \mathbb{R}^J$ is the zero vector
- $\mathbf{I} \in \mathbb{R}^{J \times J}$ is an identity matrix
- $\mathbf{z}^{\mathrm{T}}\mathbf{z} = \sum_{i=1}^J z_i^2 \in \mathbb{R}$ is the dot product (hence a number)
Now, the (negative of the) KL divergence is defined as follows
\begin{align}
-D_{K L}\left(q_{\boldsymbol{\phi}}(\mathbf{z}) \| p(\mathbf{z})\right)
&=
\int q_{\boldsymbol{\phi}}(\mathbf{z})\left(\log p(\mathbf{z})-\log q_{\boldsymbol{\phi}}(\mathbf{z})\right) d \mathbf{z} \\
&=
\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z})} \left[ \log p(\mathbf{z})-\log q_{\boldsymbol{\phi}}(\mathbf{z})\right] \label{3}\tag{3}
\end{align}
Given that we have logarithms here, let's compute the logarithm of equations \ref{1} and \ref{2}
\begin{align}
\log
\left(
\mathcal{N}\left(\mathbf{z} ; \boldsymbol{\mu}, \mathbf{\Sigma} \right)
\right)
&=
\dots \\
&=
-\frac{1}{2}(\mathbf{z}-\boldsymbol{\mu})^{\mathrm{T}} \mathbf{\Sigma}^{-1}(\mathbf{z}-\boldsymbol{\mu})-\frac{J}{2} \log (2 \pi)-\frac{1}{2} \log |\mathbf{\Sigma} |
\end{align}
and
\begin{align}
\log
\left(
\mathcal{N}(\mathbf{z} ;\boldsymbol{0}, \mathbf{I})
\right)
&=
\dots \\
&=
-\frac{1}{2}\mathbf{z}^{\mathrm{T}} \mathbf{z}-\frac{J}{2} \log (2 \pi)
\end{align}
We can now replace these in equation \ref{3} (below, I have already performed some simplifications, to remove verbosity, but you can check them!)
\begin{align}
\frac{1}{2}
\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z})} \left[
-\mathbf{z}^{\mathrm{T}} \mathbf{z}
+ (\mathbf{z}-\boldsymbol{\mu})^{\mathrm{T}} \mathbf{\Sigma}^{-1}(\mathbf{z}-\boldsymbol{\mu})
+ \log |\mathbf{\Sigma} |
\right]
\tag{4}\label{4}
\end{align}
Now, given that $\mathbf{\Sigma}$ is diagonal and the log of a product is just a sum of the logarithms, we have $\log |\mathbf{\Sigma} | = \sum_{i=1}^J \log \sigma_{ii}$, so we can continue
\begin{align}
\frac{1}{2}
\left(
-
\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z})}
\left[
\mathbf{z}^{\mathrm{T}} \mathbf{z}
\right]
+
\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z})} \left[
(\mathbf{z}-\boldsymbol{\mu})^{\mathrm{T}} \mathbf{\Sigma}^{-1}(\mathbf{z}-\boldsymbol{\mu})
\right]
+
\sum_{i=1}^J \log \sigma_{ii}
\right)
&= \\
\frac{1}{2}
\left(
-
\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z})}
\left[
\mathbf{z}^{\mathrm{T}} \mathbf{z}
\right]
+
\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z})} \left[
\operatorname{tr} \left(
\mathbf{\Sigma}^{-1}(\mathbf{z}-\boldsymbol{\mu})
(\mathbf{z}-\boldsymbol{\mu})^{\mathrm{T}}
\right)
\right]
+
\sum_{i=1}^J \log \sigma_{ii}
\right)
&= \\
\frac{1}{2}
\left(
-
\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z})}
\left[
\mathbf{z}^{\mathrm{T}} \mathbf{z}
\right]
+
\operatorname{tr} \left(
\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z})} \left[
\mathbf{\Sigma}^{-1}(\mathbf{z}-\boldsymbol{\mu})
(\mathbf{z}-\boldsymbol{\mu})^{\mathrm{T}}
\right]
\right)
+
\sum_{i=1}^J \log \sigma_{ii}
\right)
&= \\
\frac{1}{2}
\left(
-
\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z})}
\left[
\mathbf{z}^{\mathrm{T}} \mathbf{z}
\right]
+
\operatorname{tr} \left(
\mathbf{\Sigma}^{-1}
\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z})} \left[
(\mathbf{z}-\boldsymbol{\mu})
(\mathbf{z}-\boldsymbol{\mu})^{\mathrm{T}}
\right]
\right)
+
\sum_{i=1}^J \log \sigma_{ii}
\right)
&= \\
\frac{1}{2}
\left(
-
\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z})}
\left[
\mathbf{z}^{\mathrm{T}} \mathbf{z}
\right]
+
\operatorname{tr} \left(
\mathbf{\Sigma}^{-1}
\mathbf{\Sigma}
\right)
+
\sum_{i=1}^J \log \sigma_{ii}
\right)
&= \\
\frac{1}{2}
\left(
-
\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z})}
\left[
\mathbf{z}^{\mathrm{T}} \mathbf{z}
\right]
+
J
+
\sum_{i=1}^J \log \sigma_{ii}
\right)
&= \\
\frac{1}{2}
\left(
-
\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z})}
\left[
\operatorname{tr}
\left(
\mathbf{z} \mathbf{z}^{\mathrm{T}}
\right)
\right]
+
\sum_{i=1}^J 1
+
\sum_{i=1}^J \log \sigma_{ii}
\right)
&= \\
\frac{1}{2}
\left(
-
\operatorname{tr}
\left(
\mathbf{\Sigma}\right)
-
\operatorname{tr}
\left(
\boldsymbol{\mu} \boldsymbol{\mu}^T
\right)
+
\sum_{i=1}^J 1
+
\sum_{i=1}^J \log \sigma_{ii}
\right)
&= \\
\frac{1}{2}
\left(
-
\sum_{i=1}^J \sigma_{ii}
-
\sum_{i=1}^J \mu_{i}^2
+
\sum_{i=1}^J 1
+
\sum_{i=1}^J \log \sigma_{ii}
\right)
&= \\
\frac{1}{2}
\sum_{i=1}^J
\left(
1
+
\log \sigma_{ii}
-
\sigma_{ii}
-
\mu_{i}^2
\right)
\end{align}
In the above simplifications, I also applied the following rules.
The official PyTorch implementation of the VAE, which can be found here, also uses this formula. This formula can also be found in Appendix B of the VAE paper, but the long proof that I've just written above is not given. Note that, in my proof above, $\sigma$ is the variance and is denoted by $\sigma^2$ in the paper (as it is usually the case to denote the variance as the square of the standard deviation $\sigma$, but, again, in my proof above $\sigma$ is the variance).