How do you calculate KL divergence on a three-dimensional space for a Variational Autoencoder?

Question

I'm trying to implement a variational auto-encoder (as seen in Section 3.1 here: https://arxiv.org/pdf/2004.06271.pdf).

It differs from a traditional VAE because it encodes its input images to three-dimensional latent feature maps. In other words, the latent feature maps have a width, height and channel dimension rather than just a channel dimension like a traditional VAE.

When calculating the Kullback-Liebler divergence as part of the loss function, I need the mean and covariance that is the output of the encoder. However, if the latent feature maps are three-dimensional, this means that the output of the encoder is three-dimensional, and therefore each latent feature is a 2D matrix.

How can I derive a mean and covariance from a 2D matrix to calculate the KL divergence?

Tinu · Accepted Answer · 2021-02-10T12:29:57.027

Your three dimensional latent representation consists of two images of mean pixels and covariance pixels as shown in Fig. 3. Which represents a Gaussian distribution with the mean and covariance for each pixel in the latent representation. Each pixel value is a random variable.

Now, have a close look at KL-loss Eq. 3 and it's corresponding description in the paper:

$$\mathcal{L}_{KL} = \frac{1}{2 \times (\frac{W}{16} \times \frac{H}{16}) } \sum^M_{m = 1}[\mu^2_m + \sigma^2_m - \log(\sigma^2_m) - 1]$$

Finally, $M$ is the dimensionality of the latent features $\theta \in \mathbb{R}^M$ with mean $\mu = [\mu_1,...,\mu_M]$ and covariance matrix $\Sigma = \text{diag}(\sigma_1^2,...,\sigma_M^2)$, [...].

The covariance matrix is diagonal, thus all pixel values are independent of each other. That is the reason why we have this nice analytical form for the KL-divergence given by Eq. 3. Therefore you can treat your 2D random matrix simply as a random vector of size $M = \frac{W}{16} \times \frac{H}{16}$ ($\times 3$ if you like to include color dimension). The third dimension (RGB channel) can be considered independent as well, therefore it can be also flattened to a vector and appended. Indeed this is what is done in the paper indicated by the second half of the sentence from above:

that are reparameterized by via sampling from a standard multivariate Gaussian $\epsilon \sim \mathcal{N}(0,I_M)$, i.e. $\theta = \mu + \Sigma^{\frac{1}{2}}\epsilon$.

To clarify, will flattening the W/16 x H/16 matrix into a vector of size M maintain the spatial correspondence _because_ the pixel values are independent of each other? When I say spatial correspondence, I refer to this part of the paper: "we use three-dimensional latent feature maps, i.e., channel, height and width dimensions, rather than one-dimensional latent vectors with only channel dimension, for improving the reconstruction quality and preserve more spatial information." — magmacollaris, Feb 15 '21 at 13:23
As an additional point of clarification, does this mean that each mean pixel will get its own covariance? Since the covariance is diagonal, would this mean that the covariance matrix is (H/16 x W/16) x (H/16 x W/16) in size? — magmacollaris, Feb 16 '21 at 13:16

How do you calculate KL divergence on a three-dimensional space for a Variational Autoencoder?

1 Answers1