I'm trying to implement a variational auto-encoder (as seen in Section 3.1 here: https://arxiv.org/pdf/2004.06271.pdf).
It differs from a traditional VAE because it encodes its input images to three-dimensional latent feature maps. In other words, the latent feature maps have a width, height and channel dimension rather than just a channel dimension like a traditional VAE.
When calculating the Kullback-Liebler divergence as part of the loss function, I need the mean and covariance that is the output of the encoder. However, if the latent feature maps are three-dimensional, this means that the output of the encoder is three-dimensional, and therefore each latent feature is a 2D matrix.
How can I derive a mean and covariance from a 2D matrix to calculate the KL divergence?