Questions tagged [kl-divergence]

For questions related to the Kullback–Leibler (KL) divergence, which is a measure (that is not a metric, but it is pre-metric, because it does not satisfy all properties of metrics, i.e. it is not symmetric) of divergence (or distance) between two probability measures (density functions, or mass functions), which is commonly used in many machine learning settings, e.g. in the context of variational auto-encoders (VAES).

25 questions
19
votes
1 answer

Why has the cross-entropy become the classification standard loss function and not Kullback-Leibler divergence?

The cross-entropy is identical to the KL divergence plus the entropy of the target distribution. The KL divergence equals zero when the two distributions are the same, which seems more intuitive to me than the entropy of the target distribution,…
7
votes
2 answers

How is this Pytorch expression equivalent to the KL divergence?

I found the following PyTorch code (from this link) -0.5 * torch.sum(1 + sigma - mu.pow(2) - sigma.exp()) where mu is the mean parameter that comes out of the model and sigma is the sigma parameter out of the encoder. This expression is apparently…
7
votes
2 answers

Why is KL divergence used so often in Machine Learning?

The KL Divergence is quite easy to compute in closed form for simple distributions -such as Gaussians- but has some not-very-nice properties. For example, it is not symmetrical (thus it is not a metric) and it does not respect the triangular…
6
votes
1 answer

Why is the evidence equal to the KL divergence plus the loss?

Why is the equation $$\log p_{\theta}(x^1,...,x^N)=D_{KL}(q_{\theta}(z|x^i)||p_{\phi}(z|x^i))+\mathbb{L}(\phi,\theta;x^i)$$ true, where $x^i$ are data points and $z$ are latent variables? I was reading the original variation autoencoder paper and I…
5
votes
1 answer

Why is the Jensen-Shannon divergence preferred over the KL divergence in measuring the performance of a generative network?

I have read articles on how Jensen-Shannon divergence is preferred over Kullback-Leibler in measuring how good a distribution mapping is learned in a generative network because of the fact that JS-divergence better measures distribution similarity…
5
votes
2 answers

What are the advantages of the Kullback-Leibler over the MSE/RMSE?

I've recently encountered different articles that are recommending to use the KL divergence instead of the MSE/RMSE (as the loss function), when trying to learn a probability distribution, but none of the articles are giving a clear reasoning why…
4
votes
1 answer

What is the impact of scaling the KL divergence and reconstruction loss in the VAE objective function?

Variational autoencoders have two components in their loss function. The first component is the reconstruction loss, which for image data, is the pixel-wise difference between the input image and output image. The second component is the…
3
votes
1 answer

How do you calculate KL divergence on a three-dimensional space for a Variational Autoencoder?

I'm trying to implement a variational auto-encoder (as seen in Section 3.1 here: https://arxiv.org/pdf/2004.06271.pdf). It differs from a traditional VAE because it encodes its input images to three-dimensional latent feature maps. In other words,…
3
votes
1 answer

Are there some notions of distance between two policies?

I want to determine some distance between two policies $\pi_1 (a \mid s)$ and $\pi_2 (a \mid s)$, i.e. something like $\vert \vert \pi_1 (a \mid s) - \pi_2(a \mid s) \vert \vert$, where $\pi_i (a\mid s)$ is the vector $(\pi_i (a_1 \mid s), \dots,…
3
votes
2 answers

When should one prefer using Total Variational Divergence over KL divergence in RL

In RL, both the KL divergence (DKL) and Total variational divergence (DTV) are used to measure the distance between two policies. I'm most familiar with using DKL as an early stopping metric during policy updates to ensure the new policy doesn't…
3
votes
1 answer

What is the reason for mode collapse in GAN as opposed to WGAN?

In this article I am reading: $D_{KL}$ gives us inifity when two distributions are disjoint. The value of $D_{JS}$ has sudden jump, not differentiable at $\theta=0$. Only Wasserstein metric provides a smooth measure, which is super helpful for a…
3
votes
1 answer

Why does the KL divergence not satisfy the triangle inequality?

The KL divergence is defined as $$D_{KL}=\sum_i p(x_i)log\left(\frac{p(x_i)}{q(x_i)}\right)$$ Why does $D_{KL}$ not satisfy the triangle inequality? Also, can't you make it satisfy the triangle inequality by taking the absolute value of the…
user8714896
  • 717
  • 1
  • 4
  • 21
2
votes
1 answer

How is this statement from a TensorFlow implementation of a certain KL-divergence formula related to the corresponding formula?

I am trying to understand a certain KL-divergence formula (which can be found on page 6 of the paper Evidential Deep Learning to Quantify Classification Uncertainty) and found a TensorFlow implementation for it. I understand most parts of the…
2
votes
1 answer

How does the Kullback-Leibler divergence give "knowledge gained"?

I'm reading about the KL divergence on Wikipedia. I don't understand how the equation gives "information gained" as it says in the "Interpretations" section Expressed in the language of Bayesian inference, ${\displaystyle D_{\text{KL}}(P\parallel…
1
vote
0 answers

How to compare different trajecories in a Markov Decision Process

I realize that my question is a bit fuzzy and I am sorry for that. If needed, I will try to make it more rigorous and precice. Let $\mathcal{M}$ be a Markov Decision Process, with state space $\mathcal{S}$ and action space $\mathcal{A}$. Let $\tau =…
1
2