Why is the Jensen-Shannon divergence preferred over the KL divergence in measuring the performance of a generative network?

Question

I have read articles on how Jensen-Shannon divergence is preferred over Kullback-Leibler in measuring how good a distribution mapping is learned in a generative network because of the fact that JS-divergence better measures distribution similarity when there are zero values in either distribution.

I am unable to understand how the mathematical formulation of JS-divergence would take care of this and also what advantage it particularly holds qualitatively apart from this edge case.

Could anyone explain or link me to an explanation that could answer this satisfactorily?

score 4 · Accepted Answer · answered Nov 12 '19 at 00:58

Lets start with question 1) how does JS-divergence handles zeros?

by definition:
\begin{align} D_{JS}(p||q) &= \frac{1}{2}[D_{KL}(p||\frac{p+q}{2}) + D_{KL}(q||\frac{p+q}{2})] \\ &= \frac{1}{2}\sum_{x\in\Omega} [p(x)log(\frac{2 p(x)}{p(x)+q(x)}) + q(x)log(\frac{2 q(x)}{p(x)+q(x)})] \end{align} Where $\Omega$ is the union of the domains of $p$ and $q$. Now lets assume one distribution is zero where the other is not (without loss of generality due to symmetry we can just say $p(x_i) = 0$ and $q(x_i) \neq 0$. We then get for that term in the sum
$$\frac{1}{2}q(x_i)log(\frac{2q(x_i)}{q(x_i)}) = q(x_i)\frac{log(2)}{2}$$
Which isn't undefined as it would be the KL case.

Now onto 2) In GANS why does JS divergence produce better results than KL

The asymmetry of KL divergence places an unfair advantage to one distribution over the other where in this case, its not ideal to consider it this way from an optimization perspective. Additionally KL divergences inability to handle non-overlapped distributions is crushing given that these are approximated through sampling schemes, and therefore there are no guarantees. JS solves both those issues and leads to a smoother manifold which is why its generally preferred. A good resource is this paper where they go more in detail investigating this.

Why is the Jensen-Shannon divergence preferred over the KL divergence in measuring the performance of a generative network?

1 Answers1