How can we process the data from both the true distribution and the generator?

Question

I'm struggling to understand the GAN loss function as provided in Understanding Generative Adversarial Networks (a blog post written by Daniel Seita).

In the standard cross-entropy loss, we have an output that has been run through a sigmoid function and a resulting binary classification.

Sieta states

Thus, For [each] data point $x_1$ and its label, we get the following loss function ...

$$ H((x_1, y_1), D) = -y_1 \log D(x_1) - (1 - y_1) \log (1 - D(x_1)) $$

This is just the log of the expectation, which makes sense. However, according to this formulation of the GAN loss, how can we process the data from both the true distribution and the generator in the same iteration?

Douglas Daseeco · Accepted Answer · 2018-10-15T23:15:23.633

The Focus of This Question

"How can ... we process the data from the true distribution and the data from the generative model in the same iteration?

Analyzing the Foundational Publication

In the referenced page, Understanding Generative Adversarial Networks (2017), doctoral candidate Daniel Sieta correctly references Generative Adversarial Networks, Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio, June 2014. It's abstract states, "We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models ..." This original paper defines two models defined as MLPs (multilayer perceptrons).

Generative model, G
Discriminative model, D

These two models are controlled in a way where one provides a form of negative feedback toward the other, therefore the term adversarial.

G is trained to capture the data distribution of a set of examples well enough to fool D.
D is trained to discover whether its input are G's mocks or the set of examples for the GAN system.

(The set of examples for the GAN system are sometimes referred to as the real samples, but they may be no more real than the generated ones. Both are numerical arrays in a computer, one set with an internal origin and the other with an external origin. Whether the external ones are from a camera pointed at some physical scene is not relevant to GAN operation.)

Probabilistically, fooling D is synonymous to maximizing the probability that D will generate as many false positives and false negatives as it does correct categorizations, 50% each. In information science, this is to say that the limit of information D has of G approaches 0 as t approaches infinity. It is a process of maximizing the entropy of G from D's perspective, thus the term cross-entropy.

How Convergence is Accomplished

Because the loss function reproduced from Sieta's 2017 writing in the question is that of D, designed to minimize the cross entropy (or correlation) between the two distributions when applied to the full set of points for a given training state.

$H((x_1, y_1), D) = 1 \, D(x_1)$

There is a separate loss function for G, designed to maximize the cross entropy. Notice that there are TWO levels of training granularity in the system.

That of game moves in a two-player game
That of the training samples

These produce nested iteration with the outer iteration as follows.

Training of G proceeds using the loss function of G.
Mock input patterns are generated from G at its current state of training.
Training of D proceeds using the loss function of D.
Repeat if the cross entropy is not yet sufficiently maximized, D can still discriminate.

When D finally loses the game, we have achieved our goal.

G recovered the training data distribution
D has been reduced to ineffectiveness ("1/2 probability everywhere")

Why Concurrent Training is Necessary

If the two models were not trained in a back and forth manner to simulate concurrency, convergence in the adversarial plane (the outer iteration) would not occur on the unique solution claimed in the 2014 paper.

More Information

Beyond the question, the next item of interest in Sieta's paper is that, "Poor design of the generator's loss function," can lead to insufficient gradient values to guide descent and produce what is sometimes called saturation. Saturation is simply the reduction of the feedback signal that guides descent in back-propagation to chaotic noise arising from floating point rounding. The term comes from signal theory.

I suggest studying the 2014 paper by Goodfellow et alia (the seasoned researchers) to learn about GAN technology rather than the 2017 page.

score 3 · Answer 2 · edited Jun 17 '20 at 09:57

Let's start at the beginning. GANs are models that can learn to create data that is similar to the data that we give them.

When training a generative model other than a GAN, the easiest loss function to come up with is probably the Mean Squared Error (MSE).

Kindly allow me to give you an example (Trickot L 2017):

Now suppose you want to generate cats ; you might give your model examples of specific cats in photos. Your choice of loss function means that your model has to reproduce each cat exactly in order to avoid being punished.

But that's not necessarily what we want! You just want your model to generate cats, any cat will do as long as it's a plausible cat. So, you need to change your loss function.

However which function could disregard concrete pixels and focus on detecting cats in a photo?

That's a neural network. This is the role of the discriminator in the GAN. The discriminator's job is to evaluate how plausible an image is.

The paper that you cite, Understanding Generative Adversarial Networks (Daniel S 2017) lists two major insights.

Major Insight 1: the discriminator’s loss function is the cross entropy loss function.

Major Insight 2: understanding how gradient saturation may or may not adversely affect training. Gradient saturation is a general problem when gradients are too small (i.e. zero) to perform any learning.

To answer your question we need to elaborate further on the second major insight.

In the context of GANs, gradient saturation may happen due to poor design of the generator’s loss function, so this “major insight” ... is based on understanding the tradeoffs among different loss functions for the generator.

The design implemented in the paper resolves the loss function problem by having a very specific function (to discriminate among two classes). The best way of doing this is by using cross entropy (Insight 1). As the blog post says:

The cross-entropy is a great loss function since it is designed in part to accelerate learning and avoid gradient saturation only up to when the classifier is correct.

As clarified in the blog post's comments:

The expectation [in the cross entropy function] comes from the sums. If you look at the definition of expectation for a discrete random variable, you'll see that you need to sum over different possible values of the random variable, weighing each of them by their probability. Here, the probabilities are just 1/2 for each, and we can treat them as coming from the generator or discriminator.

The question didn't ask what was easiest to use as a loss function for the ANNs. The specifics of the math were unclear to @tryingtolearn and quoting the grad student without providing any clarification doesn't clarify. — Douglas Daseeco, Aug 05 '18 at 04:47

score 2 · Answer 3 · answered Jun 17 '17 at 14:37

You can treat a combination of z input and x input as a single sample, and you evaluate how well the discriminator performed the classification of each of these.

This is why the post later on separates a single y into E(p~data) and E(z) -- basically, you have different expectations (ys) for each of the discriminator inputs and you need to measure both at the same time to evaluate how well the discriminator is performing.

That's why the loss function is conceived as a combination of both the positive classification of the real input and the negative classification of the negative input.

How can we process the data from both the true distribution and the generator?

3 Answers3