The Focus of This Question
"How can ... we process the data from the true distribution and the data from the generative model in the same iteration?
Analyzing the Foundational Publication
In the referenced page, Understanding Generative Adversarial Networks (2017), doctoral candidate Daniel Sieta correctly references Generative Adversarial Networks, Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio, June 2014. It's abstract states, "We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models ..." This original paper defines two models defined as MLPs (multilayer perceptrons).
- Generative model, G
- Discriminative model, D
These two models are controlled in a way where one provides a form of negative feedback toward the other, therefore the term adversarial.
- G is trained to capture the data distribution of a set of examples well enough to fool D.
- D is trained to discover whether its input are G's mocks or the set of examples for the GAN system.
(The set of examples for the GAN system are sometimes referred to as the real samples, but they may be no more real than the generated ones. Both are numerical arrays in a computer, one set with an internal origin and the other with an external origin. Whether the external ones are from a camera pointed at some physical scene is not relevant to GAN operation.)
Probabilistically, fooling D is synonymous to maximizing the probability that D will generate as many false positives and false negatives as it does correct categorizations, 50% each. In information science, this is to say that the limit of information D has of G approaches 0 as t approaches infinity. It is a process of maximizing the entropy of G from D's perspective, thus the term cross-entropy.
How Convergence is Accomplished
Because the loss function reproduced from Sieta's 2017 writing in the question is that of D, designed to minimize the cross entropy (or correlation) between the two distributions when applied to the full set of points for a given training state.
$H((x_1, y_1), D) = 1 \, D(x_1)$
There is a separate loss function for G, designed to maximize the cross entropy. Notice that there are TWO levels of training granularity in the system.
- That of game moves in a two-player game
- That of the training samples
These produce nested iteration with the outer iteration as follows.
- Training of G proceeds using the loss function of G.
- Mock input patterns are generated from G at its current state of training.
- Training of D proceeds using the loss function of D.
- Repeat if the cross entropy is not yet sufficiently maximized, D can still discriminate.
When D finally loses the game, we have achieved our goal.
- G recovered the training data distribution
- D has been reduced to ineffectiveness ("1/2 probability everywhere")
Why Concurrent Training is Necessary
If the two models were not trained in a back and forth manner to simulate concurrency, convergence in the adversarial plane (the outer iteration) would not occur on the unique solution claimed in the 2014 paper.
More Information
Beyond the question, the next item of interest in Sieta's paper is that, "Poor design of the generator's loss function," can lead to insufficient gradient values to guide descent and produce what is sometimes called saturation. Saturation is simply the reduction of the feedback signal that guides descent in back-propagation to chaotic noise arising from floating point rounding. The term comes from signal theory.
I suggest studying the 2014 paper by Goodfellow et alia (the seasoned researchers) to learn about GAN technology rather than the 2017 page.