Does average loss function in GAN training is just an approximation of value function and does not ensure convergence of generator and discriminator?

Question

The value function on which convergence has been proved by the original paper of GAN is

$$\min_G \max_DV(D, G) = \mathbb{E}_{x ∼ P_{data}}[\log D(x)] + \mathbb{E}_{z ∼ p_z}[log (1 - D(G(z)))]$$

and the loss function used in training are

$$\max L(D) = \frac{1}{m} \sum_{i=1}^{m}\left[\log D\left(\boldsymbol{x}^{(i)}\right)+\log \left(1-D\left(G\left(\boldsymbol{z}^{(i)}\right)\right)\right)\right]$$

$$\min L(G) = \frac{1}{m} \sum_{i=1}^{m}\left[\log \left(1-D\left(G\left(\boldsymbol{z}^{(i)}\right)\right)\right)\right]$$

where $\{z^{(1)}, z^{(2)}, z^{(3)}, \cdots, z^{(m)}\}$ and $\{x^{(1)}, x^{(2)}, x^{(3)}, \cdots, x^{(m)}\}$ ate the noise samples and data samples for a mini-batch respectively.

I found after analyzing some questions 1, 2 on our main site that the loss function used for training is just an approximation of the value function and are not same in formal sense.

Is it true? If yes, what is the reason behind the disparity? Is the loss function used for implementation also ensures convergence?

What do you mean by "an approximation of the value function and are not same in formal sense"? The original loss is nothing more than the BCE loss. The generator's loss is more complex: D plays the role of some kind of dynamic loss function. — Aray Karjauv, Aug 02 '21 at 10:56
@ArayKarjauv I mean, they both are not same but latter is inspired on former. What do you mean by original loss function? Is it value function? — hanugm, Aug 02 '21 at 11:02
Do you mean the difference between the loss functions for D and G? — Aray Karjauv, Aug 02 '21 at 11:10
@ArayKarjauv No, the difference between the actual value function and the loss function used for implementation. — hanugm, Aug 02 '21 at 11:11
They are identical. The notation is slightly different. The only difference is the loss for G. — Aray Karjauv, Aug 02 '21 at 11:14
@ArayKarjauv How can be the same? The value function is using the actual probabilities and in implementation, we are using just 1/m! — hanugm, Aug 02 '21 at 11:15
@ArayKarjauv Log part is the same. But when comes to the probability of samples, they are different right? — hanugm, Aug 02 '21 at 11:16
@ArayKarjauv Are you saying that both are BCE? If yes then I may be facing difficulty in deriving the loss functions used for implementation from the theoretical value function used... — hanugm, Aug 02 '21 at 11:18
I decided to write a complete answer. Feel free to leave your feedback. — Aray Karjauv, Aug 02 '21 at 13:45

Aray Karjauv · Answer 1 · 2021-08-02T12:39:12.760

Expected value can be thought of as a weighted average of outcomes. Thus, expectation and mean are the same thing, if each outcome has the same probability (which is $\frac{1}{m}$), so we can replace it with a sum divided by $m$. We can rewrite the equation: $$\min_G \max_DV(D, G) = \mathbb{E}_{x ∼ P_{data}}[\log D(x)] + \mathbb{E}_{z ∼ p_z}[log (1 - D(G(z)))]$$

First, we sample minibatch of size $m$ for $\boldsymbol{x} \sim P_{data}$ and $\boldsymbol{z} \sim \mathcal{N(0, 1)}$. Now we can replace the expectation with the sum:

$$ \begin{align*} \min_G \max_DV(D, G) &= \sum_{i=1}^{m}\left[p(\boldsymbol{x}^{(i)})\log D(\boldsymbol{x}^{(i)})\right] + \sum_{i=1}^{m}\left[p(\boldsymbol{z}^{(i)})log (1 - D(G(\boldsymbol{z}^{(i)})))\right] \\ &= \sum_{i=1}^{m}\left[\frac{1}{m}\log D(\boldsymbol{x}^{(i)})\right] + \sum_{i=1}^{m}[\frac{1}{m}log (1 - D(G(\boldsymbol{z}^{(i)})))]\\ &=\frac{1}{m}\sum_{i=1}^{m}\left[\log D(\boldsymbol{x}^{(i)}) + log (1 - D(G(\boldsymbol{z}^{(i)})))\right] \end{align*} $$

Binary cross entropy defined as follows:

$$H(p, q) = \operatorname{E}_p[-\log q] = H(p) + D_{\mathrm{KL}}(p \| q)=-\sum_x p(x)\log q(x)$$

Since we have a binary classification problem (fake/real), we can define $p \in \{y,1-y\}$ and $q \in \{\hat{y}, 1-\hat{y}\}$ and rewriting coros entropy as follows:

$$H(p, q)=-\sum_x p_x \log q_x =-y\log \hat{y}-(1-y)\log (1-\hat{y})$$

which is nothing but logistic loss. Since we know the source of our data (either real or fake), we can replace labels $y$ for real and fake with 1. We then get: $$\min_G\max_D L = \frac{1}{m} \sum_{i=1}^{m}\left[1\cdot\log D\left(\boldsymbol{x}^{(i)}\right)+1\cdot\log \left(1-D\left(G\left(\boldsymbol{z}^{(i)}\right)\right)\right)\right] $$

This is the original loss. The first term in the equation gets always real images, while the second gets only generated. Hence, both terms have corresponding true labels. Read this article for more details.

Since the first term does not depend on $G$, we can rewrite it as follows:

$$\max L(D) = \frac{1}{m} \sum_{i=1}^{m}\left[\log D\left(\boldsymbol{x}^{(i)}\right)+\log \left(1-D\left(G\left(\boldsymbol{z}^{(i)}\right)\right)\right)\right]$$

$$\min L(G) = \frac{1}{m} \sum_{i=1}^{m}\left[\log \left(1-D\left(G\left(\boldsymbol{z}^{(i)}\right)\right)\right)\right]$$

My doubt is in fixing up the probabilities of noise, generated and real samples as $\dfrac{1}{m}$, where $m$ is just a hyperparameter we are selecting and does not depends on actual probability. So, how can we take that? — hanugm, Aug 02 '21 at 23:09
I guess I [answered](https://ai.stackexchange.com/questions/29953/what-are-the-iid-random-variables-for-a-dataset-in-the-gan-framework/29958#29958) the question about $m$ - we assume each image has the same probability (we can also sample images with certain probabilities). This is not so important. We could have made a batch size 1, but as far as I know, batch optimization is less prone to getting stuck in local minima. As for the noise, it is only used to generate images, so we can replace $G(z)$ with $\hat{x} \sim P_g$ — Aray Karjauv, Aug 03 '21 at 01:15
yeah, I understood. But I am not sure whether considering our own probability is theoretically correct or nor. So, I am still in confusion. — hanugm, Aug 03 '21 at 02:19
@hanugm I don't think the mean is related to the probabilities of the data in this case. It has more to do with the importance of the gradient in the batch. The gradient for all data points is equally important. You can update your model using one data point at a time, so you get rid of the mean. The likelihood of the data is encoded in your dataset. If your dataset is unbalanced, you can sample a specific class with a certain probability, for example, using [WeightedRandomSampler](https://pytorch.org/docs/stable/data.html) in PyTorch. — Aray Karjauv, Aug 03 '21 at 08:45

Does average loss function in GAN training is just an approximation of value function and does not ensure convergence of generator and discriminator?

1 Answers1

Linked