1

I was learning about GANs when the term "Label Smoothing" surfaced. In the video tutorial that I watched, they use the term "label smoothing" to change the binary labels when calculating the loss of the discriminator network. Instead of using 1, they use 0.9 for the label. What is the main purpose of this label smoothing?

I've skimmed through the original paper, and there is a lot of maths that, honestly, I have difficulty understanding. But I notice this paragraph in there:

We propose a mechanism for encouraging the model to be less confident. While this may not be desired if the goal is to maximize the log-likelihood of training labels, it does regularize the model and makes it more adaptable

And it gives me another question:

  • why "this may not be desired if the goal is to maximize the log-likelihood of training labels"?

  • what do they mean by "adaptable"?

Robin van Hoorn
  • 1,810
  • 7
  • 32
malioboro
  • 2,729
  • 3
  • 20
  • 46

1 Answers1

1

From the previous paragraph:

"This, however, can cause two problems. First, it may result in over-fitting: if the model learns to assign full probability to the groundtruth label for each training example, it is not guaranteed to generalize. Second, it encourages the differences between the largest logit and all others to become large, and this, combined with the bounded gradient ∂` ∂zk , reduces the ability of the model to adapt. Intuitively, this happens because the model becomes too confident about its predictions."

In other words, label smoothing increases your model's entropy in its output logits, which, in turn, makes it less confident in its predictions. As the authors point out, this may lead to a decrease in training performance w.r.t. the log-likelihood of the training labels. In particular, the model's capacity may be high enough to find a complex function to map over all input examples without label smoothing. This function, however, may not generalise well, making it less adaptable to unseen data. In the GAN space, there are other benefits to label smoothing as well, such as preventing the discriminator from relaying a too high gradient signal to the generator (see here: https://www.kth.se/social/files/59086d09f2765460c378ca73/GANs.pdf). Large gradients are known to make the training process more instable and may prevent convergence. For a deeper understanding of label-smoothing and entropy, I recommend this paper: https://arxiv.org/pdf/2005.00820.pdf Also, check out this related question: https://datascience.stackexchange.com/questions/28764/one-sided-label-smoothing-in-gans