0

There are instances in literature where we need to change loss function in order to escape from gradient problems.

Let $L_f$ be a loss function for a model I need to train on. Some times $L_f$ leads to the problems due to gradient. So I reformulate it to $L_g$ and can apply the optimization successfully. Most of the times the new loss function is obtained by making a small adjustments on $L_f$.


For example: Consider the following excerpt from the paper titled Evolutionary Generative Adversarial Networks

In the original GAN, training the generator was equal to minimizing the JSD between the data distribution and the generated distribution, which easily resulted in the vanishing gradient problem. To solve this issue, a nonsaturating heuristic objective (i.e., “$− \log D$ trick”) replaced the minimax objective function to penalize the generator


How can one understand those facts geometrically? Are there any simple examples on either 2d or 3d that shows two types of curves: one gives no gradient issues and the other gives gradient issues yet both obtains the same objective?

hanugm
  • 3,571
  • 3
  • 18
  • 50
  • Could you link one or more examples that you say you have read in the literature? This is not something I have heard of, or I do not understand properly what you are asking. The original text would help – Neil Slater Aug 08 '21 at 07:27
  • @NeilSlater https://ai.stackexchange.com/questions/29985/what-does-it-mean-by-strong-or-sufficient-gradient-for-training-in-this-context – hanugm Aug 08 '21 at 07:49
  • That doesn't appear to mention vanishing gradient? Could you link or clarify about vanishing gradient as per this question? Or is it possible that you are not asking about vanishing gradient, but about gradient issues discussed in the paper? – Neil Slater Aug 08 '21 at 11:56
  • @NeilSlater I think the latter one. I thought vanishing gradient is mostly the problem related to gradients... – hanugm Aug 08 '21 at 13:40
  • 1
    Vanishing gradient is specifically the problem associated with gradient signals decaying through multiple layers or multiple time steps. I think your linked paper is more to do with an original signal that can get weaker when it is certain distance away from the objective – Neil Slater Aug 08 '21 at 14:31
  • It is more common to talk about derogates and vanishing gradient problem when discussing SGD. when discussing loss issues it is more common to discuss regularization methods. – codecypher Aug 09 '21 at 20:01

0 Answers0