There are instances in literature where we need to change loss function in order to escape from gradient problems.
Let $L_f$ be a loss function for a model I need to train on. Some times $L_f$ leads to the problems due to gradient. So I reformulate it to $L_g$ and can apply the optimization successfully. Most of the times the new loss function is obtained by making a small adjustments on $L_f$.
For example: Consider the following excerpt from the paper titled Evolutionary Generative Adversarial Networks
In the original GAN, training the generator was equal to minimizing the JSD between the data distribution and the generated distribution, which easily resulted in the vanishing gradient problem. To solve this issue, a nonsaturating heuristic objective (i.e., “$− \log D$ trick”) replaced the minimax objective function to penalize the generator
How can one understand those facts geometrically? Are there any simple examples on either 2d or 3d that shows two types of curves: one gives no gradient issues and the other gives gradient issues yet both obtains the same objective?