Why does TensorFlow docs discourage using softmax as activation for the last layer?

Question

The beginner colab example for tensorflow states:

Note: It is possible to bake this tf.nn.softmax in as the activation function for the last layer of the network. While this can make the model output more directly interpretable, this approach is discouraged as it's impossible to provide an exact and numerically stable loss calculation for all models when using a softmax output.

My question is, then, why? What do they mean by impossible to provide an exact and numerically stable loss calculation?

I did some quick research and by numerically stable gradient they basically mean the vanishing gradient in exponential functions. As you go higher the 32 / 64 bit precision will overflow and the gradient will be 0 irrespective of actually it being non 0. — , Apr 13 '20 at 08:04

ted · Accepted Answer · 2020-04-14T01:51:20.540

2

It's because of gradient computations: automatic differentiation will compute the gradient for each module and if you have a standalone crossentropy module the over all loss will be unstable (~1/x so it will diverge for small input values) whereas if you use a softmax + crossentropy module all-in-one, then it becomes numerically stable (y-p)

Slides from DeepMind's Simon Osindero lecture at UCL in 2016:

edited Apr 14 '20 at 01:51

answered Apr 13 '20 at 14:23

ted

266
1
4

You say "if you have a standalone crossentropy module the over all loss will be unstable (~1/x)", but you don't really explain WHY. Maybe you can explain why using softmax is unstable.That's the actual question. Also, what does the CE have to do with the softmax? – nbro Apr 14 '20 at 00:16
The slides actually do explain *why*. If the question were about mathematical gradient derivations I'd agree with your comment, it's not. I'll add a few words though – ted Apr 14 '20 at 01:50
There is a typo in the gradient on the second slide. Check here: https://xeonqq.github.io/machine%20learning/softmax/ The $i$-th element in the gradient should be $\frac{e^{x_i}}{\sum_m e^{x_m}} - p_i$. – Carlos H. Mendoza-Cardenas Jul 11 '22 at 15:18

xeonqq · Answer 2 · 2020-12-17T12:09:45.357

This is also a question I stumble upon, thanks for the explaination from ted, it is very helpfull, I will try to elaborate a little bit. Let's still use DeepMind's Simon Osindero's slide: The grey block on the left we are looking at is only a cross entropy operation, the input $x$ (a vector) could be the softmax output from previous layer (not the input for the neutral network), and $y$ (a scalar) is the cross entropy result of $x$. To propagate the gradient back, we need to calculate the gradient of $dy/dx_i$, which is $-p_i/x_i$ for each element in $x$. As we know the softmax function scale the logits into the range [0,1], so if in one training step, the neutral network becomes super confident and predict one of the probabilties $x_i$ to be 0 then we have a numerical problem in calculting $dy/dx_i$.

While in the other case where we take the logits and calculate the softmax and crossentropy at one shot (XentLogits function), we don't have this problem. Because the derivative of XentLogits is $dy/dx_i = y - p_i$, a more elaborated derivation can be found here.

Why does TensorFlow docs discourage using softmax as activation for the last layer?

2 Answers2