Why is a softmax used rather than dividing each activation by the sum?

Question

Just wondering why a softmax is typically used in practice on outputs of most neural nets rather than just summing the activations and dividing each activation by the sum. I know it's roughly the same thing but what is the mathematical reasoning behind a softmax over just a normal summation? Is it better in some way?

See the following Stack Overflow question [Why use softmax as opposed to standard normalization?](https://stackoverflow.com/q/17187507/3924118). — Miguel Saraiva, Jan 01 '20 at 23:24
See also: [Stats.SE](https://stats.stackexchange.com/q/419751/25741) — Martin Thoma, Jun 08 '20 at 05:14

score 2 · Answer 1 · answered Jan 02 '20 at 10:57

There are probably multiple different explanations and reasonings, but I can offer you one. If your output vector contains negative values, to get something that's related to probabilities (all components positive, summing to $1$) you cannot do what you suggested because you can possibly get a negative probability which doesn't make sense. Good property of exponential function used in softmax, in this case, is that it cannot give negative values, so regradless of your output vector, you will never get negative probability.

You could suggest adding some positive offset vector $\mathbf d$ to your output vector to get rid of negative values but there are couple of problems. First, you cannot know in advance your range of negative output values so that you can know what offset vector to use to cover all possible cases. Second, with such strategy in some cases you could get unrealistic results. For example let's assume output vector is $[-0.1, 0.2, 0.3]^T$ and offset vector is $[0.1, 0.1, 0.1]^T$. If you add those 2 you get $[0, 0.3, 0.4]^T$. Probability of first class would be $0$ since the numerator is $0$. That is very overconfident result and we probably wouldn't want to get the result of $0$ for this class. The result would change depending on the offset vector. Let's say that offset vector now is $[0.3, 0.3, 0.3]^T$. Addition of offset vector and output vector gives $[0.2, 0.5, 0.6]$. In this case probability of first class is now $0.2/(0.2 + 0.5 + 0.6) = 0.15$. We see that changing the offset vector changes the values of probabilities and as components of offset vector $\rightarrow \infty$ probabilities for all classes $\rightarrow 0.33$. We would likely want to get the same result regardless of the scale of the values of the vector, we only care about its relative relationships. Another good property of softmax is that it is shift invariant \begin{align} p_i &= \frac{e^{x_i + d}}{\sum_{j=1}^n e^{x_j + d}}\\ &= \frac{e^d \cdot e^{x_i} }{ e^d \cdot \sum_{j=1}^n e^{x_j} }\\ &= \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}} \end{align} so we see that the probability of $i$-th component is independent of offset vector. Apparently softmax doesn't care about the scale of values, it manages to capture the relative relationships of the components.

Why is a softmax used rather than dividing each activation by the sum?

1 Answers1

Linked