8

Previously I have learned that the softmax as the output layer coupled with the log-likelihood cost function (the same as the the nll_loss in pytorch) can solve the learning slowdown problem.

However, while I am learning the pytorch mnist tutorial, I'm confused that why the combination of the log_softmax as the output layer and the nll_loss(the negative log likelihood loss) as the loss function was used (L26 and L34).

I found that when log_softmax+nll_loss was used, the test accuracy was 99%, while when softmax+nll_loss was used, the test accuracy was 97%.

I'm confused that what's the advantage of log_softmax over softmax? How can we explain the performance gap between them? Is log_softmax+nll_loss always better than softmax+nll_loss?

user1024
  • 181
  • 2

1 Answers1

2

The short answer is yes, log_softmax + nll_loss will work better.

I don’t know the implementation details under the hood in PyTorch, but see the screenshot below from the documentation:

img

Hanzy
  • 499
  • 3
  • 10
  • Yes, I know that `log_softmax` + `nll_loss` will work better, but I want to know why... – user1024 May 01 '19 at 03:54
  • @user1024 this seems like a question for the development team since it could depend heavily on their implementation. But you mentioned you used softmax + nll_loss together. Note that since log is monotonically increasing, maximizing the log of the probability with respect to theta is the same as maximizing the probability with respect to theta. That is they both have the same argmax. But log likelihood expects log probabilities, this log_softmax is used. Also note that log(a/b) = log(a) - log(b). But since b is the same for all classes, it’s (probably) omitted in their implementation. – Hanzy May 01 '19 at 04:33
  • @user1024 omitted because it wouldn’t affect the maximization, so I assume it would make it faster and more numerically stable. But I also assume they have further optimized the routine implementation in ways that aren’t as obvious simply for speed and stability performance. – Hanzy May 01 '19 at 04:34
  • @user1024 and assuming that we are using a Gaussian prior, then maximizing the likelihood requires us to take an exponentiation, but taking the log cancels out the exponentiation, which would also make it faster and more stable. – Hanzy May 01 '19 at 04:36
  • Thank you @Hanzy, your comments helped explain the advantages of `log_softmax` in speed and numerical properties. As a beginner in AI, I have allodoxaphobia to determine so many super-parameters. – user1024 May 04 '19 at 03:21