Previously I have learned that the softmax
as the output layer coupled with the log-likelihood cost function
(the same as the the nll_loss
in pytorch) can solve the learning slowdown problem.
However, while I am learning the pytorch mnist tutorial, I'm confused that why the combination of the log_softmax
as the output layer and the nll_loss
(the negative log likelihood loss) as the loss function was used (L26 and L34).
I found that when log_softmax
+nll_loss
was used, the test accuracy was 99%, while when softmax
+nll_loss
was used, the test accuracy was 97%.
I'm confused that what's the advantage of log_softmax
over softmax
? How can we explain the performance gap between them? Is log_softmax
+nll_loss
always better than softmax
+nll_loss
?