AFAIK, momentum is quite useful when training CNNs, and can speed-up the training substantially without any drop in validation accuracy.
I've recently learned that it is not as helpful for RNNs, where plain SGD is preferred.
For example, Deep Learning by Goodfellow et. al says (section 10.11, page 401):
Both of these approaches have largely been replaced by simply using SGD (even without momentum) applied to LSTMs.
The author talks about LSTMs and "both of these approaches" refer to second-order and first-order SGD methods with momentum methods, respectively, according to my understanding.
What causes this discrepancy?