From my knowledge, the most used optimizer in practice is Adam, which in essence is just mini-batch gradient descent with momentum to combat getting stuck in saddle points and with some damping to avoid wiggling back and forth if the conditioning of the search space is bad at any point.
Not to say that this is actually easy in absolute terms, but after a few days, I think I got most of it. But when I look into the field of mathematical (non-linear) optimization, I'm totally overwhelmed.
What are the possible reasons that optimization algorithms for neural networks aren't more intricate?
- There are just more important things to improve?
- Just not possible?
- Is Adam and others already so good that researchers just don't care?