An article on https://spell.ml says
Because Adam manages learning rates internally, it's incompatible with most learning rate schedulers. Anything more complicated than simple learning warmup and/or decay will put the Adam optimizer to "complete" with the learning rate scheduler when managing its internal LR, causing model convergence to worsen.
I have found the same convergence issues in my own work when using both Adam and a StepLR scheduler.
I understand that Adam adjusts the learning rate on a per-parameter basis, which perhaps negates the need for a learning rate scheduler, but why does this lead to convergence issues?
Is there any mathematical reason/proof why using both the Adam optimiser and a learning rate scheduler causes convergence issues?
Is it true that they really "compete" with each other?