I've been working on Neural Collaborative Filtering (NCF) recently to build a recommender system using Tensorflow Recommenders. Doing some hyperparameter tuning with different optimizers available in the module tf.keras.optimizers, I found out that Adam and its other variants, such as Adamax and Nadam, work much slower than seemingly less advanced optimizers, like Adagrad, Adadelta, and SGD. With Adam and its variants, training each epoch takes about 30x longer.
It came out as a surprise to me, knowing one of the most cherished properties of Adam optimizer is its convergence speed, especially compared to SGD. What could be the reason for such a significant difference in computation speed?