Why does Adam optimizer work slower than Adagrad, Adadelta, and SGD for Neural Collaborative Filtering (NCF)?

Question

I've been working on Neural Collaborative Filtering (NCF) recently to build a recommender system using Tensorflow Recommenders. Doing some hyperparameter tuning with different optimizers available in the module tf.keras.optimizers, I found out that Adam and its other variants, such as Adamax and Nadam, work much slower than seemingly less advanced optimizers, like Adagrad, Adadelta, and SGD. With Adam and its variants, training each epoch takes about 30x longer.

It came out as a surprise to me, knowing one of the most cherished properties of Adam optimizer is its convergence speed, especially compared to SGD. What could be the reason for such a significant difference in computation speed?

Just to make sure I understand your question, are you asking why the time taken to perform 1 epoch with Adam is longer than the time used to perform 1 epoch with SGD, in the context of NCF? — nbro, May 06 '21 at 17:59
Exactly. In my case, it is clear that Adam or other Adam-like optimizers converge faster in terms of the number of epochs that it takes them to reach a better set of parameters. However, it takes much longer for them to complete one epoch. Therefore it ends up taking much longer to train the network using such optimizers. I hope this clarifies your question. — bkaankuguoglu, May 07 '21 at 07:48

Why does Adam optimizer work slower than Adagrad, Adadelta, and SGD for Neural Collaborative Filtering (NCF)?

0 Answers0