For questions about optimization methods/algorithms (also know as optimizers) in the context of machine learning and other AI subfields. Examples of optimizers are plain (stochastic) gradient descent, Adam, SGD with momentum, Adagrad, and RMSprop.
Questions tagged [optimizers]
9 questions
9
votes
1 answer
What is the formula for the momentum and Adam optimisers?
In the gradient descent algorithm, the formula to update the weight $w$, which has $g$ as the partial gradient of the loss function with respect to it, is:
$$w\ -= r \times g$$
where $r$ is the learning rate.
What should be the formula for momentum…

Dee
- 1,283
- 1
- 11
- 35
2
votes
1 answer
How do I use machine learning to create an optimization algorithm?
Let's say that I want to create an optimization algorithm, which is supposed to find an optimum value for a given objective function. Creating an optimization algorithm to explore through the search space can be quite challenging.
My question is:…

sherl.lol
- 23
- 4
2
votes
1 answer
Joined vs Separate optimizer for Actor-Critic
Say that I have a simple Actor-Critic architecture, (I am not familiar with Tensorflow, but) in Pytorch we need to specify the parameters when defining an optimizer (SGD, Adam, etc) and therefore we can define 2 separate optimizers for the Actor and…

Sanyou
- 165
- 2
- 10
2
votes
1 answer
What do we mean by "infrequent features"?
I am reading this blog post: https://ruder.io/optimizing-gradient-descent/index.html. In the section about AdaGrad, it says:
It adapts the learning rate to the parameters, performing smaller updates (i.e. low learning rates) for parameters…

ava_punksmash
- 133
- 2
2
votes
3 answers
What kind of optimizer is suggested to use for binary classification of similar images?
I have spent some time searching Google and wasn't able to find out what kind of optimization algorithm is best for binary classification when images are similar to one another.
I'd like to read some theoretical proofs (if any) to convince myself…

bit_scientist
- 241
- 1
- 4
- 15
1
vote
1 answer
What is uncentered variance and how it becomes equal to mean square in Adam?
I have been reading about Adam and AdamW (Here). The author mentioned that in "uncentered variance" we don't consider subtracting mean
In this statement, the author is talking about uncentered variance and how it becomes equal to the square of the…

learner
- 151
- 5
1
vote
0 answers
Why does Adam optimizer work slower than Adagrad, Adadelta, and SGD for Neural Collaborative Filtering (NCF)?
I've been working on Neural Collaborative Filtering (NCF) recently to build a recommender system using Tensorflow Recommenders. Doing some hyperparameter tuning with different optimizers available in the module tf.keras.optimizers, I found out that…

bkaankuguoglu
- 111
- 2
1
vote
1 answer
In the update rule of RMSprop, do we divide by a matrix?
I've been trying to understand RMSprop for a long time, but there's something that keeps eluding me.
Here is a screenshot from this video by Andrew Ng.
From the element-wise comment, from what I understand, $dW$ and $db$ are matrices, so that must…

Uriyasama
- 11
- 1
0
votes
2 answers
When training a DNN on infinite samples, do ADAM or other popular optimization algorithms still work as intended?
When training a DNN on infinite samples, do ADAM or other popular optimization algorithms still work as intended?
I have an DNN training from an infinite stream of samples, that most likely won't repeat. So there is no real notion of "epoch".
Now I…

dronus
- 101