For questions about Adam, a gradient-based optimization algorithm widely used to train neural networks. It was proposed in the paper "Adam: A Method for Stochastic Optimization" (2014) by Diederik P. Kingma and Jimmy Ba.
Questions tagged [adam]
17 questions
9
votes
1 answer
What is the formula for the momentum and Adam optimisers?
In the gradient descent algorithm, the formula to update the weight $w$, which has $g$ as the partial gradient of the loss function with respect to it, is:
$$w\ -= r \times g$$
where $r$ is the learning rate.
What should be the formula for momentum…

Dee
- 1,283
- 1
- 11
- 35
4
votes
1 answer
How does OpenAI-ES use Adam?
I just read that OpenAI's ES uses Adam: "OpenAI’s ES is denoted as “OptimES” (since it uses Adam optimizer)"?? I verified they are correct using the link they posted, (see es_distributed/Optimizers.py). But I don't understand how because the paper…

profPlum
- 360
- 1
- 9
2
votes
0 answers
Do learning rate schedulers conflict with or prevent convergence of the Adam optimiser?
An article on https://spell.ml says
Because Adam manages learning rates internally, it's incompatible with most learning rate schedulers. Anything more complicated than simple learning warmup and/or decay will put the Adam optimizer to "complete"…

Jack G
- 21
- 3
2
votes
0 answers
How to decide if gradients are vanishing?
I am trying to debug a convolutional neural network. I am seeing gradients close to zero.
How can I decide whether these gradients are vanishing or not? Is there some threshold to decide on vanishing gradient by looking at the values?
I am getting…

pramesh
- 121
- 4
2
votes
1 answer
How long should the state-dependent baseline for policy gradient methods be trained at each iteration?
How long should the state-dependent baseline be trained at each iteration? Or what baseline loss should we target at each iteration for use with policy gradient methods?
I'm using this equation to compute the policy gradient:
$$
\nabla_{\theta}…

junior-flight
- 33
- 3
2
votes
2 answers
Is the choice of the optimiser relevant when doing object detection?
Suppose that we have 4 types of dogs that we want to detect (Golden Retriever, Black Labrador, Cocker Spaniel, and Pit Bull). The training data consists of png images of a data set of dogs along with their annotations. We want to train a model using…

neeraj
- 31
- 3
1
vote
0 answers
Why are optimization algorithms for deep learning so simple?
From my knowledge, the most used optimizer in practice is Adam, which in essence is just mini-batch gradient descent with momentum to combat getting stuck in saddle points and with some damping to avoid wiggling back and forth if the conditioning of…

Moritz Groß
- 133
- 5
1
vote
1 answer
How many iterations of the optimisation algorithm are performed on each mini-batch in mini-batch gradient descent?
I understand the idea of mini-batch gradient descent for neural networks in that we calculate the gradient of the loss function using one mini-batch at a time and use this gradient to adjust the parameters.
My question is: how many times do we…

user50018
- 13
- 2
1
vote
1 answer
Why does my model not improve when training with mini-batch gradient descent, while it does with Adam?
I am currently experimenting with the U-Net. I am doing semantic segmentation on the 2018 Data Science Bowl dataset from Kaggle without any data augmentation.
In my experiments, I am trying different hyper-parameters, like using Adam, mini-batch GD…

Bert Gayus
- 545
- 3
- 12
1
vote
1 answer
Why is Adam trapped in bad/suspicious local optima after the first few updates?
In the paper On the Variance of the Adaptive Learning Rate and Beyond, in section 2, the authors write
To further analyze this phenomenon, we visualize the histogram of the absolute value of gradients on a log scale in Figure 2. We observe that,…

AmorFati
- 11
- 2
1
vote
0 answers
What is the equation of the learning rate decay in the Adam optimiser?
Adam is known as an algorithm that has an adaptive learning rate for each parameter. I believe this is due to the division by the term $$v_t = \beta_2 \cdot v_{t-1} + (1-\beta_2) \cdot g_t^2 $$ Hence, each weight will get updated differently based…

calveeen
- 1,251
- 7
- 17
1
vote
0 answers
Are there optimizers that schedule their learning rate, momentum etc. autonomously?
I'm aware there are some optimizer such as Adam that adjust the learning rate for each dimension during training. However, afaik, the maximum learning rate they can have is still determined by the user's input.
So, I wonder if there are optimizers…

SpiderRico
- 960
- 8
- 18
0
votes
0 answers
Longer DNN training times when using evolutionary algorithms
I am comparing my deep neural network (DNN) performance when using 2 types of optimizers: gradient-based Adam (properly tuned) and a population-based optimization algorithm (e.g., genetic algorithm (GA), PSO, etc.). My training dataset is of size…

knowledge_seeker
- 97
- 7
0
votes
0 answers
What's wrong with our loss and PyTorch?
Given the samples $\vec{x_i} \in \mathbb{R}^d, i \in [1,..,l]$ where $l$ is the number of training samples, $d$ is the number of input features, the related target values $y_i \in \mathbb{R}$, and the $l \times l$ matrix defined below:
$$ S_{i,j} = …

Filippo Portera
- 3
- 1
- 2
0
votes
2 answers
When training a DNN on infinite samples, do ADAM or other popular optimization algorithms still work as intended?
When training a DNN on infinite samples, do ADAM or other popular optimization algorithms still work as intended?
I have an DNN training from an infinite stream of samples, that most likely won't repeat. So there is no real notion of "epoch".
Now I…

dronus
- 101