Highest Voted 'adam' Questions - Artificial Intelligence Stack Exchange

9

votes

1 answer

What is the formula for the momentum and Adam optimisers?

In the gradient descent algorithm, the formula to update the weight $w$, which has $g$ as the partial gradient of the loss function with respect to it, is: $$w\ -= r \times g$$ where $r$ is the learning rate. What should be the formula for momentum…

asked Jan 13 '20 at 07:04

Dee

1,283
1
11
35

4

votes

1 answer

How does OpenAI-ES use Adam?

I just read that OpenAI's ES uses Adam: "OpenAI’s ES is denoted as “OptimES” (since it uses Adam optimizer)"?? I verified they are correct using the link they posted, (see es_distributed/Optimizers.py). But I don't understand how because the paper…

reinforcement-learning evolutionary-algorithms neuroevolution adam

asked Mar 24 '22 at 16:34

profPlum

360
1
9

2

votes

0 answers

Do learning rate schedulers conflict with or prevent convergence of the Adam optimiser?

An article on https://spell.ml says Because Adam manages learning rates internally, it's incompatible with most learning rate schedulers. Anything more complicated than simple learning warmup and/or decay will put the Adam optimizer to "complete"…

convergence learning-rate adam

asked Mar 31 '22 at 10:58

Jack G

21
3

2

votes

0 answers

How to decide if gradients are vanishing?

I am trying to debug a convolutional neural network. I am seeing gradients close to zero. How can I decide whether these gradients are vanishing or not? Is there some threshold to decide on vanishing gradient by looking at the values? I am getting…

convolutional-neural-networks activation-functions relu vanishing-gradient-problem adam

asked Oct 20 '20 at 05:22

pramesh

121
4

2

votes

1 answer

How long should the state-dependent baseline for policy gradient methods be trained at each iteration?

How long should the state-dependent baseline be trained at each iteration? Or what baseline loss should we target at each iteration for use with policy gradient methods? I'm using this equation to compute the policy gradient: $$ \nabla_{\theta}…

reinforcement-learning gradient-descent policy-gradients reinforce adam

asked May 08 '20 at 11:15

junior-flight

33
3

2

votes

2 answers

Is the choice of the optimiser relevant when doing object detection?

Suppose that we have 4 types of dogs that we want to detect (Golden Retriever, Black Labrador, Cocker Spaniel, and Pit Bull). The training data consists of png images of a data set of dogs along with their annotations. We want to train a model using…

object-detection object-recognition stochastic-gradient-descent adam adadelta

asked Feb 24 '20 at 23:08

neeraj

31
3

1

vote

0 answers

Why are optimization algorithms for deep learning so simple?

From my knowledge, the most used optimizer in practice is Adam, which in essence is just mini-batch gradient descent with momentum to combat getting stuck in saddle points and with some damping to avoid wiggling back and forth if the conditioning of…

neural-networks optimization gradient-descent adam

asked Oct 21 '21 at 22:39

Moritz Groß

133
5

1

vote

1 answer

How many iterations of the optimisation algorithm are performed on each mini-batch in mini-batch gradient descent?

I understand the idea of mini-batch gradient descent for neural networks in that we calculate the gradient of the loss function using one mini-batch at a time and use this gradient to adjust the parameters. My question is: how many times do we…

neural-networks gradient-descent adam mini-batch-gradient-descent

asked Sep 27 '21 at 15:43

user50018

13
2

1

vote

1 answer

Why does my model not improve when training with mini-batch gradient descent, while it does with Adam?

I am currently experimenting with the U-Net. I am doing semantic segmentation on the 2018 Data Science Bowl dataset from Kaggle without any data augmentation. In my experiments, I am trying different hyper-parameters, like using Adam, mini-batch GD…

u-net batch-normalization adam mini-batch-gradient-descent semantic-segmentation

asked Feb 01 '21 at 13:53

Bert Gayus

545
3
12

1

vote

1 answer

Why is Adam trapped in bad/suspicious local optima after the first few updates?

In the paper On the Variance of the Adaptive Learning Rate and Beyond, in section 2, the authors write To further analyze this phenomenon, we visualize the histogram of the absolute value of gradients on a log scale in Figure 2. We observe that,…

deep-learning papers optimization learning-rate adam

asked Dec 10 '20 at 22:14

AmorFati

11
2

1

vote

0 answers

What is the equation of the learning rate decay in the Adam optimiser?

Adam is known as an algorithm that has an adaptive learning rate for each parameter. I believe this is due to the division by the term $$v_t = \beta_2 \cdot v_{t-1} + (1-\beta_2) \cdot g_t^2 $$ Hence, each weight will get updated differently based…

machine-learning deep-learning gradient-descent learning-rate adam

asked Jun 17 '20 at 07:07

calveeen

1,251
7
17

1

vote

0 answers

Are there optimizers that schedule their learning rate, momentum etc. autonomously?

I'm aware there are some optimizer such as Adam that adjust the learning rate for each dimension during training. However, afaik, the maximum learning rate they can have is still determined by the user's input. So, I wonder if there are optimizers…

optimization hyperparameter-optimization adam

asked Feb 06 '20 at 18:43

SpiderRico

960
8
18

0

votes

0 answers

Longer DNN training times when using evolutionary algorithms

I am comparing my deep neural network (DNN) performance when using 2 types of optimizers: gradient-based Adam (properly tuned) and a population-based optimization algorithm (e.g., genetic algorithm (GA), PSO, etc.). My training dataset is of size…

deep-neural-networks evolutionary-algorithms hyperparameter-optimization training-datasets adam

asked Apr 25 '23 at 19:30

knowledge_seeker

97
7

0

votes

0 answers

What's wrong with our loss and PyTorch?

Given the samples $\vec{x_i} \in \mathbb{R}^d, i \in [1,..,l]$ where $l$ is the number of training samples, $d$ is the number of input features, the related target values $y_i \in \mathbb{R}$, and the $l \times l$ matrix defined below: $$ S_{i,j} = …

pytorch loss adam

asked Feb 25 '23 at 15:49

Filippo Portera

3
1
2

0

votes

2 answers

When training a DNN on infinite samples, do ADAM or other popular optimization algorithms still work as intended?

When training a DNN on infinite samples, do ADAM or other popular optimization algorithms still work as intended? I have an DNN training from an infinite stream of samples, that most likely won't repeat. So there is no real notion of "epoch". Now I…

machine-learning training adam optimizers

asked Jan 20 '23 at 14:08

dronus

101

Questions tagged [adam]