1

I understand the idea of mini-batch gradient descent for neural networks in that we calculate the gradient of the loss function using one mini-batch at a time and use this gradient to adjust the parameters.

My question is: how many times do we adjust the parameters per mini-batch, i.e. how many optimisation iterations are performed on a mini-batch?

The fact that I can't find anything in the TensorFlow documentation about this to me implies the answer is just 1 iteration per mini-batch. If this assumption is correct, then how does an optimisation algorithm, like adam, work which uses past gradient information? It seems strange, since then gradients from past mini-batches are being used to minimise the loss of the current mini-batch?

nbro
  • 39,006
  • 12
  • 98
  • 176
user50018
  • 13
  • 2

1 Answers1

0

How many optimisation iterations are performed on a mini-batch?

Just one, as you suspected.

then how does an optimisation algorithm like adam work which uses past gradient information?

It uses the gradient estimates from each mini-batch as its input sequence.

It seems strange since then gradients from past mini-batches are being used to minimise the loss of the current mini-batch?

Each mini-batch generates a gradient which is an estimate of the true gradient of the loss function over the whole dataset. There can be a lot of randomness in this, depending on the problem and the size of the minibatch, but the expected value of the gradient for any fairly sampled minibatch is the mean gradient for the whole dataset.

As there is an update between each minibatch, the expected gradient will change, because the new parameters will generate a new loss value from a new location in the parameter space that is being searched for optimal values.

Adam is not intended to calculate a single true gradient for the whole dataset or population. If you wanted that outcome, then you could use the entire dataset for each step instead of minibatches. Instead, Adam applies some second-order effects to gradient step updates:

  • Adam does not take a direct deepest descent step, but normalises steps in each parameter against a rolling average of gradients seen so far.

  • Adam applies a form of momentum to update steps.

These manipulations of gradient data appear to work well in practice when optimising parameters over complex loss function spaces with many ridges, valleys and saddle points.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60