1

When gradients are aggregated over mini batches, I sometimes see formulations like this, e.g., in the "Deep Learning" book by Goodfellow et al.

$$\mathbf{g} = \frac{1}{m} \nabla_{\mathbf{w}} \left( \sum\limits_{i=1}^{m} L \left( f \left( \mathbf{x}^{(i)}, \mathbf{w} \right), y^{(i)} \right) \right)$$

This is mathematically equivalent to

$$\mathbf{g} = \frac{1}{m} \left( \sum\limits_{i=1}^{m} \nabla_{\mathbf{w}} L \left( f \left( \mathbf{x}^{(i)}, \mathbf{w} \right), y^{(i)} \right) \right)$$

just moving the gradient operator inside/outside the sum.

But I was wondering: why one would prefer the first representation?

My thoughts, and please correct me if I am wrong:

  • When performing batch gradient descent in practice, we process one example after the other and compute the corresponding gradient everytime. So the second equation above better represents what is really happening

  • Even more, there is no alternative in practice. I can not compute all losses, just sum them up and obtain just one gradient afterwards. So the first equation (although mathematically correct and equal) might even be somewhat misleading?

  • The only reason I can imagine to decide for the first way is to express more clearly that there is one (aggregated, mean) gradient based on one data set

Any mistakes here on my side?

nbro
  • 39,006
  • 12
  • 98
  • 176
Eddie C
  • 11
  • 1

0 Answers0