0

In backpropagation the gradients are used to update the weights using the formula $$w = w - \alpha \frac{dL}{dw}$$ and the loss gradient w.r.t. weights is $$\frac{dL}{dw} = \frac{dL}{dz} \frac{dz}{dw} = (\frac{dL}{da} \frac{da}{dz} \frac{1}{m}) \frac{dz}{dw} $$

Why is the there the $\frac{1}{m}$ term? Does batch size matter and what if it's 1?

rkuang25
  • 21
  • 4
  • could you post a link to the book/blogpost from which you got the formula? m should be just the total amount of weights, but the notation is not consistent across literature so to be sure I should check from your source. – Edoardo Guerriero Jul 25 '22 at 08:19
  • Sorry I forgot to mention that m is the number of features in the last layer's output. Here's the video link https://youtu.be/yXcQ4B-YSjQ?list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&t=780 but he writes it in an abbreviated way – rkuang25 Jul 25 '22 at 22:43

0 Answers0