4

I am using the cross-entropy cost function to calculate its derivatives using different variables $Z, W$ and $b$ at different instances. Please refer image below for calculation. enter image description here

As per my knowledge, my derivation is correct for $dZ, dW, db$ and $dA$, but, if I refer to Andrew Ng Coursera stuff, then I am seeing an extra $\frac{1}{m}$ for $dW$ and $db$, whereas no $\frac{1}{m}$ in $dZ$. Andrew's slides on the left represent derivative and whereas the right side of slides shows NumPy implementation corresponding to the right side equation.

enter image description here Can someone please explain why there is:

1) $\frac{1}{m}$ in $dW^{[2]}$ and $db^{[2]}$ in Andrew's slides in NumPy representation

2) missing $\frac{1}{m}$ for $dZ^{[2]}$ in Andrew's slides in both normal and NumPy representation.

Am I missing something or doing it in the wrong way?

nbro
  • 39,006
  • 12
  • 98
  • 176
learner
  • 151
  • 5

1 Answers1

1

TL;DR: This has to do with the way A. Ng has defined back propagation for the course.

Left Column

This is only with respect to one input example and so the $\frac{1}{m}$ factor reduces to 1 and can be omitted. He uses lower case to represent one input example (eg a vector $dz$) and upper case with respect to a (mini-)batch (eg a matrix $dZ$).

The $\frac{1}{m}$ factors in $dW,db$

In this definition of backprop, he "defers" multiplying by the $\frac{1}{m}$ factor until $dW,db$ rather than "absorbing" it into $dZ^{[2]}$. That is, the $dZ^{[2]}$ term is defined in a way that it does not have $\frac{1}{m}$.

Observe, if you move the $\frac{1}{m}$ factor to be in the definition of $dZ^{[2]}$ and remove it from the definitions of $dW,db$ you will still come out with the same values for all $dW,db$.

Speculation

This "deferred" multiplication might have to do with numerical stability. Or simply a stylistic choice made by A. Ng. This might also prevent one from "accidentally" multiplying by $\frac{1}{m}$ more than once.

respectful
  • 1,096
  • 9
  • 26
  • Thanks for clearing the confusion. Can you please let me know where I can read this numerical stability thing and why following my approach (list in image 1 ) causes numerical stability ? – learner Mar 19 '20 at 04:56
  • @user110244 I was only speculating as to why A. Ng presents back prop in this way. To better understand numerical stability I'd recommend searching around on the math stack exchange - surely there will be an answer better than any I could provide here. – respectful Mar 19 '20 at 23:43