Why is my derivation of the back-propagation equations inconsistent with Andrew Ng's slides from Coursera?

Question

I am using the cross-entropy cost function to calculate its derivatives using different variables $Z, W$ and $b$ at different instances. Please refer image below for calculation.

As per my knowledge, my derivation is correct for $dZ, dW, db$ and $dA$, but, if I refer to Andrew Ng Coursera stuff, then I am seeing an extra $\frac{1}{m}$ for $dW$ and $db$, whereas no $\frac{1}{m}$ in $dZ$. Andrew's slides on the left represent derivative and whereas the right side of slides shows NumPy implementation corresponding to the right side equation.

Can someone please explain why there is:

1) $\frac{1}{m}$ in $dW^{[2]}$ and $db^{[2]}$ in Andrew's slides in NumPy representation

2) missing $\frac{1}{m}$ for $dZ^{[2]}$ in Andrew's slides in both normal and NumPy representation.

Am I missing something or doing it in the wrong way?

respectful · Answer 1 · 2020-03-19T02:06:31.893

TL;DR: This has to do with the way A. Ng has defined back propagation for the course.

Left Column

This is only with respect to one input example and so the $\frac{1}{m}$ factor reduces to 1 and can be omitted. He uses lower case to represent one input example (eg a vector $dz$) and upper case with respect to a (mini-)batch (eg a matrix $dZ$).

The $\frac{1}{m}$ factors in $dW,db$

In this definition of backprop, he "defers" multiplying by the $\frac{1}{m}$ factor until $dW,db$ rather than "absorbing" it into $dZ^{[2]}$. That is, the $dZ^{[2]}$ term is defined in a way that it does not have $\frac{1}{m}$.

Observe, if you move the $\frac{1}{m}$ factor to be in the definition of $dZ^{[2]}$ and remove it from the definitions of $dW,db$ you will still come out with the same values for all $dW,db$.

Speculation

This "deferred" multiplication might have to do with numerical stability. Or simply a stylistic choice made by A. Ng. This might also prevent one from "accidentally" multiplying by $\frac{1}{m}$ more than once.

Thanks for clearing the confusion. Can you please let me know where I can read this numerical stability thing and why following my approach (list in image 1 ) causes numerical stability ? — learner, Mar 19 '20 at 04:56
@user110244 I was only speculating as to why A. Ng presents back prop in this way. To better understand numerical stability I'd recommend searching around on the math stack exchange - surely there will be an answer better than any I could provide here. — respectful, Mar 19 '20 at 23:43

Why is my derivation of the back-propagation equations inconsistent with Andrew Ng's slides from Coursera?

1 Answers1

Linked