Derivation of regularized cost function w.r.t activation and bias

Question

In regularzied cost function a L2 regularization cost has been added.

Here we have already calculated cross entropy cost w.r.t $A, W$.

As mentioned in the regularization notebook (see below) in order to do derivation of regularized $J$ (cost function), the changes only concern $dW^{[1]}$, $dW^{[2]}$ and $dW^{[3]}$. For each, you have to add the regularization term's gradient.(No impact on $dA^{[2]}$, $db^{[2]}$, $dA^{[1]}$ and $db^{[1]}$ ?)

But I am doing it using the chain rule then I am getting change in values for $dA^{[2]}$ , $dZ^{[2]}$, $dA^{[1]}$, $dW^{[1]}$ and $db^{[1]}$.

Please refer below how I calculated this ?

Can someone explain why I am getting different results?

What is the derivative of L2 reguarlization w.r.t $dA^{[2]}$ ? (in equation 1)

So my questions are

1) Derivative of L2 regularization cost w.r.t $dA^{[2]}$

2) How adding regularization term not affecting $dA^{[2]}$, $db^{[2]}$, $dA^{[1]}$ and $db^{[1]}$ (i.e. $dA$ and $db$) but changes $dW$'s ?

Its a good thing you are trying to derive the equations yourself, but it is quite difficult for someone to debug your equations for you. So I would suggest you to try to find the mistake yourself, it'll prepare you for future mistakes you are going to make. As for your question the $L2$ is w.r.t weights and hence impact is directly on weights, you don't have to go through chain rule of activations to reach a weight. All get the same weightage in the L2 term. A better way to derive this would be just to forget the cross entropy term and make your cost just L2. — , Mar 24 '20 at 09:30
This will result in a cost dependent linearly only on square of weights, whose derivative w.r.t weights you can easily take....there is no interfering terms of $dA$ and other stuff. The cost function $J_{reg}$ is modularized/independent. So try to find the derivatives of cross entropy and L2 individually like I suggested above and just add them. This is the most cited tutorial (as far as I have seen) for backprop: https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ — , Mar 24 '20 at 09:32
@DuttaA One of my question is why we need to avoid chain rule in case of derivative of regularised cost function especially for regularization part ? I know if I separate cost function into two part where part_1 is cross_entropy and part_two is l2_regularization cost and then calculate derivative where for part_1 we can use chain rule and for part_2 we don't need to use chain rule as this makes reglarization impact on dW zero then we can easily arrive at the derivative. — learner, Mar 24 '20 at 09:46

Derivation of regularized cost function w.r.t activation and bias

0 Answers0