The MSE can be defined as $(\hat{y} - y)^2$, which should be equal to $(y - \hat{y})^2$, but I think their derivative is different, so I am confused of what derivative will I use for computing my gradient. Can someone explain for me what term to use?
3 Answers
The derivative of $\mathcal{L_1}(y, x) = (\hat{y} - y)^2 = (f(x) - y)^2$ with respect to $\hat{y}$, where $f$ is the model and $\hat{y} = f(x)$ is the output of the model, is
\begin{align} \frac{d}{d \hat{y}} \mathcal{L_1} &= \frac{d}{d \hat{y}} (\hat{y} - y)^2 \\ &= 2(\hat{y} - y) \frac{d}{d \hat{y}} (\hat{y} - y) \\ &= 2(\hat{y} - y) (1) \\ &= 2(\hat{y} - y) \end{align}
The derivative of $\mathcal{L_2}(y, x) = (y - \hat{y})^2 = (y - f(x))^2$ w.r.t $\hat{y}$ is
\begin{align} \frac{d}{d \hat{y}} \mathcal{L_2} &= \frac{d}{d \hat{y}} (y - \hat{y})^2 \\ &= 2(y -\hat{y}) \frac{d}{d \hat{y}} (y -\hat{y}) \\ &= 2(y - \hat{y})(-1)\\ &= -2(y - \hat{y})\\ &= 2(\hat{y} - y) \end{align}
So, the derivatives of $\mathcal{L_1}$ and $\mathcal{L_2}$ are the same.

- 39,006
- 12
- 98
- 176
The MSE can be defined as $(\hat{y} - y)^2$, which should be equivalent to $(y - \hat{y})^2$
They are not just "equivalent". It is actually the exact same function, with two different ways to write it.
$$(\hat{y} - y)^2 = (\hat{y} - y)(\hat{y} - y) = \hat{y}^2 -2\hat{y}y + y^2$$
$$(y - \hat{y})^2 = (y -\hat{y})(y - \hat{y}) = y^2 -2y\hat{y} + \hat{y}^2$$
These are exactly the same function. Not just "equivalent" or "equivalent everywhere", but actually the same function. It is therefore no surprise that any derivative is also the same - including the partial derivative with respect to $\hat{y}$ which is what you typically use to drive gradient descent.
The two ways of writing the function is because it is a square and thus has two factorisations. When you write it as a square you can choose which form to use for the inner term.
Which function [form] should I use to compute the gradient?
You can use either form, it does not matter. They represent the same function and have the same gradient.

- 28,678
- 3
- 38
- 60
The derivative is the same as far as I understand it.
If $y$ is constant and $\hat{y}$ is the variable the result will be:
$((\hat{y} - y)^2)' = 2(\hat{y} - y)$
and for the other formula:
$((y - \hat{y})^2)' = -2(y - \hat{y})$
which is the same.

- 767
- 1
- 5
- 14
-
1I can see that the second equation can only be derived if you have taken the correct route for partial derivative (I commented earlier that it looked wrong - but actually I was wrong to say that). Usually tutorials don't consider that $y$ is a "constant", but that this is a *partial* derivative, where we only care about the gradient w.r.t. $\hat{y}$. The result is much the same, but either way it may help to show a step of expansion – Neil Slater May 31 '19 at 16:09