2

I understand both terms, linear regression and maximum likelihood, but, when it comes to the math, I am totally lost. So I am reading this article The Principle of Maximum Likelihood (by Suriyadeepan Ramamoorthy). It is really well written, but, as mentioned in the previous sentence, I don't get the math.

The joint probability distribution of $y,\theta, \sigma$ is given by (assuming $y$ is normally distributed):

joint probability distribution of y,θ,σ

This equivalent to maximizing the log likelihood: enter image description here

The maxima can be then equating through the derivative of l(θ) to zero: enter image description here

I get everything until this point, but don't understand how this function is equivalent to the previous one : enter image description here

nbro
  • 39,006
  • 12
  • 98
  • 176
xava
  • 423
  • 1
  • 3
  • 9
  • This kind of question goes well on Maths Stack Exchange. – it's a hire car baby Nov 27 '18 at 12:02
  • 1
    @RobertFrost I think this type of questions can be asked here, given the topic (maximum likelihood and linear regression). AI is still a mathematical-based field, so this type of questions are normal in the AI field too. Honestly, I would like to see more questions of this kind here. – nbro Nov 27 '18 at 12:16
  • @nbro me too, I was just saying in case you wanted more. – it's a hire car baby Nov 27 '18 at 15:24

1 Answers1

2

Note first that the first $=$ (equals) in $\frac{dl(\theta)}{d\theta} = 0 = −\frac{1}{2\sigma^2}(0−2X^TY + X^TX \theta)$ should be interpreted as a "is set to", that is, we set $\frac{dl(\theta)}{d\theta} = 0$. Given that (apparently) $\frac{dl(\theta)}{d\theta} = −\frac{1}{2\sigma^2}(0−2X^TY + X^TX \theta)$, $\frac{dl(\theta)}{d\theta} = 0$ is equivalent to $0 = −\frac{1}{2\sigma^2}(0−2X^TY + X^TX \theta)$.

Now, let's apply some basic linear algebra:

\begin{align} 0 &= −\frac{1}{2\sigma^2}(0−2X^TY + X^TX \theta) \iff \\ 0 &= −(0−2X^TY + X^TX \theta) \iff \\ 0 &= −0 + 2X^TY - X^TX \theta) \iff \\ 0 &= 2X^TY - X^TX \theta \iff \\ X^TX \theta &= 2X^TY \iff \\ (X^TX)^{-1}(X^TX) \theta &= (X^TX)^{-1}2X^TY \iff \\ \theta &= (X^TX)^{-1}2X^TY \end{align}

Now, you can ignore the $2$, because it is just a constant, and, when optimizing, this does not influence the result.

Note that using $\hat{\theta}$ instead of $\theta$ is just to indicate that what we will get is an "estimate" of the real $\theta$, because of round off errors during the computations, etc.

nbro
  • 39,006
  • 12
  • 98
  • 176
  • I think it would be better to clarify under what circumstances and why, constants can be ignored. It looks at face value like doing so will halve the value of theta. – it's a hire car baby Nov 27 '18 at 03:26
  • 1
    @RobertFrost I think there's a wrong derivation in the original article and that 2 would not even occur in the calculations above anymore, because it would cancel out with the 2 in the denominator. Essentially, $(Y - X\theta)^T(Y - X\theta)$ would have a term $X^TX\theta^2$ and the partial derivative of $l$ with respect to $\theta$ would produce the term $2X^TX\theta$ (but the original author of the article left out the 2 in front). In that case, after the partial derivation, all addends would have a $2$ in front, which would cancel out with the 2 at the denominator. – nbro Nov 27 '18 at 10:32
  • Maybe that's what confused the OP – it's a hire car baby Nov 27 '18 at 12:01
  • @RobertFrost I think what confused the OP was that the "$= 0$" is actually "setting to $0$", but only the OP can tell us. Maybe more than one thing confused him/her. – nbro Nov 27 '18 at 12:14
  • 2
    Well , in fact, I was confused about the disappearance of 2 in the last equation. thank you very much for the detailed answer! – xava Nov 27 '18 at 13:25