0

The gradient descent step is the following

\begin{align} \mathbf{W}_i = \mathbf{W}_{i-1} - \alpha * \nabla L(\mathbf{W}_{i-1}) \end{align}

were $L(\mathbf{W}_{i-1})$ is the loss value, $\alpha$ the learning rate and $\nabla L(\mathbf{W}_{i-1})$ the gradient of the loss.

So, how do we get to the $L(\mathbf{W}_{i-1})$ to calculate the gradient of $L(\mathbf{W}_{i-1})$? As an example, we can initialize the set of $\mathbf{W}$ to 0.5. How can you explain it to me?

nbro
  • 39,006
  • 12
  • 98
  • 176
Mahdi Amrollahi
  • 209
  • 2
  • 6
  • Are you asking how to calculate the gradient of $f$ or how to calculate the loss value itself? – nbro Dec 29 '19 at 18:39
  • Wow, Thanks for your edition. Before I can to calculate the gradient of f(W), I should have the f function. So, how can I have the f function at the first? – Mahdi Amrollahi Dec 29 '19 at 18:43

1 Answers1

2

In your case, $L$ is the loss (or cost) function, which can be, for example, the mean squared error (MSE) or the cross-entropy, depending on the problem you want to solve. Given one training example $(\mathbf{x}_i, y_i) \in D$, where $\mathbf{x}_i \in \mathbb{R}^d$ is the input (for example, an image) and $y_i \in \mathbb{R}$ can either be a label (aka class) or a numerical value, and $D$ is your training dataset, then the MSE is defined as follows

$$L(\mathbf{W}) = \frac{1}{2} \left(f(\mathbf{x}_i) - y_i \right)^2,$$

where $f(\mathbf{x}_i) \in \mathbb{R}$ is the output of the neural network $f$ given the input $\mathbf{x}_i$.

If you have a mini-batch of $M$ training examples $\{(\mathbf{x}_i, y_i) \}_{i=1}^M$, then the loss will be an average of the MSE for each training example. For more info, have a look at this answer https://ai.stackexchange.com/a/11675/2444. The https://ai.stackexchange.com/a/8985/2444 may also be useful.

See the article Loss and Loss Functions for Training Deep Learning Neural Networks for more info regarding different losses used in deep learning and how to choose the appropriate loss for your problem.

nbro
  • 39,006
  • 12
  • 98
  • 176
  • Is that true to say : L(w) = 1/2 * (f(x,w) - y)**2 ? – Mahdi Amrollahi Dec 30 '19 at 05:54
  • I did not get my answer completely however, your answer was very helpful. You said in this answer: ( https://ai.stackexchange.com/questions/11667/is-back-propagation-applied-for-each-data-point-or-for-a-batch-of-data-points/11675#11675 ) "For simplicity, assume that we are able to calculate the gradient of L, that is, delat(L)". So, how can I calculate the gradient of L function according to W set, because I need to update my W set. I mean that, do we have such a function like f(x) = 2x and then calculate the L(w)=1/2*(2x-y)**2 and delta(L) = 2*(2x-y)? – Mahdi Amrollahi Dec 30 '19 at 06:50
  • $f(x, w)$ or $f(x)$ are "just" notations, in the sense that $w$ is clearly a parameter (or weight) of the neural network $x$, but $w$ is not an input to the neural network: $x$ is the input! The way you calculate the gradient of $L$ with respect to any of the parameters is with **back-propagation** and it depends on the architecture of the neural network and the loss function you choose. To understand back-propagation, you need to understand the basics of calculus. – nbro Dec 30 '19 at 12:12
  • Thanks and I got the point : https://towardsdatascience.com/linear-regression-using-gradient-descent-97a6c8700931 – Mahdi Amrollahi Dec 30 '19 at 13:53