Questions tagged [gradient-descent]

For questions surrounding gradient descent, a method for finding the optimum state of a parameterized function based on another function often called the loss or error function. It iteratively descends the loss surface to the minimum loss by adjusting parameters based on the product of the partial derivatives comprising the gradient and a learning rate.

The loss function is sometimes called an error function. Its inverse is a wellness or value function.

The intention of each iteration is to decrease the result of applying the loss function. The particular method intended to produce this decrease is to calculate the gradient of the loss function and use it to compute the incremental change in parameters likely to reduce the loss. It is often used in conjunction with back propagation to distribute the corrective signal over a sequence of layers, each of which is parameterized.

To avoid overshooting the optimum, leading to oscillation or chaotic behavior, the corrective signal is attenuated with a factor called the learning rate. Too low a learning rate will compromise the speed of convergence.

Several strategies exist to calculate loss functions, hyper-parameterize the corrective signaling that is back-propagated, or integrate other search strategies to improve either reliability, speed, or accuracy.

196 questions
14
votes
2 answers

Is the mean-squared error always convex in the context of neural networks?

Multiple resources I referred to mention that MSE is great because it's convex. But I don't get how, especially in the context of neural networks. Let's say we have the following: $X$: training dataset $Y$: targets $\Theta$: the set of parameters…
10
votes
1 answer

Can non-differentiable layer be used in a neural network, if it's not learned?

For example, AFAIK, the pooling layer in a CNN is not differentiable, but it can be used because it's not learning. Is it always true?
10
votes
1 answer

What is the relationship between gradient accumulation and batch size?

I am currently training some models using gradient accumulation since the model batches do not fit in GPU memory. Since I am using gradient accumulation, I had to tweak the training configuration a bit. There are two parameters that I tweaked: the…
10
votes
2 answers

Is neural networks training done one-by-one?

I'm trying to learn neural networks by watching this series of videos and implementing a simple neural network in Python. Here's one of the things I'm wondering about: I'm training the neural network on sample data, and I've got 1,000 samples. The…
9
votes
2 answers

Is there an ideal range of learning rate which always gives a good result almost in all problems?

I once read somewhere that there is a range of learning rate within which learning is optimal in almost all the cases, but I can't find any literature about it. All I could get is the following graph from the paper: The need for small learning rates…
9
votes
2 answers

What exactly is averaged when doing batch gradient descent?

I have a question about how the averaging works when doing mini-batch gradient descent. I think I now understood the general gradient descent algorithm, but only for online learning. When doing mini-batch gradient descent, do I have to: forward…
9
votes
1 answer

What is the formula for the momentum and Adam optimisers?

In the gradient descent algorithm, the formula to update the weight $w$, which has $g$ as the partial gradient of the loss function with respect to it, is: $$w\ -= r \times g$$ where $r$ is the learning rate. What should be the formula for momentum…
Dee
  • 1,283
  • 1
  • 11
  • 35
9
votes
3 answers

How is it possible that the MSE used to train neural networks with gradient descent has multiple local minima?

We often train neural networks by optimizing the mean squared error (MSE), which is an equation of a parabola $y=x^2$, with gradient descent. We also say that weight adjustment in a neural network by the gradient descent algorithm can hit a local…
9
votes
1 answer

Is back-propagation applied for each data point or for a batch of data points?

I am new to deep learning and trying to understand the concept of back-propagation. I have a doubt about when the back-propagation is applied. Assume that I have a training data set of 1000 images for handwritten letters, Is back-propagation…
8
votes
2 answers

Why is the perceptron criterion function differentiable?

I'm reading chapter one of the book called Neural Networks and Deep Learning from Aggarwal. In section 1.2.1.1 of the book, I'm learning about the perceptron. One thing that book says is, if we use the sign function for the following loss function:…
8
votes
1 answer

Why is the learning rate generally beneath 1?

In all examples I've ever seen, the learning rate of an optimisation method is always less than $1$. However, I've never found an explanation as to why this is. In addition to that, there are some cases where having a learning rate bigger than 1 is…
7
votes
4 answers

Can the mean squared error be negative?

I'm new to machine learning. I was watching a Prof. Andrew Ng's video about gradient descent from the machine learning online course. It said that we want our cost function (in this case, the mean squared error) to have the minimum value, but that…
7
votes
3 answers

How to compute the derivative of the error with respect to the input of a convolutional layer when the stride is bigger than 1?

I read that to compute the derivative of the error with respect to the input of a convolution layer is the same to make of a convolution between deltas of the next layer and the weight matrix rotated by $180°$, i.e. something…
7
votes
1 answer

How is the gradient calculated for the middle layer's weights?

I am trying to understand backpropagation. I used a simple neural network with one input $x$, one hidden layer $h$ and one output layer $y$, with weight $w_1$ connecting $x$ to $h$, and $w_2$ connecting $h$ to $y$ $$ x \rightarrow (w_1) \rightarrow…
Eka
  • 1,036
  • 8
  • 23
7
votes
2 answers

Why is gradient descent used over the conjugate gradient method?

Based on some preliminary research, the conjugate gradient method is almost exactly the same as gradient descent, except the search direction must be orthogonal to the previous step. From what I've read, the idea tends to be that the conjugate…
Recessive
  • 1,346
  • 8
  • 21
1
2 3
13 14