For questions related to stochastic gradient descent (SGD), which is stochastic gradient descent that uses stochastic (or noisy) gradients.
Questions tagged [stochastic-gradient-descent]
33 questions
23
votes
3 answers
How do I choose the optimal batch size?
Batch size is a term used in machine learning and refers to the number of training examples utilised in one iteration. The batch size
can be one of three options:
batch mode: where the batch size is equal to the total dataset thus making the…

Sebastian Nielsen
- 363
- 1
- 2
- 10
10
votes
2 answers
How do I handle negative rewards in policy gradients with the cross-entropy loss function?
I am using policy gradients in my reinforcement learning algorithm, and occasionally my environment provides a severe penalty (i.e. negative reward) when a wrong move is made. I'm using a neural network with stochastic gradient descent to learn the…

jstaker7
- 209
- 1
- 2
- 5
10
votes
1 answer
What is the relationship between gradient accumulation and batch size?
I am currently training some models using gradient accumulation since the model batches do not fit in GPU memory. Since I am using gradient accumulation, I had to tweak the training configuration a bit. There are two parameters that I tweaked: the…

JVGD
- 1,088
- 1
- 6
- 14
10
votes
2 answers
Is neural networks training done one-by-one?
I'm trying to learn neural networks by watching this series of videos and implementing a simple neural network in Python.
Here's one of the things I'm wondering about: I'm training the neural network on sample data, and I've got 1,000 samples. The…

Ram Rachum
- 261
- 1
- 9
9
votes
2 answers
What exactly is averaged when doing batch gradient descent?
I have a question about how the averaging works when doing mini-batch gradient descent.
I think I now understood the general gradient descent algorithm, but only for online learning. When doing mini-batch gradient descent, do I have to:
forward…

Ben
- 425
- 3
- 10
9
votes
1 answer
Is back-propagation applied for each data point or for a batch of data points?
I am new to deep learning and trying to understand the concept of back-propagation. I have a doubt about when the back-propagation is applied. Assume that I have a training data set of 1000 images for handwritten letters,
Is back-propagation…

Maanu
- 235
- 2
- 6
8
votes
1 answer
Why is the learning rate generally beneath 1?
In all examples I've ever seen, the learning rate of an optimisation method is always less than $1$. However, I've never found an explanation as to why this is. In addition to that, there are some cases where having a learning rate bigger than 1 is…

Recessive
- 1,346
- 8
- 21
4
votes
0 answers
How does SGD escape local minima?
SGD is able to jump out of local minima that would otherwise trap BGD
I don't really understand the above statement. Could someone please provide a mathematical explanation for why SGD (Stochastic Gradient Descent) is able to escape local minima,…

stoic-santiago
- 1,121
- 5
- 18
3
votes
1 answer
How are these equations of SGD with momentum equivalent?
I know this question may be so silly, but I can not prove it.
In Stanford slide (page 17), they define the formula of SGD with momentum like this:
$$
v_{t}=\rho v_{t-1}+\nabla f(x_{t-1})
\\
x_{t}=x_{t-1}-\alpha v_{t},
$$
where:
$v_{t+1}$ is the…

CuCaRot
- 892
- 3
- 15
3
votes
1 answer
Should we also shuffle the test dataset when training with SGD?
When training machine learning models (e.g. neural networks) with stochastic gradient descent, it is common practice to (uniformly) shuffle the training data into batches/sets of different samples from different classes. Should we also shuffle the…

SpiderRico
- 960
- 8
- 18
3
votes
2 answers
What is the difference between batch and mini-batch gradient decent?
I am learning deep learning from Andrew Ng's tutorial Mini-batch Gradient Descent.
Can anyone explain the similarities and dissimilarities between batch GD and mini-batch GD?

DRV
- 1,573
- 2
- 11
- 18
2
votes
2 answers
What's the rationale behind mini-batch gradient descent?
I am reading a book that states
As the mini-batch size increases, the gradient computed is closer to the 'true' gradient
So, I assume that they are saying that mini-batch training only focuses on decreasing the cost function in a certain 'plane',…

ngc1300
- 133
- 5
2
votes
1 answer
Is there any way to train a neural network without using gradients?
The only algorithm I know for updation of weights of a neural network is based on gradients. The update equation can be roughly written as
$$w \leftarrow w - \nabla_{w}L$$
where $\nabla_{w}L$ is the gradient of loss function with respect to…

hanugm
- 3,571
- 3
- 18
- 50
2
votes
1 answer
Why is it called "batch" gradient descent if it consumes the full dataset before calculating the gradient?
While training a neural network, we can follow three methods: batch gradient descent, mini-batch gradient descent and stochastic gradient descent.
For this question, assume that your dataset has $n$ training samples and we divided it into $k$…

hanugm
- 3,571
- 3
- 18
- 50
2
votes
0 answers
Methodologies for passing the best samples for a neural network to learn
Just an idea I am sure I read in a book some time ago, but I can't remember the name.
Given a very large dataset and a neural network (or anything that can learn via something like stochastic gradient descent, passing a subset of samples to modify…

user4052054
- 121
- 1