For questions about mini-batch (or batch) gradient descent, which is gradient descent with typically more than one sample of input-label pairs.
Questions tagged [mini-batch-gradient-descent]
22 questions
10
votes
2 answers
Is neural networks training done one-by-one?
I'm trying to learn neural networks by watching this series of videos and implementing a simple neural network in Python.
Here's one of the things I'm wondering about: I'm training the neural network on sample data, and I've got 1,000 samples. The…

Ram Rachum
- 261
- 1
- 9
9
votes
2 answers
What exactly is averaged when doing batch gradient descent?
I have a question about how the averaging works when doing mini-batch gradient descent.
I think I now understood the general gradient descent algorithm, but only for online learning. When doing mini-batch gradient descent, do I have to:
forward…

Ben
- 425
- 3
- 10
9
votes
1 answer
Is back-propagation applied for each data point or for a batch of data points?
I am new to deep learning and trying to understand the concept of back-propagation. I have a doubt about when the back-propagation is applied. Assume that I have a training data set of 1000 images for handwritten letters,
Is back-propagation…

Maanu
- 235
- 2
- 6
3
votes
1 answer
When using experience replay, do we update the parameters for all samples of the mini-batch or for each sample in the mini-batch separately?
I've been reading Google's DeepMind Atari paper and I'm trying to understand how to implement experience replay.
Do we update the parameters $\theta$ of function $Q$ once for all the samples of the minibatch, or do we do that for each sample of the…

user491626
- 241
- 1
- 4
3
votes
2 answers
What is the difference between batch and mini-batch gradient decent?
I am learning deep learning from Andrew Ng's tutorial Mini-batch Gradient Descent.
Can anyone explain the similarities and dissimilarities between batch GD and mini-batch GD?

DRV
- 1,573
- 2
- 11
- 18
2
votes
2 answers
What's the rationale behind mini-batch gradient descent?
I am reading a book that states
As the mini-batch size increases, the gradient computed is closer to the 'true' gradient
So, I assume that they are saying that mini-batch training only focuses on decreasing the cost function in a certain 'plane',…

ngc1300
- 133
- 5
2
votes
1 answer
Why is it called "batch" gradient descent if it consumes the full dataset before calculating the gradient?
While training a neural network, we can follow three methods: batch gradient descent, mini-batch gradient descent and stochastic gradient descent.
For this question, assume that your dataset has $n$ training samples and we divided it into $k$…

hanugm
- 3,571
- 3
- 18
- 50
2
votes
1 answer
When is the loss calculated, and when does the back-propagation take place?
I read different articles and keep getting confused on this point. Not sure if the literature is giving mixed information or I'm interpreting it incorrectly.
So from reading articles my understanding (loosely) for the following terms are as…

Hazzaldo
- 279
- 2
- 9
1
vote
1 answer
What is the order of execution of steps in back-propagation algorithm in a neural network?
I am a machine learning newbie. I am trying to understand the back-propagation algorithm. I have a training dataset of 60 instances/records.
What is the correct order of the process? This one?
Forward pass of the first instance. Calculate the…

gokul
- 53
- 4
1
vote
0 answers
Why would one prefer the gradient of the sum rather than the sum of the gradients?
When gradients are aggregated over mini batches, I sometimes see formulations like this, e.g., in the "Deep Learning" book by Goodfellow et al.
$$\mathbf{g} = \frac{1}{m} \nabla_{\mathbf{w}} \left( \sum\limits_{i=1}^{m} L \left( f \left(…

Eddie C
- 11
- 1
1
vote
1 answer
Is it possible to use stochastic gradient descent at the beginning, then switch to batch gradient descent with only a few training examples?
Batch gradient descent is extremely slow for large datasets, but it can find the lowest possible value for the cost function. Stochastic gradient descent is relatively fast, but it kind of finds the general area where convergence happens and it kind…

Robo
- 121
- 3
1
vote
2 answers
When would it make sense to perform a gradient descent step for each term of a loss function with multiple terms?
I am training a neural network using a mini-batch gradient descent algorithm.
Now, consider the following loss function, which is composed of 2 terms.
$$L = L_{\text{MSE}} + L_{\text{regularization}} \label{1}\tag{1}$$
As far as I understand,…

hanugm
- 3,571
- 3
- 18
- 50
1
vote
1 answer
How many iterations of the optimisation algorithm are performed on each mini-batch in mini-batch gradient descent?
I understand the idea of mini-batch gradient descent for neural networks in that we calculate the gradient of the loss function using one mini-batch at a time and use this gradient to adjust the parameters.
My question is: how many times do we…

user50018
- 13
- 2
1
vote
2 answers
In mini-batch gradient descent, do we pass each input in the batch individually or all inputs at the same time through the layer?
In the stochastic gradient descent algorithm, the weight update happens for every training sample.
In the mini-batch gradient descent algorithm, the weight update happens for every batch of training samples.
In the batch gradient descent algorithm,…

hanugm
- 3,571
- 3
- 18
- 50
1
vote
1 answer
Why does my model not improve when training with mini-batch gradient descent, while it does with Adam?
I am currently experimenting with the U-Net. I am doing semantic segmentation on the 2018 Data Science Bowl dataset from Kaggle without any data augmentation.
In my experiments, I am trying different hyper-parameters, like using Adam, mini-batch GD…

Bert Gayus
- 545
- 3
- 12