When is the loss calculated, and when does the back-propagation take place?

Question

I read different articles and keep getting confused on this point. Not sure if the literature is giving mixed information or I'm interpreting it incorrectly.

So from reading articles my understanding (loosely) for the following terms are as follows:

Epoch: One Epoch is when an ENTIRE dataset is passed forward and backward through the neural network only ONCE.

Batch Size: Total number of training examples present in a single batch. In real life scenarios of utilising neural nets, the dataset needs to be as large as possible, for the network to learn better. So you can’t pass the entire dataset into the neural net at once (due to computation power limitation). So, you divide dataset into Number of Batches.

Iterations: Iterations is the number of batches needed to complete one epoch. We can divide the dataset of 2000 examples into batches of 500 then it will take 4 iterations to complete 1 epoch.

So, if all is correct, then my question is, at what point does the loss/cost function and the subsequent backprop processes take place (assuming from my understanding that backprop takes place straight after the loss/cost is calculated)? Does the cost/loss function gets calculated:

At the end of each batch where the data samples in that batch have been forward-fed to the network (i.e. at each "Iteration, not each Epoch")? If so, then the loss/cost functions gets the average loss of all losses of all data samples in that batch, correct?
At the end of each epoch? Meaning all the data samples of all the batches are forward-fed first, before the a cost/loss function is calculated.

My understanding is that it's the first point, i.e. at the end of each batch (passed to the network), hence at each iteration (not Epoch). At least when it comes to SGD optimisation. My understanding is - the whole point is that you calculate loss/cost and backprop for each batch. That way you're not calculating the average loss of the entire data samples. Otherwise you would get a very universal minima value in the cost graph, rather than local minima with lower cost from each batch you train on separately. Once all iterations have taken place, then that would count as 1 Epoch. But then I was watching a YouTube video explaining Neural Nets, which mentioned that the cost/loss function is calculated at the end of each Epoch, which confused me. Any clarification would be really appreciated.

score 1 · Accepted Answer · answered Aug 07 '19 at 12:23

1

Epoch: One Epoch is when an ENTIRE dataset is passed forward and backward through the neural network only ONCE.

Batch Size: Total number of training examples present in a single batch. In real life scenarios of utilising neural nets, the dataset needs to be as large as possible, for the network to learn better. So you can’t pass the entire dataset into the neural net at once (due to computation power limitation). So, you divide dataset into Number of Batches.

Iterations: Iterations is the number of batches needed to complete one epoch. We can divide the dataset of 2000 examples into batches of 500 then it will take 4 iterations to complete 1 epoch.

This is for the most part correct, except there are other reasons you would sometimes want to use batches (even if you could fit the whole thing in memory). One is that its less likely to overfit in a stochastic setting then in the full setting. Another is that it can achieve similar extrema with faster convergence.

Now regarding your question, yes you apply the gradient descent step at the end of each batch or desired batch (what I mean by desired batch, is if you want to use a batch-size of 24 but your device can only process 8, you may use gradient accumulation of 3 pseudo-batches to achieve an emulated batch of 24).

Though I do think its worth mentioning, your goal is still to find the global minimum. Even if your batch size is the entire dataset, that does not mean you will not fall into a local minimum, its actually in most cases still the most probable outcome

answered Aug 07 '19 at 12:23

mshlis

2,349
7
23

Thanks for the clarification. So my take aways from this (and correct me if I got this wrong), is that feeding the entire data set , in one go, to a neural network isn't recommended for the following reasons: 1) Today's computer's memory simply couldn't handle the huge datasets that neural nets need to train on, if all the dataset is fed in, in one go. 2) It will likely to achieve a local minimum (rather than a global one), since we're looking at an average gradient of loss function of the entire dataset, in one go (with respect to weights and baises). Hence SGD solves this by ... – Hazzaldo Aug 09 '19 at 00:12
... splitting data into batches, and therefore we will calculate the gradient of the loss for each batch (wrt weights & biases), rather the average of the entire dataset, and therefore would be able to find a much better minimum point of the loss function, that we're able to represent as the global minimum. Thus, we should always split our data into batches, and workout the loss and backprop for each batch. – Hazzaldo Aug 09 '19 at 00:20
This would finally lead me onto the final takeaway: given what we discussed, the Loss and Backprop, therefore, takes place for each batch, NOT at the end of an Epoch. Once all batches have done their Loss calculations and Backprops, that would count as one Epoch. Again feel free to correct me if I got any of this wrong. – Hazzaldo Aug 09 '19 at 00:21
One point I wasn't clear about, you mentioned: "Another is that it can achieve similar extrema with faster convergence". Sorry, could you just explain briefly what you mean by this? Also regarding this point: "if you want to use a batch-size of 24 but your device can only process 8, you may use gradient accumulation of 3 pseudo-batches to achieve an emulated batch of 24)", do you know if there's a good tutorial explaning on how this is done? Many thanks for your help and clear explanation. – Hazzaldo Aug 09 '19 at 00:24
@Hazzaldo no. point 2 of takeaway is wrong. both strategies will lead to **a** minima no gaurantee on either. – mshlis Aug 09 '19 at 02:31
@Hazzaldo part one of that means that because in SGD is taking more backprop steps and still approximating the gradient, in a single epoch you generally go much farther-- and the second part is just saying in practice some use gradient accumulation to make up for lack of GPU space – mshlis Aug 09 '19 at 02:32
TY. The last 2 points makes sense. The only part that I'm not sure about is where you said: "point 2 of takeaway is wrong. both strategies will lead to a minima no gaurantee on either". Learning from other sources, I thought one major reason of SGD is to calculate loss gradient and subsequent backprop per batch, rather for the entire dataset in one go, is so that you don't end up falling into a local minimum - calculating gradient and updating per batch, rather than averaging entire set. Thus finding separate local minimum points and taking steps to approximate to the best local ... – Hazzaldo Aug 09 '19 at 21:51
In theory in a locally convex region and some constraints on the learning rate they will find the exact same minima actually – mshlis Aug 09 '19 at 21:52
... minimum point to then achieve a better global minimum point. I know there isn't guarantee that that will take place, but that is the goal/purpose, isn't it? – Hazzaldo Aug 09 '19 at 21:52
Ok you mentioned that: "One is that its less likely to overfit in a stochastic setting then in the full setting". In what sense is it less likely to overfit in SGD? I'm just trying to clarify the clear advantages of SGD over pushing the full dataset in one go (apart from memory and CPU limitation). In one course the explanation was to do with local and global minima. But that's pointed out not to be case here. – Hazzaldo Aug 09 '19 at 21:59
So Imagine SGD as GD with dropout on the loss essentially which may help by optimizing a smaller goal and seeing if it generalizes to rest rather than solving whole thing (this is no guarantee though) – mshlis Aug 09 '19 at 22:01
Also I recommend double checking the point you mention because I feel they probably did not say that, most courses are checked on material – mshlis Aug 09 '19 at 22:02
Your point about the dropout makes sense. As going back to the main point of SGD regarding local and global minima, here's the exact video (part of one of the best selling courses on Udemy), which explains exactly what I mentioned, one of the main reasons of SGD is to avoid falling into a local minimum, when your loss is not a convex shape, and because you're not calculating gradient and updating weights for the entire data set, but for separate batches which helps find more separate local minima and find the best one for a global minimum: – Hazzaldo Aug 09 '19 at 22:18
https://www.udemy.com/deeplearning/learn/lecture/6753756#overview – Hazzaldo Aug 09 '19 at 22:18
I'm not saying the video and course must be right. I'm just quoting exactly from what the course taught. – Hazzaldo Aug 09 '19 at 22:19

When is the loss calculated, and when does the back-propagation take place?

1 Answers1