What's the rationale behind mini-batch gradient descent?

Question

I am reading a book that states

As the mini-batch size increases, the gradient computed is closer to the 'true' gradient

So, I assume that they are saying that mini-batch training only focuses on decreasing the cost function in a certain 'plane', sacrificing accuracy for speed. Is that correct?

score 2 · Accepted Answer · answered Aug 09 '18 at 04:16

The basic idea behind mini-batch training is rooted in the exploration / exploitation tradeoff in local search and optimization algorithms.

You can view training of an ANN as a local search through the space of possible parameters. The most common search method is to move all the parameters in the direction that reduces error the most (gradient decent).

However, ANN parameter spaces do not usually have a smooth topology. There are many shallow local optima. Following the global gradient will usually cause the search to become trapped in one of these optima, preventing convergence to a good solution.

Stochastic gradient decent solves this problem in much the same way as older algorithms like simulated annealing: you can escape from a shallow local optima because you will eventually (with high probability) pick a sequence of updates based on a single point that "bubbles" you out. The problem is that you'll also tend to waste a lot of time moving in wrong directions.

Mini-batch training sits between these two extremes. Basically you average the gradient across enough examples that you still have some global error signal, but not so many that you'll get trapped in a shallow local optima for long.

Recent research by Masters and Luschi suggests that in fact, most of the time you'd want to use smaller batch sizes than what's being done now. If you set the learning rate carefully enough, you can use a big batch size to complete training faster, but the difficulty of picking the correct learning rate increases with the size of the batch.

So, in short, my interpretation of what they meant by 'true' gradient was incorrect? Not that the gradient isn't calculated w.r.t all the weights, but not 'true' in the sense that it isn't averaged for all the training examples? — ngc1300, Aug 09 '18 at 04:46
@pmac I'm not sure I understand what you mean. The true gradient is the average of the gradient all training data points. Mini-batches will approximate this. The key is that, in many problems, approximating this might lead to better movement through the parameter space than computing it exactly. — John Doucette, Aug 09 '18 at 05:32
Let me use an example. If the cost function only depends on two weights, C(w1,w2), then it can be visualized as a surface over a 2D plane. From the explanation in the book, I was thinking that mini-batch training somehow would try to take a step 'downhill' in only the w1 or w2 direction after each batch. It's less direct than taking a 'diagonal' step that is a linear combination of those two directions, but less of a computational burden. Am I making sense? — ngc1300, Aug 11 '18 at 04:44
Ah, no, what you're describing is the difference between "coordinate" and "conjugate" gradient decent. Mini-batch training is more like "instead of computing the shape of the plane by averaging together all of the data, let's just average together part of it". However, depending on the method you use to adjust weights, you can still move directly along the steepest gradient of the resulting surface. — John Doucette, Aug 13 '18 at 14:19

score 0 · Answer 2 · answered Aug 09 '18 at 04:29

It's like you have a class of 1000 children and you being a teacher, want all of them to learn something at the same time. It is difficult because all are not the same, they have different adaptability and reasoning strength. So one can have alternate strategies for the same task. 1) Take each child at a time and train it. It will be the good approach but it will take a long time here each child is equal to your batch size

2) Take a group of 10 children and train them, this can be the good compromise between time, and learning. In the smaller group, you can handle naughty one better. here your batch size is 10

3) If you take all 1000 children and teach them, it will take a very short time but you will not be able to give proper attention to those mischievous ones here your batch size is 1000

Same with machine learning, Take reasonable batch size, tune weight accordingly. I hope this analogy will clear your doubt.

Thanks Patel. This is a good analogy. What is your input on the comment I made to the other answer? — ngc1300, Aug 10 '18 at 23:30

What's the rationale behind mini-batch gradient descent?

2 Answers2