The basic idea behind mini-batch training is rooted in the exploration / exploitation tradeoff in local search and optimization algorithms.
You can view training of an ANN as a local search through the space of possible parameters. The most common search method is to move all the parameters in the direction that reduces error the most (gradient decent).
However, ANN parameter spaces do not usually have a smooth topology. There are many shallow local optima. Following the global gradient will usually cause the search to become trapped in one of these optima, preventing convergence to a good solution.
Stochastic gradient decent solves this problem in much the same way as older algorithms like simulated annealing: you can escape from a shallow local optima because you will eventually (with high probability) pick a sequence of updates based on a single point that "bubbles" you out. The problem is that you'll also tend to waste a lot of time moving in wrong directions.
Mini-batch training sits between these two extremes. Basically you average the gradient across enough examples that you still have some global error signal, but not so many that you'll get trapped in a shallow local optima for long.
Recent research by Masters and Luschi suggests that in fact, most of the time you'd want to use smaller batch sizes than what's being done now. If you set the learning rate carefully enough, you can use a big batch size to complete training faster, but the difficulty of picking the correct learning rate increases with the size of the batch.