Is there a way to define the boundaries of the optimal size of a training set?

Question

At a related question in Computer Science SE, a user told:

Neural networks typically require a large training set.

Is there a way to define the boundaries of the "optimal" size of a training set in the general case?

When I was learning about fuzzy logic, I've heard some rules of thumb that involved examining the mathematical composition of the problem and using that to define the number of fuzzy sets.

Is there such a method that can be applicable for an already defined neural network architecture?

score 3 · Accepted Answer · answered Aug 04 '16 at 16:09

For a finite value to be 'optimal,' typically you need some benefit from more paired up with some cost for more, and eventually the lines cross because the benefit decreases and the cost increases.

Most models will have a reduction in error with more training data, that asymptotically approaches the best the model can do. See this image (from here) as an example:

The costs of training data are also somewhat obvious; data is costly to obtain, to store, and to move. (Assuming model complexity stays constant, the actual cost of storing, moving, and using the model remains the same, since the weights in the model are just being tuned.)

So at some point the slope of the error-reduction curve becomes horizontal enough that more data points are costlier than they're worth, and that's the optimal amount of training data.

score 1 · Answer 2 · answered Aug 04 '16 at 16:15

In general, the larger the training set, the better. See The Unreasonable effectiveness of Data, though this article is quite dated (written in 2009). Xavier Amatriain, a researcher at Netflix has a Quora answer where he discusses that more data can sometimes hurt algorithms.

For deep neural networks in particular, it does not seem that we have hit these limits yet.

Is there a way to define the boundaries of the optimal size of a training set?

2 Answers2