3

Let's say I've got a training sample set of 1 million records, which I pull batches of 100 from to train a basic regression model using gradient descent and MSE as a loss function. Assume test and cross validation samples have already been withheld from the training set, so we have 1 million entries to train with.

Consider following cases:

  • Run 2 epochs (I'm guessing this one is potentially bad as it's basically 2 separate training sets)
    • In the first Epoch train over records 1-500K
    • In the second epoch train over the 500K-1M
  • Run 4 epochs
    • In the first and third Epoch train over records 1-500K
    • In the second and fourth epoch train over the 500K-1M
  • Run X epochs, but each epoch has a random 250K samples from the training set to choose from

Should every epoch have the exact samples? Is there any benefit/negative to doing so? My intuition is any deviation in samples changes the 'topography' of the surface you're descending, but I'm not sure if the samples are from the same population if it matters.

This relates to a SO question: https://stackoverflow.com/questions/39001104/in-keras-if-samples-per-epoch-is-less-than-the-end-of-the-generator-when-it

Ray
  • 131
  • 4

1 Answers1

1

Your goal in regression should be to obtain the factors which result in the best fit model without over-fitting. The more data you have in the training set, the better your regression will be. Thus you would want to train on the most data, but you also want to have some data held out to validate that your model is not over-fit. So this is where you should have your data split into say a 80/20 training and validation set. And if data is scarce or you want that 20% to contribute to the model then you could do a 5 fold cross validation.

In the spirit of research perhaps you should try both of these routes, and report your findings.

Snives
  • 111
  • 1
  • The 1 million training set in the post doesn't include the samples with held for testing or cross validation--assume it's already been withheld. I'm only concerned with whether or not the final training set should be identical across epochs or whether the set can vary. – Ray Jan 06 '17 at 15:23
  • I suppose the assumption here is that the two training data sets you wish to train on in this different way has the same distribution, variance, and homoscedasticity as each other, in which case it shouldn't make a difference in the result. However, if there is a difference, then you would expect to get a better result from training on all the training data combined. – Snives Jan 08 '17 at 17:49
  • Now that I know a little more about descriptive statistic, I understand your comment. I see that if the multiple different samples are pulled from the same population, by the central limit theroum – Ray Feb 20 '17 at 20:29
  • Now that I know a little more about descriptive statistic, I understand your comment. I see that if the multiple sufficiently sized different samples are pulled from the same population, by the central limit theory they should share similar distribution and variance, so my intuition is it shouldn't make a difference. So, I'm upvoting you, but I'm not yet fully conveinced. If you can add more details/proof or point me to where I could find more info supporting this I'd be appreciative. – Ray Feb 20 '17 at 20:35
  • **WARNING Amateur Statistician Cadet Navigation Attempt** Just to clarify my point, the central limit implies their means should be a normal distribution (meaning, no pun, tend to be close to each other's means) with the standard error decreasing as you increase the sample size. I'm guessing this also implies their own varience will deviate less with sufficient sample size. – Ray Feb 20 '17 at 20:41