1

Let's say we use an MLE estimator (implementation doesn't matter) and we have a training set. We assume that we have sampled the training set from a Gaussian distribution $\mathcal N(\mu, \sigma^2)$.

Now, we split the dataset into training, validation and test sets. The result will be that each will have maximum likelihoods for the following Gaussian distributions $\mathcal N(\mu_{training}, \sigma^2_{training}), \mathcal N(\mu_{validation}, \sigma^2_{validation}), \mathcal N(\mu_{test}, \sigma^2_{test})$.

Now, let's assume the case where $\mu_{validation}<\mu_{training}<\mu_{test}$ and $\mu_{training}<\mu<\mu_{test}$.

Clearly, if we perform validation using this split, then the model that gets selected will be closer to $\mu_{validation}$, which will worsen the performance on actual data, whereas if we only used the training set, the performance could actually be better (this is the simplest case without taking into account the effect of variance).

So, we will have a $4!$ combinations between the means, and each one might improve or worsen the performance (probably in $50 \%$ cases performance will be worsened, assuming symmetry).

So, what am I missing here? Were my aforementioned assumptions wrong? Or does the validation set has a completely different purpose?

nbro
  • 39,006
  • 12
  • 98
  • 176

1 Answers1

2

I think Cross-Validation serves a completely different purpose.

From your post, it looks like you think we would use CV to get a better estimate of the parameters of our model (i.e. the model parameters after cross validation are closer to the parameters of the test data).

In fact, we use CV to get an estimate of generalization error while keeping our test set outside the training process. That is, we use it to answer the question "What is the size of the difference between my training and testing performance likely to be?". If you have an estimate of this that you are confident in, you can be confident that when you deploy a model to your customers, the model will actually work as you expect.

If you're only going to build a single model, then you don't need cross validation. You just train the model on the training data, and test it on the test data. Then you have an unbiased estimate of generalization error.

However, we might want to try out many different kinds of models, and many different parameters (broadly, we might want to do hyperparameter tuning). To do this, we need to understand how generalization error changes as we change our hyperparameters, and then use this information to pick hyperparameter values that we think will minimize the actual error when we deploy the model.

You could do this by training different models on the training set, and then testing them on the test set, recording the difference in model performance on the two sets. If you use this as a basis to pick a model though, you have effectively pulled the test set inside your training process (model parameters were implicitly selected using the test set, since you picked the parameters with the lowest test error). This bias will make your true generalization error much larger than what you observed.

As a stop gap, you could split your training set into a 'real training' set and a validation set. You could train models on the 'real training' set, and then measure their performance on the validation set. The difference would be a biased (but hopefully still useful) estimate of generalization error. You could then test against the test set just once (at the end) to get an unbiased estimate that you can use to decide whether or not to deploy the model.

A better workflow is to use CV on the training set to get an estimate of generalization error during hyperparameter optimization. You get K samples for k-fold cross-validation, so you can do statistical testing to see whether one model truly has better generalization error than another, or whether it's just a fluke. This decreases the degree of bias in your estimates of generalization error. Then, once you've completed hyperparameter optimization, you can run your final model against the test set once to obtain a truly unbiased estimate of your final generalization error.

John Doucette
  • 9,147
  • 1
  • 17
  • 52
  • So the thing I am skeptical about is the part 'hopefully'. If the CV set is farther from the test set or true distribution in terms of mean and variance, then it'll potray the wrong model/hyperparameter choices as correct. –  Apr 14 '20 at 00:23
  • @DuttaA Ah, I think I see what you mean. This is actually a matter of some debate. I think the current consensus is that boorstrapping has better theoretical support than CV, but CV works fine in practice. I think I can write up a better version at a later date. – John Doucette Apr 14 '20 at 01:39