3

It has been asked here if we should repeat lengthy experiments.

Let's say I can repeat them, how should I present them? For instance, if I am measuring the accuracy of a model on test data during some training epochs, and I repeat various times this training, I will have different values of test accuracy. I can average them to take into account all the experiments. Can I then calculate a sort of confidence interval to say that the accuracy will most likely be within an interval? Does this make sense? If it does, what formula should I use?

It says here that we can use $\hat{x} \pm 1.96 \frac{\hat{\sigma}}{\sqrt{n}}$, but I don't quite understand the theory behind.

nbro
  • 39,006
  • 12
  • 98
  • 176
biofa.esp
  • 31
  • 2

1 Answers1

0

It says here that we can use $\hat{x} \pm 1.96 \frac{\hat{\sigma}}{\sqrt{n}}$, but I don't quite understand the theory behind.

Following the Gaussian distribution, $1.96$ is an approximate value by which we multiply the sample standard deviation $\hat{\sigma}$ to get the $95\%$ confidence interval for unknown $x$$-$i.e., $95\%$ of multiple intervals $[\hat{x} - 1.96\frac{\hat{\sigma}}{\sqrt{n}}, \hat{x}+1.96\frac{\hat{\sigma}}{\sqrt{n}}]$ constructed on the basis of different experiments and their corresponding test-score lists will contain the true value of test score $x$.

I guess this makes sense for $k \geq 10$ cross-validation, although this issue baffles me too, and from my experience, practitioners either report $\text{mean}(x) \pm \text{std}(x)$ or just leave the details out.

  • You don't need k-fold cv. If you are using a test set, then report the mean error and standard error based on the size of the test set. – Neil Slater May 19 '22 at 19:34
  • @NeilSlater What if we calculate accuracy, AUROC, etc.? – Sanjar Adilov May 20 '22 at 17:57
  • Accuracy is just fine, you can treat it as a binary variable per example and calculate mean/sd from a single test set (in fact the accuracy measure is already the mean of this variable). With AUROC yes you would need to do differently, but I am not sure it is a done thing to report AUROC with bounds. The problem of quoting cv-based results is most often k-fold cv is used for model selection, resulting in biased measurements. – Neil Slater May 20 '22 at 20:05
  • @NeilSlater Sorry, I'm not sure I get your point. There are benchmarks with no explicit train-test splits, so researches usually propose their own cv strategies. Of course, they should not search for best models based on test errors; they should split train sets into train-valid sets, explore test sets only after best models acc. to val. sets are obtained, and immediately finalize the results. Pseudocode will be `cv_errors = cross_val_scores(GridSearchCV(MyModel()), X, y)`. In this case, how should we present the distribution of `cv_errors`? In my answer, I assume that 95% CI makes sense. – Sanjar Adilov May 21 '22 at 08:13
  • My point is that it would be more usual to to use a hold-out test set for reporting mean and confidence interval on results, and that this works. Your last paragraph declares that the confidence bounds on metrics may only make sense for k-fold cv. But for metrics such as MSE and accuracy that is not the case, they will have confidence bounds based on the size of a single test set. – Neil Slater May 21 '22 at 10:21
  • @NeilSlater I do not say CIs make sense *only* for k-fold cv. OP asked about summarizing repeated experiments - i.e., $k > 1$-times validation/testing (not necessarily $k$-*fold*, btw), and I wanted to emphasize that there might be a rule of thumb for minimum $k$. And I think your solution with one hold-out test set also makes sense. – Sanjar Adilov May 21 '22 at 10:55