It has been asked here if we should repeat lengthy experiments.
Let's say I can repeat them, how should I present them? For instance, if I am measuring the accuracy of a model on test data during some training epochs, and I repeat various times this training, I will have different values of test accuracy. I can average them to take into account all the experiments. Can I then calculate a sort of confidence interval to say that the accuracy will most likely be within an interval? Does this make sense? If it does, what formula should I use?
It says here that we can use $\hat{x} \pm 1.96 \frac{\hat{\sigma}}{\sqrt{n}}$, but I don't quite understand the theory behind.