2

I was reading this paper where they are stating the following:

We also use the T-Test to test the significance of GMAN in 1 hour ahead prediction compared to Graph WaveNet. The p-value is less than 0.01, which demonstrates that GMAN statistically outperforms Graph WaveNet.

What does "Model A statistically outperforms B" mean in this context? And how should the p-value threshold be selected?

nbro
  • 39,006
  • 12
  • 98
  • 176
razvanc92
  • 1,108
  • 7
  • 18
  • 1
    You should read about hypothesis testing, null hypothesis, and p-values. These are basic statistical concepts. Maybe [Khan Academy](https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample) is a good place to start. The value of $p$ is typically a hyper-parameter and needs to be chosen before the experiment. To understand why, you should first understand the meaning of the p-value, which actually takes a little bit of time to grasp. – nbro Jul 03 '20 at 22:58

1 Answers1

4

Most model-fitting is stochastic, so you get different parameters every time you train, and you usually can't say that one algorithm will always give you a better-performing model.

However, since you can retrain many times to get a distribution of models, you can use a statistical test like the T-Test to say "algorithm A usually produces a better model than algorithm B," which is what they mean by "statistically outperforms."

p-value is usually set by consensus in the field. The higher the p-value, the less confidence you have that there's a statistical difference between the distribution of values being compared. 0.1 might be normal in a field where data is very expensive to collect (like risky, long-term studies of humans), but in machine learning, it's usually easy enough to retrain a model that 0.01 is common, and demonstrates very high confidence. To know more about selecting and interpreting the values, I recommend Wikipedia's page on statistical significance.

alltom
  • 176
  • 3