pass@k is a metric used to evaluate models that generate code, used for example to evaluate Codex. To evaluate pass@k, you have a dataset of natural language/code pairs, and you pass each NL prompt to the model. For each prompt, it generates k code snippets. If at least one of the code snippets is correct, then the model succeeded at that prompt in k samples. The pass@k is the fraction of prompts for which the model succeeded in this sense.
The samples generated for each prompt are obtained via some stochastic procedure based on the model's output probability distributions on the vocabulary, like randomly choosing the next token based on that distribution or something like that.
So in the Codex paper for example, we see these figures for the largest Codex model:
- pass@1: 28%
- pass@100: 72%
These numbers make no sense to me. Every time we sample the model's prediction for a given prompt, we get back a random code snippet. That output is either correct or incorrect. Each trial is independent. So if the probability of one output being correct is p, the probability of at least one of 100 being correct should be 1 - (1 - p)^100. Here p=0.28, so the pass@100 should be like 99.999999%.
Are the trials not independent? What's going on?