2

pass@k is a metric used to evaluate models that generate code, used for example to evaluate Codex. To evaluate pass@k, you have a dataset of natural language/code pairs, and you pass each NL prompt to the model. For each prompt, it generates k code snippets. If at least one of the code snippets is correct, then the model succeeded at that prompt in k samples. The pass@k is the fraction of prompts for which the model succeeded in this sense.

The samples generated for each prompt are obtained via some stochastic procedure based on the model's output probability distributions on the vocabulary, like randomly choosing the next token based on that distribution or something like that.

So in the Codex paper for example, we see these figures for the largest Codex model:

  • pass@1: 28%
  • pass@100: 72%

These numbers make no sense to me. Every time we sample the model's prediction for a given prompt, we get back a random code snippet. That output is either correct or incorrect. Each trial is independent. So if the probability of one output being correct is p, the probability of at least one of 100 being correct should be 1 - (1 - p)^100. Here p=0.28, so the pass@100 should be like 99.999999%.

Are the trials not independent? What's going on?

nbro
  • 39,006
  • 12
  • 98
  • 176
Jack M
  • 242
  • 1
  • 8

1 Answers1

1

Very late to this question, but pass@k doesn't behave like you're describing because each single pass@k sample is itself a union of k independent events, so averaging them doesn't behave like a union of (num samples) independent events.

Here's an illustration. Say our dataset is two samples, such that on sample 1 our model always gets it right, and on sample 2 our model never gets it right. Then pass@1 = 50%. But doing 100 passes on each sample, we get 100 correct answers on sample 1 (counts as correct) and 0 correct answers on sample 2 (counts as incorrect). Then pass@100 = 50%. In fact pass@k = 50% for any k in this toy setting.