7

I have an NLP model for answer-extraction. So, basically, I have a paragraph and a question as input, and my model extracts the span of the paragraph that corresponds to the answer to the question.

I need to know how to compute the F1 score for such models. It is the standard metric (along with Exact Match) used in the literature to evaluate question-answering systems.

nbro
  • 39,006
  • 12
  • 98
  • 176
HLeb
  • 549
  • 5
  • 10

2 Answers2

4

In QA, it's computed over the individual words in the prediction against those in the True Answer. The number of shared words between the prediction and the truth is the basis of the F1 score: precision is the ratio of the number of shared words to the total number of words in the prediction, and recall is the ratio of the number of shared words to the total number of words in the ground truth.

Source

Shayan Shafiq
  • 350
  • 1
  • 4
  • 12
0

It really depends on what you are looking for your model to do. For example, do false negatives or false positives really cost your research (or your business)? Also, it's very important to consider your label (class) distribution.

If you just want to achieve the highest accuracy, and you don't have any issue with your class distribution (that I believe you probably don't in your case) then accuracy works pretty well.

F1 score might be a better option to use if you need to seek a balance between precision and recall and there is an uneven class distribution.

Shayan Shafiq
  • 350
  • 1
  • 4
  • 12
pedrum
  • 313
  • 1
  • 13
  • Thanks for your answer. In my case I need to compute F1 to compare my model to another one. So I need a clear definition of F1 in case of question/answering tasks like SQuAD dataset. In other words, I need to know how is F1 score calculated for a model trained on SQuAD dataset. – HLeb Aug 02 '20 at 11:40