For questions related to the concept of metric, which is a function that defines the distance between pairs of elements in a set. However, it is often the case that the word "metric" refers to some measure that is not necessarily a metric in a mathematical sense.
Questions tagged [metric]
42 questions
8
votes
2 answers
Why is the perplexity a good evaluation metric for chatbots?
A few papers I have come across say that BLEU is not an appropriate evaluation metric for chatbots, so they use the perplexity.
First of all, what is perplexity? How to calculate it? And why is perplexity a good evaluation metric for chatbots?

RuiZhang1993
- 89
- 2
7
votes
2 answers
How is the F1 score calculated in a question-answering system?
I have an NLP model for answer-extraction. So, basically, I have a paragraph and a question as input, and my model extracts the span of the paragraph that corresponds to the answer to the question.
I need to know how to compute the F1 score for such…

HLeb
- 549
- 5
- 10
6
votes
2 answers
What evaluation metric are used for sequence-to-sequence prediction problems?
I am solving many sequence-to-sequence prediction problems using RNN/LSTM.
What type of evaluation metrics can be used for sequence prediction problems?
One metric is the mean squared error (MSE) that we can give as a parameter during the training…

Asif Khan
- 181
- 1
- 6
4
votes
2 answers
Which metric should I use to assess the quality of the clusters?
I have a model that outputs a latent N-dimensional embedding for all data points, trained in a way that clusters data-points from the same class together, while being separated from other clusters belonging to other different classes.
The…

jaeger6
- 308
- 1
- 7
4
votes
0 answers
When computing the ROC-AUC score for multi-class classification problems, when should we use One-vs-Rest and One-vs-One?
The sklearn's documentation of the method roc_auc_score states that the parameter multi_class can take the value 'OvR' (which stands for One-vs-Rest) or 'OvO' (which stands for One-vs-One). These values are only applicable for multi-class…

Leockl
- 151
- 1
4
votes
2 answers
How do you measure multi-label classification accuracy?
Multi-label assignment is the task in machine learning to assign to each input value a set of categories from a fixed vocabulary where the categories need not be statistically independent, so precluding building a set of independent classifiers each…

Nick
- 251
- 1
- 5
3
votes
1 answer
How should we interpret all the different metrics in reinforcement learning?
I'm trying to train some deep RL agents using policy gradient methods like AC and PPO. While training, I have a ton of different metrics being monitored.
I understand that the ultimate goal is to maximize the reward or return per episode.
But there…

bluekaterpillar
- 51
- 2
3
votes
1 answer
What is meant by the expected BLEU cost when training with BLEU and SIMILE?
Recently I was reading a paper based on a new evaluation metric SIMILE. In a section, validation loss comparison had been made for SIMILE and BLEU. The plot showed the expected BLEU cost when training with BLEU and SIMILE.
What I'm unable to…

develop97
- 31
- 2
3
votes
1 answer
Why is there more than one way of calculating the accuracy?
Some sources consider the true negatives (TN) when computing the accuracy, while some don't.
Source 1:
https://medium.com/greyatom/performance-metrics-for-classification-problems-in-machine-learning-part-i-b085d432082b
Source…

Stephen Philip
- 317
- 2
- 9
3
votes
1 answer
Using True Positive as a Cost Function
I wanted to use True Positive (and True Negative) in my cost function to make to modify the ROC shape of my classifier. Someone told me and I read that it is not differentiable and therefore not usable as a cost function for a neural network.
In the…

Léonard Barras
- 31
- 2
2
votes
2 answers
How can we compare, in terms of similarity, two pieces of text?
How can we compare, in terms of similarity (and/or meaning), two pieces of text (or documents)?
For example, let's say that I want to determine whether a document is a plagiarized version of another document. Which approach should I use? Could I use…

cuong tran
- 33
- 1
- 5
2
votes
0 answers
Which evaluation metrics should be used in training, validation and testing of a model?
Which specific performance evaluation metrics are used in training, validation, and testing, and why? I am thinking error metrics (RMSE, MAE, MSE) are used in validation, and testing should use a wide variety of metrics? I don't think performance is…

user9645302
- 53
- 3
2
votes
1 answer
Why does the pass@k metric not "behave like" probability?
pass@k is a metric used to evaluate models that generate code, used for example to evaluate Codex. To evaluate pass@k, you have a dataset of natural language/code pairs, and you pass each NL prompt to the model. For each prompt, it generates k code…

Jack M
- 242
- 1
- 8
2
votes
1 answer
How to calculate a meaningful distance between multidimensional tensors
TLDR: given two tensors $t_1$ and $t_2$, both with shape $(c,h,w),$ how shall the distance between them be measured?
More Info: I'm working on a project in which I'm trying to distinguish between an anomalous sample (specifically from MNIST) and a…

Hadar Sharvit
- 371
- 1
- 12
2
votes
2 answers
Is it possible that every class has a higher recall than precision for multi-class classification?
I am a student learning machine learning recently, and one thing is keep confusing me, I tried multiple sources and failed to find the related answer.
As following table shows (this is from some paper):
Is it possible that every class has a higher…

Cheleeger Ken
- 73
- 5