For questions related to BLEU (BiLingual Evaluation Understudy), which is a metric for evaluating the quality of text which has been machine-translated from one natural language to another. The metric was proposed in the paper "BLEU: a Method for Automatic Evaluation of Machine Translation" (2002) by Kishore Papineni et al.
Questions tagged [bleu]
5 questions
8
votes
2 answers
Why is the perplexity a good evaluation metric for chatbots?
A few papers I have come across say that BLEU is not an appropriate evaluation metric for chatbots, so they use the perplexity.
First of all, what is perplexity? How to calculate it? And why is perplexity a good evaluation metric for chatbots?

RuiZhang1993
- 89
- 2
5
votes
2 answers
What is the difference between a language model and a word embedding?
I am self-studying applications of deep learning on the NLP and machine translation.
I am confused about the concepts of "Language Model", "Word Embedding", "BLEU Score".
It appears to me that a language model is a way to predict the next word given…

Exploring
- 223
- 6
- 16
2
votes
2 answers
What are the differences between BLEU and METEOR?
I am trying to understand the concept of evaluating the machine translation evaluation scores.
I understand how what BLEU score is trying to achieve. It looks into different n-grams like BLEU-1,BLEU-2, BLEU-3, BLEU-4 and try to match with the human…

Exploring
- 223
- 6
- 16
2
votes
1 answer
What happens when the output length in the brevity penalty is zero?
The brevity penalty is defined as
$$bp = e^{(1- r/c)},$$
where $r$ is the reference length and $c$ is the output length.
But what happens if the output length gets zero? Is there any standard way of coping with that issue?

ScientiaEtVeritas
- 155
- 5
1
vote
1 answer
Does it make sense to use BLEU or ROUGE for any machine translation task?
Many machine translation metrics such as BLEU or ROUGE are used to evaluate sequence to sequence models where, usually, the sequences are pieces of natural language.
Is it possible to use these metrics when the dataset is not constituted of natural…

Blencer
- 73
- 5