BLEU Performance
| |

BLEU has frequently been reported as correlating well with human judgement,[1][2][3] and remains a benchmark for the assessment of any new evaluation metric. There are however a number of criticisms that have been voiced. It has been noted that although … Read More

BLEU
| |

BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine’s output and that of a human: “the … Read More

METEOR
| |

The METEOR metric is designed to address some of the deficiencies inherent in the BLEU metric. The metric is based on the weighted harmonic mean of unigram precision and unigram recall. The metric was designed after research by Lavie (2004) … Read More

Word error rate
| |

The Word error rate (WER) is a metric based on the Levenshtein distance, where the Levenshtein distance works at the character level, WER works at the word level. It was originally used for measuring the performance of speech recognition systems, … Read More

NIST
| |

The NIST metric is based on the BLEU metric, but with some alterations. Where BLEU simply calculates n-gram precision adding equal weight to each one, NIST also calculates how informative a particular n-gram is. That is to say when a … Read More

BLEU
| |

BLEU was one of the first metrics to report high correlation with human judgements of quality. The metric is currently one of the most popular in the field. The central idea behind the metric is that “the closer a machine … Read More

Automatic evaluation
| |

In the context of this article, a metric will be understood as a measurement. A metric for the evaluation of machine translation output is a measurement of the quality of the output. The quality of a translation is inherently subjective, … Read More

Advanced Research Projects Agency (ARPA)
| |

As part of the Human Language Technologies Program, the Advanced Research Projects Agency (ARPA) created a methodology to evaluate machine translation systems, and continues to perform evaluations based on this methodology. The evaluation programme was instigated in 1991, and continues … Read More

Automatic Language Processing Advisory Committee (ALPAC)
| |

One of the constituent parts of the ALPAC report was a study comparing different levels of human translation with machine translation output, using human subjects as judges. The human judges were specially trained for the purpose. The evaluation study compared … Read More

Round-trip translation
| |

Although this may intuitively be a good method of evaluation, it has been shown that round-trip translation is a, “poor predictor of quality”. The reason why it is such a poor predictor of quality is reasonably intuitive. When a round-trip … Read More

Evaluation of machine translation
| |

Various methods for the evaluation for machine translation have been employed. This article will focus on the evaluation of the output of machine translation, rather than on performance or usability evaluation. Before covering the large scale studies, a brief comment … Read More