BLEU Performance

BLEU has frequently been reported as correlating well with human judgement,[1][2][3] and remains a benchmark for the assessment of any new evaluation metric. There are however a number of criticisms that have been voiced. It has been noted that although in principle capable of evaluating translations of any language, BLEU cannot in its present form deal with languages lacking word boundaries.[4]

It has been argued that although BLEU has significant advantages, there is no guarantee that an increase in BLEU score is an indicator of improved translation quality.[5] Nevertheless, they highlight two instances where BLEU seriously underperformed. These were the 2005 NIST evaluations[6] where a number of different machine translation systems were tested, and their study of the SYSTRAN engine versus two engines using statistical machine translation (SMT) techniques.[7]

In the 2005 NIST MT evaluation, it is reported that the scores generated by BLEU failed to correspond to the scores produced in the human evaluations. The system which was ranked highest by the human judges was only ranked 6th by BLEU. In their study, they compared SMT systems with SYSTRAN, a knowledge based system. The scores from BLEU for SYSTRAN were substantially worse than the scores given to SYSTRAN by the human judges. They note that the SMT systems were trained using BLEU minimum error rate training,[8] and point out that this could be one of the reasons behind the difference. They conclude by recommending that BLEU be used in a more restricted manner, for comparing the results from two similar systems, and for tracking “broad, incremental changes to a single system”.[9]

Notes

  1. ^ Papineni, K., et al. (2002)
  2. ^ Coughlin, D. (2003)
  3. ^ Doddington, G. (2002)
  4. ^ Denoual, E. and Lepage, Y. (2005)
  5. ^ Callison-Burch, C., Osborne, M. and Koehn, P. (2006)
  6. ^ Lee, A. and Przybocki, M. (2005)
  7. ^ Callison-Burch, C., Osborne, M. and Koehn, P. (2006)
  8. ^ Lin, C. and Och, F. (2004)
  9. ^ Callison-Burch, C., Osborne, M. and Koehn, P. (2006)

This article is licensed under the GNU Free Documentation License. It uses material from the Wikipedia.

Leave a Reply

Your e-mail address will not be published. Required fields are marked *