Automatic Filtering of Bilingual Corpora for Statistical Machine Translation

  • Shahram Khadivi
  • Hermann Ney
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3513)


For many applications such as machine translation and bilingual information retrieval, the bilingual corpora play an important role in training the system. Because they are obtained through automatic or semi automatic methods, they usually include noise, sentence pairs which are worthless or even harmful for training the system. We study the effect of different levels of corpus noise on an end-to-end statistical machine translation system. We also propose an efficient method for corpus filtering. This method filters out the noisy part of a corpus based on the state-of-the-art word alignment models. We show the efficiency of this method on the basis of the sentence misalignment rate of the filtered corpus and its positive effect on the translation quality.


Machine Translation Target Sentence Statistical Machine Translation Sentence Pair Translation Quality 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Nie, J., Cai, J.: Filtering noisy parallel corpora of web pages. In: IEEE Symposium on NLP and Knowledge Engineering, Tucson, pp. 453–458 (2001)Google Scholar
  2. 2.
    Imamura, K., Sumita, E.: Automatic construction of machine translation knowledge using translation literalness. In: 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, pp. 155–162 (2003)Google Scholar
  3. 3.
    Imamura, K., Sumita, E.: Bilingual corpus cleaning focusing on translation literality. In: 7th International Conference on Spoken Language Processing (ICSLP-2002), Denver, Colorado, pp. 1713–1716 (2002)Google Scholar
  4. 4.
    Vogel, S.: Using noisy bilingual data for statistical machine translation. In: 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, pp. 175–178 (2003)Google Scholar
  5. 5.
    Munteanu, D.S., Fraser, A., Marcu, D.: Improved machine translation performance via parallel sentence extraction from comparable corpora. In: Susan Dumais, D.M., Roukos, S. (eds.) HLT-NAACL 2004: Main Proceedings, Boston, Massachusetts, USA, Association for Computational Linguistics, pp. 265–272 (2004)Google Scholar
  6. 6.
    Brown, P.F., Cocke, J., Della Pietra, S.A., Della Pietra, V.J., Jelinek, F., Lafferty, J.D., Mercer, R.L., Roossin, P.S.: A statistical approach to machine translation. Computational Linguistics 16, 79–85 (1990)Google Scholar
  7. 7.
    Och, F.J., Ney, H.: Discriminative training and maximum entropy models for statistical machine translation. In: Proc. of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, PA, pp. 295–302 (2002)Google Scholar
  8. 8.
    Och, F.J.: Minimum error rate training in statistical machine translation. In: Proc. of the 41th Annual Meeting of the Association for Computational Linguistics (ACL), Sapporo, Japan, pp. 160–167 (2003)Google Scholar
  9. 9.
    Vidal, E., et al.: Final report of esprit research project 30268 (EuTrans): Example-based language translation systems. Technical report (2000)Google Scholar
  10. 10.
    Resnik, P.: Mining the web for bilingual text. In: Proc. of the 37th Annual Meeting of the Association for Computational Linguistics (ACL), University of Maryland, College Park, MD, pp. 527–534 (1999)Google Scholar
  11. 11.
    Chen, J., Nie, J.Y.: Automatic construction of parallel english-chinese corpus for crosslanguage information retrieval. In: Proceedings of the sixth conference on Applied natural language processing, Seattle, Washington, pp. 21–28. Morgan Kaufmann Publishers Inc., San Francisco (2000)CrossRefGoogle Scholar
  12. 12.
    Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Computational Linguistics 19, 75–102 (1993)Google Scholar
  13. 13.
    Zhao, B., et al.: Efficient optimization for bilingual sentence alignment based on linear regression. In: HLT-NAACL 2003 Workshop: Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, Edmonton, Canada, pp. 81–87 (2003)Google Scholar
  14. 14.
    Melamed, I.D.: Pattern recognition for mapping bitext correspondence. In: Véronis, J. (ed.) Parallel Text Processing: Alignment and Use of Translation Corpora, pp. 25–47. Kluwer Academic Publishers, Dordrecht (2000)Google Scholar
  15. 15.
    Melamed, I.D.: A geometric approach to mapping bitext correspondence. In: Brill, E., Church, K. (eds.) Proceedings of the Conference on Empirical Methods in Natural Language Processing, Somerset, New Jersey, Association for Computational Linguistics, pp. 1–12 (1996)Google Scholar
  16. 16.
    Simard, M., Foster, G., Isabelle, P.: Using cognates to align sentences in bilingual corpora. In: Fourth Int. Conf. on Theoretical andMethodological Issues in Machine Translation (TMI 1992), Montreal, Canada, pp. 67–81 (1992)Google Scholar
  17. 17.
    LDC: Champollion tool kit (2004),
  18. 18.
    Caseli, H.M., Nunes, M.G.V.: Evaluation of sentence alignment methods on portugueseenglish parallel texts. Scientia 14, 1–14 (2003)Google Scholar
  19. 19.
    Och, F.J.: YASMET: Toolkit for conditional maximum entropy models (2001),
  20. 20.
    Vogel, S., Ney, H., Tillmann, C.: HMM-based word alignment in statistical translation. In: COLING 1996: The 16th Int. Conf. on Computational Linguistics, Copenhagen, Denmark, pp. 836–841 (1996)Google Scholar
  21. 21.
    Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: Themathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19, 263–311 (1993)Google Scholar
  22. 22.
    Papineni, K.A., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. Technical Report RC22176 (W0109-022), IBM Research Division, Thomas J. Watson Research Center (2001)Google Scholar
  23. 23.
    Bisani, M., Ney, H.: Bootstrap estimates for confidence intervals in asr performance evaluationx. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Canada, pp. 409–412 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Shahram Khadivi
    • 1
  • Hermann Ney
    • 1
  1. 1.Lehrstuhl für Informatik VI – Computer Science DepartmentRWTH Aachen UniversityAachenGermany

Personalised recommendations