Skip to main content

A Language-Independent Method for Detection and Correction of Alignment Errors in Parallel Corpora

  • Conference paper
  • First Online:
Natural Language Processing and Information Systems (NLDB 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9103))

  • 1772 Accesses

Abstract

We present a method for detection of the alignment errors in parallel corpora. The method is meant to be language-independent and was tested for pairs of English, Polish and Spanish languages. It utilizes automatically obtained dictionaries to perform the detection. A discussion about the origin of errors is included. An approach to correcting one of classes of errors is also described and tested. The proposed method has proven itself to be effective in improving the quality of Parallel Corpora. Conclusions of this study may be useful while dealing with errors in existing parallel data sources, as well as at the stage of aligning new parallel corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. JRC-Acquis (2014). https://ec.europa.eu/jrc/en/language-technologies/jrc-acquis

  2. Morfeusz SGJP (2014). http://sgjp.pl/morfeusz/

  3. Mozilla dictionaries (2014). https://wiki.mozilla.org/L10n:Dictionaries

  4. OpenSubtitles2011 (2014). http://datahub.io/dataset/opus/resource/e5a441a7-73d5-4f8c-a4b5-4bab42a739f2

  5. Tatoeba (2014). http://tatoeba.org/pol/downloads

  6. Wiktionary, the free dictionary (2014). http://en.wiktionary.org

  7. Ács, J.: Pivot-based multilingual dictionary building using wiktionary. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014). European Language Resources Association (ELRA), Reykjavik, Iceland, May 2014

    Google Scholar 

  8. Bojar, O., Žabokrtskỳ, Z.: CzEng 0.9: large parallel treebank with rich annotation. Prague Bull. Math. Linguist. 92(1), 63–84 (2009)

    Article  Google Scholar 

  9. Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Comput. Linguist. 19(1), 75–102 (1993)

    Google Scholar 

  10. Galley, M., Hopkins, M., Knight, K., Marcu, D.: What’s in a translation rule? In: HLT-NAACL, pp. 273–280 (2004). http://acl.ldc.upenn.edu/hlt-naacl2004/main/pdf/130_Paper.pdf

  11. Kȩdzia, P.: Rl-button. http://nlp.pwr.wroc.pl/pl/narzedzia-i-zasoby/rl-button

  12. Khadivi, S., Ney, H.: Automatic filtering of bilingual corpora for statistical machine translation. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 263–274. Springer, Heidelberg (2005). http://dx.doi.org/10.1007/11428817_24

    Chapter  Google Scholar 

  13. Ma, X.: Champollion: a robust parallel text sentence aligner. In: LREC 2006: Fifth International Conference on Language Resources and Evaluation, pp. 489–492 (2006)

    Google Scholar 

  14. Nie, J.Y., Cai, J.: Filtering noisy parallel corpora of web pages. In: 2001 IEEE International Conference on Systems, Man, and Cybernetics, vol. 1, pp. 453–458 (2001)

    Google Scholar 

  15. Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Chair), N.C.C., Choukri, K., Declerck, T., Dogăn, M.U., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012). European Language Resources Association (ELRA), Istanbul, Turkey, May 2012

    Google Scholar 

  16. Vogel, S.: Using noisy bilingual data for statistical machine translation. In: Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics, vol. 2, pp. 175–178. Association for Computational Linguistics (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Katarzyna Niżałowska .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Niżałowska, K., Markowska-Kaczmar, U. (2015). A Language-Independent Method for Detection and Correction of Alignment Errors in Parallel Corpora. In: Biemann, C., Handschuh, S., Freitas, A., Meziane, F., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2015. Lecture Notes in Computer Science(), vol 9103. Springer, Cham. https://doi.org/10.1007/978-3-319-19581-0_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-19581-0_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-19580-3

  • Online ISBN: 978-3-319-19581-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics