Abstract
We present a method for detection of the alignment errors in parallel corpora. The method is meant to be language-independent and was tested for pairs of English, Polish and Spanish languages. It utilizes automatically obtained dictionaries to perform the detection. A discussion about the origin of errors is included. An approach to correcting one of classes of errors is also described and tested. The proposed method has proven itself to be effective in improving the quality of Parallel Corpora. Conclusions of this study may be useful while dealing with errors in existing parallel data sources, as well as at the stage of aligning new parallel corpora.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
JRC-Acquis (2014). https://ec.europa.eu/jrc/en/language-technologies/jrc-acquis
Morfeusz SGJP (2014). http://sgjp.pl/morfeusz/
Mozilla dictionaries (2014). https://wiki.mozilla.org/L10n:Dictionaries
OpenSubtitles2011 (2014). http://datahub.io/dataset/opus/resource/e5a441a7-73d5-4f8c-a4b5-4bab42a739f2
Tatoeba (2014). http://tatoeba.org/pol/downloads
Wiktionary, the free dictionary (2014). http://en.wiktionary.org
Ács, J.: Pivot-based multilingual dictionary building using wiktionary. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014). European Language Resources Association (ELRA), Reykjavik, Iceland, May 2014
Bojar, O., Žabokrtskỳ, Z.: CzEng 0.9: large parallel treebank with rich annotation. Prague Bull. Math. Linguist. 92(1), 63–84 (2009)
Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Comput. Linguist. 19(1), 75–102 (1993)
Galley, M., Hopkins, M., Knight, K., Marcu, D.: What’s in a translation rule? In: HLT-NAACL, pp. 273–280 (2004). http://acl.ldc.upenn.edu/hlt-naacl2004/main/pdf/130_Paper.pdf
Kȩdzia, P.: Rl-button. http://nlp.pwr.wroc.pl/pl/narzedzia-i-zasoby/rl-button
Khadivi, S., Ney, H.: Automatic filtering of bilingual corpora for statistical machine translation. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 263–274. Springer, Heidelberg (2005). http://dx.doi.org/10.1007/11428817_24
Ma, X.: Champollion: a robust parallel text sentence aligner. In: LREC 2006: Fifth International Conference on Language Resources and Evaluation, pp. 489–492 (2006)
Nie, J.Y., Cai, J.: Filtering noisy parallel corpora of web pages. In: 2001 IEEE International Conference on Systems, Man, and Cybernetics, vol. 1, pp. 453–458 (2001)
Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Chair), N.C.C., Choukri, K., Declerck, T., Dogăn, M.U., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012). European Language Resources Association (ELRA), Istanbul, Turkey, May 2012
Vogel, S.: Using noisy bilingual data for statistical machine translation. In: Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics, vol. 2, pp. 175–178. Association for Computational Linguistics (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Niżałowska, K., Markowska-Kaczmar, U. (2015). A Language-Independent Method for Detection and Correction of Alignment Errors in Parallel Corpora. In: Biemann, C., Handschuh, S., Freitas, A., Meziane, F., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2015. Lecture Notes in Computer Science(), vol 9103. Springer, Cham. https://doi.org/10.1007/978-3-319-19581-0_30
Download citation
DOI: https://doi.org/10.1007/978-3-319-19581-0_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19580-3
Online ISBN: 978-3-319-19581-0
eBook Packages: Computer ScienceComputer Science (R0)