Unsupervised Construction of Quasi-comparable Corpora and Probing for Parallel Textual Data

Wołk, Krzysztof; Marasek, Krzysztof

doi:10.1007/978-3-319-43982-2_27

Krzysztof Wołk⁵ &
Krzysztof Marasek⁵

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 506))

514 Accesses

Abstract

The multilingual nature of the world makes translation a crucial requirement today. Parallel dictionaries constructed by humans are a widely-available resource, but they are limited and do not provide enough coverage for good quality translation purposes, due to out-of-vocabulary words and neologisms. This motivates the use of statistical translation systems, which are unfortunately dependent on the quantity and quality of training data. Such systems have a very limited availability especially for some languages and very narrow text domains. Is this research we present our improvements to current quasi-comparable corpora mining methodologies by re-implementing the comparison algorithms, introducing a tuning script and improving performance using GPU acceleration. The experiments are conducted on lectures text domain and bi-data is extracted from web crawl from the WWW. The modifications made a positive impact on the quality and quantity of mined data and on the translation quality as well and used the BLEU, NIST and TER metrics. By defining proper translation parameters to morphologically rich languages we improve the translation quality and draw the conclusions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.ted.com/.
2.
http://www.fbk.eu/.

References

Wołk, K., Marasek, K.: Real-time statistical speech translation. New Perspectives in Information Systems and Technologies, vol. 1, pp. 107–113. Springer International Publishing (2014)
Google Scholar
Wołk, K., Marasek, K.: Polish–English speech statistical machine translation systems for the IWSLT 2013. In: Proceedings of the 10th International Workshop on Spoken Language Translation, Heidelberg, Germany, pp. 113–119 (2013)
Google Scholar
Koehn, P.: Statistical Machine Translation. Cambridge University Press (2009)
Google Scholar
Berrotarán, G., Carrascosa, R., Vine, A.: Yalign documentation. Accessed 01 2015
Google Scholar
Chu, C., Nakazawa, T., Kurohashi, S.: Chinese-Japanese parallel sentence extraction from quasi–comparable corpora. ACL 2013, 34 (2013)
Google Scholar
Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. Natural Language Processing—IJCNLP 2005. Lecture Notes in Computer Science, vol. 3651, pp. 257–268 (2005)
Google Scholar
Adafree, S.F., deRijke, M.: Finding similar sentences across multiple languages in Wikipedia (2006)
Google Scholar
Mohammadi, M., and Aghaee, N.Q.: Building bilingual parallel corpora based on Wikipedia (2010)
Google Scholar
Chu, C., Nakazawa, T., Kurohashi, S.: Accurate parallel fragment extraction from quasi–comparable corpora using alignment model and translation lexicon. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 1144–1150 (2013)
Google Scholar
Yasuda, K., Sumita, E.: Method for building sentence-aligned corpus from Wikipedia (2008)
Google Scholar
Plamada, M., Volk, M.: Mining for domain-specific parallel texts from the Wikipedia (2013)
Google Scholar
Aker, A., Kanoulas, E., Gaizauskas, R.J., A light way to collect comparable corpora from the Web. LREC (2012)
Google Scholar
Strötgen, J., Gertz, M., Junghans, C.: An event-centric model for multilingual document similarity. In: SIGIR’11: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, Beijing, China, pp. 953–962 (2011)
Google Scholar
Wu, D.: Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Comput. Linguist. 23(3), 377–403 (1997)
Google Scholar
Sarikaya, R., Maskey, S., Zhang, R., Jan, E. E., Wang, D., Ramabhadran, B., Roukos, S.: Iterative sentence-pair extraction from quasi-parallel corpora for machine translation. In: INTERSPEECH, pp. 432–435 (2009)
Google Scholar
Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Natural Language Processing–IJCNLP 2005, pp. 257–268 (2005)
Google Scholar
Cettolo, M., Girardi, C., Federico, M.: WIT3: Web inventory of transcribed and translated talks. In: Proceedings of EAMT, Trento, Italy, pp. 261–268 (2012)
Google Scholar
Bojar, O., Rosa, R., Tamchyna, A.: Chimera–three heads for English-to-Czech translation. In: Proceedings of the Eighth Workshop on Statistical Machine Translation. Association for Computational Linguistics Sofia, Bulgaria, pp. 90–96 (2013)
Google Scholar
Musso, G.: Sequence alignment (Needleman-Wunsch, Smith-Waterman). http://www.cs.utoronto.ca/~brudno/bcb410/lec2notes.pdf
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. Lect. Notes Comput. Sci. 1398(1998), 137–142 (2005)
Google Scholar
Wołk, K., Marasek, K.: A sentence meaning based alignment method for parallel text corpora preparation. Advances in Intelligent Systems and Computing, vol. 275, pp. 107–114. Springer, Madeira Island, Portugal (2014). ISSN 2194-5357. ISBN 978-3-319-05950-1
Google Scholar
Roessler R.: A GPU implementation of Needleman-Wunsch. Specifically for use in the Program PyroNoise 2 (2010)
Google Scholar
Koehn, P., Haddow, B.: Towards effective use of training data in statistical machine translation. In: WMT’12 Proceedings of the Seventh Workshop on Statistical Machine Translation, Stroudsburg, PA, USA, 317–321 (2012)
Google Scholar
Clark, J.H., Dyer, C., Lavie, A., Smith, N.A.: Better hypothesis testing for statistical machine translation: controlling for optimizer instability. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers, vol. 2, pp. 176–181. Association for Computational Linguistics (2011)
Google Scholar

Download references

Acknowledgments

This research was supported by Polish-Japanese Academy of Information Technology statutory resources (ST/MUL/2016), resources for young researchers at PJATK and CLARIN ERIC research program.

Author information

Authors and Affiliations

Polish-Japanese Academy of Information Technology, ul. Koszykowa 86, 02-008, Warsaw, Poland
Krzysztof Wołk & Krzysztof Marasek

Authors

Krzysztof Wołk
View author publications
You can also search for this author in PubMed Google Scholar
Krzysztof Marasek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Krzysztof Wołk .

Editor information

Editors and Affiliations

Department of Information Systems, Faculty of Computer Science and Management, Wrocław University of Science and Technology, Wroclaw, Poland
Aleksander Zgrzywa
Department of Information Systems, Faculty of Computer Science and Management, Wrocław University of Science and Technology, Wrocław, Poland
Kazimierz Choroś
Department of Information Systems, Faculty of Computer Science and Management, Wrocław University of Science and Technology, Wrocław, Poland
Andrzej Siemiński

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wołk, K., Marasek, K. (2017). Unsupervised Construction of Quasi-comparable Corpora and Probing for Parallel Textual Data. In: Zgrzywa, A., Choroś, K., Siemiński, A. (eds) Multimedia and Network Information Systems. Advances in Intelligent Systems and Computing, vol 506. Springer, Cham. https://doi.org/10.1007/978-3-319-43982-2_27

Download citation

DOI: https://doi.org/10.1007/978-3-319-43982-2_27
Published: 17 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43981-5
Online ISBN: 978-3-319-43982-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics