Skip to main content

Unsupervised Construction of Quasi-comparable Corpora and Probing for Parallel Textual Data

  • Conference paper
  • First Online:
Multimedia and Network Information Systems

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 506))

  • 514 Accesses

Abstract

The multilingual nature of the world makes translation a crucial requirement today. Parallel dictionaries constructed by humans are a widely-available resource, but they are limited and do not provide enough coverage for good quality translation purposes, due to out-of-vocabulary words and neologisms. This motivates the use of statistical translation systems, which are unfortunately dependent on the quantity and quality of training data. Such systems have a very limited availability especially for some languages and very narrow text domains. Is this research we present our improvements to current quasi-comparable corpora mining methodologies by re-implementing the comparison algorithms, introducing a tuning script and improving performance using GPU acceleration. The experiments are conducted on lectures text domain and bi-data is extracted from web crawl from the WWW. The modifications made a positive impact on the quality and quantity of mined data and on the translation quality as well and used the BLEU, NIST and TER metrics. By defining proper translation parameters to morphologically rich languages we improve the translation quality and draw the conclusions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.ted.com/.

  2. 2.

    http://www.fbk.eu/.

References

  1. Wołk, K., Marasek, K.: Real-time statistical speech translation. New Perspectives in Information Systems and Technologies, vol. 1, pp. 107–113. Springer International Publishing (2014)

    Google Scholar 

  2. Wołk, K., Marasek, K.: Polish–English speech statistical machine translation systems for the IWSLT 2013. In: Proceedings of the 10th International Workshop on Spoken Language Translation, Heidelberg, Germany, pp. 113–119 (2013)

    Google Scholar 

  3. Koehn, P.: Statistical Machine Translation. Cambridge University Press (2009)

    Google Scholar 

  4. Berrotarán, G., Carrascosa, R., Vine, A.: Yalign documentation. Accessed 01 2015

    Google Scholar 

  5. Chu, C., Nakazawa, T., Kurohashi, S.: Chinese-Japanese parallel sentence extraction from quasi–comparable corpora. ACL 2013, 34 (2013)

    Google Scholar 

  6. Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. Natural Language Processing—IJCNLP 2005. Lecture Notes in Computer Science, vol. 3651, pp. 257–268 (2005)

    Google Scholar 

  7. Adafree, S.F., deRijke, M.: Finding similar sentences across multiple languages in Wikipedia (2006)

    Google Scholar 

  8. Mohammadi, M., and Aghaee, N.Q.: Building bilingual parallel corpora based on Wikipedia (2010)

    Google Scholar 

  9. Chu, C., Nakazawa, T., Kurohashi, S.: Accurate parallel fragment extraction from quasi–comparable corpora using alignment model and translation lexicon. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 1144–1150 (2013)

    Google Scholar 

  10. Yasuda, K., Sumita, E.: Method for building sentence-aligned corpus from Wikipedia (2008)

    Google Scholar 

  11. Plamada, M., Volk, M.: Mining for domain-specific parallel texts from the Wikipedia (2013)

    Google Scholar 

  12. Aker, A., Kanoulas, E., Gaizauskas, R.J., A light way to collect comparable corpora from the Web. LREC (2012)

    Google Scholar 

  13. Strötgen, J., Gertz, M., Junghans, C.: An event-centric model for multilingual document similarity. In: SIGIR’11: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, Beijing, China, pp. 953–962 (2011)

    Google Scholar 

  14. Wu, D.: Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Comput. Linguist. 23(3), 377–403 (1997)

    Google Scholar 

  15. Sarikaya, R., Maskey, S., Zhang, R., Jan, E. E., Wang, D., Ramabhadran, B., Roukos, S.: Iterative sentence-pair extraction from quasi-parallel corpora for machine translation. In: INTERSPEECH, pp. 432–435 (2009)

    Google Scholar 

  16. Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Natural Language Processing–IJCNLP 2005, pp. 257–268 (2005)

    Google Scholar 

  17. Cettolo, M., Girardi, C., Federico, M.: WIT3: Web inventory of transcribed and translated talks. In: Proceedings of EAMT, Trento, Italy, pp. 261–268 (2012)

    Google Scholar 

  18. Bojar, O., Rosa, R., Tamchyna, A.: Chimera–three heads for English-to-Czech translation. In: Proceedings of the Eighth Workshop on Statistical Machine Translation. Association for Computational Linguistics Sofia, Bulgaria, pp. 90–96 (2013)

    Google Scholar 

  19. Musso, G.: Sequence alignment (Needleman-Wunsch, Smith-Waterman). http://www.cs.utoronto.ca/~brudno/bcb410/lec2notes.pdf

  20. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. Lect. Notes Comput. Sci. 1398(1998), 137–142 (2005)

    Google Scholar 

  21. Wołk, K., Marasek, K.: A sentence meaning based alignment method for parallel text corpora preparation. Advances in Intelligent Systems and Computing, vol. 275, pp. 107–114. Springer, Madeira Island, Portugal (2014). ISSN 2194-5357. ISBN 978-3-319-05950-1

    Google Scholar 

  22. Roessler R.: A GPU implementation of Needleman-Wunsch. Specifically for use in the Program PyroNoise 2 (2010)

    Google Scholar 

  23. Koehn, P., Haddow, B.: Towards effective use of training data in statistical machine translation. In: WMT’12 Proceedings of the Seventh Workshop on Statistical Machine Translation, Stroudsburg, PA, USA, 317–321 (2012)

    Google Scholar 

  24. Clark, J.H., Dyer, C., Lavie, A., Smith, N.A.: Better hypothesis testing for statistical machine translation: controlling for optimizer instability. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers, vol. 2, pp. 176–181. Association for Computational Linguistics (2011)

    Google Scholar 

Download references

Acknowledgments

This research was supported by Polish-Japanese Academy of Information Technology statutory resources (ST/MUL/2016), resources for young researchers at PJATK and CLARIN ERIC research program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Krzysztof Wołk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing Switzerland

About this paper

Cite this paper

Wołk, K., Marasek, K. (2017). Unsupervised Construction of Quasi-comparable Corpora and Probing for Parallel Textual Data. In: Zgrzywa, A., Choroś, K., Siemiński, A. (eds) Multimedia and Network Information Systems. Advances in Intelligent Systems and Computing, vol 506. Springer, Cham. https://doi.org/10.1007/978-3-319-43982-2_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-43982-2_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-43981-5

  • Online ISBN: 978-3-319-43982-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics