Advertisement

Harvest Uyghur-Chinese Aligned-Sentences Bitexts from Multilingual Sites Based on Word Embedding

  • ShaoLin Zhu
  • Xiao Li
  • YaTing YangEmail author
  • Lei Wang
  • ChengGang Mi
Conference paper
  • 1.4k Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10565)

Abstract

Obtaining bilingual parallel data from the multilingual websites is a long-standing research problem, which is very benefit for resource-scarce languages. In this paper, we present an approach for obtaining parallel data based on word embedding, and our model only rely on a small scale of bilingual lexicon. Our approach benefit from the recent advances of continuous word representations, which can reveal more context information compared with traditional methods. Our experiments show that high-precision and sizable parallel Uyghur-Chinese data can be obtained for lacking bilingual lexicon.

Keywords

Bilingual parallel data Word embedding Resource-scarce languages 

Notes

Acknowledgments

This work is supported by the Xinjiang Fun under Grant (No. 2015KL031), the West Light Foundation of The Chinese Academy of Sciences (No. 2015-XBQN-B-10), the Xinjiang Science and Technology Major Project (No. 2016A03007-3) and Natural Science Foundation of Xinjiang (No. 2015211B034)

References

  1. Espla-Gomis, M., Forcada, M.L.: Combining content-based and URL-based heuristics to harvest aligned bitexts from multilingual sites with bitextor. Prague Bull. Math. Linguist. 93, 77–86 (2010)CrossRefGoogle Scholar
  2. Zhang, Y., Wu, K., Gao, J., Vines, P.: Automatic acquisition of Chinese–English parallel corpus from the web. In: Advances in Information Retrieval, vol. 3936, pp. 420–431 (2006)Google Scholar
  3. San Vicente, I., Manterola, I.: PaCo2: a fully automated tool for gathering parallel corpora from the web. In: Proceedings of the 8th International Conference on Language Resources and Evaluation, pp. 1–6 (2012)Google Scholar
  4. Resnik, P., Smith, N.A.: The Web as a parallel corpus. Comput. Linguist. 29, 349–380 (2003)CrossRefGoogle Scholar
  5. Papavassiliou, V., Prokopidis, P., Thurmair, G.: A modular open-source focused crawler for mining monolingual and bilingual corpora from the web. In: Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, pp. 43–51 (2013)Google Scholar
  6. Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31, 477–504 (2005a)CrossRefGoogle Scholar
  7. Espla-Gomis, M.: Bitextor, a free/open-source software to harvest translation memories from multilingual websites. In: Beyond Translation Memories Workshop (MT Summit XII) (2009)Google Scholar
  8. Espla-Gomis, M., Forcada, M.L.: Bitextor’s participation in WMT’16: shared task on document alignment. In: Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pp. 685–691 (2016)Google Scholar
  9. Ma, X., Liberman, M.Y.: BITS: a method for bilingual text search over the web. Linguist. Data Consort., 538–542 (1999)Google Scholar
  10. Espla-Gomis, M., Klubicka, F., Ljube, N.: Comparing two acquisition systems for automatically building an English–Croatian parallel corpus from multilingual websites. In: LREC 2014 Proceedings, pp. 1252–1256 (2014)Google Scholar
  11. Nie, J.-Y., Simard, M., Isabelle, P., Durand, R.: Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 74–81 (1999)Google Scholar
  12. Ling, W., Marujo, L., Dyer, C., Black, A., Trancoso, I.: Crowdsourcing high-quality parallel data extraction from Twitter. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 426–436 (2014)Google Scholar
  13. Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31, 477–504 (2005b)CrossRefGoogle Scholar
  14. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR Workshop, pp. 1–12 (2013a)Google Scholar
  15. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013b)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • ShaoLin Zhu
    • 1
    • 2
    • 3
  • Xiao Li
    • 1
    • 2
  • YaTing Yang
    • 1
    • 2
    Email author
  • Lei Wang
    • 1
    • 2
  • ChengGang Mi
    • 1
    • 2
  1. 1.The Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of SciencesUrumqiChina
  2. 2.Key Laboratory of Speech Language Information Processing of XinjiangUrumqiChina
  3. 3.University of Chinese Academy of SciencesBeijingChina

Personalised recommendations