Skip to main content

Mining Parallel Documents Using Low Bandwidth and High Precision CLIR from the Heterogeneous Web

  • Chapter
  • First Online:
Building and Using Comparable Corpora
  • 1160 Accesses

Abstract

We propose a content-based method of mining bilingual parallel documents from websites that are not necessarily structurally related to each other. There are two existing approaches for automatically mining parallel documents from the web. Structure based methods work only for parallel websites and most of content based methods are either requires large scale computational facilities, network bandwidth or not applicable to heterogeneous web. We propose a novel content based method using cross lingual information retrieval (CLIR) with query feedback and verification and supplemented with structural information, to mine parallel resources from the entire web using search engine APIs. The method goes beyond structural information to find parallel documents from non-parallel websites. We obtained a very high mining precision and extracted parallel sentences improved SMT performance, with higher BLEU score, is comparable to that obtained with high quality manually translated parallel sentences illustrating the excellent quality of the mined parallel materiel

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We knew the web was big \(\ldots \) on the Official Google Blog. http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html.

  2. 2.

    Source: http://cn.reuters.com/article/CNTechNews/idCNCHINA-3233720101027 on May 10, 2011.

  3. 3.

    http://www.elias.cn/En/ExtMainText/

  4. 4.

    LDC Catalog Number: LDC2002L27.

References

  1. Munteanu, D., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31:477–504 (2005)

    Google Scholar 

  2. Nie, J.-Y., Simard, M., Isabelle, P., Durand, R.: Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the web. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in, Information Retrieval, pp. 74–81 (1999)

    Google Scholar 

  3. Grefenstette, G.: Cross-Language Information Retrieval. Kluwer Academic, New York (1998)

    Google Scholar 

  4. Munteanu, D., Marcu, D.: Extracting parallel sub-sentential fragments from nonparallel corpora. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Morristown, NJ, USA, pp. 81–88 (2006)

    Google Scholar 

  5. Uszkoreit, J., Ponte, J., Popat, A., Dubiner, M.: Large scale parallel document mining for machine translation. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China, pp. 1101–1109 (2010)

    Google Scholar 

  6. Resnik, P., Smith, N.: The web as a parallel corpus. Comput. Linguist. 29:349–380 (2003)

    Google Scholar 

  7. Shi, L., Niu, C., Zhou, M., Gao, J.: A dom tree alignment model for mining parallel data from the web. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Morristown, NJ, USA, pp. 489–496 (2006)

    Google Scholar 

  8. Ma, X.: Champollion: a robust parallel text sentence aligner. In: Proceedings of the Fifth International Conference On Language Resources and Evaluation (LREC 2006), ELRA. Genoa, Italy (2006)

    Google Scholar 

  9. Chen, J., Nie, J.-Y.: Parallel web text mining for cross-language information retrieval. In: Recherche d’Informations Assistée par Ordinateur (RIAO), pp. 62–77 (2000)

    Google Scholar 

  10. Jiang, X., Hu, Y., Li, H.: A ranking approach to keyphrase extraction. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’09, New York, NY, USA (2009)

    Google Scholar 

  11. Hong, G., Li, C.-H., Zhou, M., Rim, H.-C.: An empirical study on web mining of parallel data. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING2010), Beijing, China, pp. 474–482 (2010)

    Google Scholar 

  12. Cheung, C., Fung, P.: Unsupervised learning of a spontaneous and colloquial speech lexicon in Chinese. Int. J. Speech Technol. 7, 173–178 (2004)

    Article  Google Scholar 

  13. Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual meeting of the Association for Computational Linguistics (ACL 2002), pp. 311–318 (2002)

    Google Scholar 

  14. Prochasson, E., Fung, P.: Rare word translation extraction from aligned comparable documents. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, USA (2011)

    Google Scholar 

  15. Carpuat, M., Fung, P., Ngai, G.: Aligning word senses using bilingual corpora. ACM Trans. Asian Lang. Inform. Process. 5(2):89–120 (2006)

    Google Scholar 

  16. Abdul-Rauf, S., Schwenk, H.: On the use of comparable corpora to improve SMT performance. In: Proceedings of the 12th Conference of the European Chapter of the Association for, Computational Linguistics (EACL’06), pp. 16–23 (2006)

    Google Scholar 

  17. Akamine, S., Kato, Y., Kawahara, D., Shinzato, K., Inui, K., Kurohashi, S., Kidawara, Y.: Development of a large-scale web crawler and search engine infrastructure. In: Proceedings of the 3rd international Universal Communication, Symposium (IUCS’09), pp. 126–131 (2009)

    Google Scholar 

  18. Fung, P., Prochasson, E., Shi, S.: Trillions of comparable documents. In: Proceeding of the 3rd Workshop on Building and Using Comparable Corpora (BUCC’10), Language Resource and Evaluation Conference (LREC2010), Malta, pp. 26–34 (2010)

    Google Scholar 

  19. Gleim, R., Mehler, A., Dehmer, M.: Web corpus mining by instance of wikipedia. In: Proceedings of the 2nd International Workshop on Web as Corpus (WAC’06), Morristown, NJ, USA, pp. 67–74 (2006)

    Google Scholar 

Download references

Acknowledgments

This project is partially funded by a subcontract from BBN, under the DARPA GALE project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Simon Shi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Shi, S., Fung, P. (2013). Mining Parallel Documents Using Low Bandwidth and High Precision CLIR from the Heterogeneous Web. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20128-8_2

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20127-1

  • Online ISBN: 978-3-642-20128-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics