Abstract
Considerable attention is being paid to methods for gathering and evaluating comparable corpora, not only to improve Statistical Machine Translation (SMT) but for other applications as well, e.g. the extraction of paraphrases. The potential value of such corpora requires efficient and effective methods for gathering and evaluating them. Most of these methods have been tested in retrieving document pairs for well resourced languages, however there is a lack of work in areas of less popular (under resourced) languages, or domains. This chapter describes the work in developing methods for automatically gathering comparable corpora from the Web, specifically for under resourced languages. Different online sources are investigated and an evaluation method is developed to assess the quality of the retrieved documents.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Adafre, S.F., de Rijke, M.: Finding similar sentences across multiple languages in wikipedia. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 62–69 (2006)
Appelt, D.: An introduction to information extraction. Artif. Intell. Commun. 12(3), 161–172 (1999)
Argaw, A.A., Asker, L.: Web mining for an Amharic-English bilingual corpus. In Proceedings of 1st International Conference on Web Information Systems and Technologies (WEBIST 2005), Miami, USA (May 2005)
Baroni, M., Bernardini, S.: Bootstrapping corpora and terms from the web. In Proceedings of LREC (2004)
Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175 (1994)
Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data, Science and Technology Books (2002)
Dietterich, T., Bakiri, G.: Solving multiclass learning problems via error-correcting output codes. J. Artif. Intell. Res. 2(1), 263–286 (1995)
Do, T., Le, V., Bigi, B., Besacier, L., Castelli, E.: Mining a comparable text corpus for a Vietnamese-French statistical machine translation system. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, pp. 165–172. Association for Computational Linguistics (2009)
Fung, P., Cheung, P.: Mining very non-parallel corpora: parallel sentence and lexicon extraction vie bootstrapping and EM. In: EMNLP, pp. 57–63 (2004)
Ghani, R., Jones, R., Mladenic, D.: Building minority language corpora by learning to generate web search queries. KAIS Knowl. Inform. Syst. 7(1) (2005)
Grishman, R., Sundheim, B.: Message understanding conference-6: a brief history. In: Proceedings of the 16th International Conference on Computational Linguistics, Copenhagen (June 1996).
Hassan, A., Fahmy, H., Hassan, H.: Improving named entity translation by exploiting comparable and parallel corpora. In Proceedings of the 2007 Conference on Recent Advances in Natural Language Processing (RANLP), AMML Workshop (2007)
http://techcrunch.com/2010/02/24/twitter-languages/. Accessed 1 April 2011
Mohammadi, M., GhasemAghaee, N.: Building bilingual parallel corpora based on wikipedia. In: Proceedings of Second International Conference on Computer Engineering and Applications, vol. 2, pp. 264–268 (2010)
Munteanu, D., Marcu, D.: Improving machine translation performance by exploiting comparable corpora. Comput. Linguist. 31(4), 477–504 (2005)
Munteanu, D. S., Fraser, A., Marcu, D.: Improved machine translation performance via parallel sentence extraction from comparable corpora. In: HLT-NAACL, pp. 265–272 (2004)
Resnik, P.: Mining the web for bilingual text. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 527–534, Morristown, NJ, USA. Association for Computational Linguistics (1999)
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Schonfeld, E.: Costolo: Twitter now has 190 million users tweeting 65 million times a day. (2010). http://techcrunch.com/2010/06/08/twitter-190-million-users/ Accessed 1 September 2010
Sparck-Jones, K., Willet, P.: Readings in Information Retrieval. Morgan Kauffmann, San Francisco (1997)
Steinberger, R., Pouliquen, B., Ignat, C.: Navigating multilingual news collections using automatically extracted information. J. Comput. Inform. Technol. 13(4), 257–264 (2005)
Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., Laurikkala, J.: Focused web crawling in the acquisition of comparable corpora. Inform. Retr. 11(5), 427–445 (2008)
Zhang, Y., Wu, K., Gao, J., Vines, P.: Automatic acquisition of Chinese-English parallel corpus from the web. In: Proceedings of 28th European Conference on Information Retrieval. ECIR ’06 (2006)
Acknowledgments
The project has received funding from the ACCURAT Project, European Community’s Seventh Framework Programme (FP7/2007-2013) under Grant Agreement no 248347.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Paramita, M.L., Guthrie, D., Kanoulas, E., Gaizauskas, R., Clough, P., Sanderson, M. (2013). Methods for Collection and Evaluation of Comparable Documents. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-20128-8_5
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20127-1
Online ISBN: 978-3-642-20128-8
eBook Packages: Computer ScienceComputer Science (R0)