Abstract
In this article we describe two tools we have built, one for compiling comparable corpora out of the Internet and the other for bilingual terminology extraction out of comparable corpora, and an evaluation we have subjected them to: bilingual terminology has been extracted out of automatically collected domain-comparable web corpora, in Basque and English, and the resulting terminology lists have been validated automatically using a specialized dictionary, in order to evaluate the quality of the extracted terminology lists. Thus, this evaluation measures the usefulness of putting these two automatic tools to work together in a real-world task, that is, specialized dictionary making.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aduriz, I., Aldezabal, I., Alegria, I., Artola, X., Ezeiza, N., Urizar, R.: Euslem: A lemmatiser/tagger for basque. In: Proceedings of 7th EURALEX International Conference, vol. 1, pp. 17ā26. EURALEX, Gƶteborg, Sweden (2002)
Al-Onaizan, Y., Knight, K.: Machine transliteration of names in arabic text. In: Proceedings of the ACL-02 workshop on Computational approaches to Semitic languages, pp. 1ā13. ACL, Philadelphia, USA (2002)
Alegria, I., Gurrutxaga, A., Lizaso, P., Saralegi, X., Ugartetxea, S., Urizar, R.: Linguistic and statistical approaches to basque term extraction. In: Proceedings of GLAT 2004. Barcelona, Spain (2004)
Alegria, I., Gurrutxaga, A., Lizaso, P., Saralegi, X., Ugartetxea, S., Urizar, R.: An xml-based term extraction tool for basque. In: Proceedings of the 4th International Conference on Language Resources and Evaluations (LREC). ELRA, Lisbon, Portugal (2004)
Alegria, I., Gurrutxaga, A., Saralegi, X., Ugartetxea, S.: Elexbi, a basic tool for bilingual term extraction from Spanish-Basque parallel corpora. In: Proceedings of Euralex 2006, pp. 159ā165. Euralex, Torino, Italy (2006)
Amati, G., Van Rijsbergen, C.: Probabilistic models of information retrieval based on measuring divergence from randomness. Trans. Inform. Syst. 20(4), 357ā389 (2002)
Baayen, R.: Word Frequency Distributions. Kluwer, Dordrecht (2001)
Ballesteros, L., Croft, W.: Resolving ambiguity for cross-language retrieval. In: Proceedings of SIGIR Conference, pp. 64ā71. ACM, Melbourne (1998)
Baroni, M., Bernardini, S.: Bootcat: Bootstrapping corpora and terms from the web. In: Proceedings of LREC 2004, pp. 1313ā1316. ELRA, Lisbon, Portugal (2004)
Baroni, M., Chantree, F., Kilgarriff, A., Sharoff, S.: Cleaneval: a competition for cleaning web pages. In: Proceedings of LREC 2008. ELRA, Marrakech, Morocco (2008)
Baroni, M., Kilgarriff, A.: Large linguistically-processed web corpora for multiple languages. In: Proceedings of EACL 2006, pp. 87ā90. EACL, Trento, Italy (2006)
Baroni, M., Ueyama, M.: Building general- and special purpose corpora by web crawling. In: Proceedings of the 13th NIJL International Symposium. Tokyo, Japan (2006)
Barzilay, R., Lee, L.: Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In: Proceedings of HLT/NAACL, pp. 16ā23. NAACL, Edmonton, USA (2003)
Basic dictionary of science and technology, http://zthiztegia.elhuyar.org
Bekavac, B., Osenova, P., Simov, K., TadiÄ, M.: Making monolingual corpora comparable: a case study of Bulgarian & Croatian. In: Proceedings of LREC 2004, pp. 1187ā1190. ELRA, Lisbon, Portugal (2004)
Blaheta, D., Johnson, M.: Unsupervised learning of multi-word verbs. In: Proceedings of the 39th Annual Meeting of the ACL, pp. 54ā60. ACL, Toulouse, France (2001)
Bourigault, D.: Lexter, a natural language processing tool for terminology extraction. In: Proceedings of 7th EURALEX International Conference. Gƶteborg, Sweden (1996)
Braschler, M., SchƤuble, P.: Multilingual information retrieval based on document alignment techniques. In: Proceedings of the 2nd European Conference on Research and Advanced Technology for Digital Libraries, pp. 183ā197. Springer, Heraklion, Greece (1998)
Broder, A.: On the resemblance and containment of documents. In: Proceedings of Compression and Complexity of Sequences 1997, pp. 21ā29. IEEE, Salerno, Italy (1997)
Broder, A.: Identifying and filtering near-duplicate documents. In: Proceedings of Combinatorial Pattern Matching: 11th Annual Symposium, pp. 1ā10. Montreal, Canada (2000)
Cavnar, W., Trenkle, J.: N-gram-based text categorization. In: Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, pp. 161ā175. Las Vegas, USA (1994)
Chakrabarti, S., Van der Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. In: Proceedings of the 8th International WWW Conference, pp. 545ā562. W3C, Toronto, Canada (1999)
Chen, H., Bian, G., Lin, W.: Resolving translation ambiguity and target polysemy in cross-language information retrieval. In: Proceedings of the 37th annual meeting of the Association for Computational Linguistics, pp. 215ā222. ACL, College Park, USA (1999)
Chiao, Y., Zweigenbaum, P.: Looking for candidate translational equivalents in specialized, comparable corpora. In: Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), pp. 1208ā1212. ACL, Taipei, Taiwan (2002)
Church, K., Hanks, P.: Word association norms, mutual information and lexicography. In: Proceedings of the 27th Annual Meeting of the ACL, pp. 76ā83. ACL, Vancouver, Canada (1989)
Daille, B.: Combined approach for terminology extraction: lexical statistics and linguistic filtering. Tech. Rep. UCREL Technical Papers 5, UCREL (1995)
Daille, B., Morin, E.: French-english terminology extraction from comparable corpora. Natural Language ProcessingāIJCNLP, p. 707G718 (2005)
Dias, G., GuillorĆ©, S., Lopes, J.: Mutual expectation: a measure for multiword lexical unit extraction. In: Proceedings of VExTALāVenezia per il Trattamento Automatico delle Lingue. Venezia, Italy (1999)
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61ā74 (1994)
Ferraresi, A., Zanchetta, E., Baroni, M., Bernardini, S.: Introducing and evaluating ukwac, a very large web-derived corpus of English. In: Proceedings of WAC4 Workshop. ACL SIGWAC, Marrakech, Morocco (2008)
Finn, A., Kushmerick, N., Smyth, B.: Fact or fiction: content classification for digital libraries. In: Proceedings of Personalisation and Recommender Systems in Digital Libraries Workshop. Dublin, Ireland (2001)
Fletcher, W.: Corpus Linguistics in North America 2002. In: Making the Web More Useful as a Source for Linguistic Corpora. Rodopi, Amsterdam (2004)
Fung, P.: Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. In: Proceedings of the Third Workshop on Very Large Corpora, pp. 173ā183. Boston, USA (1995)
Fung, P., Yee, L.: An ir approach for translating new words from nonparallel comparable texts. In: Proceedings of COLING-ACL, pp. 414ā420. ACL, Montreal, Canada (1998)
Gamallo, P.: Learning bilingual lexicons from comparable English and Spanish corpora. In: Proceedings of Machine Translation Summit XI, pp. 191ā198. Copenhagen, Denmark (2007)
Gao, J., Nie, J.: A study of statistical models for query translation: finding a good unit of translation. In: Proceedings of SIGIR Conference, pp. 194ā201. ACM, Seattle, USA (2006)
Gurrutxaga, A., Leturia, I., Saralegi, X., San Vicente, I.: Evaluation of an automatic process for specialized web corpora collection and term extraction for basque. In: Proceedings of eLexicography 2009. Presses Universitaires de Louvain, Louvain-la-Neuve, Belgium (2009)
Hull, D., Grefenstette, G.: Querying across languages: a dictionary-based approach to multilingual information retrieval. In: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 49ā57. ACM (1996)
Justeson, J.: Technical terminology: Some linguistic properties and an algorithm for identification in text. Tech. Rep. IBM Research Report RC 18906 (82591), IBM (1993)
Kilgarriff, A.: Using word frequency lists to measure corpus homogeneity and similarity between corpora. In: Proceedings of workshop on very large corpora, pp. 231ā245. ACL SIGDAT, Beijing and Hong Kong, China (1997)
Kilgarriff, A., Rose, T.: Measures for corpus similarity and homogeneity. In: Proceedings of EMNLP-3, pp. 46ā52. ACL SIGDAT, Granada, Spain (1998)
Leturia, I., Gurrutxaga, A., Alegria, I., Ezeiza, A.: Corpeus, a āweb as corpusā tool designed for the agglutinative nature of basque. In: Proceedings of the 3rd Web as Corpus Workshop, pp. 69ā81. Presses Universitaires de Louvain, Louvain-la-Neuve, Belgium (2007)
Leturia, I., Gurrutxaga, A., Alegria, I., Ezeiza, A.: Kimatu, a tool for cleaning non-content text parts from html docs. In: Proceedings of the 3rd Web as Corpus Workshop, pp. 163ā167. Presses Universitaires de Louvain, Louvain-la-Neuve, Belgium (2007)
Leturia, I., Gurrutxaga, A., Areta, N., Alegria, I., Ezeiza, A.: Eusbila, a search service designed for the agglutinative nature of basque. In: Proceedings of Improving non-English web searching (iNEWSā07) workshop, pp. 47ā54. SIGIR, Amsterdam, The Netherlands (2007)
Leturia, I., Gurrutxaga, A., Areta, N., Pociello, E.: Analysis and performance of morphological query expansion and language-filtering words on basque web searching. In: Proceedings of LREC 2008. ELRA, Marrakech, Morocco (2008)
Leturia, I., San Vicente, I., Saralegi, X., Lopez de Lacalle, M.: Basque specialized corpora from the web: language-specific performance tweaks and improving topic precision. In: Proceedings of the 4th Web as Corpus Workshop, pp. 40ā46. ACL SIGWAC, Marrakech, Morocco (2008)
Liu, Y., Jin, R., Chai, J.: A maximum coherence model for dictionary-based cross-language information retrieval. In: Proceedings of SIGIR Conference, pp. 536ā543. ACM, Salvador, Brazil (2005)
Matsuo, Y., Ishizuka, M.: Keyword extraction from a document using word co-occurrence statistical information. Trans. Jpn. Soc. Artif. Intell. 17(3), 217ā223 (2000)
Melamed, I.D.: Bitext maps and alignment via pattern recognition. Comput. Linguist. 25(1), 107ā130 (1999), http://portal.acm.org/citation.cfm?id=973215.973218
Milos, E., Zhang, Y., He, B., Dong, L.: Automatic term extraction and document similarity in special text corpora. In: Proceedings of the Sixth Conference of the Pacific Association for Computational Linguistics, pp. 275ā284. Halifax, Canada (2003)
Morin, E., Daille, B., Takeuchi, K., Kageura, K.: Bilingual terminology miningāusing brain, not brawn comparable corpora. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 664ā671. ACL, Prague, Czech Republic (2007)
Munteanu, D., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31(4), 477ā504 (2005)
Pirkola, A.: The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In: Proceedings of SIGIR Conference, pp. 55ā63. ACM, Melbourne, Australia (1998)
Rapp, R.: Identifying word translations in non-parallel texts. In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 320ā322. ACL, Cambridge, USA (1995)
Rapp, R.: Automatic identification of word translations from unrelated English and German corpora. In: Proceedings of the 37th annual meeting of the Association for Computational Linguistics, pp. 519ā526. ACL, College Park, USA (1999)
Rayson, P., Garside, R.: Comparing corpora using frequency profiling. In: Proceedings of the Workshop on Comparing Corpora, pp. 1ā6. ACL, Hong Kong, China (2000)
Robertson, S., Walker, S., Beaulieu, M.: Okapi at trec-7: automatic ad hoc, filtering, vlc and interactive track. In: Proceedings of 7th Text REtrieval Conference (TREC-7), pp. 199ā210. Gaithersburg, USA (1998)
Saralegi, X., San Vicente, I., Gurrutxaga, A.: Automatic extraction of bilingual terms from comparable corpora in a popular science domain. In: Proceedings of Building and using Comparable Corpora workshop. ACL, Marrakech, Morocco (2008)
Saralegi, X., San Vicente, I., Lopez de Lacalle, M.: Mining term translations from domain restricted comparable corpora. Procesamiento del Lenguaje Natural 41, 273ā280 (2008)
Shao, L., Ng, H.: Mining new word translations from comparable corpora. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), pp. 618ā624. ACL, Geneva, Switzerland (2004)
Sharoff, S.: WaCky! Working papers on the Web as Corpus, chap. Creating general-purpose corpora using automated search engine queries, pp. 63ā98. Gedit, Bologna, Italy (2006)
Sharoff, S.: Classifying web corpora into domain and genre using automatic feature identification. In: Proceedings of the 3rd Web as Corpus Workshop, pp. 83ā94. Presses Universitaires de Louvain, Louvain-la-Neuve, Belgium (2007)
Sharoff, S., Babych, B., Hartley, A.: āirrefragable answersā using comparable corpora to retrieve translation equivalents. Lang. Resour. Eval. 43(1), 15ā25 (2007), http://www.springerlink.com/content/8k6631431pl3538l/
Sheridan, P., Ballerini, J.: Experiments in multilingual information retrieval using the spider system. In: Proceedings of the 19th Annual International ACM SIGIR Conference, pp. 58ā65. ACM, Zurich, Switzerland (1996)
Smadja, F.: Retrieving collocations from text: Xtract. Comput. Linguist. 19(1), 143ā177 (1993)
Talvensaari, T., Laurikkala, J., JƤrvelin, K., Juhola, M., Keskustalo, H.: Creating and exploiting a comparable corpus in cross-language information retrieval. ACM Trans. Inform. Syst. 25(1), 4 (2007)
Talvensaari, T., Pirkola, A., JƤrvelin, K., Juhola, M., Laurikkala, J.: Focused web crawling in acquisition of comparable corpora. Inform. Retr. 11, 427ā445 (2008)
Treetagger, http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
Zientzia.net, http://www.zientzia.net
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Gurrutxaga, A., Leturia, I., Saralegi, X., Vicente, I.S. (2013). Automatic Comparable Web Corpora Collection and Bilingual Terminology Extraction for Specialized Dictionary Making. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-20128-8_3
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20127-1
Online ISBN: 978-3-642-20128-8
eBook Packages: Computer ScienceComputer Science (R0)