Automatic Comparable Web Corpora Collection and Bilingual Terminology Extraction for Specialized Dictionary Making

Gurrutxaga, Antton; Leturia, Igor; Saralegi, Xabier; Vicente, Iñaki San

doi:10.1007/978-3-642-20128-8_3

Antton Gurrutxaga⁵,
Igor Leturia⁵,
Xabier Saralegi⁵ &
…
Iñaki San Vicente⁵

1163 Accesses
1 Citations

Abstract

In this article we describe two tools we have built, one for compiling comparable corpora out of the Internet and the other for bilingual terminology extraction out of comparable corpora, and an evaluation we have subjected them to: bilingual terminology has been extracted out of automatically collected domain-comparable web corpora, in Basque and English, and the resulting terminology lists have been validated automatically using a specialized dictionary, in order to evaluate the quality of the extracted terminology lists. Thus, this evaluation measures the usefulness of putting these two automatic tools to work together in a real-world task, that is, specialized dictionary making.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aduriz, I., Aldezabal, I., Alegria, I., Artola, X., Ezeiza, N., Urizar, R.: Euslem: A lemmatiser/tagger for basque. In: Proceedings of 7th EURALEX International Conference, vol. 1, pp. 17–26. EURALEX, Göteborg, Sweden (2002)
Google Scholar
Al-Onaizan, Y., Knight, K.: Machine transliteration of names in arabic text. In: Proceedings of the ACL-02 workshop on Computational approaches to Semitic languages, pp. 1–13. ACL, Philadelphia, USA (2002)
Google Scholar
Alegria, I., Gurrutxaga, A., Lizaso, P., Saralegi, X., Ugartetxea, S., Urizar, R.: Linguistic and statistical approaches to basque term extraction. In: Proceedings of GLAT 2004. Barcelona, Spain (2004)
Google Scholar
Alegria, I., Gurrutxaga, A., Lizaso, P., Saralegi, X., Ugartetxea, S., Urizar, R.: An xml-based term extraction tool for basque. In: Proceedings of the 4th International Conference on Language Resources and Evaluations (LREC). ELRA, Lisbon, Portugal (2004)
Google Scholar
Alegria, I., Gurrutxaga, A., Saralegi, X., Ugartetxea, S.: Elexbi, a basic tool for bilingual term extraction from Spanish-Basque parallel corpora. In: Proceedings of Euralex 2006, pp. 159–165. Euralex, Torino, Italy (2006)
Google Scholar
Amati, G., Van Rijsbergen, C.: Probabilistic models of information retrieval based on measuring divergence from randomness. Trans. Inform. Syst. 20(4), 357–389 (2002)
Article Google Scholar
Baayen, R.: Word Frequency Distributions. Kluwer, Dordrecht (2001)
Book MATH Google Scholar
Ballesteros, L., Croft, W.: Resolving ambiguity for cross-language retrieval. In: Proceedings of SIGIR Conference, pp. 64–71. ACM, Melbourne (1998)
Google Scholar
Baroni, M., Bernardini, S.: Bootcat: Bootstrapping corpora and terms from the web. In: Proceedings of LREC 2004, pp. 1313–1316. ELRA, Lisbon, Portugal (2004)
Google Scholar
Baroni, M., Chantree, F., Kilgarriff, A., Sharoff, S.: Cleaneval: a competition for cleaning web pages. In: Proceedings of LREC 2008. ELRA, Marrakech, Morocco (2008)
Google Scholar
Baroni, M., Kilgarriff, A.: Large linguistically-processed web corpora for multiple languages. In: Proceedings of EACL 2006, pp. 87–90. EACL, Trento, Italy (2006)
Google Scholar
Baroni, M., Ueyama, M.: Building general- and special purpose corpora by web crawling. In: Proceedings of the 13th NIJL International Symposium. Tokyo, Japan (2006)
Google Scholar
Barzilay, R., Lee, L.: Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In: Proceedings of HLT/NAACL, pp. 16–23. NAACL, Edmonton, USA (2003)
Google Scholar
Basic dictionary of science and technology, http://zthiztegia.elhuyar.org
Bekavac, B., Osenova, P., Simov, K., Tadić, M.: Making monolingual corpora comparable: a case study of Bulgarian & Croatian. In: Proceedings of LREC 2004, pp. 1187–1190. ELRA, Lisbon, Portugal (2004)
Google Scholar
Blaheta, D., Johnson, M.: Unsupervised learning of multi-word verbs. In: Proceedings of the 39th Annual Meeting of the ACL, pp. 54–60. ACL, Toulouse, France (2001)
Google Scholar
Bourigault, D.: Lexter, a natural language processing tool for terminology extraction. In: Proceedings of 7th EURALEX International Conference. Göteborg, Sweden (1996)
Google Scholar
Braschler, M., Schäuble, P.: Multilingual information retrieval based on document alignment techniques. In: Proceedings of the 2nd European Conference on Research and Advanced Technology for Digital Libraries, pp. 183–197. Springer, Heraklion, Greece (1998)
Google Scholar
Broder, A.: On the resemblance and containment of documents. In: Proceedings of Compression and Complexity of Sequences 1997, pp. 21–29. IEEE, Salerno, Italy (1997)
Google Scholar
Broder, A.: Identifying and filtering near-duplicate documents. In: Proceedings of Combinatorial Pattern Matching: 11th Annual Symposium, pp. 1–10. Montreal, Canada (2000)
Google Scholar
Cavnar, W., Trenkle, J.: N-gram-based text categorization. In: Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175. Las Vegas, USA (1994)
Google Scholar
Chakrabarti, S., Van der Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. In: Proceedings of the 8th International WWW Conference, pp. 545–562. W3C, Toronto, Canada (1999)
Google Scholar
Chen, H., Bian, G., Lin, W.: Resolving translation ambiguity and target polysemy in cross-language information retrieval. In: Proceedings of the 37th annual meeting of the Association for Computational Linguistics, pp. 215–222. ACL, College Park, USA (1999)
Google Scholar
Chiao, Y., Zweigenbaum, P.: Looking for candidate translational equivalents in specialized, comparable corpora. In: Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), pp. 1208–1212. ACL, Taipei, Taiwan (2002)
Google Scholar
Church, K., Hanks, P.: Word association norms, mutual information and lexicography. In: Proceedings of the 27th Annual Meeting of the ACL, pp. 76–83. ACL, Vancouver, Canada (1989)
Google Scholar
Daille, B.: Combined approach for terminology extraction: lexical statistics and linguistic filtering. Tech. Rep. UCREL Technical Papers 5, UCREL (1995)
Google Scholar
Daille, B., Morin, E.: French-english terminology extraction from comparable corpora. Natural Language Processing—IJCNLP, p. 707G718 (2005)
Google Scholar
Dias, G., Guilloré, S., Lopes, J.: Mutual expectation: a measure for multiword lexical unit extraction. In: Proceedings of VExTAL—Venezia per il Trattamento Automatico delle Lingue. Venezia, Italy (1999)
Google Scholar
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1994)
Google Scholar
Ferraresi, A., Zanchetta, E., Baroni, M., Bernardini, S.: Introducing and evaluating ukwac, a very large web-derived corpus of English. In: Proceedings of WAC4 Workshop. ACL SIGWAC, Marrakech, Morocco (2008)
Google Scholar
Finn, A., Kushmerick, N., Smyth, B.: Fact or fiction: content classification for digital libraries. In: Proceedings of Personalisation and Recommender Systems in Digital Libraries Workshop. Dublin, Ireland (2001)
Google Scholar
Fletcher, W.: Corpus Linguistics in North America 2002. In: Making the Web More Useful as a Source for Linguistic Corpora. Rodopi, Amsterdam (2004)
Google Scholar
Fung, P.: Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. In: Proceedings of the Third Workshop on Very Large Corpora, pp. 173–183. Boston, USA (1995)
Google Scholar
Fung, P., Yee, L.: An ir approach for translating new words from nonparallel comparable texts. In: Proceedings of COLING-ACL, pp. 414–420. ACL, Montreal, Canada (1998)
Google Scholar
Gamallo, P.: Learning bilingual lexicons from comparable English and Spanish corpora. In: Proceedings of Machine Translation Summit XI, pp. 191–198. Copenhagen, Denmark (2007)
Google Scholar
Gao, J., Nie, J.: A study of statistical models for query translation: finding a good unit of translation. In: Proceedings of SIGIR Conference, pp. 194–201. ACM, Seattle, USA (2006)
Google Scholar
Gurrutxaga, A., Leturia, I., Saralegi, X., San Vicente, I.: Evaluation of an automatic process for specialized web corpora collection and term extraction for basque. In: Proceedings of eLexicography 2009. Presses Universitaires de Louvain, Louvain-la-Neuve, Belgium (2009)
Google Scholar
Hull, D., Grefenstette, G.: Querying across languages: a dictionary-based approach to multilingual information retrieval. In: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 49–57. ACM (1996)
Google Scholar
Justeson, J.: Technical terminology: Some linguistic properties and an algorithm for identification in text. Tech. Rep. IBM Research Report RC 18906 (82591), IBM (1993)
Google Scholar
Kilgarriff, A.: Using word frequency lists to measure corpus homogeneity and similarity between corpora. In: Proceedings of workshop on very large corpora, pp. 231–245. ACL SIGDAT, Beijing and Hong Kong, China (1997)
Google Scholar
Kilgarriff, A., Rose, T.: Measures for corpus similarity and homogeneity. In: Proceedings of EMNLP-3, pp. 46–52. ACL SIGDAT, Granada, Spain (1998)
Google Scholar
Leturia, I., Gurrutxaga, A., Alegria, I., Ezeiza, A.: Corpeus, a ’web as corpus’ tool designed for the agglutinative nature of basque. In: Proceedings of the 3rd Web as Corpus Workshop, pp. 69–81. Presses Universitaires de Louvain, Louvain-la-Neuve, Belgium (2007)
Google Scholar
Leturia, I., Gurrutxaga, A., Alegria, I., Ezeiza, A.: Kimatu, a tool for cleaning non-content text parts from html docs. In: Proceedings of the 3rd Web as Corpus Workshop, pp. 163–167. Presses Universitaires de Louvain, Louvain-la-Neuve, Belgium (2007)
Google Scholar
Leturia, I., Gurrutxaga, A., Areta, N., Alegria, I., Ezeiza, A.: Eusbila, a search service designed for the agglutinative nature of basque. In: Proceedings of Improving non-English web searching (iNEWS’07) workshop, pp. 47–54. SIGIR, Amsterdam, The Netherlands (2007)
Google Scholar
Leturia, I., Gurrutxaga, A., Areta, N., Pociello, E.: Analysis and performance of morphological query expansion and language-filtering words on basque web searching. In: Proceedings of LREC 2008. ELRA, Marrakech, Morocco (2008)
Google Scholar
Leturia, I., San Vicente, I., Saralegi, X., Lopez de Lacalle, M.: Basque specialized corpora from the web: language-specific performance tweaks and improving topic precision. In: Proceedings of the 4th Web as Corpus Workshop, pp. 40–46. ACL SIGWAC, Marrakech, Morocco (2008)
Google Scholar
Liu, Y., Jin, R., Chai, J.: A maximum coherence model for dictionary-based cross-language information retrieval. In: Proceedings of SIGIR Conference, pp. 536–543. ACM, Salvador, Brazil (2005)
Google Scholar
Matsuo, Y., Ishizuka, M.: Keyword extraction from a document using word co-occurrence statistical information. Trans. Jpn. Soc. Artif. Intell. 17(3), 217–223 (2000)
Google Scholar
Melamed, I.D.: Bitext maps and alignment via pattern recognition. Comput. Linguist. 25(1), 107–130 (1999), http://portal.acm.org/citation.cfm?id=973215.973218
Milos, E., Zhang, Y., He, B., Dong, L.: Automatic term extraction and document similarity in special text corpora. In: Proceedings of the Sixth Conference of the Pacific Association for Computational Linguistics, pp. 275–284. Halifax, Canada (2003)
Google Scholar
Morin, E., Daille, B., Takeuchi, K., Kageura, K.: Bilingual terminology mining—using brain, not brawn comparable corpora. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 664–671. ACL, Prague, Czech Republic (2007)
Google Scholar
Munteanu, D., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31(4), 477–504 (2005)
Article Google Scholar
Pirkola, A.: The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In: Proceedings of SIGIR Conference, pp. 55–63. ACM, Melbourne, Australia (1998)
Google Scholar
Rapp, R.: Identifying word translations in non-parallel texts. In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 320–322. ACL, Cambridge, USA (1995)
Google Scholar
Rapp, R.: Automatic identification of word translations from unrelated English and German corpora. In: Proceedings of the 37th annual meeting of the Association for Computational Linguistics, pp. 519–526. ACL, College Park, USA (1999)
Google Scholar
Rayson, P., Garside, R.: Comparing corpora using frequency profiling. In: Proceedings of the Workshop on Comparing Corpora, pp. 1–6. ACL, Hong Kong, China (2000)
Google Scholar
Robertson, S., Walker, S., Beaulieu, M.: Okapi at trec-7: automatic ad hoc, filtering, vlc and interactive track. In: Proceedings of 7th Text REtrieval Conference (TREC-7), pp. 199–210. Gaithersburg, USA (1998)
Google Scholar
Saralegi, X., San Vicente, I., Gurrutxaga, A.: Automatic extraction of bilingual terms from comparable corpora in a popular science domain. In: Proceedings of Building and using Comparable Corpora workshop. ACL, Marrakech, Morocco (2008)
Google Scholar
Saralegi, X., San Vicente, I., Lopez de Lacalle, M.: Mining term translations from domain restricted comparable corpora. Procesamiento del Lenguaje Natural 41, 273–280 (2008)
Google Scholar
Shao, L., Ng, H.: Mining new word translations from comparable corpora. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), pp. 618–624. ACL, Geneva, Switzerland (2004)
Google Scholar
Sharoff, S.: WaCky! Working papers on the Web as Corpus, chap. Creating general-purpose corpora using automated search engine queries, pp. 63–98. Gedit, Bologna, Italy (2006)
Google Scholar
Sharoff, S.: Classifying web corpora into domain and genre using automatic feature identification. In: Proceedings of the 3rd Web as Corpus Workshop, pp. 83–94. Presses Universitaires de Louvain, Louvain-la-Neuve, Belgium (2007)
Google Scholar
Sharoff, S., Babych, B., Hartley, A.: ’irrefragable answers’ using comparable corpora to retrieve translation equivalents. Lang. Resour. Eval. 43(1), 15–25 (2007), http://www.springerlink.com/content/8k6631431pl3538l/
Sheridan, P., Ballerini, J.: Experiments in multilingual information retrieval using the spider system. In: Proceedings of the 19th Annual International ACM SIGIR Conference, pp. 58–65. ACM, Zurich, Switzerland (1996)
Google Scholar
Smadja, F.: Retrieving collocations from text: Xtract. Comput. Linguist. 19(1), 143–177 (1993)
Google Scholar
Talvensaari, T., Laurikkala, J., Järvelin, K., Juhola, M., Keskustalo, H.: Creating and exploiting a comparable corpus in cross-language information retrieval. ACM Trans. Inform. Syst. 25(1), 4 (2007)
Article Google Scholar
Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., Laurikkala, J.: Focused web crawling in acquisition of comparable corpora. Inform. Retr. 11, 427–445 (2008)
Article Google Scholar
Treetagger, http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
Zientzia.net, http://www.zientzia.net

Download references

Author information

Authors and Affiliations

Elhuyar Foundation, Zelai Haundi kalea 3, Osinalde Industrialdea, 20170 , Usurbil, Spain
Antton Gurrutxaga, Igor Leturia, Xabier Saralegi & Iñaki San Vicente

Authors

Antton Gurrutxaga
View author publications
You can also search for this author in PubMed Google Scholar
Igor Leturia
View author publications
You can also search for this author in PubMed Google Scholar
Xabier Saralegi
View author publications
You can also search for this author in PubMed Google Scholar
Iñaki San Vicente
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Antton Gurrutxaga .

Editor information

Editors and Affiliations

Centre for Translation Studies, University of Leeds, Leeds, United Kingdom
Serge Sharoff
University of Mainz, Mainz, Germany
Reinhard Rapp
Université de Paris-Sud LIMSI-CNRS, Orsay, France
Pierre Zweigenbaum
Electronic & Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, People's Republic of China
Pascale Fung

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Gurrutxaga, A., Leturia, I., Saralegi, X., Vicente, I.S. (2013). Automatic Comparable Web Corpora Collection and Bilingual Terminology Extraction for Specialized Dictionary Making. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-20128-8_3
Published: 14 December 2013
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20127-1
Online ISBN: 978-3-642-20128-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics