Advertisement

Extracting Data from Comparable Corpora

  • Mārcis Pinnis
  • Nikola Ljubešić
  • Dan Ştefănescu
  • Inguna Skadiņa
  • Marko Tadić
  • Tatjana Gornostaja
  • Špela Vintar
  • Darja Fišer
Chapter
Part of the Theory and Applications of Natural Language Processing book series (NLP)

Abstract

Comparable corpora may comprise different types of single-word and multi-word phrases that can be considered as reciprocal translations, which may be beneficial for many different natural language processing tasks. This chapter describes methods and tools developed within the ACCURAT project that allow utilising comparable corpora in order to (1) identify terms, named entities (NEs), and other lexical units in comparable corpora, and (2) to cross-lingually map the identified single-word and multi-word phrases in order to create automatically extracted bilingual dictionaries that can be further utilised in machine translation, question answering, indexing, and other areas where bilingual dictionaries can be useful.

References

  1. Apidianaki, M., Ljubešić, N., & Fišer, D. (2013). Vector disambiguation for translation extraction from comparable corpora resources used comparable corpus. Informatica (Slovenia), 37(2), 193–201.Google Scholar
  2. Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky Wide Web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation, 43(3), 209–226.CrossRefGoogle Scholar
  3. Bourigault, D. (1992). Surface grammatical analysis for the extraction of terminological noun phrases. Proceedings of the 14th Conference on Computational Linguistics (Vol. 3, pp. 977–981). Association for Computational Linguistics.Google Scholar
  4. Chen, S. F., & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech and Language, 13(4), 359–393.CrossRefGoogle Scholar
  5. Chiao, Y.-C., & Zweigenbaum, P. (2002). Looking for candidate translational equivalents in specialized, comparable corpora. Proceedings of the 19th International Conference on Computational Linguistics (Vol. 2). Association for Computational Linguistics.Google Scholar
  6. Chinchor, N. (1997). MUC-7 named entity task definition. Proceedings of the 7th Conference on Message Understanding.Google Scholar
  7. Cohen, J. (1968). Weighted Kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213–220.CrossRefGoogle Scholar
  8. Dagan, I., & Church, K. (1994). Termight: Identifying and translating technical terminology. Proceedings of the Fourth Conference on Applied Natural Language Processing (pp. 34–40). Association for Computational Linguistics.Google Scholar
  9. Daille, B. (1994). Study and implementation of combined techniques for automatic extraction of terminology. Proceedings of the Workshop The Balancing Act: Combining Symbolic and Statistical Approaches to Language (Language, Speech, and Communication) (pp. 29–36). Association for Computational Linguistics, Las Cruces, NM.Google Scholar
  10. Daille, B., & Morin, E. (2008). Effective compositional model for lexical alignment. Proceedings, IJCNLP 2008: Third International Joint Conference on Natural Language Processing (Vol. 1, pp. 95–102).Google Scholar
  11. Damerau, F. J. (1993). Generating and evaluating domain-oriented multi-word terms from texts. Information Processing and Management, 29(4), 433–447.CrossRefGoogle Scholar
  12. Déjean, H., Gaussier, E., Renders, J.-M., & Sadat, F. (2005). Automatic processing of multilingual medical terminology: Applications to thesaurus enrichment and cross-language information retrieval. Artificial Intelligence in Medicine, 33(2), 111–124.CrossRefGoogle Scholar
  13. Delač, D., Krleža, Z., Šnajder, J., Bašić, B. D., & Šarić, F. (2009). TermeX: A tool for collocation extraction. In Computational Linguistics and Intelligent Text Processing (pp. 149–157). Springer.Google Scholar
  14. Finkel, J. R., Grenager, T., & Manning, C. (2005). Incorporating non-local information into information extraction systems by Gibbs sampling. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 363–370). Association for Computational Linguistics.Google Scholar
  15. Fišer, D., & Ljubešic, N. (2011). Bilingual lexicon extraction from comparable corpora for closely related languages. Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’11) (pp. 125–131).Google Scholar
  16. Fišer, D., Vintar, Š., Ljubešić, N., & Pollak, S. (2011). Building and using comparable corpora for domain-specific bilingual lexicon extraction. Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web (pp. 19–26). Association for Computational Linguistics.Google Scholar
  17. Fišer, D., Ljubešić, N., & Kubelka, O. (2012). Addressing polysemy in bilingual lexicon extraction from comparable corpora. Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC’12 (pp. 3031–3035).Google Scholar
  18. Frantzi, K., Ananiadou, S., & Mima, H. (2000). Automatic recognition of multi-word terms: The C-value/NC-Value Method. International Journal on Digital Libraries, 3(2), 115–130.CrossRefGoogle Scholar
  19. Fung, P. (1998). A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. In Machine translation and the information soup (pp. 1–17). Springer.Google Scholar
  20. Fung, P., & McKeown, K. (1997). A technical word- and term-translation aid using noisy parallel corpora across language groups. Machine Translation, 12(1–2), 53–87.CrossRefGoogle Scholar
  21. Georgantopoulos, B., & Piperidis, S. (2000). A hybrid technique for automatic term extraction. Proceedings of the ACIDCA 2000 Conference.Google Scholar
  22. Grefenstette, G. (1994). Explorations in automatic thesaurus discovery. Heidelberg: Springer.CrossRefGoogle Scholar
  23. Grefenstette, G. (1999). The World Wide Web as a resource for example-based machine translation tasks. Proceedings of the ASLIB Conference on Translating and the Computer (Vol. 21).Google Scholar
  24. Grigonyte, G., Rimkute, E., Utka, A., & Boizou, L. (2011). Experiments on lithuanian term extraction. Proceedings of the NODALIDA 2011 Conference (pp. 82–89).Google Scholar
  25. Ion, R. (2007). Word sense disambiguation methods applied to English and Romanian. PhD Thesis, Romanian Academy, Bucharest.Google Scholar
  26. Justeson, J. S., & Katz, S. M. (1995). Technical terminology: Some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1(01), 9–27.CrossRefGoogle Scholar
  27. Kageura, K., & Umino, B. (1996). Methods of automatic term recognition: A review. Terminology, 3(2), 259–289.CrossRefGoogle Scholar
  28. Kilgarriff, A. (2001). Comparing Corpora. International Journal of Corpus Linguistics, 6(1), 97–133.CrossRefGoogle Scholar
  29. Kochanski, G. (2006). Lecture 4-good-turing probability estimation. Oxford.Google Scholar
  30. Koehn, P., & Knight, K. (2002). Learning a translation lexicon from monolingual corpora. Proceedings of the ACL-02 Workshop on Unsupervised Lexical Acquisition (Vol. 9, pp. 9–16). Association for Computational Linguistics.Google Scholar
  31. Kondrak, G., & Dorr, B. (2004). Identification of confusable drug names: A new approach and evaluation methodology. Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics.Google Scholar
  32. Kravalová, J., & Žabokrtský, Z. (2009). Czech named entity corpus and SVM-based recognizer. Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (pp. 194–201). Association for Computational Linguistics.Google Scholar
  33. Krugļevskis, V. (2010). Semi-automatic term extraction from Latvian texts and related language technologies. Magyar Terminologia (Journal of Hungarian Terminology).Google Scholar
  34. Kruglevskis, V., & Vancane, I. (2005). Term extraction from legal texts in Latvian. Proceedings of the Second Baltic Conference on Human Language Technologies (pp. 155–161).Google Scholar
  35. Lee, L., Aw, A., Zhang, M., & Li, H. (2010). EM-based hybrid model for bilingual terminology extraction from comparable corpora. Proceedings of the 23rd International Conference on Computational Linguistics: Posters (pp. 639–646). Association for Computational Linguistics.Google Scholar
  36. Liao, W., & Veeramachaneni, S. (2009). A simple semi-supervised algorithm for named entity recognition. Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing (pp. 58–65). Association for Computational Linguistics.Google Scholar
  37. Ljubešić, N., & Erjavec, T. (2011). hrWaC and slWac: Compiling web corpora for Croatian and Slovene. Text, Speech and Dialogue 2011 Conference Proceedings (pp. 395–402). Springer.Google Scholar
  38. Ljubešić, N., & Fišer, D. (2011). Bootstrapping bilingual lexicons from comparable corpora for closely related languages. Text, Speech and Dialogue (pp. 91–98).CrossRefGoogle Scholar
  39. Ljubešić, N., Fišer, D., Vintar, Š., & Pollak, S. (2011). Bilingual lexicon extraction from comparable corpora: A comparative study. First International Workshop on Lexical Resources.Google Scholar
  40. Ljubešić, N., Vintar, Š., & Fišer, D. (2012). Multi-word term extraction from comparable corpora by combining contextual and constituent clues. Proceedings of the 5th Workshop on Building and Using Comparable Corpora (BUCC 2012) (pp. 143–147). ELRA, Istanbul.Google Scholar
  41. Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.zbMATHGoogle Scholar
  42. Marsi, E., & Krahmer, E. (2010). Automatic analysis of semantic similarity in comparable text through syntactic tree matching. Proceedings of the 23rd International Conference on Computational Linguistics (pp. 752–760). Association for Computational Linguistics.Google Scholar
  43. Mima, H., & Ananiadou, S. (2000). An application and evaluation of the C/NC-value approach for the automatic term recognition of multi-word units in Japanese. Terminology, 6(2), 175–194.CrossRefGoogle Scholar
  44. Morin, E., & Prochasson, E. (2011). Bilingual lexicon extraction from comparable corpora enhanced with parallel corpora. Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web (pp. 27–34).Google Scholar
  45. Morin, E., Daille, B., Takeuchi, K., Kageura, K. (2007). Bilingual terminology mining – Using brain, not brawn comparable corpora. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (pp. 664–671). Association for Computational Linguistics.Google Scholar
  46. Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1), 3–26.CrossRefGoogle Scholar
  47. Och, F. J., & Ney, H. (2000). Improved statistical alignment models. Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (pp. 440–447). Association for Computational Linguistics.Google Scholar
  48. Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.CrossRefGoogle Scholar
  49. Otero, P. G. (2007). Learning bilingual lexicons from comparable English and Spanish corpora. Proceedings of MT Summit XI (pp. 191–198).Google Scholar
  50. Pantel, P., & Lin, D. (2001). A statistical corpus-based term extractor. Proceedings of the 14th Biennial Conference of the Canadian Society for Computational Studies of Intelligence – Advances in Artificial Intelligence (AI 2001) (pp. 36–46). Ottawa, Canada. Berlin: Springer.CrossRefGoogle Scholar
  51. Paukkeri, M.-S., Nieminen, I. T., Pöllä, M., & Honkela, T. (2008). A language-independent approach to keyphrase extraction and evaluation. Proceedings of COLING 2008 (pp. 83–86).Google Scholar
  52. Pinnis, M. (2012). Latvian and lithuanian named entity recognition with TildeNER. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12) (pp. 1258–1265). European Language Resources Association (ELRA), Istanbul, Turkey.Google Scholar
  53. Pinnis, M., & Goba, K. (2011). Maximum entropy model for disambiguation of rich morphological tags. In C. Mahlow & M. Piotrowski (Eds.), Proceedings of the 2nd International Workshop on Systems and Frameworks for Computational Morphology (pp. 14–22). Zurich: Springer.CrossRefGoogle Scholar
  54. Pinnis, M., & Skadiņš, R. (2012). MT adaptation for under-resourced domains – What works and what not. Human Language Technologies – The Baltic Perspective – Proceedings of the Fifth International Conference Baltic HLT 2012 (Vol. 247, pp. 176–184). Tartu, Estonia: IOS Press.Google Scholar
  55. Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I., Tadić, M., & Gornostay, T. (2012). Term extraction, tagging, and mapping tools for under-resourced languages. Proceedings of the 10th Conference on Terminology and Knowledge Engineering (TKE 2012) (pp. 193–208), Madrid.Google Scholar
  56. Rapp, R. (1995). Identifying word translations in non-parallel texts. Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics (pp. 320–322). Computation and Language, Association for Computational Linguistics.Google Scholar
  57. Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (pp. 519–526). Association for Computational Linguistics, Stroudsburg, PA.Google Scholar
  58. Saralegi, X., San Vicente, I., & Gurrutxaga, A. (2008). Automatic extraction of bilingual terms from comparable corpora in a popular science domain. Proceedings of Building and Using Comparable Corpora Workshop (pp. 27–32).Google Scholar
  59. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. Proceedings of International Conference on New Methods in Language Processing (Vol. 12, pp. 44–49).Google Scholar
  60. Schütze, H. (1998). The hypertext concordance: A better back-of-the-book index. Proceedings of First Workshop on Computational Terminology.Google Scholar
  61. Shao, L., & Ng, H. T. (2004). Mining new word translations from comparable corpora. Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA.Google Scholar
  62. Shezaf, D., & Rappoport, A. (2010). Bilingual lexicon generation using non-aligned signatures. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 98–107). Association for Computational Linguistics.Google Scholar
  63. Skadiņa, I. (2009). Jaunas iespējas attēlu meklēšanā: ģeotelpiskajā informācijā un valodu tehnoloģijās balstīta attēlu meklēšanas platforma TRIPOD. Latvijas Nacionālās bibliotēkas zinātniskie raksti (pp. 182–192). National Library of Latvia.Google Scholar
  64. Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19(1), 143–177.Google Scholar
  65. Spärck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21.CrossRefGoogle Scholar
  66. Ştefănescu, D. (2010). Intelligent information mining from multilingual corpora. PhD Thesis, Romanian Academy, Bucharest.Google Scholar
  67. Ştefănescu, D. (2012). Mining for term translations in comparable corpora. The 5th Workshop on Building and Using Comparable Corpora (pp. 98–103). Turkey, Istanbul.Google Scholar
  68. Ştefănescu, D., Tufiş, D., & Irimia, E. (2006). Automatic identification and extraction of collocations from texts. Proceedings of the 2nd Romanian Workshop for Linguistic Tools and Resources (Vol. 3). Bucharest, Romania.Google Scholar
  69. Ştefănescu, D., Ion, R., & Boroş, T. (2011). TiradeAI: An ensemble of spellcheckers. Proceedings of the Spelling Alteration for Web Search Workshop (pp. 20–23).Google Scholar
  70. Steinberger, R., Pouliquen, B., & Hagman, J. (2002). Cross-lingual document similarity calculation using the multilingual thesaurus EuroVoc. Computational Linguistics and Intelligent Text Processing (pp. 115–424).CrossRefGoogle Scholar
  71. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufi, D., Varga, D. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’2006) (Vol. 4, pp. 2142–2147).Google Scholar
  72. Tadić, M., & Šojat, K. (2003). Finding multiword term candidates in Croatian. In Proceedings of Information Extraction for Slavic Languages 2003 Workshop (pp. 102–107).Google Scholar
  73. Tiedemann, J. (2005). Optimization of word alignment clues. Natural Language Engineering, 11(03), 279–293.CrossRefGoogle Scholar
  74. Tjong, E. F., & Sang, K. (2002). Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. Proceedings of the 6th Conference on Natural Language Learning (Vol. 20, pp. 142–147). Association for Computational Linguistics, Taipei, Taiwan.Google Scholar
  75. Todirascu, A., Gledhill, C., & Stefanescu, D. (2009). Extracting collocations in contexts. Human Language Technology. Challenges of the Information Society (pp. 336–349). Springer.Google Scholar
  76. Tufi, D., & Irimia, E. (2006). RoCo-news: A hand validated journalistic corpus of Romanian. Proceedings of the 5th LREC Conference (pp. 869–872). Genoa, Italy.Google Scholar
  77. Tufi, D., Ion, R., Ceauşu, A., & Ştefănescu, D. (2008). RACAI’s linguistic web services. Proceedings of the 6th Language Resources and Evaluation Conference-LREC (pp. 327–333).Google Scholar
  78. Vintar, Š. (2010). Bilingual term recognition revisited: The bag-of-equivalents term alignment approach and its evaluation. Terminology, 16(2), 141–158.CrossRefGoogle Scholar
  79. Voorhees, E. M. (2001). Overview of the TREC-9 question answering track. Proceedings of the Ninth Text REtrieval Conference (TREC-9).Google Scholar
  80. Weller, M., Gojun, A., Heid, U., Daille, B., & Harastani, R. (2011). Simple methods for dealing with term variation and term alignment. Proceedings of the 9th International Conference on Terminology and Artificial Intelligence (TIA 2011) (pp. 86–92).Google Scholar
  81. Xiao, R., & McEnery, T. (2006). Collocation, semantic prosody, and near synonymy: A cross-linguistic perspective. Applied Linguistics, 27(1), 103–129.CrossRefGoogle Scholar
  82. Yu, K., & Tsujii, J. (2009). Bilingual dictionary extraction from Wikipedia. Proceedings of Machine Translation Summit XII (pp. 379–386).Google Scholar
  83. Zeller, I. (2005). Automatinis terminu atpazinimas ir apdorojimas. VDU Lietuviu Kalbos Institutas.Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Mārcis Pinnis
    • 1
  • Nikola Ljubešić
    • 2
  • Dan Ştefănescu
    • 3
  • Inguna Skadiņa
    • 1
  • Marko Tadić
    • 2
  • Tatjana Gornostaja
    • 1
  • Špela Vintar
    • 4
  • Darja Fišer
    • 4
  1. 1.TildeRigaLatvia
  2. 2.Faculty of Humanities and Social SciencesUniversity of ZagrebZagrebCroatia
  3. 3.Research Institute for Artificial Intelligence, Romanian AcademyBucharestRomania
  4. 4.Faculty of ArtsUniversity of LjubljanaLjubljanaSlovenia

Personalised recommendations