Language Resources and Evaluation

, Volume 46, Issue 2, pp 155–176 | Cite as

A survey of methods to ease the development of highly multilingual text mining applications

Original paper


Multilingual text processing is useful because the information content found in different languages is complementary, both regarding facts and opinions. While Information Extraction and other text mining software can, in principle, be developed for many languages, most text analysis tools have only been applied to small sets of languages because the development effort per language is large. Self-training tools obviously alleviate the problem, but even the effort of providing training data and of manually tuning the results is usually considerable. In this paper, we gather insights by various multilingual system developers on how to minimise the effort of developing natural language processing applications for many languages. We also explain the main guidelines underlying our own effort to develop complex text mining software for tens of languages. While these guidelines—most of all: extreme simplicity—can be very restrictive and limiting, we believe to have shown the feasibility of the approach through the development of the Europe Media Monitor (EMM) family of applications ( EMM is a set of complex media monitoring tools that process and analyse up to 100,000 online news articles per day in between twenty and fifty languages. We will also touch upon the kind of language resources that would make it easier for all to develop highly multilingual text mining applications. We will argue that—to achieve this—the most needed resources would be freely available, simple, parallel and uniform multilingual dictionaries, corpora and software tools.


Text mining Information extraction Multilinguality Saving effort Rule-based Machine learning Cross-lingual projection Methods Algorithms Sentiment analysis Summarisation Quotation recognition String similarity calculation Media monitoring 



Conference on computational natural language learning


European commission


Evaluations and language resources distribution agency


Europe media monitor


European union


General architecture for text engineering


Joint research centre


Linguistic data consortium


Language resources and evaluation conference


Machine learning


Machine translation


Named entity recognition


Natural language processing


Shallow processing with unification and typed feature structures


Text analysis conference



I would like to thank the following persons for having shared their own multilingual grammar writing experience with us, or their views on linguistic resources: Kalina Bontcheva (Sheffield University) on GATE; Frédérique Segond, Caroline Hagège and Claude Roux (Xerox Research Centre Europe) on the Xerox Incremental Parser; Aarne Ranta (Gothenburg University) on the Grammatical Framework; Jacques Vergne (Caen University) on sentence chunking using extremely light-weight methods; Eric Wehrli (Geneva University) on his deep-linguistic parser; Gregory Grefenstette (Exalead) and Gregor Thurmair (Linguatec) on their respective multilingual products; Khalid Choukri (ELRA/ELDA) and Gregory Grefenstette on linguistic resources; and my JRC colleagues Maud Ehrmann, Hristo Tanev, Vanni Zavarella and Marco Turchi for sharing their experiences and for their feedback on earlier versions of the paper. The ultimate responsibility for any errors, however, lies with me. I would furthermore like to thank my superiors Erik van der Goot and Delilah Al Khudhairy for their support, and my colleagues in the OPTIMA group at the JRC for the fruitful and efficient collaboration over the past years, and for so reliably providing large amounts of clean multilingual news data, which allowed us to run many multilingual experiments. Building the complex EMM applications was a successful team effort that also includes many less rewarding and less visible tasks. My specific thanks go to my former colleague Bruno Pouliquen (now at WIPO in Geneva). We developed most ideas together, and he very efficiently implemented many ideas and integrated the many tools with each other.


  1. Balahur-Dobrescu, A., Steinberger, R., Kabadjov, M., Zavarella, V., van der Goot, E., Halkia, M., et al. (2010). Sentiment analysis in the news. In Proceedings of LREC. Valletta, Malta.Google Scholar
  2. Bender, E., & Flickinger, D. (2005). Rapid prototyping of scalable grammars: Towards modularity in extensions to a language-independent core. In Proceedings of IJCNLP. Jeju Island, Korea.Google Scholar
  3. Bentivogli, L., Forner, P., & Pianta, E. (2004). Evaluating cross-lingual annotation transfer in the MultiSemCor corpus (pp. 364–370). Geneva, Switzerland: CoLing.Google Scholar
  4. Bering, C., Drożdżyński, W., Erbach, G., Guasch, L., Homola, P., Lehmann, S., et al. (2003). Corpora and evaluation tools for multilingual named entity grammar development. Proceedings of the multilingual corpora workshop at corpus linguistics (pp. 42–52). UK: Lancaster.Google Scholar
  5. Carenini, M., Whyte, A., Bertorello, L., & Vanocchi, M. (2007). Improving communication in E-democracy using natural language processing. IEEE Intelligent Systems, 22(1), 20–27.CrossRefGoogle Scholar
  6. Ehrmann, M., & Turchi, M. (2010). Building multilingual named entity-annotated corpora exploiting parallel corpora. In Proceedings of the workshop on annotation and exploitation of parallel corpora (AEPC) (pp. 24–33). Tartu, Estonia.Google Scholar
  7. Gamon, M., Lozano, C., Pinkham, J., & Reutter, T. (1997). Practical experience with grammar sharing in multilingual NLP. In Proceedings of ACL/EACL, Madrid, Spain.Google Scholar
  8. Gong, Y., & Liu, X. (2002). Generic text summarization using relevance measure and latent semantic analysis. In Proceedings of ACM SIGIR. New Orleans, USA.Google Scholar
  9. Goutte, C., Cancedda, N., Dymetman, M., & Foster, G. (2009). Learning machine translation. Cambridge, USA: MIT Press.Google Scholar
  10. Grefenstette, G. (2010). Proposition for a web 2.0 version of linguistic resource creation. Presentation at FLaReNet Forum 2010 in Barcelona on 12.02.2010.Google Scholar
  11. Ignat C., Pouliquen, B., Ribeiro, A., & Steinberger, R. (2003). Extending an information extraction tool set to central and eastern European languages. In Proceedings of the workshop information extraction for Slavonic and other central and eastern European languages (IESL), held at RANLP. Borovets, Bulgaria, September 8–9, 2003.Google Scholar
  12. Kabadjov, M., Atkinson, M., Steinberger, J., Steinberger, R., & van der Goot, E. (2010). NewsGist: A multilingual statistical news summarizer. In J. L. Balcázar, F. Bonchi, A. Gionis, & M. Sebag (Eds.), Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML-PKDD). Barcelona, Spain, September 20–24, 2010. Lecture Notes in Computer Science (Vol. 6323, pp. 591–594). Berlin: Springer.Google Scholar
  13. Larkey, L., Feng, F., Connell, M., & Lavrenko, V. (2004). Language-specific models in multilingual topic tracking. In Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval (pp. 402–409).Google Scholar
  14. Lee, C.-J., Chang, J. S., & Jang, J.-S. R. (2006). Extraction of transliteration pairs from parallel corpora using a statistical transliteration model. Information Sciences, 176(1), 67–90.CrossRefGoogle Scholar
  15. Linge, J., Steinberger, R., Weber, T., Yangarber, R., van der Goot, E., Al Khudhairy, D., et al. (2009). Internet surveillance systems for early alerting of health threats. Euro Surveillance, 14(13). Stockholm, April 2, 2009.Google Scholar
  16. Maynard, D., Tablan, V., & Cunningham, H. (2003). NE Recognition without training data on a language you don’t speak. In Proceedings of the ACL workshop on multilingual and mixed-language NER: Combining statistical and symbolic methods. Sapporo, Japan.Google Scholar
  17. Maynard, D., Tablan, V., Cunningham, H., Ursu, C., Saggion, H., Bontcheva, K., et al. (2002). Architectural elements of language engineering robustness. Journal of Natural Language Engineering, 8(3), 257–274. Special issue on robust methods in analysis of natural language data.CrossRefGoogle Scholar
  18. Nadeau, D., & Sekine, S. (2009). A survey of entity recognition and classification. In S. Sekine & E. Ranchhod (Eds.), Named entities—recognition, classification and use. Amsterdam/Philadelphia: John Benjamins Publishing Company.Google Scholar
  19. Nivre, J., Hall, J., Kübler, S., McDonald, R., Nilsson, J., Riedel, S., et al. (2007). The CoNLL 2007 shared task on dependency parsing. In Proceedings of the CoNLL shared task session of EMNLP-CoNLL (pp. 915–932). Prague, Czech Republic.Google Scholar
  20. Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1–2), 1–135.CrossRefGoogle Scholar
  21. Pastra, K., Maynard, D., Hamza, O., Cunningham, H., & Wilks, Y. (2002). How feasible is the reuse of grammars for named entity recognition? In Proceedings of LREC, Las Palmas, Spain.Google Scholar
  22. Pouliquen, B., Steinberger, R., & Best, C. (2007a). Automatic detection of quotations in multilingual news. In Proceedings of the international conference recent advances in natural language processing (RANLP) (pp. 487–492). Borovets, Bulgaria, September 27–29, 2007.Google Scholar
  23. Pouliquen, B., Steinberger, R., & Belyaeva, J. (2007b). Multilingual multi-document continuously updated social networks. In Proceedings of the workshop multi-source multilingual information extraction and summarization (MMIES) held at RANLP (pp. 25–32). Borovets, Bulgaria, September 26, 2007.Google Scholar
  24. Ranta, A. (2009). The GF resource grammar library. In Linguistic issues in language technology LiLT 2:2. December 2009.Google Scholar
  25. Rayner, M., & Bouillon, P. (1996). Adapting the core language engine to French and Spanish. In Proceedings of the international conference NLP+IA (pp. 224–232), Mouncton, Canada.Google Scholar
  26. Rosen, A. (2010). Mediating between incompatible tagsets. In Proceedings of the workshop on annotation and exploitation of parallel corpora (pp. 53–62), Tartu, Estonia.Google Scholar
  27. Shinyama, Y., & Sekine, S. (2004). Named entity discovery using comparable news articles. In Proceedings of the 20th international conference on computational linguistics (CoLing) (pp. 848–853). Geneva, Switzerland.Google Scholar
  28. Spreyer, K., & Frank, A. (2008). Projection-based acquisition of a temporal labeller. In Proceedings of the 3rd international joint conference on natural language processing (IJCNLP) (pp. 489–496). Hyderabad, India.Google Scholar
  29. Steinberger, J., Kabadjov, M., Pouliquen, B., Steinberger, R., & Poesio, M., (2009). WB-JRC-UT’s participation in TAC 2009: Update summarization and AESOP tasks. In Proceedings of the text analysis conference 2009 (TAC’2009). National Institute of Standards and Technology, Gaithersburg, Maryland USA, November 16–17, 2009.Google Scholar
  30. Steinberger, J., Lenkova, P., Ebrahim, M., Ehrmann, M., Vázquez, S., Hürriyetoğlu, A., Kabadjov, M., Steinberger, R., Tanev, H., & Zavarella, V. (2011). Creating sentiment dictionaries via triangulation. In Proceedings of the 2nd workshop on computational approaches to subjectivity and sentiment analysis, WASSA, held at the ACL-HLT conference (pp. 28–36). Portland, Oregon, USA, 24 June 2011.Google Scholar
  31. Steinberger, R., & Pouliquen, B. (2007). Cross-lingual named entity recognition. In S. Sekine & E. Ranchhod (Eds.), Journal Linguisticae Investigationes, Special issue on named entity recognition and categorisation. LI, 30(1), 135–162. Amsterdam: John Benjamins Publishing Company.Google Scholar
  32. Steinberger, R., Pouliquen, B., & Ignat, C. (2008). Using language-independent rules to achieve high multilinguality in text mining. In F.-S. Françoise, D. Perrotta, J. Piskorski, & R. Steinberger (Eds.), Mining massive data sets for security (pp. 217–240). Amsterdam, The Netherlands: IOS Press.Google Scholar
  33. Steinberger, R., Pouliquen, B., & van der Goot, E. (2009). An introduction to the europe media monitor family of applications. In F. Gey, N. Kando, & J. Karlgren (Eds.), Information access in a multilingual world—Proceedings of the SIGIR 2009 Workshop (SIGIR-CLIR) (pp. 1–8). Boston, USA, July 23, 2009.Google Scholar
  34. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., et al. (2006). The JRC-acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th international conference on language resources and evaluation (LREC) (pp. 2142–2147). Genoa, Italy, May 24–26, 2006.Google Scholar
  35. Steinberger, R., Ombuya, S., Kabadjov, M., Pouliquen, B., Della Rocca, L., Belyaeva, J., De Paola, M., & van der Goot, E. (2011). Expanding a multilingual media monitoring and information extraction tool to a new language: Swahili. Language Resources and Evaluation Journal, 45(3), 311–330.Google Scholar
  36. Tanev, H., Zavarella, V., Linge, J., Kabadjov,M., Piskorski, J., Atkinson, M., et al. (2009). Exploiting machine learning techniques to build an event extraction system for Portuguese and Spanish. In linguaMÁTICA—Revista para o Processamento Automático das Línguas Ibéricas (Vol. 2, pp. 55–67).Google Scholar
  37. Turchi, M., Steinberger, J., Kabadjov, M., & Steinberger, R. (2010). Using parallel corpora for multilingual (multi-document) summarisation evaluation. In Conference on multilingual and multimodal information access evaluation (CLEF). Padua, Italy, September 20–23, 2010. Springer Lecture Notes for Computer Science LNCS.Google Scholar
  38. Vergne, J. (2002). Une méthode pour l’analyse descendante et calculatoire de corpus multilingues: Application au calcul des relations sujet-verbe. In Proceedings of TALN. Nancy, France.Google Scholar
  39. Vergne, J. (2009). Defining the chunk as the period of the functions length and frequency of words on the syntagmatic axis. In Proceedings of the language technology conference LTC. Poznan, Poland.Google Scholar
  40. Wehrli, E. (2007). Fips, a “Deep” linguistic multilingual parser. In Proceedings of the ACL workshop on deep linguistic processing (pp. 120–127). Prague, Czech Republic.Google Scholar
  41. Yarowski, D., Ngai, G., & Wicentowski, R. (2001). Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the 1st international conference on Human Language Technology research (HLT) (pp. 1–8). Stroudsburg, PA, USA.Google Scholar
  42. Zaghouani, W., Pouliquen, B., Ibrahim, M., & Steinberger, R., (2010). Adapting a resource-light highly multilingual Named Entity Recognition system to Arabic. In Proceedings of LREC, Valletta, Malta.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2011

Authors and Affiliations

  1. 1.European Commission, Joint Research Centre (JRC)IspraItaly

Personalised recommendations