Skip to main content
Log in

A survey of methods to ease the development of highly multilingual text mining applications

  • Original paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Multilingual text processing is useful because the information content found in different languages is complementary, both regarding facts and opinions. While Information Extraction and other text mining software can, in principle, be developed for many languages, most text analysis tools have only been applied to small sets of languages because the development effort per language is large. Self-training tools obviously alleviate the problem, but even the effort of providing training data and of manually tuning the results is usually considerable. In this paper, we gather insights by various multilingual system developers on how to minimise the effort of developing natural language processing applications for many languages. We also explain the main guidelines underlying our own effort to develop complex text mining software for tens of languages. While these guidelines—most of all: extreme simplicity—can be very restrictive and limiting, we believe to have shown the feasibility of the approach through the development of the Europe Media Monitor (EMM) family of applications (http://emm.newsbrief.eu/overview.html). EMM is a set of complex media monitoring tools that process and analyse up to 100,000 online news articles per day in between twenty and fifty languages. We will also touch upon the kind of language resources that would make it easier for all to develop highly multilingual text mining applications. We will argue that—to achieve this—the most needed resources would be freely available, simple, parallel and uniform multilingual dictionaries, corpora and software tools.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. See http://news.google.com. All websites mentioned here were last visited in the week of 15 February 2011.

  2. See http://news.yahoo.com/.

  3. See http://emm.newsbrief.eu/.

  4. See http://www.silobreaker.com/.

  5. See http://www.newsvine.com/.

  6. See http://www.daylife.com/.

  7. See http://www.newstin.com/.

  8. See http://emm.newsexplorer.eu/. NewsExplorer processes news articles in Arabic, Bulgarian, Danish, Dutch, English, Estonian, Farsi, French, German, Italian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovene, Spanish, Swahili, Swedish and Turkish.

  9. As of February 2011, the website actually lists 54 languages, but some of them are extremely low-volume and EMM may not capture news in these languages every day.

  10. See http://europa.eu/eurovoc/. Automatic Eurovoc indexing has been trained for 22 EU languages.

  11. http://translate.google.com/.

  12. The event extraction results are accessible at http://emm.newsbrief.eu/geo?type=event&format=html&language=all.

  13. See the NewsExplorer entity page http://emm.newsexplorer.eu/NewsExplorer/entities/en/7472.html.

  14. See, for example, Barack Obama’s page at http://emm.newsexplorer.eu/NewsExplorer/entities/en/1510.html.

  15. See http://www.nlm.nih.gov/mesh/. The multilingual MeSH term recognition software was developed by Health-on-the-Net (HON, http://www.hon.ch/).

  16. See http://www.flarenet.eu/?q=node/347.

  17. http://www.issco.unige.ch/en/research/projects/MULTEXT.html.

  18. See http://nl.ijs.si/ME/.

  19. See http://www.geonames.org/

  20. See http://www.globalwordnet.org/.

  21. See http://europa.eu/eurovoc/.

  22. See http://www.elda.org/.

  23. See http://www.ldc.upenn.edu/.

  24. See http://www.flarenet.eu/.

  25. See http://www.clarin.eu/.

  26. See http://www.clef-campaign.org/.

  27. See http://www.enabler-network.org/.

  28. See http://www.meta-net.eu/meta-share.

  29. See http://projects.ldc.upenn.edu/LCTL/.

  30. See http://www.globalwordnet.org/.

Abbreviations

CoNLL:

Conference on computational natural language learning

EC:

European commission

ELDA:

Evaluations and language resources distribution agency

EMM:

Europe media monitor

EU:

European union

GATE:

General architecture for text engineering

JRC:

Joint research centre

LDC:

Linguistic data consortium

LREC:

Language resources and evaluation conference

ML:

Machine learning

MT:

Machine translation

NER:

Named entity recognition

NLP:

Natural language processing

SProUT:

Shallow processing with unification and typed feature structures

TAC:

Text analysis conference

References

  • Balahur-Dobrescu, A., Steinberger, R., Kabadjov, M., Zavarella, V., van der Goot, E., Halkia, M., et al. (2010). Sentiment analysis in the news. In Proceedings of LREC. Valletta, Malta.

  • Bender, E., & Flickinger, D. (2005). Rapid prototyping of scalable grammars: Towards modularity in extensions to a language-independent core. In Proceedings of IJCNLP. Jeju Island, Korea.

  • Bentivogli, L., Forner, P., & Pianta, E. (2004). Evaluating cross-lingual annotation transfer in the MultiSemCor corpus (pp. 364–370). Geneva, Switzerland: CoLing.

    Google Scholar 

  • Bering, C., Drożdżyński, W., Erbach, G., Guasch, L., Homola, P., Lehmann, S., et al. (2003). Corpora and evaluation tools for multilingual named entity grammar development. Proceedings of the multilingual corpora workshop at corpus linguistics (pp. 42–52). UK: Lancaster.

    Google Scholar 

  • Carenini, M., Whyte, A., Bertorello, L., & Vanocchi, M. (2007). Improving communication in E-democracy using natural language processing. IEEE Intelligent Systems, 22(1), 20–27.

    Article  Google Scholar 

  • Ehrmann, M., & Turchi, M. (2010). Building multilingual named entity-annotated corpora exploiting parallel corpora. In Proceedings of the workshop on annotation and exploitation of parallel corpora (AEPC) (pp. 24–33). Tartu, Estonia.

  • Gamon, M., Lozano, C., Pinkham, J., & Reutter, T. (1997). Practical experience with grammar sharing in multilingual NLP. In Proceedings of ACL/EACL, Madrid, Spain.

  • Gong, Y., & Liu, X. (2002). Generic text summarization using relevance measure and latent semantic analysis. In Proceedings of ACM SIGIR. New Orleans, USA.

  • Goutte, C., Cancedda, N., Dymetman, M., & Foster, G. (2009). Learning machine translation. Cambridge, USA: MIT Press.

    Google Scholar 

  • Grefenstette, G. (2010). Proposition for a web 2.0 version of linguistic resource creation. Presentation at FLaReNet Forum 2010 in Barcelona on 12.02.2010.

  • Ignat C., Pouliquen, B., Ribeiro, A., & Steinberger, R. (2003). Extending an information extraction tool set to central and eastern European languages. In Proceedings of the workshop information extraction for Slavonic and other central and eastern European languages (IESL), held at RANLP. Borovets, Bulgaria, September 8–9, 2003.

  • Kabadjov, M., Atkinson, M., Steinberger, J., Steinberger, R., & van der Goot, E. (2010). NewsGist: A multilingual statistical news summarizer. In J. L. Balcázar, F. Bonchi, A. Gionis, & M. Sebag (Eds.), Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML-PKDD). Barcelona, Spain, September 20–24, 2010. Lecture Notes in Computer Science (Vol. 6323, pp. 591–594). Berlin: Springer.

  • Larkey, L., Feng, F., Connell, M., & Lavrenko, V. (2004). Language-specific models in multilingual topic tracking. In Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval (pp. 402–409).

  • Lee, C.-J., Chang, J. S., & Jang, J.-S. R. (2006). Extraction of transliteration pairs from parallel corpora using a statistical transliteration model. Information Sciences, 176(1), 67–90.

    Article  Google Scholar 

  • Linge, J., Steinberger, R., Weber, T., Yangarber, R., van der Goot, E., Al Khudhairy, D., et al. (2009). Internet surveillance systems for early alerting of health threats. Euro Surveillance, 14(13). Stockholm, April 2, 2009.

  • Maynard, D., Tablan, V., & Cunningham, H. (2003). NE Recognition without training data on a language you don’t speak. In Proceedings of the ACL workshop on multilingual and mixed-language NER: Combining statistical and symbolic methods. Sapporo, Japan.

  • Maynard, D., Tablan, V., Cunningham, H., Ursu, C., Saggion, H., Bontcheva, K., et al. (2002). Architectural elements of language engineering robustness. Journal of Natural Language Engineering, 8(3), 257–274. Special issue on robust methods in analysis of natural language data.

    Article  Google Scholar 

  • Nadeau, D., & Sekine, S. (2009). A survey of entity recognition and classification. In S. Sekine & E. Ranchhod (Eds.), Named entities—recognition, classification and use. Amsterdam/Philadelphia: John Benjamins Publishing Company.

    Google Scholar 

  • Nivre, J., Hall, J., Kübler, S., McDonald, R., Nilsson, J., Riedel, S., et al. (2007). The CoNLL 2007 shared task on dependency parsing. In Proceedings of the CoNLL shared task session of EMNLP-CoNLL (pp. 915–932). Prague, Czech Republic.

  • Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1–2), 1–135.

    Article  Google Scholar 

  • Pastra, K., Maynard, D., Hamza, O., Cunningham, H., & Wilks, Y. (2002). How feasible is the reuse of grammars for named entity recognition? In Proceedings of LREC, Las Palmas, Spain.

  • Pouliquen, B., Steinberger, R., & Best, C. (2007a). Automatic detection of quotations in multilingual news. In Proceedings of the international conference recent advances in natural language processing (RANLP) (pp. 487–492). Borovets, Bulgaria, September 27–29, 2007.

  • Pouliquen, B., Steinberger, R., & Belyaeva, J. (2007b). Multilingual multi-document continuously updated social networks. In Proceedings of the workshop multi-source multilingual information extraction and summarization (MMIES) held at RANLP (pp. 25–32). Borovets, Bulgaria, September 26, 2007.

  • Ranta, A. (2009). The GF resource grammar library. In Linguistic issues in language technology LiLT 2:2. December 2009.

  • Rayner, M., & Bouillon, P. (1996). Adapting the core language engine to French and Spanish. In Proceedings of the international conference NLP+IA (pp. 224–232), Mouncton, Canada.

  • Rosen, A. (2010). Mediating between incompatible tagsets. In Proceedings of the workshop on annotation and exploitation of parallel corpora (pp. 53–62), Tartu, Estonia.

  • Shinyama, Y., & Sekine, S. (2004). Named entity discovery using comparable news articles. In Proceedings of the 20th international conference on computational linguistics (CoLing) (pp. 848–853). Geneva, Switzerland.

  • Spreyer, K., & Frank, A. (2008). Projection-based acquisition of a temporal labeller. In Proceedings of the 3rd international joint conference on natural language processing (IJCNLP) (pp. 489–496). Hyderabad, India.

  • Steinberger, J., Kabadjov, M., Pouliquen, B., Steinberger, R., & Poesio, M., (2009). WB-JRC-UT’s participation in TAC 2009: Update summarization and AESOP tasks. In Proceedings of the text analysis conference 2009 (TAC’2009). National Institute of Standards and Technology, Gaithersburg, Maryland USA, November 16–17, 2009.

  • Steinberger, J., Lenkova, P., Ebrahim, M., Ehrmann, M., Vázquez, S., Hürriyetoğlu, A., Kabadjov, M., Steinberger, R., Tanev, H., & Zavarella, V. (2011). Creating sentiment dictionaries via triangulation. In Proceedings of the 2nd workshop on computational approaches to subjectivity and sentiment analysis, WASSA, held at the ACL-HLT conference (pp. 28–36). Portland, Oregon, USA, 24 June 2011.

  • Steinberger, R., & Pouliquen, B. (2007). Cross-lingual named entity recognition. In S. Sekine & E. Ranchhod (Eds.), Journal Linguisticae Investigationes, Special issue on named entity recognition and categorisation. LI, 30(1), 135–162. Amsterdam: John Benjamins Publishing Company.

  • Steinberger, R., Pouliquen, B., & Ignat, C. (2008). Using language-independent rules to achieve high multilinguality in text mining. In F.-S. Françoise, D. Perrotta, J. Piskorski, & R. Steinberger (Eds.), Mining massive data sets for security (pp. 217–240). Amsterdam, The Netherlands: IOS Press.

    Google Scholar 

  • Steinberger, R., Pouliquen, B., & van der Goot, E. (2009). An introduction to the europe media monitor family of applications. In F. Gey, N. Kando, & J. Karlgren (Eds.), Information access in a multilingual world—Proceedings of the SIGIR 2009 Workshop (SIGIR-CLIR) (pp. 1–8). Boston, USA, July 23, 2009.

  • Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., et al. (2006). The JRC-acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th international conference on language resources and evaluation (LREC) (pp. 2142–2147). Genoa, Italy, May 24–26, 2006.

  • Steinberger, R., Ombuya, S., Kabadjov, M., Pouliquen, B., Della Rocca, L., Belyaeva, J., De Paola, M., & van der Goot, E. (2011). Expanding a multilingual media monitoring and information extraction tool to a new language: Swahili. Language Resources and Evaluation Journal, 45(3), 311–330.

    Google Scholar 

  • Tanev, H., Zavarella, V., Linge, J., Kabadjov,M., Piskorski, J., Atkinson, M., et al. (2009). Exploiting machine learning techniques to build an event extraction system for Portuguese and Spanish. In linguaMÁTICA—Revista para o Processamento Automático das Línguas Ibéricas (Vol. 2, pp. 55–67).

  • Turchi, M., Steinberger, J., Kabadjov, M., & Steinberger, R. (2010). Using parallel corpora for multilingual (multi-document) summarisation evaluation. In Conference on multilingual and multimodal information access evaluation (CLEF). Padua, Italy, September 20–23, 2010. Springer Lecture Notes for Computer Science LNCS.

  • Vergne, J. (2002). Une méthode pour l’analyse descendante et calculatoire de corpus multilingues: Application au calcul des relations sujet-verbe. In Proceedings of TALN. Nancy, France.

  • Vergne, J. (2009). Defining the chunk as the period of the functions length and frequency of words on the syntagmatic axis. In Proceedings of the language technology conference LTC. Poznan, Poland.

  • Wehrli, E. (2007). Fips, a “Deep” linguistic multilingual parser. In Proceedings of the ACL workshop on deep linguistic processing (pp. 120–127). Prague, Czech Republic.

  • Yarowski, D., Ngai, G., & Wicentowski, R. (2001). Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the 1st international conference on Human Language Technology research (HLT) (pp. 1–8). Stroudsburg, PA, USA.

  • Zaghouani, W., Pouliquen, B., Ibrahim, M., & Steinberger, R., (2010). Adapting a resource-light highly multilingual Named Entity Recognition system to Arabic. In Proceedings of LREC, Valletta, Malta.

Download references

Acknowledgments

I would like to thank the following persons for having shared their own multilingual grammar writing experience with us, or their views on linguistic resources: Kalina Bontcheva (Sheffield University) on GATE; Frédérique Segond, Caroline Hagège and Claude Roux (Xerox Research Centre Europe) on the Xerox Incremental Parser; Aarne Ranta (Gothenburg University) on the Grammatical Framework; Jacques Vergne (Caen University) on sentence chunking using extremely light-weight methods; Eric Wehrli (Geneva University) on his deep-linguistic parser; Gregory Grefenstette (Exalead) and Gregor Thurmair (Linguatec) on their respective multilingual products; Khalid Choukri (ELRA/ELDA) and Gregory Grefenstette on linguistic resources; and my JRC colleagues Maud Ehrmann, Hristo Tanev, Vanni Zavarella and Marco Turchi for sharing their experiences and for their feedback on earlier versions of the paper. The ultimate responsibility for any errors, however, lies with me. I would furthermore like to thank my superiors Erik van der Goot and Delilah Al Khudhairy for their support, and my colleagues in the OPTIMA group at the JRC for the fruitful and efficient collaboration over the past years, and for so reliably providing large amounts of clean multilingual news data, which allowed us to run many multilingual experiments. Building the complex EMM applications was a successful team effort that also includes many less rewarding and less visible tasks. My specific thanks go to my former colleague Bruno Pouliquen (now at WIPO in Geneva). We developed most ideas together, and he very efficiently implemented many ideas and integrated the many tools with each other.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ralf Steinberger.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Steinberger, R. A survey of methods to ease the development of highly multilingual text mining applications. Lang Resources & Evaluation 46, 155–176 (2012). https://doi.org/10.1007/s10579-011-9165-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-011-9165-9

Keywords

Navigation