Abstract
This study aims at developing a news surveillance system able to address multilingual web corpora. As an example of a domain where multilingual capacity is crucial, we focus on Epidemic Surveillance. This task necessitates worldwide coverage of news in order to detect new events as quickly as possible, anywhere, whatever the language it is first reported in. In this study, text-genre is used rather than sentence analysis. The news-genre properties allow us to assess the thematic relevance of news, filtered with the help of a specialised lexicon that is automatically collected on Wikipedia. Afterwards, a more detailed analysis of text specific properties is applied to relevant documents to better characterize the epidemic event (i.e., which disease spreads where?). Results from 400 documents in each language demonstrate the interest of this multilingual approach with light resources. DAnIEL achieves an F 1-measure score around 85%. Two issues are addressed: the first is morphology rich languages, e.g. Greek, Polish and Russian as compared to English. The second is event location detection as related to disease detection. This system provides a reliable alternative to the generic IE architecture that is constrained by the lack of numerous components in many languages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Linge, J., Steinberger, R., Weber, T., Yangarber, R., van der Goot, E., Al Khudhairy, D., Stilianakis, N.: Internet surveillance systems for early alerting of threats. EurosurveillanceĀ 14(13) (2009)
Lyon, A., Nunn, M., Grossel, G., Burgman, M.: Comparison of web-based biosecurity intelligence systems: BioCaster, EpiSPIDER and HealthMap. Transboundary and Emerging Diseases (2011)
Son, D., Quoc, H.N., Ai, K., Collier, N.: Global health monitor - a web-based system for detecting and mapping infectious diseases. In: International Joint Conference on Natural Language Processing, pp. 951ā956 (2008)
Hartley, D.M., Nelson, N.P., Walters, R., Arthur, R., Yangarber, R., Madoff, L., Linge, J., Mawudeku, A., Collier, N., Bronstein, J.S., Thinus, G., Lightfoot, N.: The landscape of international event-based biosurveillance. Emerging Health Threats JournalĀ 3(e3) (2010)
Reilly, A.R., Iarocci, E.A., Jung, C.M., Hartley, D.M., Nelson, N.P.: Indications and warning of pandemic influenza compared to seasonal inflluenza. Advances in Disease SurveillanceĀ 5, 190 (2008)
Steinberger, R., Fuart, F., van der Goot, E., Best, C., von Etter, P., Yangarber, R.: Text mining from the web for medical intelligence. In: Mining Massive Data Sets for Security, pp. 295ā310. OIS Press (2008)
Huttunen, S., Arto, V., von Etter, P., Yangarber, R.: Relevance prediction in information extraction using discourse and lexical features. In: Nordic Conference on Computational Linguistics, Nodalida 2011, pp. 114ā121 (2011)
Ji, H.: Challenges from information extraction to information fusion. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 507ā515 (2010)
Du, M., Von Etter, P., Kopotev, M., Novikov, M., Tarbeeva, N., Yangarber, R.: Building Support Tools for Russian-Language Information Extraction. In: Habernal, I., MatouÅ”ek, V. (eds.) TSD 2011. LNCS, vol.Ā 6836, pp. 380ā387. Springer, Heidelberg (2011)
Lucas, N.: Stylistic devices in the news, as related to topic recognition. In: Kwiatkowska, A. (ed.) Texts and Minds: Papers in Cognitive Poetics and Rhetoric. ÅĆ³dÅŗ, Studies in language. Peter Lang, Frankfurt am Main, vol.Ā 26, pp. 301ā316 (2012)
Etzioni, O., Fader, A., Christensen, J., Soderland, S.: Open information extraction: The second generation. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence, pp. 3ā10 (2011)
Hobbs, J.R.: The generic information extraction system. In: Proceedings of the 5th Conference on Message Understanding, MUC5 1993, pp. 87ā91. Association for Computational Linguistics, Stroudsburg (1993)
Steinberger, R.: A survey of methods to ease the development of highly multilingual text mining applications. Language Resources and Evaluation, 1ā22 (2011)
Church, K.: Empirical estimates of adaptation: the chance of two Noriegas is closer to \(\frac{p}{2}\) than p 2. In: Proceedings of the 18th Conference on Computational Linguistics, vol.Ā 1, pp. 173ā179. Association for Computational Linguistics (2000)
Collier, N., Ai, K., Jin, L., et al.: A multilingual ontology for infectious disease surveillance: rationale, design and challenges. Journal of Language Resources and Evaluation, 405ā413 (2007)
Ukkonen, E.: Maximal and minimal representations of gapped and non-gapped motifs of a string. Theorie in Computer ScienceĀ 410(43), 4341ā4349 (2009)
KƤrkkƤinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. Journal of the ACMĀ 53(6), 918ā936 (2006)
Liao, S., Grishman, R.: Using document level cross-event inference to improve event extraction. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL 2010, pp. 789ā797 (2010)
Piskorski, J., Belyaeva, J., Atkinson, M.: On refining real-time multilingual news event extraction through deployment of cross-lingual information fusion techniques. In: Proceedings of European Intelligence and Security Informatics Conference (EISIC), pp. 38ā45 (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lejeune, G., Brixtel, R., Doucet, A., Lucas, N. (2012). DAnIEL: Language Independent Character-Based News Surveillance. In: Isahara, H., Kanzaki, K. (eds) Advances in Natural Language Processing. JapTAL 2012. Lecture Notes in Computer Science(), vol 7614. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33983-7_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-33983-7_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33982-0
Online ISBN: 978-3-642-33983-7
eBook Packages: Computer ScienceComputer Science (R0)