Abstract
Despite of the spread of Electronic Health Records (EHRs) in Spanish hospitals and Spanish occupying the second place in the ranking of number of speakers, to the best of our knowledge there are no natural language processing tools for medical texts written in Spanish.
This paper presents an approach based on OpenNLP to process natural language texts written in Spanish for information extraction. The main goal is to integrate our development with cTAKES. As cTAKES has been specifically trained for the clinical domain, in this paper we will train the main modules from a general purpose annotated Spanish corpus and an in-house corpus developed with medical documents, testing both on a set of medical documents. Best performance of individual components when tested with medical documents: Sentence boundary detector accuracy = 0.872; Part-of-speech tagger accuracy = 0.946; chunker = 0.909.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Accenture: Overview of international emr/ehr markets (2010), http://www.accenture.com/SiteCollectionDocuments/PDF/Accenture_EMR_Markets_Whitepaper_vfinal.pdf
Aparicio, J., Taulé, M., Martí, M.A.: Ancora-verb: A lexical resource for the semantic annotation of corpora. In: LREC (2008)
Coden, A., Savova, G.K., Sominsky, I.L., Tanenblatt, M.A., Masanz, J.J., Schuler, K., Cooper, J.W., Guan, W., de Groen, P.C.: Automatically extracting cancer disease characteristics from pathology reports into a disease knowledge representation model. Journal of Biomedical Informatics 42(5), 937–949 (2009), http://dblp.uni-trier.de/db/journals/jbi/jbi42.html#CodenSSTMSCGG09
de la Concha, V.G., Salamanca, R.R.P., Prados, L., Fernández, F.M., Iglesias, M., Vítores, D.F., Rivilla, R.G.: El español: Una lengua viva
EAGLES Computational Lexicons Working Group: Preliminary recommendations on semantic encoding. Tech. rep., EAGLES (1996), http://www.ilc.pi.cnr.it/EAGLES96/rep2/
Fiszman, M., Haug, P., Frederick, P.: Automatic extraction of pioped interpretations from ventilation/perfusion lung scan reports. In: Proceedings of the AMIA Symposium, pp. 860–864 (1998)
Friedman, C., Hripcsak, G., DuMouchel, W., Johnson, S., Clayton, P.: Natural language processing in an operational clinical information system. Natural Language Engineering 1(1), 83–108 (1995), http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=1313068
Friedman, C.: Towards a comprehensive medical language processing system: methods and issues. In: Proceedings of the AMIA Annual Fall Symposium, p. 595. American Medical Informatics Association (1997)
Friedman, C.: A broad-coverage natural language processing system. In: Proceedings of the AMIA Symposium, p. 270. American Medical Informatics Association (2000)
Friedman, C., Alderson, P.O., Austin, J.H., Cimino, J.J., Johnson, S.B.: A general natural-language text processor for clinical radiology. Journal of the American Medical Informatics Association 1(2), 161–174 (1994)
Friedman, C., Liu, H., Shagina, L., Johnson, S., Hripcsak, G.: Evaluating the umls as a source of lexical knowledge for medical language processing. In: Proceedings of the AMIA Symposium, p. 189. American Medical Informatics Association (2001)
Hohnloser, J.H., Holzer, M., Fischer, M.R., Ingenerf, J., Günther-Sutherland, A.: Natural language processing and automatic snomed-encoding of free text: An analysis of free text data from a routine electronic patient record application with a parsing tool using the german snomed ii. In: Proceedings of the AMIA Annual Fall Symposium, p. 856. American Medical Informatics Association (1996)
i2b2: Health information text extraction, https://www.i2b2.org/software/projects/hitex/hitex_manual.html
Martí, M.A., Taulé, M., Bertran, M., Màrquez, L.: Ancora: Multilingual and multilevel annotated corpora (2007), http://clic.ub.edu/corpus/webfm_send/13
Meystre, S.M., Savova, G.K., Kipper-Schuler, K.C., Hurdle, J.F.: Extracting information from textual documents in the electronic health record: A review of recent research. IMIA Yearbook 2008: Access to Health Information 2008(1), 128–144 (2008), http://www.schattauer.de/en/magazine/subject-areas/journals-a-z/imia-yearbook/imia-yearbook-2008/issue/special/manuscript/9830/show.html
Apache OpenNLP: Apache software foundation (2011), http://opennlp.apache.org
Palomar, M., Civit, M., Díaz, A., Moreno, L., Bisbal, E., Aranzabe, M.J., Ageno, A., Martí, M.A., Navarro, B.: 3lb: Construcción de una base de datos de árboles sintáctico-semánticos para el catalán, euskera y castellano. Procesamiento del Lenguaje Natural 33 (2004), http://dblp.uni-trier.de/db/journals/pdln/pdln33.html#PalomarCDMBAAMN04
Pietrzyk, P.: A medical text analysis system for german–syntax analysis. Methods of Information in Medicine 30(4), 275–283 (1991)
Recasens, M., Martí, M.A.: Ancora-co: Coreferentially annotated corpora for spanish and catalan. Language Resources and Evaluation 44(4), 315–345 (2010)
Savova, G.K., Masanz, J.J., Ogren, P.V., Zheng, J., Sohn, S., Kipper-Schuler, K.C., Chute, C.G.: Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications. Journal of the American Medical Informatics Association 17(5), 507–513 (2010)
Taboada, M., Meizoso, M., Hernández, D.M., Riaño, D., Alonso, A.: Combining open-source natural language processing tools to parse clinical practice guidelines. Expert Systems 30(1), 3–11 (2013), http://dblp.uni-trier.de/db/journals/es/es30.html#TaboadaMHRA13
Taulé, M., Civit, M., Artigas, N., García, M., Màrquez, L., Martí, M.A., Navarro, B.: Minicors and cast3lb: Two semantically tagged spanish corpora. In: LREC. European Language Resources Association (2004), http://dblp.uni-trier.de/db/conf/lrec/lrec2004.html#TauleCAGMMN04
Taulé, M., Martí, M.A., Recasens, M.: Ancora: Multilevel annotated corpora for catalan and spanish. In: LREC. European Language Resources Association (2008), http://dblp.uni-trier.de/db/conf/lrec/lrec2008.html#TauleMR08
Thomas, A.A., Zheng, C., Jung, H., Chang, A., Kim, B., Gelfond, J., Slezak, J., Porter, K., Jacobsen, S.J., Chien, G.W.: Extracting data from electronic medical records: validation of a natural language processing program to assess prostate biopsy results. World Journal of Urology 32(1), 99–103 (2014)
Trick, W.E., Chapman, W.W., Wisniewski, M.F., Peterson, B.J., Solomon, S.L., Weinstein, R.A.: Electronic interpretation of chest radiograph reports to detect central venous catheters. Infection Control and Hospital Epidemiology 24(12), 950–954 (2003)
Zeng, Q.T., Goryachev, S., Weiss, S., Sordo, M., Murphy, S.N., Lazarus, R.: Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system. BMC Medical Informatics and Decision Making 6(1), 30 (2006)
Zweigenbaum, P.: Menelas: an access system for medical records using natural language. Computer Methods and Programs in Biomedicine 45(1), 117–120 (1994)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Costumero, R., García-Pedrero, Á., Gonzalo-Martín, C., Menasalvas, E., Millan, S. (2014). Text Analysis and Information Extraction from Spanish Written Documents. In: Ślȩzak, D., Tan, AH., Peters, J.F., Schwabe, L. (eds) Brain Informatics and Health. BIH 2014. Lecture Notes in Computer Science(), vol 8609. Springer, Cham. https://doi.org/10.1007/978-3-319-09891-3_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-09891-3_18
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09890-6
Online ISBN: 978-3-319-09891-3
eBook Packages: Computer ScienceComputer Science (R0)