Text Analysis and Information Extraction from Spanish Written Documents

Costumero, Roberto; García-Pedrero, Ángel; Gonzalo-Martín, Consuelo; Menasalvas, Ernestina; Millan, Socorro

doi:10.1007/978-3-319-09891-3_18

Text Analysis and Information Extraction from Spanish Written Documents

Roberto Costumero²³,
Ángel García-Pedrero²³,
Consuelo Gonzalo-Martín²³,
Ernestina Menasalvas²³ &
…
Socorro Millan²⁴

Conference paper

1828 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8609))

Abstract

Despite of the spread of Electronic Health Records (EHRs) in Spanish hospitals and Spanish occupying the second place in the ranking of number of speakers, to the best of our knowledge there are no natural language processing tools for medical texts written in Spanish.

This paper presents an approach based on OpenNLP to process natural language texts written in Spanish for information extraction. The main goal is to integrate our development with cTAKES. As cTAKES has been specifically trained for the clinical domain, in this paper we will train the main modules from a general purpose annotated Spanish corpus and an in-house corpus developed with medical documents, testing both on a set of medical documents. Best performance of individual components when tested with medical documents: Sentence boundary detector accuracy = 0.872; Part-of-speech tagger accuracy = 0.946; chunker = 0.909.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Accenture: Overview of international emr/ehr markets (2010), http://www.accenture.com/SiteCollectionDocuments/PDF/Accenture_EMR_Markets_Whitepaper_vfinal.pdf
Aparicio, J., Taulé, M., Martí, M.A.: Ancora-verb: A lexical resource for the semantic annotation of corpora. In: LREC (2008)
Google Scholar
Coden, A., Savova, G.K., Sominsky, I.L., Tanenblatt, M.A., Masanz, J.J., Schuler, K., Cooper, J.W., Guan, W., de Groen, P.C.: Automatically extracting cancer disease characteristics from pathology reports into a disease knowledge representation model. Journal of Biomedical Informatics 42(5), 937–949 (2009), http://dblp.uni-trier.de/db/journals/jbi/jbi42.html#CodenSSTMSCGG09
Article Google Scholar
de la Concha, V.G., Salamanca, R.R.P., Prados, L., Fernández, F.M., Iglesias, M., Vítores, D.F., Rivilla, R.G.: El español: Una lengua viva
Google Scholar
EAGLES Computational Lexicons Working Group: Preliminary recommendations on semantic encoding. Tech. rep., EAGLES (1996), http://www.ilc.pi.cnr.it/EAGLES96/rep2/
Fiszman, M., Haug, P., Frederick, P.: Automatic extraction of pioped interpretations from ventilation/perfusion lung scan reports. In: Proceedings of the AMIA Symposium, pp. 860–864 (1998)
Google Scholar
Friedman, C., Hripcsak, G., DuMouchel, W., Johnson, S., Clayton, P.: Natural language processing in an operational clinical information system. Natural Language Engineering 1(1), 83–108 (1995), http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=1313068
Article Google Scholar
Friedman, C.: Towards a comprehensive medical language processing system: methods and issues. In: Proceedings of the AMIA Annual Fall Symposium, p. 595. American Medical Informatics Association (1997)
Google Scholar
Friedman, C.: A broad-coverage natural language processing system. In: Proceedings of the AMIA Symposium, p. 270. American Medical Informatics Association (2000)
Google Scholar
Friedman, C., Alderson, P.O., Austin, J.H., Cimino, J.J., Johnson, S.B.: A general natural-language text processor for clinical radiology. Journal of the American Medical Informatics Association 1(2), 161–174 (1994)
Article Google Scholar
Friedman, C., Liu, H., Shagina, L., Johnson, S., Hripcsak, G.: Evaluating the umls as a source of lexical knowledge for medical language processing. In: Proceedings of the AMIA Symposium, p. 189. American Medical Informatics Association (2001)
Google Scholar
Hohnloser, J.H., Holzer, M., Fischer, M.R., Ingenerf, J., Günther-Sutherland, A.: Natural language processing and automatic snomed-encoding of free text: An analysis of free text data from a routine electronic patient record application with a parsing tool using the german snomed ii. In: Proceedings of the AMIA Annual Fall Symposium, p. 856. American Medical Informatics Association (1996)
Google Scholar
i2b2: Health information text extraction, https://www.i2b2.org/software/projects/hitex/hitex_manual.html
Martí, M.A., Taulé, M., Bertran, M., Màrquez, L.: Ancora: Multilingual and multilevel annotated corpora (2007), http://clic.ub.edu/corpus/webfm_send/13
Meystre, S.M., Savova, G.K., Kipper-Schuler, K.C., Hurdle, J.F.: Extracting information from textual documents in the electronic health record: A review of recent research. IMIA Yearbook 2008: Access to Health Information 2008(1), 128–144 (2008), http://www.schattauer.de/en/magazine/subject-areas/journals-a-z/imia-yearbook/imia-yearbook-2008/issue/special/manuscript/9830/show.html
Google Scholar
Apache OpenNLP: Apache software foundation (2011), http://opennlp.apache.org
Palomar, M., Civit, M., Díaz, A., Moreno, L., Bisbal, E., Aranzabe, M.J., Ageno, A., Martí, M.A., Navarro, B.: 3lb: Construcción de una base de datos de árboles sintáctico-semánticos para el catalán, euskera y castellano. Procesamiento del Lenguaje Natural 33 (2004), http://dblp.uni-trier.de/db/journals/pdln/pdln33.html#PalomarCDMBAAMN04
Pietrzyk, P.: A medical text analysis system for german–syntax analysis. Methods of Information in Medicine 30(4), 275–283 (1991)
Google Scholar
Recasens, M., Martí, M.A.: Ancora-co: Coreferentially annotated corpora for spanish and catalan. Language Resources and Evaluation 44(4), 315–345 (2010)
Article Google Scholar
Savova, G.K., Masanz, J.J., Ogren, P.V., Zheng, J., Sohn, S., Kipper-Schuler, K.C., Chute, C.G.: Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications. Journal of the American Medical Informatics Association 17(5), 507–513 (2010)
Article Google Scholar
Taboada, M., Meizoso, M., Hernández, D.M., Riaño, D., Alonso, A.: Combining open-source natural language processing tools to parse clinical practice guidelines. Expert Systems 30(1), 3–11 (2013), http://dblp.uni-trier.de/db/journals/es/es30.html#TaboadaMHRA13
Article Google Scholar
Taulé, M., Civit, M., Artigas, N., García, M., Màrquez, L., Martí, M.A., Navarro, B.: Minicors and cast3lb: Two semantically tagged spanish corpora. In: LREC. European Language Resources Association (2004), http://dblp.uni-trier.de/db/conf/lrec/lrec2004.html#TauleCAGMMN04
Taulé, M., Martí, M.A., Recasens, M.: Ancora: Multilevel annotated corpora for catalan and spanish. In: LREC. European Language Resources Association (2008), http://dblp.uni-trier.de/db/conf/lrec/lrec2008.html#TauleMR08
Thomas, A.A., Zheng, C., Jung, H., Chang, A., Kim, B., Gelfond, J., Slezak, J., Porter, K., Jacobsen, S.J., Chien, G.W.: Extracting data from electronic medical records: validation of a natural language processing program to assess prostate biopsy results. World Journal of Urology 32(1), 99–103 (2014)
Article Google Scholar
Trick, W.E., Chapman, W.W., Wisniewski, M.F., Peterson, B.J., Solomon, S.L., Weinstein, R.A.: Electronic interpretation of chest radiograph reports to detect central venous catheters. Infection Control and Hospital Epidemiology 24(12), 950–954 (2003)
Article Google Scholar
Zeng, Q.T., Goryachev, S., Weiss, S., Sordo, M., Murphy, S.N., Lazarus, R.: Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system. BMC Medical Informatics and Decision Making 6(1), 30 (2006)
Article Google Scholar
Zweigenbaum, P.: Menelas: an access system for medical records using natural language. Computer Methods and Programs in Biomedicine 45(1), 117–120 (1994)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Universidad Politécnica de Madrid - Centro de Tecnología Biomédica, Madrid, Spain
Roberto Costumero, Ángel García-Pedrero, Consuelo Gonzalo-Martín & Ernestina Menasalvas
Universidad del Valle, Colombia
Socorro Millan

Authors

Roberto Costumero
View author publications
You can also search for this author in PubMed Google Scholar
Ángel García-Pedrero
View author publications
You can also search for this author in PubMed Google Scholar
Consuelo Gonzalo-Martín
View author publications
You can also search for this author in PubMed Google Scholar
Ernestina Menasalvas
View author publications
You can also search for this author in PubMed Google Scholar
Socorro Millan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Warsaw and Infobright Inc., Poland
Dominik Ślȩzak
School of Computer Engineering, Nanyang Technological University, Nanyang Avenue, Singapore
Ah-Hwee Tan
Computational Intelligence Laboratory, ECE Department, University of Manitoba, R3T 5V6, Winnipeg, MB, Canada
James F. Peters
Institute of Computer Science, University of Rostock, Rostock, Germany
Lars Schwabe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Costumero, R., García-Pedrero, Á., Gonzalo-Martín, C., Menasalvas, E., Millan, S. (2014). Text Analysis and Information Extraction from Spanish Written Documents. In: Ślȩzak, D., Tan, AH., Peters, J.F., Schwabe, L. (eds) Brain Informatics and Health. BIH 2014. Lecture Notes in Computer Science(), vol 8609. Springer, Cham. https://doi.org/10.1007/978-3-319-09891-3_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-09891-3_18
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09890-6
Online ISBN: 978-3-319-09891-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics