Skip to main content

Text Analysis and Information Extraction from Spanish Written Documents

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8609))

Abstract

Despite of the spread of Electronic Health Records (EHRs) in Spanish hospitals and Spanish occupying the second place in the ranking of number of speakers, to the best of our knowledge there are no natural language processing tools for medical texts written in Spanish.

This paper presents an approach based on OpenNLP to process natural language texts written in Spanish for information extraction. The main goal is to integrate our development with cTAKES. As cTAKES has been specifically trained for the clinical domain, in this paper we will train the main modules from a general purpose annotated Spanish corpus and an in-house corpus developed with medical documents, testing both on a set of medical documents. Best performance of individual components when tested with medical documents: Sentence boundary detector accuracy = 0.872; Part-of-speech tagger accuracy = 0.946; chunker = 0.909.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Accenture: Overview of international emr/ehr markets (2010), http://www.accenture.com/SiteCollectionDocuments/PDF/Accenture_EMR_Markets_Whitepaper_vfinal.pdf

  2. Aparicio, J., Taulé, M., Martí, M.A.: Ancora-verb: A lexical resource for the semantic annotation of corpora. In: LREC (2008)

    Google Scholar 

  3. Coden, A., Savova, G.K., Sominsky, I.L., Tanenblatt, M.A., Masanz, J.J., Schuler, K., Cooper, J.W., Guan, W., de Groen, P.C.: Automatically extracting cancer disease characteristics from pathology reports into a disease knowledge representation model. Journal of Biomedical Informatics 42(5), 937–949 (2009), http://dblp.uni-trier.de/db/journals/jbi/jbi42.html#CodenSSTMSCGG09

    Article  Google Scholar 

  4. de la Concha, V.G., Salamanca, R.R.P., Prados, L., Fernández, F.M., Iglesias, M., Vítores, D.F., Rivilla, R.G.: El español: Una lengua viva

    Google Scholar 

  5. EAGLES Computational Lexicons Working Group: Preliminary recommendations on semantic encoding. Tech. rep., EAGLES (1996), http://www.ilc.pi.cnr.it/EAGLES96/rep2/

  6. Fiszman, M., Haug, P., Frederick, P.: Automatic extraction of pioped interpretations from ventilation/perfusion lung scan reports. In: Proceedings of the AMIA Symposium, pp. 860–864 (1998)

    Google Scholar 

  7. Friedman, C., Hripcsak, G., DuMouchel, W., Johnson, S., Clayton, P.: Natural language processing in an operational clinical information system. Natural Language Engineering 1(1), 83–108 (1995), http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=1313068

    Article  Google Scholar 

  8. Friedman, C.: Towards a comprehensive medical language processing system: methods and issues. In: Proceedings of the AMIA Annual Fall Symposium, p. 595. American Medical Informatics Association (1997)

    Google Scholar 

  9. Friedman, C.: A broad-coverage natural language processing system. In: Proceedings of the AMIA Symposium, p. 270. American Medical Informatics Association (2000)

    Google Scholar 

  10. Friedman, C., Alderson, P.O., Austin, J.H., Cimino, J.J., Johnson, S.B.: A general natural-language text processor for clinical radiology. Journal of the American Medical Informatics Association 1(2), 161–174 (1994)

    Article  Google Scholar 

  11. Friedman, C., Liu, H., Shagina, L., Johnson, S., Hripcsak, G.: Evaluating the umls as a source of lexical knowledge for medical language processing. In: Proceedings of the AMIA Symposium, p. 189. American Medical Informatics Association (2001)

    Google Scholar 

  12. Hohnloser, J.H., Holzer, M., Fischer, M.R., Ingenerf, J., Günther-Sutherland, A.: Natural language processing and automatic snomed-encoding of free text: An analysis of free text data from a routine electronic patient record application with a parsing tool using the german snomed ii. In: Proceedings of the AMIA Annual Fall Symposium, p. 856. American Medical Informatics Association (1996)

    Google Scholar 

  13. i2b2: Health information text extraction, https://www.i2b2.org/software/projects/hitex/hitex_manual.html

  14. Martí, M.A., Taulé, M., Bertran, M., Màrquez, L.: Ancora: Multilingual and multilevel annotated corpora (2007), http://clic.ub.edu/corpus/webfm_send/13

  15. Meystre, S.M., Savova, G.K., Kipper-Schuler, K.C., Hurdle, J.F.: Extracting information from textual documents in the electronic health record: A review of recent research. IMIA Yearbook 2008: Access to Health Information 2008(1), 128–144 (2008), http://www.schattauer.de/en/magazine/subject-areas/journals-a-z/imia-yearbook/imia-yearbook-2008/issue/special/manuscript/9830/show.html

    Google Scholar 

  16. Apache OpenNLP: Apache software foundation (2011), http://opennlp.apache.org

  17. Palomar, M., Civit, M., Díaz, A., Moreno, L., Bisbal, E., Aranzabe, M.J., Ageno, A., Martí, M.A., Navarro, B.: 3lb: Construcción de una base de datos de árboles sintáctico-semánticos para el catalán, euskera y castellano. Procesamiento del Lenguaje Natural 33 (2004), http://dblp.uni-trier.de/db/journals/pdln/pdln33.html#PalomarCDMBAAMN04

  18. Pietrzyk, P.: A medical text analysis system for german–syntax analysis. Methods of Information in Medicine 30(4), 275–283 (1991)

    Google Scholar 

  19. Recasens, M., Martí, M.A.: Ancora-co: Coreferentially annotated corpora for spanish and catalan. Language Resources and Evaluation 44(4), 315–345 (2010)

    Article  Google Scholar 

  20. Savova, G.K., Masanz, J.J., Ogren, P.V., Zheng, J., Sohn, S., Kipper-Schuler, K.C., Chute, C.G.: Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications. Journal of the American Medical Informatics Association 17(5), 507–513 (2010)

    Article  Google Scholar 

  21. Taboada, M., Meizoso, M., Hernández, D.M., Riaño, D., Alonso, A.: Combining open-source natural language processing tools to parse clinical practice guidelines. Expert Systems 30(1), 3–11 (2013), http://dblp.uni-trier.de/db/journals/es/es30.html#TaboadaMHRA13

    Article  Google Scholar 

  22. Taulé, M., Civit, M., Artigas, N., García, M., Màrquez, L., Martí, M.A., Navarro, B.: Minicors and cast3lb: Two semantically tagged spanish corpora. In: LREC. European Language Resources Association (2004), http://dblp.uni-trier.de/db/conf/lrec/lrec2004.html#TauleCAGMMN04

  23. Taulé, M., Martí, M.A., Recasens, M.: Ancora: Multilevel annotated corpora for catalan and spanish. In: LREC. European Language Resources Association (2008), http://dblp.uni-trier.de/db/conf/lrec/lrec2008.html#TauleMR08

  24. Thomas, A.A., Zheng, C., Jung, H., Chang, A., Kim, B., Gelfond, J., Slezak, J., Porter, K., Jacobsen, S.J., Chien, G.W.: Extracting data from electronic medical records: validation of a natural language processing program to assess prostate biopsy results. World Journal of Urology 32(1), 99–103 (2014)

    Article  Google Scholar 

  25. Trick, W.E., Chapman, W.W., Wisniewski, M.F., Peterson, B.J., Solomon, S.L., Weinstein, R.A.: Electronic interpretation of chest radiograph reports to detect central venous catheters. Infection Control and Hospital Epidemiology 24(12), 950–954 (2003)

    Article  Google Scholar 

  26. Zeng, Q.T., Goryachev, S., Weiss, S., Sordo, M., Murphy, S.N., Lazarus, R.: Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system. BMC Medical Informatics and Decision Making 6(1), 30 (2006)

    Article  Google Scholar 

  27. Zweigenbaum, P.: Menelas: an access system for medical records using natural language. Computer Methods and Programs in Biomedicine 45(1), 117–120 (1994)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Costumero, R., García-Pedrero, Á., Gonzalo-Martín, C., Menasalvas, E., Millan, S. (2014). Text Analysis and Information Extraction from Spanish Written Documents. In: Ślȩzak, D., Tan, AH., Peters, J.F., Schwabe, L. (eds) Brain Informatics and Health. BIH 2014. Lecture Notes in Computer Science(), vol 8609. Springer, Cham. https://doi.org/10.1007/978-3-319-09891-3_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-09891-3_18

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-09890-6

  • Online ISBN: 978-3-319-09891-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics