Natural Language Processing – The Basics

Pestian, John P.; Deleger, Louise; Savova, Guergana K.; Dexheimer, Judith W.; Solti, Imre

doi:10.1007/978-94-007-5149-1_9

John P. Pestian Ph.D.^2,3,
Louise Deleger Ph.D.³,
Guergana K. Savova Ph.D.⁴,
Judith W. Dexheimer Ph.D.^2,5 &
…
Imre Solti M.D., Ph.D., M.A.^2,3

Part of the book series: Translational Bioinformatics ((TRBIO,volume 2))

1316 Accesses
9 Citations

Abstract

Natural language processing (NLP) emerged in the 1900s to support the wartime efforts. It’s dubious performance, however, slowed research initiatives until the 1960s when advances in machine learning provided novel approaches to text analysis. Increased processing speed and widespread availability of digital text accelerated this trend in the late 1990s. At the present time, there are extensive efforts to use NLP on clinical text and to incorporate this technology into software applications that support clinical care. In this chapter, the first of two about NLP, we will present: basic principles of NLP, the lexical resources required to produce high quality output from clinical text, the process (called annotation) of creating and NLP gold standard, the statistical methods used to evaluate and the role of shared tasks for evaluating and facilitating standardization in the field. Subsequent chapters will discuss ongoing research dedicated improving the quality and utility of NLP in the clinical setting.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 179.00; Price excludes VAT (USA)

Softcover Book: USD 229.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Allbright D, et al. Towards comprehensive syntactic and semantic annotations. Submitted for publication, is under review (2012)
Google Scholar
AMIA Proceedings Library. (n.d.) [cited 2012 Apr 5]. Available from: http://proceedings.amia.org/
Association of Computational Linguistics (ACL), Special Interest Group on Natural Language Learning (SIGNLL). CoNLL: the conference of SIGNLL. December 9, 2010 [cited 2012 June 1]. Available from: http://ifarm.nl/signll/conll/
Baker CF, Fillmore CJ, Lowe JB. The Berkeley FrameNet project. In: Proceedings of the 17th international conference on computational linguistics, vol. 1. Montreal: Association for Computational Linguistics; 1998. p. 86–90.
Chapter Google Scholar
Beckwith BA, et al. Development and evaluation of an open source software tool for deidentification of pathology reports. BMC Med Inform Decis Mak. 2006;6:12.
Article PubMed Google Scholar
Brants T. Inter-annotator agreement for a German newspaper corpus. In: Proceedings of the second international conference on language resources and evaluation (LREC 2000), Athens, Greece. Paris: European Language Association; 2000.
Google Scholar
Bright W. International encyclopedia of linguistics. New York: Oxford University Press; 1992.
Google Scholar
Brownstein JS, et al. The tell-tale heart: population-based surveillance reveals an association of rofecoxib and celecoxib with myocardial infarction. PLoS One. 2007;2(9):e840.
Article PubMed Google Scholar
Brownstein JS, Freifeld CC, Madoff LC. Influenza A (H1N1) virus, 2009–online monitoring. N Engl J Med. 2009;360(21):2156.
Article PubMed Google Scholar
Chapman WW, Dowling JN. Inductive creation of an annotation schema for manually indexing clinical conditions from emergency department reports. J Biomed Inform. 2006;39(2):196–208.
Article PubMed Google Scholar
Chapman WW, Dowling JN, Hripcsak G. Evaluation of training with an annotation schema for manual annotation of clinical conditions from emergency department reports. Int J Med Inform. 2008;77(2):107–13.
Article PubMed Google Scholar
Chen Y, Mani S, Xu H. Applying active learning to assertion classification of concepts in clinical text. J Biomed Inform. 2012;45(2):265–72.
Article PubMed Google Scholar
Cinchor N. The statistical significance of MUC4 results. In: MUC4’92 proceedings of the 4th conference on message understanding. San Mateo: Morgan Kaufmann; 1992.
Google Scholar
Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20:37–46.
Article Google Scholar
Computational Medicine Center (CMC). 2007 international challenge: classifying clinical free text using natural language processing. [cited 2012 June 1]. Available from: http://computationalmedicine.org/challenge/previous
Computational Medicine Center (CMC). Suicide notes. [cited 2012 June 12]. Available from: http://computationalmedicine.org/home-0
Covey SR. The 7 habits of highly effective people: powerful lessons in personal change. Rev. ed. New York: Free Press; 2004. p. 372.
Google Scholar
de Marneffe MC, Manning CD. The Stanford typed dependencies representation. In: Coling 2008: proceedings of the workshop on cross-framework and cross-domain parser evaluation. Manchester: Coling 2008 Organizing Committee; 2008. p. 1–8.
Chapter Google Scholar
Deleger L, et al. Building Gold Standard Corpora for Medical Natural Language Processing Tasks. In: American Medical Informatics Annual Symposium Proceedings. Chicago, November 1–6, 2012
Google Scholar
Demner-Fushman D, Chapman WW, McDonald CJ. What can natural language processing do for clinical decision support? J Biomed Inform. 2009;42(5):760–72.
Article PubMed Google Scholar
Denny JC, et al. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics. 2010;26(9):1205–10.
Article PubMed CAS Google Scholar
Ely JW, et al. Answering physicians’ clinical questions: obstacles and potential solutions. J Am Med Inform Assoc. 2005;12(2):217–24.
Article PubMed Google Scholar
Fellbaum C, Grabowski J, Landes S. Performance and confidence in a semantic annotation task. In: Fellbaum C, editor. WordNet: an electronic lexical database. Cambridge, MA: MIT Press; 1998.
Google Scholar
Friedman C, Hripcsak G. Evaluating natural language processors in the clinical domain. Methods Inf Med. 1998;37(4–5):334–44.
PubMed CAS Google Scholar
Gale W, Church KW, Yarowsky D. Estimating upper and lower bounds on the performance of word-sense disambiguation programs. In: Proceedings of the 30th annual meeting on Association for Computational Linguistics. Newark/Delaware: Association for Computational Linguistics; 1992. p. 249–56.
Chapter Google Scholar
Gardner J, Xiong L. HIDE: an integrated system for health information DE-identification. In: Proceedings of the 21st IEEE international symposium on computer-based medical systems. Los Alamitos: IEEE Computer Society; 2008. p. 254–9.
Chapter Google Scholar
Geisser S. Predictive sample reuse method with applications. J Am Stat Assoc. 1975;70(350):320–8.
Article Google Scholar
Giacomini KM, Brett CM, et al. The pharmacogenetics research network: from SNP discovery to clinical drug response. Clin PharmaTher. 2007;81(3):328–45.
Google Scholar
Grosso WE, et al. Knowledge modeling at the millennium (The design and evolution of Protégé-2000). In: Twelfth Banff workshop on knowledge acquisition, modeling, and management, Alberta; 1999.
Google Scholar
Grouin C, et al. Proposal for an extension of traditional named entities: from guidelines to evaluation, an overview. In: Proceedings of the 5th linguistic annotation workshop (LAW V’11). Portland: Association for Computational Linguistics; 2011. p. 92–100.
Google Scholar
Guidelines for the 2011 TREC Medical Records Track. [cited 2012 June 1]. Available from: http://www-nlpir.nist.gov/projects/trecmed/2011/tm2011.html
Gupta D, Saul M, Gilbertson J. Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. Am J Clin Pathol. 2004;121(2):176–86.
Article PubMed Google Scholar
Health Map. [cited 2012 June 1]. Available from: http://www.healthmap.org/en/
Hicks J. The potential of claims data to support the measurement of health care quality. Santa Monica: RAND Corporation; 2003.
Google Scholar
Hripcsak G, Rothschild AS. Agreement, the f-measure, and reliability in information retrieval. J Am Med Inform Assoc. 2005;12(3):296–8.
Article PubMed Google Scholar
i2b2: Informatics for Integrating Biology and the Bedside. 2012 NLP shared task: shared-tasks and workshop on challenges in natural language processing for clinical data. [cited 2012 June 1]. Available from: https://www.i2b2.org/NLP/TemporalRelations/Call.php
i2b2: Informatics for Integrating Biology and the Bedside. Datasets. [cited 2012 June 1]. Available from: https://www.i2b2.org/NLP/DataSets/Main.php
Jha AK. The promise of electronic records: around the corner or down the road? JAMA. 2011;306(8):880–1.
Article PubMed CAS Google Scholar
Jones K. Natural language processing: a historical review (Paper). Current issues in Computational Linguistics: in honour of Don Walker. 2001 [cited 2012 June 1]. Available from: http://www.cl.cam.ac.uk/archive/ksj21/histdw4.pdf
Jurafsky D, Martin JH. Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, Prentice Hall series in artificial intelligence. Upper Saddle River: Prentice Hall; 2000. xxvi, 934 p.
Google Scholar
Kho AN, et al. Electronic medical records for genetic research: results of the eMERGE consortium. Sci Transl Med. 2011;3(79):79re1.
Article PubMed Google Scholar
Kohane IS. Using electronic health records to drive discovery in disease genomics. Nat Rev Genet. 2011;12(6):417–28.
Article PubMed CAS Google Scholar
Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the fourteenth international joint conference on artificial intelligence. San Mateo: Morgan Kaufmann; 1995. p. 1137–43.
Google Scholar
Liddy ED. Natural language processing. In: Drake MA, editor. Encyclopedia of library and information science. New York: Marcel Dekker; 2003. p. 2126–36.
Google Scholar
Luo Z, et al. Extracting temporal constraints from clinical research eligibility criteria using conditional random fields. AMIA Annu Symp Proc. 2011;2011:843–52.
PubMed Google Scholar
Marcus MP, Marcinkiewicz MA, Santorini B. Building a large annotated corpus of English: the penn treebank. Comput Linguist. 1993;19(2):313–30.
Google Scholar
McCarty CA, Chisholm RL, et al. The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Med Genomics. 2011;4:13.
Google Scholar
Medpedia: an open platform connecting people and information to advance medicine. [cited 2012 June 1]. Available from: http://www.medpedia.com/
Melton GB, Hripcsak G. Automated detection of adverse events using natural language processing of discharge summaries. J Am Med Inform Assoc. 2005;12(4):448–57.
Article PubMed Google Scholar
Meyers A, et al. The NomBank project: an interim report. In: HLT-NAACL 2004 Workshop: frontiers in corpus annotation. Boston: Association for Computational Linguistics; 2004.
Google Scholar
Meystre SM, et al. Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform. 2008;2008:128–44.
Google Scholar
Meystre SM, et al. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med Res Methodol. 2010;10:70.
Article PubMed Google Scholar
Miller GA. WordNet: a lexical database for English. Commun ACM. 1995;38(11):39–41.
Article Google Scholar
Miller T, Dligach D, Savova GK. Active learning for coreference resolution. In: BioNLP workshop at the Conference of the North American Association of Computational Linguistics (NACCL), Montreal; 2012.
Google Scholar
Miltsakaki E, et al. The Penn Discourse Treebank. In: Proceedings of the LREC 2004 fourth international conference on language resources and evaluation, Lisbon, Portugal; 2004. pp. 2237–2240.
Google Scholar
Moore Slides. (n.d.) [cited 2012 Apr 5]. Available from: http://old-site.clsp.jhu.edu/ws04/calendar/School/Moore_slides.ppt
Moore RC. Two paradigms for natural-language processing. In: Cappelli A, Turini F, editors. AI*IA 2003: advances in artificial intelligence: 8th Congress of the Italian Association for Artificial Intelligence, Pisa, Italy, September, 2003: proceedings (Associazione italiana per l’intelligenza artificiale. Congress (8th: 2003: Pisa Italy)). Berlin/New York: Springer; 2003. p. 548.
Google Scholar
Murff HJ, et al. Automated identification of postoperative complications within an electronic medical record using natural language processing. JAMA. 2011;306(8):848–55.
Article PubMed CAS Google Scholar
National Institutes of Health and National Institute of General Medical Sciences. The NIH Pharmacogenomics Research Network (PGRN). [cited 2012 June 1]. Available from: http://www.nigms.nih.gov/Research/FeaturedPrograms/PGRN/
National Research Council. “Recommendations.” Language and machines: computers in translation and linguistics. 1966 [cited 2012 Apr 5]. Available from: http://www.nap.edu/openbook.php?isbn=ARC00.
Neamatullah I, et al. Automated de-identification of free-text medical records. BMC Med Inform Decis Mak. 2008;8:32.
Article PubMed Google Scholar
Noreen EW. Computer-intensive methods for testing hypotheses: an introduction. New York: Wiley; 1989.
Google Scholar
Ogren P. Knowtator: a protégé plug-in for annotated corpus construction. In: Proceedings of the North American chapter of the association for computational linguistics on human language technology. East Stroudsburg: Association for Computational Linguistics; 2006.
Google Scholar
Ogren PV, Savova GK, Chute C. Constructing evaluation corpora for automated clinical named entity recognition. In: Proceedings of the sixth international conference on language resources and evaluation (LREC’08), Marrakech, Morocco; 2008.
Google Scholar
Pakhomov SV, Coden A, Chute CG. Developing a corpus of clinical notes manually annotated for part-of-speech. Int J Med Inform. 2006;75(6):418–29.
Article PubMed Google Scholar
Palmer M, Gildea D, Kingsbury P. The proposition bank: an annotated corpus of semantic roles. Comput Linguist. 2005;31(1):71–106.
Article Google Scholar
Palmer M, Dang HT, Fellbaum C. Making fine-grained and coarse-grained sense distinctions, both manually and automatically. Nat Lang Eng. 2007;13(02):137–63.
Google Scholar
Palmer M, Xue N. Linguistic annotation. In: Clark A, Fox C, Lappin S, editors. The handbook of computational linguistics and natural language processing. Chichester/Malden: Wiley-Blackwell; 2010. p. 13 and 21.
Google Scholar
Penn Treebank Project. [cited 2012 June 1]. Available from: http://www.cis.upenn.edu/∼treebank/
Pestian JP, et al. A shared task involving multi-label classification of clinical free text. In: Proceedings of the workshop on BioNLP 2007: biological, translational, and clinical language processing. Prague: Association for Computational Linguistics; 2007. p. 97–104.
Chapter Google Scholar
Poesio M. Discourse annotation and semantic annotation in the GNOME corpus. In: Proceedings of the 2004 ACL workshop on discourse annotation. Barcelona: Association for Computational Linguistics; 2004. p. 72–9.
Chapter Google Scholar
Poesio M, Vieira R. A corpus-based investigation of definite description use. Comput Linguist. 1998;24(2):183–216.
Google Scholar
PropBank Project. [cited 2012 June 1]. Available from: http://verbs.colorado.edu/~mpalmer/projects/ace.html
Pustejovsky J, Stubbs A. Natural language annotation for machine learning. Sebastopol: O’Reilly Media; 2012.
Google Scholar
Pustejovsky J, et al. The TIMEBANK Corpus. In: Proceedings of the Corpus Linguistics, Lancaster, UK; 2003. pp. 647–656.
Google Scholar
Resnik P, Lin J. Evaluation of NLP systems. In: Clark A, Fox C, Lappin S, editors. The handbook of computational linguistics and natural language processing. Chichester/Malden: Wiley-Blackwell; 2010. p. 271–96.
Chapter Google Scholar
Roberts A, et al. Building a semantically annotated corpus of clinical texts. J Biomed Inform. 2009;42(5):950–66.
Article PubMed Google Scholar
SENSEVAL: evaluation exercises for the semantic analysis of text. [cited 2012 June 1]. Available from: http://www.senseval.org/
Settles B. Active learning literature survey, in computer sciences technical report 1648. Madison: University of Wisconsin-Madison; 2009.
Google Scholar
Shared Annotated Resource for the Clinical Domain (ShaRe). Clinical NLP annotation. [cited 2012 June 1]. Available from: https://www.clinicalnlpannotation.org/index.php/Main_Page
SHARPn Main Page. Strategic Health IT Advanced Research Projects (SHARP): Research Focus Area 4. (n.d.) [cited 2012 Apr 5]. Available from: http://informatics.mayo.edu/sharp/index.php/Main_Page
Solti I, et al. Automated classification of radiology reports for acute lung injury: comparison of keyword and machine learning based natural language processing approaches. Proceedings (IEEE Int Conf Bioinformatics Biomed). 2009;2009:314–19.
Google Scholar
South BR, et al. Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease. BMC Bioinformatics. 2009;10 Suppl 9:S12.
Article PubMed Google Scholar
Sparck Jones K, Galliers JR. Evaluating natural language processing systems: an analysis and review, Lecture notes in computer science. Berlin/New York: Springer; 1995. xv, 228 p.
Book Google Scholar
Text Analysis Conference (TAC). [cited 2012 June 1]. Available from: http://www.nist.gov/tac/
Text REtrieval Conference (TREC). [cited 2012 June 1]. Available from: http://trec.nist.gov/
The British National Corpus, version 3 (BNC XML Edition). Distributed by Oxford University Computing Services on behalf of the BNC Consortium; 2007. URL: http://www.natcorp.ox.ac.uk/
Thomas SM, et al. A successful technique for removing names in pathology reports using an augmented search and replace method. Proc AMIA Symp. 2002;2002:777–81.
Google Scholar
THYME: Temporal history of your medical events. [cited 2012 June 1]. Available from: http://clear.colorado.edu/compsem/index.php?page=endendsystems&sub=temporal
TimeML specifications. [cited 2012 June 1]. Available from: http://www.timeml.org/site/publications/specs.html
Tomanek K, Hahn U. Timed annotations – enhancing MUC7 metadata by the time it takes to annotate named entities. In: Proceedings of the linguistic annotation workshop. Singapore: Association for Computational Linguistics; 2009.
Google Scholar
U.S. Department of Health and Human Services. Project information 5R01GM090187-03. (n.d.) [cited 2012 Apr 5]. Available from: http://projectreporter.nih.gov/project_info_description.cfm?aid=8133360
Uzuner O. Recognizing obesity and comorbidities in sparse data. J Am Med Inform Assoc. 2009;16(4):561–70.
Article PubMed Google Scholar
Uzuner O, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc. 2007;14(5):550–63.
Article PubMed Google Scholar
Uzuner O, et al. Identifying patient smoking status from medical discharge records. J Am Med Inform Assoc. 2008;15(1):14–24.
Article PubMed Google Scholar
Uzuner O, et al. Community annotation experiment for ground truth generation for the i2b2 medication challenge. J Am Med Inform Assoc. 2010;17(5):519–23.
Article PubMed Google Scholar
Uzuner O, et al. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Med Inform Assoc. 2011;18(5):552–6.
Article PubMed Google Scholar
Uzuner O, et al. Evaluating the state of the art in coreference resolution for electronic medical records. J Am Med Inform Assoc. 2012;19(5):786–91.
Article PubMed Google Scholar
Wiebe J, Wilson T, Cardie C. Annotating expressions of opinions and emotions in language. Lang Resour Eval. 2005;39(2):165–210.
Article Google Scholar
WordNet: a lexical database for English. Princeton: Princeton University. Available from: http://wordnet.princeton.edu/
Yetisgen-Yildiz M, Solti I, Xia F. Using Amazon’s Mechanical Turk for annotating medical named entities. AMIA Annu Symp Proc. 2010;2010:1316.
PubMed Google Scholar
Zampolli A, et al. Current issues in computational linguistics : in honour of Don Walker. Linguistica computazionale. Pisa/Norwell: Giardini ; Distributed in the U.S.A. and Canada by Kluwer Academic Publishers; 1994. xxv, 595 p.
Google Scholar
Zhan C, Miller MR. Administrative data based patient safety research: a critical review. Qual Saf Health Care. 2003;12 Suppl 2:ii58–63.
PubMed Google Scholar

Download references

Acknowledgements

Dr. Savova’s work was supported in part by NIH grants U54LM008748 and 1U01HG006828. Drs. Deleger’s and Solti’s work was supported in part by NIH grant 5R00LM010227.

Author information

Authors and Affiliations

Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, OH, USA
John P. Pestian Ph.D., Judith W. Dexheimer Ph.D. & Imre Solti M.D., Ph.D., M.A.
Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, 3333 Burnet Avenue, ML-7024, Cincinnati, OH, 45229-3039, USA
John P. Pestian Ph.D., Louise Deleger Ph.D. & Imre Solti M.D., Ph.D., M.A.
Boston Children’s Hospital and Harvard Medical School, Harvard University, 300 Longwood Avenue, Enders 138, Boston, MA, 02115, USA
Guergana K. Savova Ph.D.
Divisions of Emergency Medicine and Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, 3333 Burnet Avenue, ML-7024, Cincinnati, OH, 45229-3039, USA
Judith W. Dexheimer Ph.D.

Authors

John P. Pestian Ph.D.
View author publications
You can also search for this author in PubMed Google Scholar
Louise Deleger Ph.D.
View author publications
You can also search for this author in PubMed Google Scholar
Guergana K. Savova Ph.D.
View author publications
You can also search for this author in PubMed Google Scholar
Judith W. Dexheimer Ph.D.
View author publications
You can also search for this author in PubMed Google Scholar
Imre Solti M.D., Ph.D., M.A.
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to John P. Pestian Ph.D. .

Editor information

Editors and Affiliations

Burnet Ave 3333, Cincinnati, 45229-3026, Ohio, USA
John J. Hutton

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Pestian, J.P., Deleger, L., Savova, G.K., Dexheimer, J.W., Solti, I. (2012). Natural Language Processing – The Basics. In: Hutton, J. (eds) Pediatric Biomedical Informatics. Translational Bioinformatics, vol 2. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-5149-1_9

Download citation

DOI: https://doi.org/10.1007/978-94-007-5149-1_9
Published: 24 September 2012
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-007-5148-4
Online ISBN: 978-94-007-5149-1
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics