Abstract
The complexity and wide range of current biomedical research is reflected in the number and scope of biomedical publications. Due to this abundance scientists are often no longer capable of keeping up with publications in their specific areas of research, let alone finding, reading, and analyzing potentially related scientific publications. Real advances in research, however, can be achieved only if a researcher can obtain an overview of the state of a given research question in a timely manner. This chapter presents methods to help researchers access the content of the biomedical literature. Information Retrieval (IR) identifies, in a large document database, the documents that are most relevant to a search topic provided by a user. Natural Language Processing (NLP) affords finer-grained access to more precise information contained in texts, which opens up a range of data analysis and knowledge synthesis functionalities. Powerful tools have been designed to exploit these techniques for the benefit of biomedical researchers, extracting millions of facts from the published literature and assisting Literature-Based Discovery. This chapter is organized as follows. It first describes the current capacities of IR from the Medline® bibliographic database. A short introduction to the main concepts of Natural Language Processing follows. Tasks which build on Natural Language Processing are then presented: Information Extraction and its derivatives and Literature-Based Discovery. A review of some existing applications closes the chapter. The references cited in the text are supplemented by a list of textbooks and Web resources.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
The complete term here is monoostotic fibrous dysplasia, and was obtained through the analysis of adjective coordination in a noun phrase.
- 6.
MetaMap 2008 expands the abbreviation within the span of submitted text
- 7.
Although these may actually be hypotheses, observations, results, etc., we refer to them uniformly as “facts.”
- 8.
- 9.
iHOP – http://www.ihop-net.org/
- 10.
Chilibot – http://www.chilibot.net/
- 11.
- 12.
BITOLA – http://www.mf.uni-lj.si/bitola/
- 13.
Semantic MEDLINE – http://skr3.nlm.nih.gov/SemMedDemo/
References
Ahlers CB, Fiszman M, Demner-Fushman D, Lang FM, Rindflesch TC (2007) Extracting semantic predications from MEDLINE citations for pharmacogenomics. In: Pac Symp Biocomput 12. Maui, Hawaii, pp 209–220
Airola A, Pyysalo S, Bjorne J, Pahikkala T, Ginter F, Salakoski T (2008) A graph kernel for protein–protein interaction extraction. In: Proceedings of the workshop on current trends in biomedical natural language processing (BioNLP’08). Association for Computational Linguistics, pp 1–9
Alex B, Grover C, Haddow B, Kabadjov M, Klein E, Matthews M et al (2008) Assisted curation: Does text mining really help? In: Pac Symp Biocomput 13. Big Island, Hawaii, pp 556–567
Ananiadou S, Friedman C, Tsujii JI (eds) (2004) Named entity recognition in biomedicine. J Biomed Inform 37(6):393–528
Aronson AR (2001) Effective mapping of biomedical text to the UMLS Metathesaurus: The MetaMap program. In: Proc AMIA Symp, pp 17–21
Bodenreider O (2008) Biomedical ontologies in action: Role in knowledge management, Data integration and decision support. IMIA Yearbook of Medical Informatics, pp 67–79
Branco A, McEnery T, Mitkov R (eds) (2005) Anaphora processing: Linguistic, cognitive and computational modelling. Current Issues in Linguistic Theory, vol 263. John Benjamins, Amsterdam and Philadelphia
Bruza P, Weeber M (eds) (2008) Literature-based discovery. Information Science and Knowledge Management, vol 15, Springer, Berlin Heidelberg New York
Carreras X, Màrquez L (2005) Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling. In: Proc 9th CoNLL. ACL, pp 152–164
Chang JT, Altman RB (2004) Extracting and characterizing gene-drug relationships from literature. Pharmacogenetics 14(9):577–586
Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG (2001) A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform 34(5):301–310
Chen H, Sharp BM (2004) Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinform 8;5:147
Dee CR (2007) The development of the Medical Literature Analysis and Retrieval System (MEDLARS). J Med Libr Assoc 95(4):416–425. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2000779
Demner-Fushman D, Humphrey SM, Ide NC, Loane RF, Mork JG, Ruch P et al (2007) Combining resources to find answers to biomedical questions. In: The sixteenth text retrieval conference TREC-2007, Gaithersburg, MD, pp 205–215
Eaton AD (2006) HubMed: a web-based biomedical literature search interface. Nucleic Acids Res 1:34(Web Server issue):W745-7
Firth JR (1957) Papers in linguistics, 1934–1951. Oxford University Press, London
Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A (2001) GENIES: A natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17(Suppl 1):S74–S82
Grishman R, Sundheim B (1996) Message understanding conference – 6: A brief history. In: Proc 16th COLING. ACL, pp 466–471
Habert B, Zweigenbaum P (2002) Contextual acquisition of information categories: what has been done and what can be done automatically? In: Nevin BE, Johnson SM (eds) The Legacy of Zellig Harris: Language and information into the 21st Century, Mathematics and computability of language, vol 2. John Benjamins, Amsterdam, pp 203–231
Hersh W, Cohen AM, Ruslen L, Roberts P (2007) TREC 2007 genomics track overview. In: The Sixteenth Text Retrieval Conference – TREC 2007. NIST
Hersh WR, Greenes RA (1990) SAPHIRE: An information retrieval system featuring concept matching, automatic indexing, probabilistic retrieval, and hierarchical relationships. Comput Biomed Res 23(5):410–425
Hirschman L (2007) The second biocreative evaluation: Lessons learned and future directions. In: Fifth Fraunhofer-symposium on text mining. Bonn, Germany, http://www.scai.fraunhofer.de/fileadmin/download/vortraege/tms_07/Lynette_Hirschmann.pdf
Hliaoutakis A, Varelas G, Petrakis EGM, Milios EE (2006) MedSearch: A retrieval system for medical information based on semantic similarity. In: Proc ECDL, Lecture Notes in Computer Science 4172. Springer, Berlin Heidelberg New York
Hoffmann R, Valencia A (2004) A gene network for navigating the literature. Nat Genet 36:664
Hristovski D, Friedman C, Rindflesch TC, Peterlin B (2006) Exploiting semantic relations for literature-based discovery. In: AMIA Annu Symp Proc, pp 349–353
Hristovski D, Peterlin B, Mitchell JA, Humphrey SM (2005) Using literature-based discovery to identify disease candidate genes. Int J Med Inform 74(2–4):289–298
Ide NC, Loane RF, Demner-Fushman D (2007) Essie: A concept based search engine for structured biomedical text. J Am Med Inform Assoc 14(3):253–263
Jacquemin C (2001) Spotting and discovering terms through NLP. MIT Press, Cambridge, MA
Jelier R, Jenster G, Dorssers LCJ, Wouters BJ, Hendriksen PJM, Mons B et al (2007) Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation. BMC Bioinform 8:14
Jelier R, Schuemie MJ, Veldhoven A, Dorssers LCJ, Jenster G, Kors JA (2008) Anni 2.0: A multipurpose text-mining tool for the life sciences. Genome Biol 9(6):R96
Karamanis N, Lewin I, Seal R, Drysdale R, Briscoe E (2007) Integrating natural language processing with flybase curation. In: Pac Symp Biocomput 12. Maui, Hawaii, pp 245–256
Krallinger M, Leitner F, Valencia A (2007) Assessment of the second BioCreative PPI task: Automatic extraction of protein–protein interactions. In: Proceedings of the BioCreAtIvE II Workshop, Madrid, pp 41–54
Lee LC, Horn F, Cohen FE (2007) Automatic extraction of protein point mutations using a graph bigram association. PLoS Comput Biol Feb 2;3(2):e16
Leroy G, Chen H (2005) Genescene: An ontology-enhanced integration of linguistic and co-occurrence based relations in biomedical texts. J Am Soc Inf Sci Technol (JASIST) 56(5):457–468
Lewis J, Ossowski S, Hicks J, Errami M, Garner HR (2006) Text similarity: an alternative way to search MEDLINE. Bioinformatics September 15;22(18):2298–2304
Lindberg DA, Humphreys BL, McCray AT (1993) The unified medical language system. Methods Inf Med 32(4):281–291
Lussier Y, Borlawsky T, Rappaport D, Liu Y, Friedman C (2006) PhenoGO: Assigning phenotypic context to Gene Ontology annotations with natural language processing. In: Pac Symp Biocomput 11. Maui, Hawaii, pp 64–75
Miles WD (1992) A history of the national library of medicine: The Nation’s Treasury of Medical Knowledge. Bernan Assoc. http://www.nlm.nih.gov/hmd/manuscripts/miles/miles.pdf. Accessed 4 August 2008
Miyao Y, Ohta T, Masuda K, Tsuruoka Y, Yoshida K, Ninomiya T, et al (2006) Semantic retrieval for the accurate identification of relational concepts in massive textbases. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (COLING/ACL 2006). Sydney, Australia, pp 1017–1024
Morgan AA, Hirschman L (2007) Overview of BioCreative II gene normalisation. In: Proceedings of the BioCreAtIvE II Workshop, Madrid, pp 17–27
Müller HM, Kenny EE, Sternberg PW (2004) Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol 2(11):e309
Mutalik PG, Deshpande A, Nadkarni PM (2001) Use of general-purpose negation detection to augment concept indexing of medical documents: a quantitative study using the UMLS. J Am Med Inform Assoc 8(6):598–609
Nakov P, Hearst M (2006) Using verbs to characterize noun–noun relations. In: Proceedings of the twelfth international conference on artificial intelligence: Methodology, systems, applications (AIMSA), Bulgaria
Namer F, Baud R (2007) Defining and relating biomedical terms: Towards a cross-language morphosemantics-based system. Int J Med Inform 76(2–3):226–233
Palakal M, Bright J, Sebastian T, Hartanto S (2007) A comparative study of cells in inflammation, EAE and MS using biomedical literature data mining. J Biomed Sci 14(1):67–85
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
Pratt W, Hearst M, Fagan L. A knowledge-based approach to organizing retrieved documents. In: Proceedings of 16th annual conference on artificial intelligence (AAAI-99). July 1999 Orlando, FL, pp 80–85
Rebholz-Schuhmann D, Kirsch H, Arregui M, Gaudan S, Riethoven M, Stoehr P (2007) EBIMed–text crunching to gather facts for proteins from Medline. Bioinformatics 23(2):e237–e244
Rebholz-Schuhmann D, Kirsch H, Couto F (2005) Facts from text – is text mining ready to deliver? PLoS Biol 3(2):e65
Rindflesch TC, Tanabe L, Weinstein JN, Hunter L (2000) EDGAR: Extraction of drugs, genes and relations from the biomedical literature. Pac Symp Biocomput 517–528
Sanchez-Graillet O, Poesio M (2007) Negation of protein–protein interactions: Analysis and extraction. Bioinformatics 23(13):i424–i432
Sandler T, Schein AI, Ungar LH (2006) Automatic term list generation for entity tagging. Bioinformatics 22(6):651–657
Smalheiser NR, Torvik VI, Bischoff-Grethe A, Burhans LB, Gabriel M, Homayouni R, et al (2006) Collaborative development of the Arrowsmith two node search interface designed for laboratory investigators. J Biomed Discov Collab 1(8). http://www.j-biomed-discovery.com/content/1/1/8
Smith L, Rindflesch T, Wilbur WJ (2004) MedPost: A part-of-speech tagger for bioMedical text. Bioinformatics 20(14):2320–2321
Srinivasan P, Libbus B (2004) Mining MEDLINE for implicit links between dietary substances and diseases. Bioinformatics 20(Suppl 1):i290–i296
Surdeanu M, Harabagiu S, Williams J, Aarseth P (2003) Using predicate-argument structures for information extraction. In: Proc 41st ACL. Sapporo, Japan, pp 8–15
Swanson DR (1986) Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspectives in biology and medicine 30:7–18
Swanson DR, Smalheiser NR, Torvik VI (2006) Ranking indirect connections in literature-based discovery: The role of medical subject headings. J Am Soc Inf Sci Technol 57(11):1427–1439
Szarvas G, Vincze V, Farkas R, Csirik J (2008) The BioScope corpus: annotation for negation, uncertainty and their scope in biomedical texts. Proceedings of the 2008 Workshop on Biomedical Natural Language Processing (BioNL’08), Columbus, Ohio. June 2008. pp 38–45
Tanabe L, Scherf U, Smith LH, Lee JK, Hunter L, Weinstein JN (1999) MedMiner: An internet text-mining tool for biomedical information, with application to gene expression profiling. BioTechniques 27:1210–1217
The Unicode Standard (2007), Version 5.0 Addison-Wesley, Boston, MA
Tsai RTH, Chou WC, Lin YC, Sung CL, Ku W, Su YS, et al (2006) BIOSMILE: Adapting semantic role labeling for biomedical verbs: An exponential model coupled with automatically generated template features. In: HLT-NAACL BioNLP, pp 57–64
Wang X (2007) Rule-based protein term identification with help from automatic species tagging. In: Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2007), Lecture Notes in Computer Science 4394, Springer, Berlin Heidelberg New York, pp 288–298
Wilbur J, Smith L, Tanabe L (2007) BioCreative 2. Gene Mention Task. In: Proceedings of the BioCreAtIvE II Workshop 2007, Madrid, Spain, pp 7–16
Wren JD, Garner HR (2004) Shared relationship analysis: Ranking set cohesion and commonalities within a literature-derived relationship network. Bioinformatics 20(2):191–198
Yeh A, Morgan A, Colosimo M, Hirschman L (2005) BioCreAtIvE task 1A: gene mention finding evaluation. BMC Bioinform 6:(Suppl)1
Textbooks and Introductions to NLP and BioNLP
Allen JF (1995) Natural language understanding, 2nd edn. Benjamin/Cummings, Menlo Park, CA
Ananiadou S, Kell DB, Tsujii JI (2006) Text mining and its potential applications in systems biology. Trends Biotechnol 24(12):571–579
Ananiadou S, McNaught J (2006) Text mining for biology and biomedicine. Artech House Publishers, Norwood, Massachusetts, USA
de Bruijn B, Martin J (2002) Getting to the (c) ore of knowledge: Mining biomedical literature. Int J Med Inform 67:7–18
Cohen AM, Hersh W (2005) A survey of current work in biomedical text mining. Brief Bioinform 6(1):57–71
Cohen KB, Hunter L (2004) Natural language processing and systems biology. In: Dubitzky W, Azuaje F (eds) Artificial intelligence methods and tools for systems biology. Springer, Norwell, MA, pp 147–174
Cohen KB, Hunter L (2008) Getting started in text mining. PloS Comput Biol 4(1):e20
Hearst MA (2003) What is text mining? Available online at http://www.ischool.berkeley.edu/∼hearst/text-mining.html
Hunter L, Cohen KB (2006) Biomedical language processing: what’s beyond PubMed? Mol Cell 21:589–594
Jackson P, Moulinier I (2002) Natural language processing for online applications: text retrieval, extraction, and categorization. John Benjamins Publishing Company
Jensen LJ, Saric J, Bork P (2006) Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet 7:119–129
Jurafsky D, Martin JH (2000) Speech and language processing: An introduction to natural language processing, computational linguistics and speech recognition. Prentice Hall, Lebanon, Indiana
Krallinger M, Valencia A (2005) Text-mining and information-retrieval services for molecular biology. Genome Biol 6(7):224
Mitkov R (ed) (2003) The Oxford handbook of computational linguistics. Oxford University Press, New York
Shatkay H (2005) Hairpins in bookstacks: Information retrieval from biomedical text. Brief Bioinform 6(3):222–238
Shatkay H, Craven M (2007) Biomedical text mining, MIT Press, Cambridge
Spasic I, Ananiadou S, McNaught J, Kumar A (2005) Text mining and ontologies in biomedicine: Making sense of raw text. Brief Bioinform 6(3):239–251
Weeber M, Kors JA, Mons B (2005) Online tools to support literature-based discovery in the life sciences. Brief Bioinform 6(3):277–286
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix
Lists of Tools and Services
State of the art NLP tools:
http://aclweb.org/aclwiki/index.php?title=State_of_the_art
Resource list compiled by Kevin Bretonell Cohen
http://compbio.uchsc.edu/corpora/bcresources.html
Resource list compiled by Robert Futrelle:
BioCreAtIvE bio-NLP tools:
http://biocreative.sourceforge.net/bionlp_tools_links.html
NLP and Text Mining Research list at NaCTeM:
http://www.nactem.ac.uk/research.php?view=4
Arrowsmith:
http://arrowsmith.psych.uic.edu/arrowsmith_uic/tools.html
The Open Directory Project:
http://www.dmoz.org/Science/Biology/Bioinformatics/Software/
The National Centers for Biomedical Computing (NCBC) funded under the NIH Roadmap for Bioinformatics and Computational Biology:
Gene and Protein Name Resources
Entrez Gene:
http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene
FlyBase:
HUGO Gene:
http://www.genenames.org/index.html
Model organisms:
http://www.nih.gov/science/models/
Mouse Genome Informatics:
http://www.informatics.jax.org/
Saccharomyces genome database:
http://www.yeastgenome.org/gene_list.shtml
The Worldwide Protein Data Bank:
UniProt:
Biomedical Terminologies
The national center for biomedical ontology:
The open biomedical ontologies:
Resources for Biomedical Terminology and Ontology:
http://www.ldc.upenn.edu/mamandel/itre/term.html#Dictionaries
Unified Medical Language System
Rights and permissions
Copyright information
© 2009 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Zweigenbaum, P., Demner-Fushman, D. (2009). Advanced Literature-Mining Tools. In: Edwards, D., Stajich, J., Hansen, D. (eds) Bioinformatics. Springer, New York, NY. https://doi.org/10.1007/978-0-387-92738-1_17
Download citation
DOI: https://doi.org/10.1007/978-0-387-92738-1_17
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-0-387-92737-4
Online ISBN: 978-0-387-92738-1
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)