Skip to main content

Advanced Literature-Mining Tools

  • Chapter
  • First Online:
Bioinformatics

Abstract

The complexity and wide range of current biomedical research is reflected in the number and scope of biomedical publications. Due to this abundance scientists are often no longer capable of keeping up with publications in their specific areas of research, let alone finding, reading, and analyzing potentially related scientific publications. Real advances in research, however, can be achieved only if a researcher can obtain an overview of the state of a given research question in a timely manner. This chapter presents methods to help researchers access the content of the biomedical literature. Information Retrieval (IR) identifies, in a large document database, the documents that are most relevant to a search topic provided by a user. Natural Language Processing (NLP) affords finer-grained access to more precise information contained in texts, which opens up a range of data analysis and knowledge synthesis functionalities. Powerful tools have been designed to exploit these techniques for the benefit of biomedical researchers, extracting millions of facts from the published literature and assisting Literature-Based Discovery. This chapter is organized as follows. It first describes the current capacities of IR from the Medline® bibliographic database. A short introduction to the main concepts of Natural Language Processing follows. Tasks which build on Natural Language Processing are then presented: Information Extraction and its derivatives and Literature-Based Discovery. A review of some existing applications closes the chapter. The references cited in the text are supplemented by a list of textbooks and Web resources.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://lucene.apache.org/

  2. 2.

    http://www.ncbi.nlm.nih.gov/sites/entrez?db = pubmed

  3. 3.

    http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html

  4. 4.

    http://vivisimo.com/

  5. 5.

    The complete term here is monoostotic fibrous dysplasia, and was obtained through the analysis of adjective coordination in a noun phrase.

  6. 6.

    MetaMap 2008 expands the abbreviation within the span of submitted text

  7. 7.

    Although these may actually be hypotheses, observations, results, etc., we refer to them uniformly as “facts.”

  8. 8.

    EbiMed – http://www.ebi.ac.uk/Rebholz-srv/ebimed/index.jsp

  9. 9.

    iHOP – http://www.ihop-net.org/

  10. 10.

    Chilibot – http://www.chilibot.net/

  11. 11.

    ARROWSMITH – http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/start.cgi

  12. 12.

    BITOLA – http://www.mf.uni-lj.si/bitola/

  13. 13.

    Semantic MEDLINE – http://skr3.nlm.nih.gov/SemMedDemo/

References

  • Ahlers CB, Fiszman M, Demner-Fushman D, Lang FM, Rindflesch TC (2007) Extracting semantic predications from MEDLINE citations for pharmacogenomics. In: Pac Symp Biocomput 12. Maui, Hawaii, pp 209–220

    Google Scholar 

  • Airola A, Pyysalo S, Bjorne J, Pahikkala T, Ginter F, Salakoski T (2008) A graph kernel for protein–protein interaction extraction. In: Proceedings of the workshop on current trends in biomedical natural language processing (BioNLP’08). Association for Computational Linguistics, pp 1–9

    Google Scholar 

  • Alex B, Grover C, Haddow B, Kabadjov M, Klein E, Matthews M et al (2008) Assisted curation: Does text mining really help? In: Pac Symp Biocomput 13. Big Island, Hawaii, pp 556–567

    Google Scholar 

  • Ananiadou S, Friedman C, Tsujii JI (eds) (2004) Named entity recognition in biomedicine. J Biomed Inform 37(6):393–528

    Google Scholar 

  • Aronson AR (2001) Effective mapping of biomedical text to the UMLS Metathesaurus: The MetaMap program. In: Proc AMIA Symp, pp 17–21

    Google Scholar 

  • Bodenreider O (2008) Biomedical ontologies in action: Role in knowledge management, Data integration and decision support. IMIA Yearbook of Medical Informatics, pp 67–79

    Google Scholar 

  • Branco A, McEnery T, Mitkov R (eds) (2005) Anaphora processing: Linguistic, cognitive and computational modelling. Current Issues in Linguistic Theory, vol 263. John Benjamins, Amsterdam and Philadelphia

    Google Scholar 

  • Bruza P, Weeber M (eds) (2008) Literature-based discovery. Information Science and Knowledge Management, vol 15, Springer, Berlin Heidelberg New York

    Google Scholar 

  • Carreras X, Màrquez L (2005) Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling. In: Proc 9th CoNLL. ACL, pp 152–164

    Google Scholar 

  • Chang JT, Altman RB (2004) Extracting and characterizing gene-drug relationships from literature. Pharmacogenetics 14(9):577–586

    Article  CAS  PubMed  Google Scholar 

  • Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG (2001) A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform 34(5):301–310

    Article  CAS  PubMed  Google Scholar 

  • Chen H, Sharp BM (2004) Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinform 8;5:147

    Google Scholar 

  • Dee CR (2007) The development of the Medical Literature Analysis and Retrieval System (MEDLARS). J Med Libr Assoc 95(4):416–425. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2000779

    Google Scholar 

  • Demner-Fushman D, Humphrey SM, Ide NC, Loane RF, Mork JG, Ruch P et al (2007) Combining resources to find answers to biomedical questions. In: The sixteenth text retrieval conference TREC-2007, Gaithersburg, MD, pp 205–215

    Google Scholar 

  • Eaton AD (2006) HubMed: a web-based biomedical literature search interface. Nucleic Acids Res 1:34(Web Server issue):W745-7

    Google Scholar 

  • Firth JR (1957) Papers in linguistics, 1934–1951. Oxford University Press, London

    Google Scholar 

  • Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A (2001) GENIES: A natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17(Suppl 1):S74–S82

    PubMed  Google Scholar 

  • Grishman R, Sundheim B (1996) Message understanding conference – 6: A brief history. In: Proc 16th COLING. ACL, pp 466–471

    Google Scholar 

  • Habert B, Zweigenbaum P (2002) Contextual acquisition of information categories: what has been done and what can be done automatically? In: Nevin BE, Johnson SM (eds) The Legacy of Zellig Harris: Language and information into the 21st Century, Mathematics and computability of language, vol 2. John Benjamins, Amsterdam, pp 203–231

    Google Scholar 

  • Hersh W, Cohen AM, Ruslen L, Roberts P (2007) TREC 2007 genomics track overview. In: The Sixteenth Text Retrieval Conference – TREC 2007. NIST

    Google Scholar 

  • Hersh WR, Greenes RA (1990) SAPHIRE: An information retrieval system featuring concept matching, automatic indexing, probabilistic retrieval, and hierarchical relationships. Comput Biomed Res 23(5):410–425

    Article  CAS  PubMed  Google Scholar 

  • Hirschman L (2007) The second biocreative evaluation: Lessons learned and future directions. In: Fifth Fraunhofer-symposium on text mining. Bonn, Germany, http://www.scai.fraunhofer.de/fileadmin/download/vortraege/tms_07/Lynette_Hirschmann.pdf

  • Hliaoutakis A, Varelas G, Petrakis EGM, Milios EE (2006) MedSearch: A retrieval system for medical information based on semantic similarity. In: Proc ECDL, Lecture Notes in Computer Science 4172. Springer, Berlin Heidelberg New York

    Google Scholar 

  • Hoffmann R, Valencia A (2004) A gene network for navigating the literature. Nat Genet 36:664

    Article  CAS  PubMed  Google Scholar 

  • Hristovski D, Friedman C, Rindflesch TC, Peterlin B (2006) Exploiting semantic relations for literature-based discovery. In: AMIA Annu Symp Proc, pp 349–353

    Google Scholar 

  • Hristovski D, Peterlin B, Mitchell JA, Humphrey SM (2005) Using literature-based discovery to identify disease candidate genes. Int J Med Inform 74(2–4):289–298

    Article  PubMed  Google Scholar 

  • Ide NC, Loane RF, Demner-Fushman D (2007) Essie: A concept based search engine for structured biomedical text. J Am Med Inform Assoc 14(3):253–263

    Article  PubMed  Google Scholar 

  • Jacquemin C (2001) Spotting and discovering terms through NLP. MIT Press, Cambridge, MA

    Google Scholar 

  • Jelier R, Jenster G, Dorssers LCJ, Wouters BJ, Hendriksen PJM, Mons B et al (2007) Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation. BMC Bioinform 8:14

    Article  Google Scholar 

  • Jelier R, Schuemie MJ, Veldhoven A, Dorssers LCJ, Jenster G, Kors JA (2008) Anni 2.0: A multipurpose text-mining tool for the life sciences. Genome Biol 9(6):R96

    Article  PubMed  Google Scholar 

  • Karamanis N, Lewin I, Seal R, Drysdale R, Briscoe E (2007) Integrating natural language processing with flybase curation. In: Pac Symp Biocomput 12. Maui, Hawaii, pp 245–256

    Google Scholar 

  • Krallinger M, Leitner F, Valencia A (2007) Assessment of the second BioCreative PPI task: Automatic extraction of protein–protein interactions. In: Proceedings of the BioCreAtIvE II Workshop, Madrid, pp 41–54

    Google Scholar 

  • Lee LC, Horn F, Cohen FE (2007) Automatic extraction of protein point mutations using a graph bigram association. PLoS Comput Biol Feb 2;3(2):e16

    Google Scholar 

  • Leroy G, Chen H (2005) Genescene: An ontology-enhanced integration of linguistic and co-occurrence based relations in biomedical texts. J Am Soc Inf Sci Technol (JASIST) 56(5):457–468

    Article  CAS  Google Scholar 

  • Lewis J, Ossowski S, Hicks J, Errami M, Garner HR (2006) Text similarity: an alternative way to search MEDLINE. Bioinformatics September 15;22(18):2298–2304

    Google Scholar 

  • Lindberg DA, Humphreys BL, McCray AT (1993) The unified medical language system. Methods Inf Med 32(4):281–291

    CAS  PubMed  Google Scholar 

  • Lussier Y, Borlawsky T, Rappaport D, Liu Y, Friedman C (2006) PhenoGO: Assigning phenotypic context to Gene Ontology annotations with natural language processing. In: Pac Symp Biocomput 11. Maui, Hawaii, pp 64–75

    Google Scholar 

  • Miles WD (1992) A history of the national library of medicine: The Nation’s Treasury of Medical Knowledge. Bernan Assoc. http://www.nlm.nih.gov/hmd/manuscripts/miles/miles.pdf. Accessed 4 August 2008

  • Miyao Y, Ohta T, Masuda K, Tsuruoka Y, Yoshida K, Ninomiya T, et al (2006) Semantic retrieval for the accurate identification of relational concepts in massive textbases. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (COLING/ACL 2006). Sydney, Australia, pp 1017–1024

    Google Scholar 

  • Morgan AA, Hirschman L (2007) Overview of BioCreative II gene normalisation. In: Proceedings of the BioCreAtIvE II Workshop, Madrid, pp 17–27

    Google Scholar 

  • Müller HM, Kenny EE, Sternberg PW (2004) Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol 2(11):e309

    Article  PubMed  Google Scholar 

  • Mutalik PG, Deshpande A, Nadkarni PM (2001) Use of general-purpose negation detection to augment concept indexing of medical documents: a quantitative study using the UMLS. J Am Med Inform Assoc 8(6):598–609

    CAS  PubMed  Google Scholar 

  • Nakov P, Hearst M (2006) Using verbs to characterize noun–noun relations. In: Proceedings of the twelfth international conference on artificial intelligence: Methodology, systems, applications (AIMSA), Bulgaria

    Google Scholar 

  • Namer F, Baud R (2007) Defining and relating biomedical terms: Towards a cross-language morphosemantics-based system. Int J Med Inform 76(2–3):226–233

    Article  PubMed  Google Scholar 

  • Palakal M, Bright J, Sebastian T, Hartanto S (2007) A comparative study of cells in inflammation, EAE and MS using biomedical literature data mining. J Biomed Sci 14(1):67–85

    Article  CAS  PubMed  Google Scholar 

  • Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137

    Google Scholar 

  • Pratt W, Hearst M, Fagan L. A knowledge-based approach to organizing retrieved documents. In: Proceedings of 16th annual conference on artificial intelligence (AAAI-99). July 1999 Orlando, FL, pp 80–85

    Google Scholar 

  • Rebholz-Schuhmann D, Kirsch H, Arregui M, Gaudan S, Riethoven M, Stoehr P (2007) EBIMed–text crunching to gather facts for proteins from Medline. Bioinformatics 23(2):e237–e244

    Article  CAS  PubMed  Google Scholar 

  • Rebholz-Schuhmann D, Kirsch H, Couto F (2005) Facts from text – is text mining ready to deliver? PLoS Biol 3(2):e65

    Article  PubMed  Google Scholar 

  • Rindflesch TC, Tanabe L, Weinstein JN, Hunter L (2000) EDGAR: Extraction of drugs, genes and relations from the biomedical literature. Pac Symp Biocomput 517–528

    Google Scholar 

  • Sanchez-Graillet O, Poesio M (2007) Negation of protein–protein interactions: Analysis and extraction. Bioinformatics 23(13):i424–i432

    Article  CAS  PubMed  Google Scholar 

  • Sandler T, Schein AI, Ungar LH (2006) Automatic term list generation for entity tagging. Bioinformatics 22(6):651–657

    Article  CAS  PubMed  Google Scholar 

  • Smalheiser NR, Torvik VI, Bischoff-Grethe A, Burhans LB, Gabriel M, Homayouni R, et al (2006) Collaborative development of the Arrowsmith two node search interface designed for laboratory investigators. J Biomed Discov Collab 1(8). http://www.j-biomed-discovery.com/content/1/1/8

  • Smith L, Rindflesch T, Wilbur WJ (2004) MedPost: A part-of-speech tagger for bioMedical text. Bioinformatics 20(14):2320–2321

    Article  CAS  PubMed  Google Scholar 

  • Srinivasan P, Libbus B (2004) Mining MEDLINE for implicit links between dietary substances and diseases. Bioinformatics 20(Suppl 1):i290–i296

    Article  CAS  PubMed  Google Scholar 

  • Surdeanu M, Harabagiu S, Williams J, Aarseth P (2003) Using predicate-argument structures for information extraction. In: Proc 41st ACL. Sapporo, Japan, pp 8–15

    Google Scholar 

  • Swanson DR (1986) Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspectives in biology and medicine 30:7–18

    CAS  PubMed  Google Scholar 

  • Swanson DR, Smalheiser NR, Torvik VI (2006) Ranking indirect connections in literature-based discovery: The role of medical subject headings. J Am Soc Inf Sci Technol 57(11):1427–1439

    Article  CAS  Google Scholar 

  • Szarvas G, Vincze V, Farkas R, Csirik J (2008) The BioScope corpus: annotation for negation, uncertainty and their scope in biomedical texts. Proceedings of the 2008 Workshop on Biomedical Natural Language Processing (BioNL’08), Columbus, Ohio. June 2008. pp 38–45

    Google Scholar 

  • Tanabe L, Scherf U, Smith LH, Lee JK, Hunter L, Weinstein JN (1999) MedMiner: An internet text-mining tool for biomedical information, with application to gene expression profiling. BioTechniques 27:1210–1217

    CAS  PubMed  Google Scholar 

  • The Unicode Standard (2007), Version 5.0 Addison-Wesley, Boston, MA

    Google Scholar 

  • Tsai RTH, Chou WC, Lin YC, Sung CL, Ku W, Su YS, et al (2006) BIOSMILE: Adapting semantic role labeling for biomedical verbs: An exponential model coupled with automatically generated template features. In: HLT-NAACL BioNLP, pp 57–64

    Google Scholar 

  • Wang X (2007) Rule-based protein term identification with help from automatic species tagging. In: Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2007), Lecture Notes in Computer Science 4394, Springer, Berlin Heidelberg New York, pp 288–298

    Google Scholar 

  • Wilbur J, Smith L, Tanabe L (2007) BioCreative 2. Gene Mention Task. In: Proceedings of the BioCreAtIvE II Workshop 2007, Madrid, Spain, pp 7–16

    Google Scholar 

  • Wren JD, Garner HR (2004) Shared relationship analysis: Ranking set cohesion and commonalities within a literature-derived relationship network. Bioinformatics 20(2):191–198

    Article  CAS  PubMed  Google Scholar 

  • Yeh A, Morgan A, Colosimo M, Hirschman L (2005) BioCreAtIvE task 1A: gene mention finding evaluation. BMC Bioinform 6:(Suppl)1

    Google Scholar 

Textbooks and Introductions to NLP and BioNLP

  • Allen JF (1995) Natural language understanding, 2nd edn. Benjamin/Cummings, Menlo Park, CA

    Google Scholar 

  • Ananiadou S, Kell DB, Tsujii JI (2006) Text mining and its potential applications in systems biology. Trends Biotechnol 24(12):571–579

    Article  CAS  PubMed  Google Scholar 

  • Ananiadou S, McNaught J (2006) Text mining for biology and biomedicine. Artech House Publishers, Norwood, Massachusetts, USA

    Google Scholar 

  • de Bruijn B, Martin J (2002) Getting to the (c) ore of knowledge: Mining biomedical literature. Int J Med Inform 67:7–18

    Article  PubMed  Google Scholar 

  • Cohen AM, Hersh W (2005) A survey of current work in biomedical text mining. Brief Bioinform 6(1):57–71

    Article  CAS  PubMed  Google Scholar 

  • Cohen KB, Hunter L (2004) Natural language processing and systems biology. In: Dubitzky W, Azuaje F (eds) Artificial intelligence methods and tools for systems biology. Springer, Norwell, MA, pp 147–174

    Google Scholar 

  • Cohen KB, Hunter L (2008) Getting started in text mining. PloS Comput Biol 4(1):e20

    Article  PubMed  Google Scholar 

  • Hearst MA (2003) What is text mining? Available online at http://www.ischool.berkeley.edu/∼hearst/text-mining.html

  • Hunter L, Cohen KB (2006) Biomedical language processing: what’s beyond PubMed? Mol Cell 21:589–594

    Article  CAS  PubMed  Google Scholar 

  • Jackson P, Moulinier I (2002) Natural language processing for online applications: text retrieval, extraction, and categorization. John Benjamins Publishing Company

    Google Scholar 

  • Jensen LJ, Saric J, Bork P (2006) Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet 7:119–129

    Article  CAS  PubMed  Google Scholar 

  • Jurafsky D, Martin JH (2000) Speech and language processing: An introduction to natural language processing, computational linguistics and speech recognition. Prentice Hall, Lebanon, Indiana

    Google Scholar 

  • Krallinger M, Valencia A (2005) Text-mining and information-retrieval services for molecular biology. Genome Biol 6(7):224

    Article  PubMed  Google Scholar 

  • Mitkov R (ed) (2003) The Oxford handbook of computational linguistics. Oxford University Press, New York

    Google Scholar 

  • Shatkay H (2005) Hairpins in bookstacks: Information retrieval from biomedical text. Brief Bioinform 6(3):222–238

    Article  CAS  PubMed  Google Scholar 

  • Shatkay H, Craven M (2007) Biomedical text mining, MIT Press, Cambridge

    Google Scholar 

  • Spasic I, Ananiadou S, McNaught J, Kumar A (2005) Text mining and ontologies in biomedicine: Making sense of raw text. Brief Bioinform 6(3):239–251

    Article  CAS  PubMed  Google Scholar 

  • Weeber M, Kors JA, Mons B (2005) Online tools to support literature-based discovery in the life sciences. Brief Bioinform 6(3):277–286

    Article  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pierre Zweigenbaum .

Editor information

Editors and Affiliations

Appendices

Appendix

Lists of Tools and Services

State of the art NLP tools:

http://aclweb.org/aclwiki/index.php?title=State_of_the_art

Resource list compiled by Kevin Bretonell Cohen

http://compbio.uchsc.edu/corpora/bcresources.html

Resource list compiled by Robert Futrelle:

http://www.bionlp.org/

BioCreAtIvE bio-NLP tools:

http://biocreative.sourceforge.net/bionlp_tools_links.html

NLP and Text Mining Research list at NaCTeM:

http://www.nactem.ac.uk/research.php?view=4

Arrowsmith:

http://arrowsmith.psych.uic.edu/arrowsmith_uic/tools.html

The Open Directory Project:

http://www.dmoz.org/Science/Biology/Bioinformatics/Software/

The National Centers for Biomedical Computing (NCBC) funded under the NIH Roadmap for Bioinformatics and Computational Biology:

http://www.ncbcs.org/

Gene and Protein Name Resources

Entrez Gene:

http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene

FlyBase:

http://flybase.org/

HUGO Gene:

http://www.genenames.org/index.html

Model organisms:

http://www.nih.gov/science/models/

Mouse Genome Informatics:

http://www.informatics.jax.org/

Saccharomyces genome database:

http://www.yeastgenome.org/gene_list.shtml

The Worldwide Protein Data Bank:

http://www.wwpdb.org/

UniProt:

http://www.ebi.ac.uk/uniprot/

Biomedical Terminologies

The national center for biomedical ontology:

http://bioontology.org/

The open biomedical ontologies:

http://www.obofoundry.org/

Resources for Biomedical Terminology and Ontology:

http://www.ldc.upenn.edu/mamandel/itre/term.html#Dictionaries

Unified Medical Language System

http://www.nlm.nih.gov/research/umls/

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Zweigenbaum, P., Demner-Fushman, D. (2009). Advanced Literature-Mining Tools. In: Edwards, D., Stajich, J., Hansen, D. (eds) Bioinformatics. Springer, New York, NY. https://doi.org/10.1007/978-0-387-92738-1_17

Download citation

Publish with us

Policies and ethics