Bioinformatics pp 347-380 | Cite as

Advanced Literature-Mining Tools

  • Pierre Zweigenbaum
  • Dina Demner-Fushman


The complexity and wide range of current biomedical research is reflected in the number and scope of biomedical publications. Due to this abundance scientists are often no longer capable of keeping up with publications in their specific areas of research, let alone finding, reading, and analyzing potentially related scientific publications. Real advances in research, however, can be achieved only if a researcher can obtain an overview of the state of a given research question in a timely manner. This chapter presents methods to help researchers access the content of the biomedical literature. Information Retrieval (IR) identifies, in a large document database, the documents that are most relevant to a search topic provided by a user. Natural Language Processing (NLP) affords finer-grained access to more precise information contained in texts, which opens up a range of data analysis and knowledge synthesis functionalities. Powerful tools have been designed to exploit these techniques for the benefit of biomedical researchers, extracting millions of facts from the published literature and assisting Literature-Based Discovery. This chapter is organized as follows. It first describes the current capacities of IR from the Medline® bibliographic database. A short introduction to the main concepts of Natural Language Processing follows. Tasks which build on Natural Language Processing are then presented: Information Extraction and its derivatives and Literature-Based Discovery. A review of some existing applications closes the chapter. The references cited in the text are supplemented by a list of textbooks and Web resources.


Noun Phrase Natural Language Processing Name Entity Recognition Relation Extraction Annotate Corpus 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. Ahlers CB, Fiszman M, Demner-Fushman D, Lang FM, Rindflesch TC (2007) Extracting semantic predications from MEDLINE citations for pharmacogenomics. In: Pac Symp Biocomput 12. Maui, Hawaii, pp 209–220Google Scholar
  2. Airola A, Pyysalo S, Bjorne J, Pahikkala T, Ginter F, Salakoski T (2008) A graph kernel for protein–protein interaction extraction. In: Proceedings of the workshop on current trends in biomedical natural language processing (BioNLP’08). Association for Computational Linguistics, pp 1–9Google Scholar
  3. Alex B, Grover C, Haddow B, Kabadjov M, Klein E, Matthews M et al (2008) Assisted curation: Does text mining really help? In: Pac Symp Biocomput 13. Big Island, Hawaii, pp 556–567Google Scholar
  4. Ananiadou S, Friedman C, Tsujii JI (eds) (2004) Named entity recognition in biomedicine. J Biomed Inform 37(6):393–528Google Scholar
  5. Aronson AR (2001) Effective mapping of biomedical text to the UMLS Metathesaurus: The MetaMap program. In: Proc AMIA Symp, pp 17–21Google Scholar
  6. Bodenreider O (2008) Biomedical ontologies in action: Role in knowledge management, Data integration and decision support. IMIA Yearbook of Medical Informatics, pp 67–79Google Scholar
  7. Branco A, McEnery T, Mitkov R (eds) (2005) Anaphora processing: Linguistic, cognitive and computational modelling. Current Issues in Linguistic Theory, vol 263. John Benjamins, Amsterdam and PhiladelphiaGoogle Scholar
  8. Bruza P, Weeber M (eds) (2008) Literature-based discovery. Information Science and Knowledge Management, vol 15, Springer, Berlin Heidelberg New YorkGoogle Scholar
  9. Carreras X, Màrquez L (2005) Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling. In: Proc 9th CoNLL. ACL, pp 152–164Google Scholar
  10. Chang JT, Altman RB (2004) Extracting and characterizing gene-drug relationships from literature. Pharmacogenetics 14(9):577–586CrossRefPubMedGoogle Scholar
  11. Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG (2001) A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform 34(5):301–310CrossRefPubMedGoogle Scholar
  12. Chen H, Sharp BM (2004) Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinform 8;5:147Google Scholar
  13. Dee CR (2007) The development of the Medical Literature Analysis and Retrieval System (MEDLARS). J Med Libr Assoc 95(4):416–425. Google Scholar
  14. Demner-Fushman D, Humphrey SM, Ide NC, Loane RF, Mork JG, Ruch P et al (2007) Combining resources to find answers to biomedical questions. In: The sixteenth text retrieval conference TREC-2007, Gaithersburg, MD, pp 205–215Google Scholar
  15. Eaton AD (2006) HubMed: a web-based biomedical literature search interface. Nucleic Acids Res 1:34(Web Server issue):W745-7Google Scholar
  16. Firth JR (1957) Papers in linguistics, 1934–1951. Oxford University Press, LondonGoogle Scholar
  17. Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A (2001) GENIES: A natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17(Suppl 1):S74–S82PubMedGoogle Scholar
  18. Grishman R, Sundheim B (1996) Message understanding conference – 6: A brief history. In: Proc 16th COLING. ACL, pp 466–471Google Scholar
  19. Habert B, Zweigenbaum P (2002) Contextual acquisition of information categories: what has been done and what can be done automatically? In: Nevin BE, Johnson SM (eds) The Legacy of Zellig Harris: Language and information into the 21st Century, Mathematics and computability of language, vol 2. John Benjamins, Amsterdam, pp 203–231Google Scholar
  20. Hersh W, Cohen AM, Ruslen L, Roberts P (2007) TREC 2007 genomics track overview. In: The Sixteenth Text Retrieval Conference – TREC 2007. NISTGoogle Scholar
  21. Hersh WR, Greenes RA (1990) SAPHIRE: An information retrieval system featuring concept matching, automatic indexing, probabilistic retrieval, and hierarchical relationships. Comput Biomed Res 23(5):410–425CrossRefPubMedGoogle Scholar
  22. Hirschman L (2007) The second biocreative evaluation: Lessons learned and future directions. In: Fifth Fraunhofer-symposium on text mining. Bonn, Germany,
  23. Hliaoutakis A, Varelas G, Petrakis EGM, Milios EE (2006) MedSearch: A retrieval system for medical information based on semantic similarity. In: Proc ECDL, Lecture Notes in Computer Science 4172. Springer, Berlin Heidelberg New YorkGoogle Scholar
  24. Hoffmann R, Valencia A (2004) A gene network for navigating the literature. Nat Genet 36:664CrossRefPubMedGoogle Scholar
  25. Hristovski D, Friedman C, Rindflesch TC, Peterlin B (2006) Exploiting semantic relations for literature-based discovery. In: AMIA Annu Symp Proc, pp 349–353Google Scholar
  26. Hristovski D, Peterlin B, Mitchell JA, Humphrey SM (2005) Using literature-based discovery to identify disease candidate genes. Int J Med Inform 74(2–4):289–298CrossRefPubMedGoogle Scholar
  27. Ide NC, Loane RF, Demner-Fushman D (2007) Essie: A concept based search engine for structured biomedical text. J Am Med Inform Assoc 14(3):253–263CrossRefPubMedGoogle Scholar
  28. Jacquemin C (2001) Spotting and discovering terms through NLP. MIT Press, Cambridge, MAGoogle Scholar
  29. Jelier R, Jenster G, Dorssers LCJ, Wouters BJ, Hendriksen PJM, Mons B et al (2007) Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation. BMC Bioinform 8:14CrossRefGoogle Scholar
  30. Jelier R, Schuemie MJ, Veldhoven A, Dorssers LCJ, Jenster G, Kors JA (2008) Anni 2.0: A multipurpose text-mining tool for the life sciences. Genome Biol 9(6):R96CrossRefPubMedGoogle Scholar
  31. Karamanis N, Lewin I, Seal R, Drysdale R, Briscoe E (2007) Integrating natural language processing with flybase curation. In: Pac Symp Biocomput 12. Maui, Hawaii, pp 245–256Google Scholar
  32. Krallinger M, Leitner F, Valencia A (2007) Assessment of the second BioCreative PPI task: Automatic extraction of protein–protein interactions. In: Proceedings of the BioCreAtIvE II Workshop, Madrid, pp 41–54Google Scholar
  33. Lee LC, Horn F, Cohen FE (2007) Automatic extraction of protein point mutations using a graph bigram association. PLoS Comput Biol Feb 2;3(2):e16Google Scholar
  34. Leroy G, Chen H (2005) Genescene: An ontology-enhanced integration of linguistic and co-occurrence based relations in biomedical texts. J Am Soc Inf Sci Technol (JASIST) 56(5):457–468CrossRefGoogle Scholar
  35. Lewis J, Ossowski S, Hicks J, Errami M, Garner HR (2006) Text similarity: an alternative way to search MEDLINE. Bioinformatics September 15;22(18):2298–2304Google Scholar
  36. Lindberg DA, Humphreys BL, McCray AT (1993) The unified medical language system. Methods Inf Med 32(4):281–291PubMedGoogle Scholar
  37. Lussier Y, Borlawsky T, Rappaport D, Liu Y, Friedman C (2006) PhenoGO: Assigning phenotypic context to Gene Ontology annotations with natural language processing. In: Pac Symp Biocomput 11. Maui, Hawaii, pp 64–75Google Scholar
  38. Miles WD (1992) A history of the national library of medicine: The Nation’s Treasury of Medical Knowledge. Bernan Assoc. Accessed 4 August 2008
  39. Miyao Y, Ohta T, Masuda K, Tsuruoka Y, Yoshida K, Ninomiya T, et al (2006) Semantic retrieval for the accurate identification of relational concepts in massive textbases. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (COLING/ACL 2006). Sydney, Australia, pp 1017–1024Google Scholar
  40. Morgan AA, Hirschman L (2007) Overview of BioCreative II gene normalisation. In: Proceedings of the BioCreAtIvE II Workshop, Madrid, pp 17–27Google Scholar
  41. Müller HM, Kenny EE, Sternberg PW (2004) Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol 2(11):e309CrossRefPubMedGoogle Scholar
  42. Mutalik PG, Deshpande A, Nadkarni PM (2001) Use of general-purpose negation detection to augment concept indexing of medical documents: a quantitative study using the UMLS. J Am Med Inform Assoc 8(6):598–609PubMedGoogle Scholar
  43. Nakov P, Hearst M (2006) Using verbs to characterize noun–noun relations. In: Proceedings of the twelfth international conference on artificial intelligence: Methodology, systems, applications (AIMSA), BulgariaGoogle Scholar
  44. Namer F, Baud R (2007) Defining and relating biomedical terms: Towards a cross-language morphosemantics-based system. Int J Med Inform 76(2–3):226–233CrossRefPubMedGoogle Scholar
  45. Palakal M, Bright J, Sebastian T, Hartanto S (2007) A comparative study of cells in inflammation, EAE and MS using biomedical literature data mining. J Biomed Sci 14(1):67–85CrossRefPubMedGoogle Scholar
  46. Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137Google Scholar
  47. Pratt W, Hearst M, Fagan L. A knowledge-based approach to organizing retrieved documents. In: Proceedings of 16th annual conference on artificial intelligence (AAAI-99). July 1999 Orlando, FL, pp 80–85Google Scholar
  48. Rebholz-Schuhmann D, Kirsch H, Arregui M, Gaudan S, Riethoven M, Stoehr P (2007) EBIMed–text crunching to gather facts for proteins from Medline. Bioinformatics 23(2):e237–e244CrossRefPubMedGoogle Scholar
  49. Rebholz-Schuhmann D, Kirsch H, Couto F (2005) Facts from text – is text mining ready to deliver? PLoS Biol 3(2):e65CrossRefPubMedGoogle Scholar
  50. Rindflesch TC, Tanabe L, Weinstein JN, Hunter L (2000) EDGAR: Extraction of drugs, genes and relations from the biomedical literature. Pac Symp Biocomput 517–528Google Scholar
  51. Sanchez-Graillet O, Poesio M (2007) Negation of protein–protein interactions: Analysis and extraction. Bioinformatics 23(13):i424–i432CrossRefPubMedGoogle Scholar
  52. Sandler T, Schein AI, Ungar LH (2006) Automatic term list generation for entity tagging. Bioinformatics 22(6):651–657CrossRefPubMedGoogle Scholar
  53. Smalheiser NR, Torvik VI, Bischoff-Grethe A, Burhans LB, Gabriel M, Homayouni R, et al (2006) Collaborative development of the Arrowsmith two node search interface designed for laboratory investigators. J Biomed Discov Collab 1(8).
  54. Smith L, Rindflesch T, Wilbur WJ (2004) MedPost: A part-of-speech tagger for bioMedical text. Bioinformatics 20(14):2320–2321CrossRefPubMedGoogle Scholar
  55. Srinivasan P, Libbus B (2004) Mining MEDLINE for implicit links between dietary substances and diseases. Bioinformatics 20(Suppl 1):i290–i296CrossRefPubMedGoogle Scholar
  56. Surdeanu M, Harabagiu S, Williams J, Aarseth P (2003) Using predicate-argument structures for information extraction. In: Proc 41st ACL. Sapporo, Japan, pp 8–15Google Scholar
  57. Swanson DR (1986) Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspectives in biology and medicine 30:7–18PubMedGoogle Scholar
  58. Swanson DR, Smalheiser NR, Torvik VI (2006) Ranking indirect connections in literature-based discovery: The role of medical subject headings. J Am Soc Inf Sci Technol 57(11):1427–1439CrossRefGoogle Scholar
  59. Szarvas G, Vincze V, Farkas R, Csirik J (2008) The BioScope corpus: annotation for negation, uncertainty and their scope in biomedical texts. Proceedings of the 2008 Workshop on Biomedical Natural Language Processing (BioNL’08), Columbus, Ohio. June 2008. pp 38–45Google Scholar
  60. Tanabe L, Scherf U, Smith LH, Lee JK, Hunter L, Weinstein JN (1999) MedMiner: An internet text-mining tool for biomedical information, with application to gene expression profiling. BioTechniques 27:1210–1217PubMedGoogle Scholar
  61. The Unicode Standard (2007), Version 5.0 Addison-Wesley, Boston, MAGoogle Scholar
  62. Tsai RTH, Chou WC, Lin YC, Sung CL, Ku W, Su YS, et al (2006) BIOSMILE: Adapting semantic role labeling for biomedical verbs: An exponential model coupled with automatically generated template features. In: HLT-NAACL BioNLP, pp 57–64Google Scholar
  63. Wang X (2007) Rule-based protein term identification with help from automatic species tagging. In: Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2007), Lecture Notes in Computer Science 4394, Springer, Berlin Heidelberg New York, pp 288–298Google Scholar
  64. Wilbur J, Smith L, Tanabe L (2007) BioCreative 2. Gene Mention Task. In: Proceedings of the BioCreAtIvE II Workshop 2007, Madrid, Spain, pp 7–16Google Scholar
  65. Wren JD, Garner HR (2004) Shared relationship analysis: Ranking set cohesion and commonalities within a literature-derived relationship network. Bioinformatics 20(2):191–198CrossRefPubMedGoogle Scholar
  66. Yeh A, Morgan A, Colosimo M, Hirschman L (2005) BioCreAtIvE task 1A: gene mention finding evaluation. BMC Bioinform 6:(Suppl)1Google Scholar

Textbooks and Introductions to NLP and BioNLP

  1. Allen JF (1995) Natural language understanding, 2nd edn. Benjamin/Cummings, Menlo Park, CAGoogle Scholar
  2. Ananiadou S, Kell DB, Tsujii JI (2006) Text mining and its potential applications in systems biology. Trends Biotechnol 24(12):571–579CrossRefPubMedGoogle Scholar
  3. Ananiadou S, McNaught J (2006) Text mining for biology and biomedicine. Artech House Publishers, Norwood, Massachusetts, USAGoogle Scholar
  4. de Bruijn B, Martin J (2002) Getting to the (c) ore of knowledge: Mining biomedical literature. Int J Med Inform 67:7–18CrossRefPubMedGoogle Scholar
  5. Cohen AM, Hersh W (2005) A survey of current work in biomedical text mining. Brief Bioinform 6(1):57–71CrossRefPubMedGoogle Scholar
  6. Cohen KB, Hunter L (2004) Natural language processing and systems biology. In: Dubitzky W, Azuaje F (eds) Artificial intelligence methods and tools for systems biology. Springer, Norwell, MA, pp 147–174Google Scholar
  7. Cohen KB, Hunter L (2008) Getting started in text mining. PloS Comput Biol 4(1):e20CrossRefPubMedGoogle Scholar
  8. Hearst MA (2003) What is text mining? Available online at∼hearst/text-mining.html
  9. Hunter L, Cohen KB (2006) Biomedical language processing: what’s beyond PubMed? Mol Cell 21:589–594CrossRefPubMedGoogle Scholar
  10. Jackson P, Moulinier I (2002) Natural language processing for online applications: text retrieval, extraction, and categorization. John Benjamins Publishing CompanyGoogle Scholar
  11. Jensen LJ, Saric J, Bork P (2006) Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet 7:119–129CrossRefPubMedGoogle Scholar
  12. Jurafsky D, Martin JH (2000) Speech and language processing: An introduction to natural language processing, computational linguistics and speech recognition. Prentice Hall, Lebanon, IndianaGoogle Scholar
  13. Krallinger M, Valencia A (2005) Text-mining and information-retrieval services for molecular biology. Genome Biol 6(7):224CrossRefPubMedGoogle Scholar
  14. Mitkov R (ed) (2003) The Oxford handbook of computational linguistics. Oxford University Press, New YorkGoogle Scholar
  15. Shatkay H (2005) Hairpins in bookstacks: Information retrieval from biomedical text. Brief Bioinform 6(3):222–238CrossRefPubMedGoogle Scholar
  16. Shatkay H, Craven M (2007) Biomedical text mining, MIT Press, CambridgeGoogle Scholar
  17. Spasic I, Ananiadou S, McNaught J, Kumar A (2005) Text mining and ontologies in biomedicine: Making sense of raw text. Brief Bioinform 6(3):239–251CrossRefPubMedGoogle Scholar
  18. Weeber M, Kors JA, Mons B (2005) Online tools to support literature-based discovery in the life sciences. Brief Bioinform 6(3):277–286CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.LIMSI-CNRSOrsayFrance

Personalised recommendations