Advanced Literature-Mining Tools

Zweigenbaum, Pierre; Demner-Fushman, Dina

doi:10.1007/978-0-387-92738-1_17

Pierre Zweigenbaum⁴ &
Dina Demner-Fushman

Abstract

The complexity and wide range of current biomedical research is reflected in the number and scope of biomedical publications. Due to this abundance scientists are often no longer capable of keeping up with publications in their specific areas of research, let alone finding, reading, and analyzing potentially related scientific publications. Real advances in research, however, can be achieved only if a researcher can obtain an overview of the state of a given research question in a timely manner. This chapter presents methods to help researchers access the content of the biomedical literature. Information Retrieval (IR) identifies, in a large document database, the documents that are most relevant to a search topic provided by a user. Natural Language Processing (NLP) affords finer-grained access to more precise information contained in texts, which opens up a range of data analysis and knowledge synthesis functionalities. Powerful tools have been designed to exploit these techniques for the benefit of biomedical researchers, extracting millions of facts from the published literature and assisting Literature-Based Discovery. This chapter is organized as follows. It first describes the current capacities of IR from the Medline^® bibliographic database. A short introduction to the main concepts of Natural Language Processing follows. Tasks which build on Natural Language Processing are then presented: Information Extraction and its derivatives and Literature-Based Discovery. A review of some existing applications closes the chapter. The references cited in the text are supplemented by a list of textbooks and Web resources.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://lucene.apache.org/
2.
http://www.ncbi.nlm.nih.gov/sites/entrez?db = pubmed
3.
http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html
4.
http://vivisimo.com/
5.
The complete term here is monoostotic fibrous dysplasia, and was obtained through the analysis of adjective coordination in a noun phrase.
6.
MetaMap 2008 expands the abbreviation within the span of submitted text
7.
Although these may actually be hypotheses, observations, results, etc., we refer to them uniformly as “facts.”
8.
EbiMed – http://www.ebi.ac.uk/Rebholz-srv/ebimed/index.jsp
9.
iHOP – http://www.ihop-net.org/
10.
Chilibot – http://www.chilibot.net/
11.
ARROWSMITH – http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/start.cgi
12.
BITOLA – http://www.mf.uni-lj.si/bitola/
13.
Semantic MEDLINE – http://skr3.nlm.nih.gov/SemMedDemo/

References

Ahlers CB, Fiszman M, Demner-Fushman D, Lang FM, Rindflesch TC (2007) Extracting semantic predications from MEDLINE citations for pharmacogenomics. In: Pac Symp Biocomput 12. Maui, Hawaii, pp 209–220
Google Scholar
Airola A, Pyysalo S, Bjorne J, Pahikkala T, Ginter F, Salakoski T (2008) A graph kernel for protein–protein interaction extraction. In: Proceedings of the workshop on current trends in biomedical natural language processing (BioNLP’08). Association for Computational Linguistics, pp 1–9
Google Scholar
Alex B, Grover C, Haddow B, Kabadjov M, Klein E, Matthews M et al (2008) Assisted curation: Does text mining really help? In: Pac Symp Biocomput 13. Big Island, Hawaii, pp 556–567
Google Scholar
Ananiadou S, Friedman C, Tsujii JI (eds) (2004) Named entity recognition in biomedicine. J Biomed Inform 37(6):393–528
Google Scholar
Aronson AR (2001) Effective mapping of biomedical text to the UMLS Metathesaurus: The MetaMap program. In: Proc AMIA Symp, pp 17–21
Google Scholar
Bodenreider O (2008) Biomedical ontologies in action: Role in knowledge management, Data integration and decision support. IMIA Yearbook of Medical Informatics, pp 67–79
Google Scholar
Branco A, McEnery T, Mitkov R (eds) (2005) Anaphora processing: Linguistic, cognitive and computational modelling. Current Issues in Linguistic Theory, vol 263. John Benjamins, Amsterdam and Philadelphia
Google Scholar
Bruza P, Weeber M (eds) (2008) Literature-based discovery. Information Science and Knowledge Management, vol 15, Springer, Berlin Heidelberg New York
Google Scholar
Carreras X, Màrquez L (2005) Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling. In: Proc 9th CoNLL. ACL, pp 152–164
Google Scholar
Chang JT, Altman RB (2004) Extracting and characterizing gene-drug relationships from literature. Pharmacogenetics 14(9):577–586
Article CAS PubMed Google Scholar
Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG (2001) A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform 34(5):301–310
Article CAS PubMed Google Scholar
Chen H, Sharp BM (2004) Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinform 8;5:147
Google Scholar
Dee CR (2007) The development of the Medical Literature Analysis and Retrieval System (MEDLARS). J Med Libr Assoc 95(4):416–425. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2000779
Google Scholar
Demner-Fushman D, Humphrey SM, Ide NC, Loane RF, Mork JG, Ruch P et al (2007) Combining resources to find answers to biomedical questions. In: The sixteenth text retrieval conference TREC-2007, Gaithersburg, MD, pp 205–215
Google Scholar
Eaton AD (2006) HubMed: a web-based biomedical literature search interface. Nucleic Acids Res 1:34(Web Server issue):W745-7
Google Scholar
Firth JR (1957) Papers in linguistics, 1934–1951. Oxford University Press, London
Google Scholar
Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A (2001) GENIES: A natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17(Suppl 1):S74–S82
PubMed Google Scholar
Grishman R, Sundheim B (1996) Message understanding conference – 6: A brief history. In: Proc 16th COLING. ACL, pp 466–471
Google Scholar
Habert B, Zweigenbaum P (2002) Contextual acquisition of information categories: what has been done and what can be done automatically? In: Nevin BE, Johnson SM (eds) The Legacy of Zellig Harris: Language and information into the 21st Century, Mathematics and computability of language, vol 2. John Benjamins, Amsterdam, pp 203–231
Google Scholar
Hersh W, Cohen AM, Ruslen L, Roberts P (2007) TREC 2007 genomics track overview. In: The Sixteenth Text Retrieval Conference – TREC 2007. NIST
Google Scholar
Hersh WR, Greenes RA (1990) SAPHIRE: An information retrieval system featuring concept matching, automatic indexing, probabilistic retrieval, and hierarchical relationships. Comput Biomed Res 23(5):410–425
Article CAS PubMed Google Scholar
Hirschman L (2007) The second biocreative evaluation: Lessons learned and future directions. In: Fifth Fraunhofer-symposium on text mining. Bonn, Germany, http://www.scai.fraunhofer.de/fileadmin/download/vortraege/tms_07/Lynette_Hirschmann.pdf
Hliaoutakis A, Varelas G, Petrakis EGM, Milios EE (2006) MedSearch: A retrieval system for medical information based on semantic similarity. In: Proc ECDL, Lecture Notes in Computer Science 4172. Springer, Berlin Heidelberg New York
Google Scholar
Hoffmann R, Valencia A (2004) A gene network for navigating the literature. Nat Genet 36:664
Article CAS PubMed Google Scholar
Hristovski D, Friedman C, Rindflesch TC, Peterlin B (2006) Exploiting semantic relations for literature-based discovery. In: AMIA Annu Symp Proc, pp 349–353
Google Scholar
Hristovski D, Peterlin B, Mitchell JA, Humphrey SM (2005) Using literature-based discovery to identify disease candidate genes. Int J Med Inform 74(2–4):289–298
Article PubMed Google Scholar
Ide NC, Loane RF, Demner-Fushman D (2007) Essie: A concept based search engine for structured biomedical text. J Am Med Inform Assoc 14(3):253–263
Article PubMed Google Scholar
Jacquemin C (2001) Spotting and discovering terms through NLP. MIT Press, Cambridge, MA
Google Scholar
Jelier R, Jenster G, Dorssers LCJ, Wouters BJ, Hendriksen PJM, Mons B et al (2007) Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation. BMC Bioinform 8:14
Article Google Scholar
Jelier R, Schuemie MJ, Veldhoven A, Dorssers LCJ, Jenster G, Kors JA (2008) Anni 2.0: A multipurpose text-mining tool for the life sciences. Genome Biol 9(6):R96
Article PubMed Google Scholar
Karamanis N, Lewin I, Seal R, Drysdale R, Briscoe E (2007) Integrating natural language processing with flybase curation. In: Pac Symp Biocomput 12. Maui, Hawaii, pp 245–256
Google Scholar
Krallinger M, Leitner F, Valencia A (2007) Assessment of the second BioCreative PPI task: Automatic extraction of protein–protein interactions. In: Proceedings of the BioCreAtIvE II Workshop, Madrid, pp 41–54
Google Scholar
Lee LC, Horn F, Cohen FE (2007) Automatic extraction of protein point mutations using a graph bigram association. PLoS Comput Biol Feb 2;3(2):e16
Google Scholar
Leroy G, Chen H (2005) Genescene: An ontology-enhanced integration of linguistic and co-occurrence based relations in biomedical texts. J Am Soc Inf Sci Technol (JASIST) 56(5):457–468
Article CAS Google Scholar
Lewis J, Ossowski S, Hicks J, Errami M, Garner HR (2006) Text similarity: an alternative way to search MEDLINE. Bioinformatics September 15;22(18):2298–2304
Google Scholar
Lindberg DA, Humphreys BL, McCray AT (1993) The unified medical language system. Methods Inf Med 32(4):281–291
CAS PubMed Google Scholar
Lussier Y, Borlawsky T, Rappaport D, Liu Y, Friedman C (2006) PhenoGO: Assigning phenotypic context to Gene Ontology annotations with natural language processing. In: Pac Symp Biocomput 11. Maui, Hawaii, pp 64–75
Google Scholar
Miles WD (1992) A history of the national library of medicine: The Nation’s Treasury of Medical Knowledge. Bernan Assoc. http://www.nlm.nih.gov/hmd/manuscripts/miles/miles.pdf. Accessed 4 August 2008
Miyao Y, Ohta T, Masuda K, Tsuruoka Y, Yoshida K, Ninomiya T, et al (2006) Semantic retrieval for the accurate identification of relational concepts in massive textbases. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (COLING/ACL 2006). Sydney, Australia, pp 1017–1024
Google Scholar
Morgan AA, Hirschman L (2007) Overview of BioCreative II gene normalisation. In: Proceedings of the BioCreAtIvE II Workshop, Madrid, pp 17–27
Google Scholar
Müller HM, Kenny EE, Sternberg PW (2004) Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol 2(11):e309
Article PubMed Google Scholar
Mutalik PG, Deshpande A, Nadkarni PM (2001) Use of general-purpose negation detection to augment concept indexing of medical documents: a quantitative study using the UMLS. J Am Med Inform Assoc 8(6):598–609
CAS PubMed Google Scholar
Nakov P, Hearst M (2006) Using verbs to characterize noun–noun relations. In: Proceedings of the twelfth international conference on artificial intelligence: Methodology, systems, applications (AIMSA), Bulgaria
Google Scholar
Namer F, Baud R (2007) Defining and relating biomedical terms: Towards a cross-language morphosemantics-based system. Int J Med Inform 76(2–3):226–233
Article PubMed Google Scholar
Palakal M, Bright J, Sebastian T, Hartanto S (2007) A comparative study of cells in inflammation, EAE and MS using biomedical literature data mining. J Biomed Sci 14(1):67–85
Article CAS PubMed Google Scholar
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
Google Scholar
Pratt W, Hearst M, Fagan L. A knowledge-based approach to organizing retrieved documents. In: Proceedings of 16th annual conference on artificial intelligence (AAAI-99). July 1999 Orlando, FL, pp 80–85
Google Scholar
Rebholz-Schuhmann D, Kirsch H, Arregui M, Gaudan S, Riethoven M, Stoehr P (2007) EBIMed–text crunching to gather facts for proteins from Medline. Bioinformatics 23(2):e237–e244
Article CAS PubMed Google Scholar
Rebholz-Schuhmann D, Kirsch H, Couto F (2005) Facts from text – is text mining ready to deliver? PLoS Biol 3(2):e65
Article PubMed Google Scholar
Rindflesch TC, Tanabe L, Weinstein JN, Hunter L (2000) EDGAR: Extraction of drugs, genes and relations from the biomedical literature. Pac Symp Biocomput 517–528
Google Scholar
Sanchez-Graillet O, Poesio M (2007) Negation of protein–protein interactions: Analysis and extraction. Bioinformatics 23(13):i424–i432
Article CAS PubMed Google Scholar
Sandler T, Schein AI, Ungar LH (2006) Automatic term list generation for entity tagging. Bioinformatics 22(6):651–657
Article CAS PubMed Google Scholar
Smalheiser NR, Torvik VI, Bischoff-Grethe A, Burhans LB, Gabriel M, Homayouni R, et al (2006) Collaborative development of the Arrowsmith two node search interface designed for laboratory investigators. J Biomed Discov Collab 1(8). http://www.j-biomed-discovery.com/content/1/1/8
Smith L, Rindflesch T, Wilbur WJ (2004) MedPost: A part-of-speech tagger for bioMedical text. Bioinformatics 20(14):2320–2321
Article CAS PubMed Google Scholar
Srinivasan P, Libbus B (2004) Mining MEDLINE for implicit links between dietary substances and diseases. Bioinformatics 20(Suppl 1):i290–i296
Article CAS PubMed Google Scholar
Surdeanu M, Harabagiu S, Williams J, Aarseth P (2003) Using predicate-argument structures for information extraction. In: Proc 41st ACL. Sapporo, Japan, pp 8–15
Google Scholar
Swanson DR (1986) Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspectives in biology and medicine 30:7–18
CAS PubMed Google Scholar
Swanson DR, Smalheiser NR, Torvik VI (2006) Ranking indirect connections in literature-based discovery: The role of medical subject headings. J Am Soc Inf Sci Technol 57(11):1427–1439
Article CAS Google Scholar
Szarvas G, Vincze V, Farkas R, Csirik J (2008) The BioScope corpus: annotation for negation, uncertainty and their scope in biomedical texts. Proceedings of the 2008 Workshop on Biomedical Natural Language Processing (BioNL’08), Columbus, Ohio. June 2008. pp 38–45
Google Scholar
Tanabe L, Scherf U, Smith LH, Lee JK, Hunter L, Weinstein JN (1999) MedMiner: An internet text-mining tool for biomedical information, with application to gene expression profiling. BioTechniques 27:1210–1217
CAS PubMed Google Scholar
The Unicode Standard (2007), Version 5.0 Addison-Wesley, Boston, MA
Google Scholar
Tsai RTH, Chou WC, Lin YC, Sung CL, Ku W, Su YS, et al (2006) BIOSMILE: Adapting semantic role labeling for biomedical verbs: An exponential model coupled with automatically generated template features. In: HLT-NAACL BioNLP, pp 57–64
Google Scholar
Wang X (2007) Rule-based protein term identification with help from automatic species tagging. In: Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2007), Lecture Notes in Computer Science 4394, Springer, Berlin Heidelberg New York, pp 288–298
Google Scholar
Wilbur J, Smith L, Tanabe L (2007) BioCreative 2. Gene Mention Task. In: Proceedings of the BioCreAtIvE II Workshop 2007, Madrid, Spain, pp 7–16
Google Scholar
Wren JD, Garner HR (2004) Shared relationship analysis: Ranking set cohesion and commonalities within a literature-derived relationship network. Bioinformatics 20(2):191–198
Article CAS PubMed Google Scholar
Yeh A, Morgan A, Colosimo M, Hirschman L (2005) BioCreAtIvE task 1A: gene mention finding evaluation. BMC Bioinform 6:(Suppl)1
Google Scholar

Textbooks and Introductions to NLP and BioNLP

Allen JF (1995) Natural language understanding, 2nd edn. Benjamin/Cummings, Menlo Park, CA
Google Scholar
Ananiadou S, Kell DB, Tsujii JI (2006) Text mining and its potential applications in systems biology. Trends Biotechnol 24(12):571–579
Article CAS PubMed Google Scholar
Ananiadou S, McNaught J (2006) Text mining for biology and biomedicine. Artech House Publishers, Norwood, Massachusetts, USA
Google Scholar
de Bruijn B, Martin J (2002) Getting to the (c) ore of knowledge: Mining biomedical literature. Int J Med Inform 67:7–18
Article PubMed Google Scholar
Cohen AM, Hersh W (2005) A survey of current work in biomedical text mining. Brief Bioinform 6(1):57–71
Article CAS PubMed Google Scholar
Cohen KB, Hunter L (2004) Natural language processing and systems biology. In: Dubitzky W, Azuaje F (eds) Artificial intelligence methods and tools for systems biology. Springer, Norwell, MA, pp 147–174
Google Scholar
Cohen KB, Hunter L (2008) Getting started in text mining. PloS Comput Biol 4(1):e20
Article PubMed Google Scholar
Hearst MA (2003) What is text mining? Available online at http://www.ischool.berkeley.edu/∼hearst/text-mining.html
Hunter L, Cohen KB (2006) Biomedical language processing: what’s beyond PubMed? Mol Cell 21:589–594
Article CAS PubMed Google Scholar
Jackson P, Moulinier I (2002) Natural language processing for online applications: text retrieval, extraction, and categorization. John Benjamins Publishing Company
Google Scholar
Jensen LJ, Saric J, Bork P (2006) Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet 7:119–129
Article CAS PubMed Google Scholar
Jurafsky D, Martin JH (2000) Speech and language processing: An introduction to natural language processing, computational linguistics and speech recognition. Prentice Hall, Lebanon, Indiana
Google Scholar
Krallinger M, Valencia A (2005) Text-mining and information-retrieval services for molecular biology. Genome Biol 6(7):224
Article PubMed Google Scholar
Mitkov R (ed) (2003) The Oxford handbook of computational linguistics. Oxford University Press, New York
Google Scholar
Shatkay H (2005) Hairpins in bookstacks: Information retrieval from biomedical text. Brief Bioinform 6(3):222–238
Article CAS PubMed Google Scholar
Shatkay H, Craven M (2007) Biomedical text mining, MIT Press, Cambridge
Google Scholar
Spasic I, Ananiadou S, McNaught J, Kumar A (2005) Text mining and ontologies in biomedicine: Making sense of raw text. Brief Bioinform 6(3):239–251
Article CAS PubMed Google Scholar
Weeber M, Kors JA, Mons B (2005) Online tools to support literature-based discovery in the life sciences. Brief Bioinform 6(3):277–286
Article PubMed Google Scholar

Download references

Author information

Authors and Affiliations

LIMSI-CNRS, Orsay, France
Pierre Zweigenbaum

Authors

Pierre Zweigenbaum
View author publications
You can also search for this author in PubMed Google Scholar
Dina Demner-Fushman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pierre Zweigenbaum .

Editor information

Editors and Affiliations

Inst. Molecular Bioscience, University of Queensland, St.Lucia, 4072, Australia
David Edwards
Dept. Plant & Microbial Biology, University of California, Berkeley, Koshland Hall 111, Berkeley, 94720, U.S.A.
Jason Stajich
e-Health Research Centre, Adelaide St. 300, Brisbane, 4000, Australia
David Hansen

Appendices

Appendix

Lists of Tools and Services

State of the art NLP tools:

http://aclweb.org/aclwiki/index.php?title=State_of_the_art

Resource list compiled by Kevin Bretonell Cohen

http://compbio.uchsc.edu/corpora/bcresources.html

Resource list compiled by Robert Futrelle:

http://www.bionlp.org/

BioCreAtIvE bio-NLP tools:

http://biocreative.sourceforge.net/bionlp_tools_links.html

NLP and Text Mining Research list at NaCTeM:

http://www.nactem.ac.uk/research.php?view=4

Arrowsmith:

http://arrowsmith.psych.uic.edu/arrowsmith_uic/tools.html

The Open Directory Project:

http://www.dmoz.org/Science/Biology/Bioinformatics/Software/

The National Centers for Biomedical Computing (NCBC) funded under the NIH Roadmap for Bioinformatics and Computational Biology:

http://www.ncbcs.org/

Gene and Protein Name Resources

Entrez Gene:

http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene

FlyBase:

http://flybase.org/

HUGO Gene:

http://www.genenames.org/index.html

Model organisms:

http://www.nih.gov/science/models/

Mouse Genome Informatics:

http://www.informatics.jax.org/

Saccharomyces genome database:

http://www.yeastgenome.org/gene_list.shtml

The Worldwide Protein Data Bank:

http://www.wwpdb.org/

UniProt:

http://www.ebi.ac.uk/uniprot/

Biomedical Terminologies

The national center for biomedical ontology:

http://bioontology.org/

The open biomedical ontologies:

http://www.obofoundry.org/

Resources for Biomedical Terminology and Ontology:

http://www.ldc.upenn.edu/mamandel/itre/term.html#Dictionaries

Unified Medical Language System

http://www.nlm.nih.gov/research/umls/

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zweigenbaum, P., Demner-Fushman, D. (2009). Advanced Literature-Mining Tools. In: Edwards, D., Stajich, J., Hansen, D. (eds) Bioinformatics. Springer, New York, NY. https://doi.org/10.1007/978-0-387-92738-1_17

Download citation

DOI: https://doi.org/10.1007/978-0-387-92738-1_17
Published: 05 August 2009
Publisher Name: Springer, New York, NY
Print ISBN: 978-0-387-92737-4
Online ISBN: 978-0-387-92738-1
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics