Skip to main content

Text Mining

  • Protocol
Bioinformatics

Part of the book series: Methods in Molecular Biology™ ((MIMB,volume 453))

Abstract

One of the fastest-growing fields in bioinformatics is text mining: the application of natural language processing techniques to problems of knowledge management and discovery, using large collections of biological or biomedical text such as MEDLINE. The techniques used in text mining range from the very simple (e.g., the inference of relationships between genes from frequent proximity in documents) to the complex and computationally intensive (e.g., the analysis of sentence structures with parsers in order to extract facts about protein —protein interactions from statements in the text).

This chapter presents a general introduction to some of the key principles and challenges of natural language processing, and introduces some of the tools available to end-users and developers. A case study describes the construction and testing of a simple tool designed to tackle a task that is crucial to almost any application of text mining in bioinformatics —identifying gene/protein names in text and mapping them onto records in an external database.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Cohen, K. B., Hunter, L. (2004) Natural language processing and systems biology, in (Dubitzky, W., Azuaje, F., eds.), Artificial Intelligence Methods and Tools for Systems Biology. Kluwer, Dordrecht.

    Google Scholar 

  2. MEDLINE via PubMed, http://www.pubmed.org/

  3. MEDLINE Fact Sheet, http://www.nlm.nih.gov/pubs/factsheets/medline.html

  4. Brody, T. (1999) The Interactive Fly: gene networks, development and the Internet. Trends Genet 15, 333 –334.

    Article  PubMed  CAS  Google Scholar 

  5. The Interactive Fly, http://flybase.bio.indiana.edu/allieddata/lk/interactivefly/aimain/1aahome.htm

  6. Shatkay, H., Edwards, S., Wilbur, W. J., et al. (2000) Genes, themes, and micro-arrays: Using information retrieval for large-scale gene analysis, in (Bourne, P. , Gribskov, M., Altman, R., et al., eds.), Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, CA.

    Google Scholar 

  7. Hersh, W., Bhupatiraju, R. T. (2003) Of mice and men (and rats and fruit flies): the TREC genomics track, in (Brown, E., Hersh, W., and Valencia, A., eds.), ACM SIGIR′03 Workshop on Text Analysis and Search for Bioinfor-matics: Participant Notebook. Association for Computing Machinery, Toronto, Canada.

    Google Scholar 

  8. Hirschman, L., Yeh, A., Blaschke, C., et al. (2005) Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 6:S1.

    Article  Google Scholar 

  9. Wain, H. M., Bruford, E. A., Lovering, R. C., et al. (2002) Guidelines for human gene nomenclature. Genomics 79, 464 –470.

    Article  PubMed  CAS  Google Scholar 

  10. HUGO Gene Nomenclature Committee, http://www.gene.ucl.ac.uk/nomenclature/

  11. Drysdale, R. A., Crosby, M. A., The Fly-Base Consortium. (2005) FlyBase: genes and gene models. Nucl Acids Res 33, D390 –D395.

    Article  PubMed  CAS  Google Scholar 

  12. FlyBase: A Database of the Drosophila genome, http://flybase.bio.indiana.edu/

  13. Cherry, J. M. (1995) Genetic nomenclature guide. Saccharomyces cerevisiae. in Trends Genetics Nomenclature Guide, Trends Genetics, p. 11 –12.

    Google Scholar 

  14. Saccharomyces Genome Database, http://www.yeastgenome.org/

  15. Ashburner, M., Ball, C. A., Blake, J. A., et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25 –29.

    Article  PubMed  CAS  Google Scholar 

  16. Gene ontology, http://www.geneontology.org

  17. Rebhan, M., Chalifa-Caspi, V., Prilusky, J., et al. (1997), GeneCards: encyclopedia for genes, proteins and diseases. http://bioin-formatics.weizmann.ac.il/cards

  18. Kim, J.D., Ohta, T., Tateisi, Y., et al. (2003) GENIA corpus—a semantically annotated corpus for biotextmining. Bioinformatics 19, i180 –i182.

    Article  PubMed  Google Scholar 

  19. The GENIA Project, http://www.tsujii.is.s.u-tokyo.ac.jp/̃genia/

  20. Kulick, S., Bies, A., Liberman, M., et al. (2004) Integrated annotation for biomedical information extraction, in (Hirschman, L., Pustejovsky, J., eds.), HLTNAACL 2004 Workshop: BioLINK 2004, Linking Biological Literature, Ontologies and Databases. Association for Computational Linguistics, Boston.

    Google Scholar 

  21. Mining the Bibliome, http://bioie.ldc.upenn.edu/

  22. Clegg, A. B., Shepherd, A. J. (2005) Evaluating and integrating treebank parsers on a biomedical corpus, in (Jansche, M., ed.), Association for Computational Linguistics Workshop on Software CDROM. Association for Computational Linguistics, Ann Arbor, MI.

    Google Scholar 

  23. Lease, M., Charniak, E. (2005) Parsing biomedical literature, in (Dale, R., Wong, K.-F., Su, J., et al., eds.), Proceedings of the Second International Joint Conference on Natural Language Processing (IJC-NLP′05). Jeju Island, Korea.

    Google Scholar 

  24. Wermter, J., Fluck, J., Stroetgen, J., et al. (2005) Recognizing noun phrases in biomedical text: an evaluation of lab prototypes and commercial chunker, in (Hahn, U., and Valanaa A. eds.), Proceedings of the First International Symposium on Semantic Mining in Biomedicine. Hinxton, UK.

    Google Scholar 

  25. Grover, C., Lapata, M., Lascarides, A. (2005) A comparison of parsing technologies for the biomedical domain. Nat Language Engin 11, 27 –65.

    Article  Google Scholar 

  26. van Rijsbergen, C. J. (1979) Information Retrieval, 2nd ed. Butterworths, London.

    Google Scholar 

  27. Google, http://www.google.com/

  28. Smalheiser, N. R., Swanson, D. R. (1998) Using ARROWSMITH: a computer-assisted approach to formulating and assessing scientific hypotheses. Comput Methods Progr Biomed 57, 149 –153.

    Article  CAS  Google Scholar 

  29. Arrowsmith 3.0, http://kiwi.uchicago.edu/

  30. Arrowsmith @ University of Illinois at Chicago, http://arrowsmith.psych.uic.edu/arrowsmith_uic/index.html

  31. Hristovski, D., Peterlin, B., Mitchell, J. A., et al. (2003) Improving literature based discovery support by genetic knowledge integration. Stud Health Technol Informat 95, 68 –73.

    Google Scholar 

  32. BITOLA, http://www.mf.unilj.si/bitola/

  33. Manjal, http://sulu.infoscience.uiowa.edu/Manjal.html

  34. Jenssen, T.-K., Lægreid, A., Komorowski, J., et al. (2001) A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 28, 21 –28.

    PubMed  CAS  Google Scholar 

  35. PubGene, http://www.pubgene.org/

  36. DRAGON Genome Explorer, http://research.i2r.astar.edu.sg/DRAGON/

  37. BioEx, http://monkey.dbmi.columbia.edu/ Biology/

  38. Müller, H.-M., Kenny, E. E., Sternberg, P. W. (2004) Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biology 2(11).

    Article  Google Scholar 

  39. Textpresso, http://www.textpresso.org/

  40. NIST Message Understanding Conference web archive, http://www.itl.nist.gov/iaui/894.02/related_projects/muc/

  41. Chen, H., Sharp, B. M. (2004) Contentrich biological network constructed by mining PubMed abstracts. BMC Bioinformatics 5:147.

    Article  Google Scholar 

  42. ChiliBot, http://www.chilibot.net/index.html

  43. Domedel-Puig, N., Wernisch, L. (2005) Applying GIFT, a Gene Interactions Finder in Text, to fly literature. Bioinformatics 21, 3582 –3583.

    Article  PubMed  CAS  Google Scholar 

  44. Gene Interactions Finder in Text, http://gift.cryst.bbk.ac.uk/gift/

  45. EBIMed, http://www.ebi.ac.uk/Rebholz-srv/ebimed/index.jsp

  46. Corney, D. P. A., Buxton, B. F., Langdon, W. B., et al. (2004) Biorat: extracting biological information from full-length papers. Bioinformatics 20, 3206 –3213.

    Article  PubMed  CAS  Google Scholar 

  47. BioRAT: a Biological Research Assistant for Text Mining, http://bioinf.cs.ucl.ac.uk/biorat/

  48. POStech Biological Text-Mining System, http://isoft.postech.ac.kr/Research/Bio/bio.html

  49. von Mering, C., Jensen, L. J., Snel, B., et al. (2005) STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucl Acids Res 33, D433 –D437.

    Article  Google Scholar 

  50. STRING—Search Tool for the Retrieval of Interacting Genes/Proteins, http://string.embl.de/

  51. Gaizauskas, R., Davis, N., Demetriou, G., et al. (2004) Integrating biomedical text mining services into a distributed workflow environment, in Proceedings of the UK e- Science All Hands Meeting. Nottingham, UK.

    Google Scholar 

  52. Altschul, S. F., Gish, W., Miller, W., et al. (1990) Basic local alignment search tool. J Mol Biol 215, 403 –410.

    PubMed  CAS  Google Scholar 

  53. Boeckmann, B., Bairoch, A., Apweiler, R., et al. (2003) The SWISS-PROT protein knowledge base and its supplement TrEMBL in 2003. Nucl Acids Res 31, 365 –370.

    Article  PubMed  CAS  Google Scholar 

  54. Gaizauskas, R., Hepple, M., Davis, N., et al. (2003) Ambit: Acquiring medical and biological information from text, in Proceedings of the UK e-Science All Hands Meeting, Nottingham, UK.

    Google Scholar 

  55. Alma Bioinformatica, http://www.almabio info.com/

  56. Ariadne Genomics, http://www.ariadne genomics.com/

  57. Autonomy,http://www.autonomy.com/

  58. Exergen Biosciences,http://www.exergenbio.com/

  59. IBM,http://www.ibm.com/

  60. LION bioscience,http://www.lionbio-science.com/

  61. Linguamatics,http://www.linguamatics.com/

  62. PubGene,http://www.pubgene.com/

  63. SAS,http://www.sas.com/

  64. SPSS,http://www.spss.com/

  65. Stratagene,http://www.stratagene.com/

  66. TEMIS,http://www.temis-group.com/

  67. Ding, J., Berleant, D., Nettleton, D., et al. (2002) Mining MEDLINE: abstracts, sentences, or phrases? in Proceedings of the 7th Pacific Symposium on Biocomputing. World Scientific Publishing, Lihue, HI.

    Google Scholar 

  68. Smith, L., Rindflesch, T., Wilbur, W. J. (2004) MedPost: a part-of-speech tagger for biomedical text. Bioinformatics 20, 2320 –2321.

    Article  PubMed  CAS  Google Scholar 

  69. Medpost ftp site,ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedPost/medpost.tar.gz

  70. LingPipe,http://alias-i.com/lingpipe/

  71. Smith, L. H., Tanabe, L., Rindflesch, T., et al. (2005) MedTag: a collection of bio-medical annotations, in (Bozanis, P., and Houstis, E. N., eds.), Proceedings of the ACLISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics. Association for Computational Linguistics, Detroit.

    Google Scholar 

  72. MedTag,ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedTag

  73. Tsuruoka, Y., Tateishi, Y., Kim, J.-D., et al. (2005) Developing a robust part-of-speech tagger for biomedical text, in Advances in Informatics: 10th Panhellenic Conference on Informatics. Springer-Verlag, Volos, Greece.

    Google Scholar 

  74. GENIA Tagger,http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/

  75. Biosfier Software Distribution,http://www. cis.upenn.edu/datamining/software_dist/ biosfier/

  76. ETIQ,http://www.lri.fr/ia/Genomics/formulaire_ETIQ.html

  77. Settles, B. (2005) ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 21, 3191 –3192.

    Article  PubMed  CAS  Google Scholar 

  78. ABNER: A Biomedical Named Entity Recognizer,http://www.cs.wisc.edu/?bsettles/abner/

  79. Chang, J. T., Schtze, H., Altman, R. B. (2004) GAPSCORE: finding gene and protein names one word at a time. Bioinformatics 20, 216 –225.

    Article  PubMed  CAS  Google Scholar 

  80. Gene and Protein Name Server,http://bionlp.stanford.edu/gapscore/

  81. Song, Y., Kim, E., Lee, G. G., et al. (2005) POSBIOTM-NER: a trainable biomedical named-entity recognition system. Bioinformatics 21, 2794 –2796.

    Article  PubMed  CAS  Google Scholar 

  82. POStech Biological Text-Mining System,http://isoft.postech.ac.kr/Research/BioNER/POSBIOTM/NER/main.html

  83. Mika, S., Rost, B. (2004) Protein names precisely peeled off free text. Bioinformatics 20, i241 –i247.

    Article  PubMed  CAS  Google Scholar 

  84. NLProt,http://cubic.bioc.columbia.edu/services/nlprot/

  85. Fukuda, K., Tsunoda, T., Tamura, A., et al. (1998) Toward information extraction: Identifying protein names from biological papers, in Proceedings of the Pacific Symposium on Biocomputing (PSB′98), Hawaii.

    Google Scholar 

  86. KeX,http://www.hgc.jp/service/tooldoc/KeX/intro.html

  87. Tanabe, L., Wilbur, W. J. (2002) Tagging gene and protein names in biomedical text. Bioinformatics 18, 1124 –1132.

    Article  PubMed  CAS  Google Scholar 

  88. ABGene,ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/

  89. Christ, O. (1994) A modular and flexible architecture for an integrated corpus query system, in Proceedings of the Third Conference on Computational Lexicography and Text Research (COMPLEX ′94), Budapest.

    Google Scholar 

  90. IMS Corpus Workbench,http://www.ims.uni-stuttgart.de/projekte/CorpusWork-bench/

  91. Charniak, E. (2000) A maximum-entropy-inspired parser, in Proceedings of the first conference on North American chapter of the Association for Computational Linguistics, Morgan Kaufmann Publishers, San Francisco.

    Google Scholar 

  92. BLLIP Resources,http://www.cog.brown.edu/Research/nlp/resources.html

  93. Sleator, D., Temperley, D. (1993) Parsing English with a link grammar, in Proceedings of the Third International Workshop on Parsing Technologies, Tilburg, Netherlands.

    Google Scholar 

  94. Link Grammar,http://www.link.cs.cmu.edu/link/

  95. Hatcher, E., Gospodnetií, O. (2004) Lucene in Action. Manning Publications, Greenwich, CT.

    Google Scholar 

  96. Lucene,http://lucene.apache.org/

  97. Cohen, A. M. (2005) Unsupervised gene/ protein named entity normalization using automatically extracted dictionaries, in Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, Association for Computational Linguistics, Detroit.

    Chapter  Google Scholar 

  98. Tsuruoka, Y., Tsujii, J. (2003) Boosting precision and recall of dictionary-based protein name recognition, in (Ananiadou, S., Tsujii, J., eds.), Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine. Association for Computational Linguistics, Sapporo, Japan.

    Google Scholar 

  99. Tsuruoka, Y., Tsujii, J. (2003) Probabilistic term variant generator for biomedical terms, in Proceedings of the 26th Annual International ACM SIGIR Conference, Association for Computing Machinery, Toronto, Canada.

    Google Scholar 

  100. Fundel, K., Güttler, D., Zimmer, R., et al. (2005) A simple approach for protein name identification: prospects and limits. BMC Bioinformatics 6(Suppl 1):S15.

    Article  PubMed  Google Scholar 

  101. Apweiler, R., Bair och, A., Wu, C., et al. (2004) UniProt: the Universal Protein knowledge-base. Nucl Acids Res 32, D115 –D119.

    Article  PubMed  CAS  Google Scholar 

  102. Hubbard, T., Andrews, D., Caccamo, M., et al. (2005) Ensembl 2005. Nucl Acids Res 33, D447 –D453.

    Article  PubMed  CAS  Google Scholar 

  103. Gaudan, S., Kirsch, H., Rebholz-Schuhmann, D. (2005) Resolving abbreviations to their senses in Medline. Bioinformatics 21, 3658 –3664.

    Article  PubMed  CAS  Google Scholar 

  104. Widdows, D., Peters, S., Cederberg, S., et al. (2003) Unsupervised monolingual and bilingual word-sense disambiguation of medical documents using UMLS, in (Ananiadou, S., Tsujii, J., eds.), Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine. Association for Computational Linguistics, Sapporo, Japan.

    Google Scholar 

  105. The Unified Medical Language System,http://www.nlm.nih.gov/research/umls/

  106. Arama, E., Dickman, D., Kimchie, Z., et al. (2000) Mutations in the β-propeller domain of the Drosophila brain tumor (brat) protein induce neoplasm in the larval brain. Oncogene 19, 3706 –3716.

    Article  PubMed  CAS  Google Scholar 

  107. Svolovits, P. (2003) Adding a medical lexicon to an English parser, in (Musen, M., ed.), Proceedings of the AMIA 2003 Annual Symposium. American Medical Informatics Association, Bethesda, MD.

    Google Scholar 

  108. Gusfield, D. (1997) Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge University Press, Cambridge, UK.

    Book  Google Scholar 

  109. Notes from A. G. McDowell,http://www.mcdowella.demon.co.uk/programs.html

Download references

Acknowledgments

This work was supported by the Biotechnology and Biological Sciences Research Council and AstraZeneca. The authors thank Mark Halling-Brown for supplying the dictionary and A. G. McDowell for implementing (and advising on) the Aho-Corasick algorithm.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Humana Press, a part of Springer Science+Business Media, LLC

About this protocol

Cite this protocol

Clegg, A.B., Shepherd, A.J. (2008). Text Mining. In: Keith, J.M. (eds) Bioinformatics. Methods in Molecular Biology™, vol 453. Humana Press. https://doi.org/10.1007/978-1-60327-429-6_25

Download citation

  • DOI: https://doi.org/10.1007/978-1-60327-429-6_25

  • Publisher Name: Humana Press

  • Print ISBN: 978-1-60327-428-9

  • Online ISBN: 978-1-60327-429-6

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics