Combining Biological Databases and Text Mining to Support New Bioinformatics Applications

  • René Witte
  • Christopher J. O. Baker
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3513)


A large amount of biological knowledge today is only available from full-text research papers. Since neither manual database curators nor users can keep up with the rapidly expanding volume of scientific literature, natural language processing approaches are becoming increasingly important for bioinformatic projects.

In this paper, we go beyond simply extracting information from full-text articles by describing an architecture that supports targeted access to information from biological databases using the results derived from text mining of research papers, thereby integrating information from both sources within a biological application.

The described architecture is currently being used to extract information about protein mutations from full-text research papers. Text mining results drive the retrieval of sequence information from protein databases and the employment of algorithmic sequence analysis tools, which facilitate further data access from protein structure databases. Complex mapping of NLP derived text annotations to protein structures allows the rendering, with 3D structure visualization, of information not available in databases of mutation annotations.


Protein Data Bank Noun Phrase Text Mining Biological Database Protein Engineer 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Altschul, S.F., Gish, W., Miller, W., Meyers, E.W., Lipman, D.J.: Basic Local Alignment Search Tool. Journal of Molecular Biology 215(3), 403–410 (1990)Google Scholar
  2. 2.
    Baker, C.J.O., Witte, R.: Enriching Protein Structure Visualizations with Mutation Annotations Obtained by Text Mining Protein Engineering Literature. In: The 3rd Canadian Working Conference on Computational Biology (CCCB 2004), Markham, Ontario, October 4 (2004) Co-located with IBM CASCONGoogle Scholar
  3. 3.
    Corney, D.P.A., Buxton, B.F., Langdon, W.B., Jones, D.T.: BioRAT: extracting biological information from full-length papers. Bioinformatics (November 22, 2004)Google Scholar
  4. 4.
    Couto, F.M., Silva, M.J., Coutinho, P.: ProFAL: PROtein Functional Annotation through Literature. In: JISBD, pp. 747–756 (2003)Google Scholar
  5. 5.
    Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the ACL (2002)Google Scholar
  6. 6.
    Gabdoulline, R.R., Hoffmann, R., Leitnern, F., Wade, R.C.: ProSAT: functional annotation of protein 3D structures. Bioinformatics 19(13), 1723–1725 (2003)CrossRefGoogle Scholar
  7. 7.
    Kawabata, T., Ota, M., Nishikawa, K.: The protein mutant database. Nucleaic Acid Research 27(1) (1999)Google Scholar
  8. 8.
    Marchler-Bauer, A., Panchenko, A.R., Shoemaker, B.A., Thiessen, P.A., Geer, L.Y., Bryant, S.H.: CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Research 30(1), 281–283 (2002)CrossRefGoogle Scholar
  9. 9.
    Müller, H.-M., Kenny, E.E., Sternberg, P.W.: Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature. PLoS Biology 2(11), 1984–1998 (2004), CrossRefGoogle Scholar
  10. 10.
    Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence comparison. Proc. of the National Academy of Sciences of the USA 85(8) (1988)Google Scholar
  11. 11.
    Schomburg, I., Chang, A., Ebeling, C., Gremse, M., Heldt, C., Huhn, G., Schomburg, D.: BRENDA, the enzyme database: updates and major new developments. Nucleic Acids Research, 32 (2004)Google Scholar
  12. 12.
    Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research 22(22), 4673–4680 (1994)CrossRefGoogle Scholar
  13. 13.
    Witte, R.: An Integration Architecture for User-Centric Document Creation, Retrieval, and Analysis. In: Proceedings of the VLDB Workshop on Information Integration on the Web (IIWeb), Toronto, Canada, August 30, pp. 141–144 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • René Witte
    • 1
  • Christopher J. O. Baker
    • 2
  1. 1.Institute for Program Structures and Data Organization (IPD)Universität Karlsruhe (TH)Germany
  2. 2.Department of Computer Science and Software EngineeringConcordia UniversityMontréalCanada

Personalised recommendations