Intelligent Extraction Versus Advanced Query: Recognize Transcription Factors from Databases

  • Zhuo Zhang
  • Merlin Veronika
  • See-Kiong Ng
  • Vladimir B Bajic
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4146)


Many entries in major biological databases have incomplete functional annotation and thus, frequently, it is difficult to identify entries for a specific functional category. We combined information of protein functional domains and gene ontology descriptions for highly accurate identification of transcription factor (TF) entries in Swiss-Prot and Entrez Gene databases. Our method utilizes support vector machines and it efficiently separates TF entries from non-TF entries. The 10-fold cross validation of predictions produced on average a positive predictive value of 97.5% and sensitivity of 93.4%. Using this method we have scanned the whole Swiss-Prot and Entrez Gene databases and extracted 13826 unique TF entries. Based on a separate manual test of 500 randomly chosen extracted TF entries, we found that the non-TF (erroneous) entries were present in  2% of the cases.


Gene Ontology Transcription Factor Activity Pfam Domain Gene Ontology Annotation Human Protein Reference Database 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Bairoch, A., Apweiler, R., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Natale, D.A., O’Donovan, C., Redaschi, N., Yeh, L.S.: The Universal Protein Resource (UniProt). Nucleic Acids Res. 33, D154–159 (2005)CrossRefGoogle Scholar
  2. 2.
    Maglott, D., Ostell, J., Pruitt, K.D., Tatusova, T.: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 33, D54–D58 (2005)CrossRefGoogle Scholar
  3. 3.
    Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., et al.: The Pfam protein families database. Nucleic Acids Res. 30, 276–280 (2002)CrossRefGoogle Scholar
  4. 4.
    Harris, M.A., Clark, J., et al.: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 32, D258–D261 (2004)CrossRefGoogle Scholar
  5. 5.
    Zupicich, J., Brenner, S.E., Skarnes, W.C.: Computational prediction of membrane-tethered transcription factors. Genome Biol. 2, 50 (2001)CrossRefGoogle Scholar
  6. 6.
    Stegmaier, P., Kel, A.E., Wingender, E.: Systematic DNA-Binding Domain Classification of Transcription Factors. In: Genome Inform. Ser. Workshop, vol. 15(2), pp. 276–286 (2004)Google Scholar
  7. 7.
    Matys, V., Wingender, E., et al.: TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 31, 374–378 (2003)CrossRefGoogle Scholar
  8. 8.
    Peri, S., Navarro, J.D., Pandey, A.: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 13, 2363–2371 (2003)CrossRefGoogle Scholar
  9. 9.
    Mewes, H.W., Amid, C., Arnold, R., Frishman, D., Gldener, U., Mannhaupt, G., Mnsterktter, M., Pagel, P., Strack, N., Stmpflen, V., Warfsmann, J., Ruepp, A.: MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 32, D41–D44 (2004)CrossRefGoogle Scholar
  10. 10.
    Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., Binns, D., Harte, N., Lopez, R., Apweiler, R.: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 32, D262–D266 (2004)CrossRefGoogle Scholar
  11. 11.
    Scholkopf, B., Burges, C., Smola, A.: Advances in Kernel Methods - Support Vector Learning. MIT-Press, Cambridge (1990)Google Scholar
  12. 12.
    Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)zbMATHGoogle Scholar
  13. 13.
    Zdobnov, E.M., Apweiler, R.: InterProScan - an integration platform for the signature-recognition methods in InterPro. Bioinformatics 17, 847–848 (2001)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Zhuo Zhang
    • 1
  • Merlin Veronika
    • 1
  • See-Kiong Ng
    • 1
  • Vladimir B Bajic
    • 2
  1. 1.Institute for Infocomm ResearchSingapore
  2. 2.South African National Bioinformatics InstituteBellvilleSouth Africa

Personalised recommendations