ATDD: An Algorithmic Tool for Domain Discovery in Protein Sequences

  • Stanislav Angelov
  • Sanjeev Khanna
  • Li Li
  • Fernando Pereira
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3240)


The problem of identifying sequence domains is essential for understanding protein function. Most current methods for protein domain identification rely on prior knowledge of homologous domains and construction of high quality multiple sequence alignments. With rapid accumulation of enormous data from genome sequencing, it is important to be able to automatically determine domain regions from a set of proteins solely based on sequence information.

We describe a new algorithm for automatic protein domain detection that does not require multiple sequence alignment and differs from alignment based methods by allowing arbitrary rearrangements (both in relative ordering and distance) of the domains within the set of proteins under study. Moreover, our algorithm extracts domains by simply performing a comparative analysis of a given set of sequences, and no auxiliary information is required. The method views protein sequences as collections of overlapping fixed length blocks. A pair of blocks within a sequence gets a “vote of confidence” to be part of a domain if several other sequences have similar pairs of blocks at roughly the same distance from each other. Candidate domains are then identified by discovering regions in each protein sequence where most block pairs get strong votes of confidence. We applied our method on several test data sets with a fixed choice of parameters. To evaluate the results we computed sensitivity and specificity measures using SMART-derived domain annotations as a reference.


Amino Acid Position Substitution Matrix Block Pair Domain Position Domain Annotation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)Google Scholar
  2. 2.
    Pearson, W., Lipman, D.: Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85, 2444–2448 (1988)CrossRefGoogle Scholar
  3. 3.
    Bork, P., Koonin, E.: Predicting functions from protein sequences–where are the bottlenecks. Nat. Genet. 18, 313–318 (1998)CrossRefGoogle Scholar
  4. 4.
    Hegyi, H., Bork, P.: On the classification and evolution of protein modules. J. Protein Chem. 16, 545–551 (1997)CrossRefGoogle Scholar
  5. 5.
    Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L.L., Studholme, D.J., Yeats, C., Eddy, S.R.: The Pfam protein families database 32, D138–141 (2004)Google Scholar
  6. 6.
    Sonnhammer, E., Eddy, S., Birney, E., Bateman, A., Durbin, R.: Pfam: multiple sequence alignments and hmm-profiles of protein domains. Nucl. Acids. Res. 26, 320–322 (1998)CrossRefGoogle Scholar
  7. 7.
    Letunic, I., Goodstadt, L., Dickens, N.J., Doerks, T., Schultz, J., Mott, R., Ciccarelli, F., Copley, R.R., Ponting, C.P., Bork, P.: Recent improvements to the SMART domain-based sequence annotation resource. Nucl. Acids. Res. 30, 242–244 (2002)CrossRefGoogle Scholar
  8. 8.
    Henikoff, J., Pietrokovski, S., McCallum, C., Henikoff, S.: Blocks-based methods for detecting protein homology. Electrophoresis 21, 1700–1706 (2000)CrossRefGoogle Scholar
  9. 9.
    Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C.J.A., Hofmann, K., Bairoch, A.: The PROSITE database, its status in 2002. Nucl. Acids. Res. 30, 235–238 (2002)CrossRefGoogle Scholar
  10. 10.
    Mulder, N., Apweiler, R., Attwood, T., Bairoch, A., Bateman, A., Binns, D., Biswas, M., Bradley, P., Bork, P., Bucher, P., Copley, R., Courcelle, E., Durbin, R., Falquet, L., Fleischmann, W., Gouzy, J., Griffith-Jones, S., Haft, D., Hermjakob, H., Hulo, N., Kahn, D., Kanapin, A., Krestyaninova, M., Lopez, R., Letunic, I., Orchard, S., Pagni, M., Peyruc, D., Ponting, C., Servant, F., Sigrist, C.: Interpro: an integrated documentation resource for protein families, domains and functional sites. Brief Bioinform 3, 225–235 (2002)CrossRefGoogle Scholar
  11. 11.
    Attwood, T., Beck, M., Bleasby, A., Parry-Smith, D.: PRINTS–a database of protein motif fingerprints. Nucl. Acids. Res. 22, 3590–3596 (1994)Google Scholar
  12. 12.
    Corpet, F., Servant, F., Gouzy, J., Kahn, D.: ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucl. Acids. Res. 28, 267–269 (2000)CrossRefGoogle Scholar
  13. 13.
    Letunic, I., Copley, R.R., Schmidt, S., Ciccarelli, F.D., Doerks, T., Schultz, J., Ponting, C.P., Bork, P.: SMART 4.0: towards genomic data integration. Nucl. Acids. Res. 32, D142–144 (2004)CrossRefGoogle Scholar
  14. 14.
    Henikoff, S., Henikoff, J.: Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89, 10915–10919 (1992)CrossRefGoogle Scholar
  15. 15.
    Wootton, J.C., Federhen, S.: Statistics of local complexity in amino acid sequences and sequence databases. Computers in Chemistry 17, 149–163 (1993)zbMATHCrossRefGoogle Scholar
  16. 16.
    Smith, T., Waterman, M.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Stanislav Angelov
    • 1
  • Sanjeev Khanna
    • 1
  • Li Li
    • 2
  • Fernando Pereira
    • 1
  1. 1.Department of Computer and Information Science, School of EngineeringUniversity of PennsylvaniaUSA
  2. 2.Department of Biology, School of Arts and SciencesUniversity of PennsylvaniaPhiladelphiaUSA

Personalised recommendations