Protein Structure Prediction Based on Sequence Similarity

  • Lukasz Jaroszewski
Part of the Methods in Molecular Biology™ book series (MIMB, volume 569)


The observation that similar protein sequences fold into similar three-dimensional structures provides a basis for the methods which predict structural features of a novel protein based on the similarity between its sequence and sequences of known protein structures. Similarity over entire sequence or large sequence fragment(s) enables prediction and modeling of entire structural domains while statistics derived from distributions of local features of known protein structures make it possible to predict such features in proteins with unknown structures. The accuracy of models of protein structures is sufficient for many practical purposes such as analysis of point mutation effects, enzymatic reactions, interaction interfaces of protein complexes, and active sites. Protein models are also used for phasing of crystallographic data and, in some cases, for drug design. By using models one can avoid the costly and time-consuming process of experimental structure determination. The purpose of this chapter is to give a practical review of the most popular protein structure prediction methods based on sequence similarity and to outline a practical approach to protein structure prediction. While the main focus of this chapter is on template-based protein structure prediction, it also provides references to other methods and programs which play an important role in protein structure prediction.

Key words

Protein structure prediction Sequence homology Protein sequence alignment Comparative modeling Fold recognition 


  1. 1.
    Chothia, C. and Lesk, A.M. (1986) The relation between the divergence of sequence and structure in proteins. EMBO J, 5, 823–826.PubMedGoogle Scholar
  2. 2.
    Greer, J., Mollison, K.W., Carter, G.W. and Zuiderweg, E.R. (1989) Comparative modeling of proteins in the complement pathway. Prog Clin Biol Res, 289, 385–397.PubMedGoogle Scholar
  3. 3.
    Sander, C. and Schneider, R. (1991) Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins, 9, 56–68.PubMedCrossRefGoogle Scholar
  4. 4.
    Swindells, M.B. and Thornton, J.M. (1991) Structure prediction and modelling. Curr Opin Biotechnol, 2, 512–519.PubMedCrossRefGoogle Scholar
  5. 5.
    Xiang, Z. (2006) Advances in homology protein structure modeling. Curr Protein Pept Sci, 7, 217–227.PubMedCrossRefGoogle Scholar
  6. 6.
    Ginalski, K. (2006) Comparative modeling for protein structure prediction. Curr Opin Struct Biol, 16, 172–177.PubMedCrossRefGoogle Scholar
  7. 7.
    Murzin, A.G., Brenner, S.E., Hubbard, T. and Chothia, C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol, 247, 536–540.PubMedGoogle Scholar
  8. 8.
    Finn, R.D., Tate, J., Mistry, J., Coggill, P.C., Sammut, S.J., Hotz, H.R., Ceric, G., Forslund, K., Eddy, S.R., Sonnhammer, E.L. et al. (2008) The Pfam protein families database. Nucleic Acids Res, 36, D281–D288.PubMedCrossRefGoogle Scholar
  9. 9.
    Bru, C., Courcelle, E., Carrere, S., Beausse, Y., Dalmar, S. and Kahn, D. (2005) The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res, 33, D212–D215.PubMedCrossRefGoogle Scholar
  10. 10.
    Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Buillard, V., Cerutti, L., Copley, R. et al. (2007) New developments in the InterPro database. Nucleic Acids Res, 35, D224–D228.PubMedCrossRefGoogle Scholar
  11. 11.
    Gough, J., Karplus, K., Hughey, R. and Chothia, C. (2001) Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol, 313, 903–919.PubMedCrossRefGoogle Scholar
  12. 12.
    Cheng, J., Sweredoski, M. and Baldi, P. (2006) DOMpro: protein domain prediction using profiles, secondary structure, relative solvent accessibility, and recursive neural networks. Data Mining and Knowledge Discovery, 13, 1–10.CrossRefGoogle Scholar
  13. 13.
    Cheng, J. (2007) DOMAC: an accurate, hybrid protein domain prediction server. Nucleic Acids Res, 35, W354–W356.PubMedCrossRefGoogle Scholar
  14. 14.
    Linding, R., Russell, R.B., Neduva, V. and Gibson, T.J. (2003) GlobPlot: Exploring protein sequences for globularity and disorder. Nucleic Acids Res, 31, 3701–3708.PubMedCrossRefGoogle Scholar
  15. 15.
    Marsden, R.L., McGuffin, L.J. and Jones, D.T. (2002) Rapid protein domain assignment from amino acid sequence using predicted secondary structure. Protein Sci, 11, 2814–2824.PubMedCrossRefGoogle Scholar
  16. 16.
    Liu, J. and Rost, B. (2004) CHOP: parsing proteins into structural domains. Nucleic Acids Res, 32, W569–W571.PubMedCrossRefGoogle Scholar
  17. 17.
    Dunbrack, R.L., Jr. (2006) Sequence comparison and protein structure prediction. Curr Opin Struct Biol, 16, 374–384.PubMedCrossRefGoogle Scholar
  18. 18.
    Holm, L., Ouzounis, C., Sander, C., Tuparev, G. and Vriend, G. (1992) A database of protein structure families with common folding motifs. Protein Sci, 1, 1691–1698.PubMedCrossRefGoogle Scholar
  19. 19.
    Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 25, 3389–3402.PubMedCrossRefGoogle Scholar
  20. 20.
    Eddy, S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755–763.PubMedCrossRefGoogle Scholar
  21. 21.
    Rychlewski, L., Jaroszewski, L., Weizhong, L. and Godzik, A. (2000) Comparison of sequence profiles. Structural predictions with no structure information. Protein Sci, 8, 232–241.Google Scholar
  22. 22.
    Soding, J. (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics, 21, 951–960.PubMedCrossRefGoogle Scholar
  23. 23.
    Chandonia, J.M., Hon, G., Walker, N.S., Lo Conte, L., Koehl, P., Levitt, M. and Brenner, S.E. (2004) The ASTRAL Compendium in 2004. Nucleic Acids Res, 32, D189–D192.PubMedCrossRefGoogle Scholar
  24. 24.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool. J Mol Biol, 215, 403–410.PubMedGoogle Scholar
  25. 25.
    Kelley, L.A., MacCallum, R.M. and Sternberg, M.J. (2000) Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol, 299, 499–520.PubMedCrossRefGoogle Scholar
  26. 26.
    Shi, J., Blundell, T.L. and Mizuguchi, K. (2001) FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J Mol Biol, 310, 243–257.PubMedCrossRefGoogle Scholar
  27. 27.
    Fischer, D. (2000) Hybrid fold recognition: combining sequence derived properties with evolutionary information. Pac Symp Biocomput, 119–130.Google Scholar
  28. 28.
    Xu, Y. and Xu, D. (2000) Protein threading using PROSPECT: design and evaluation. Proteins, 40, 343–354.PubMedCrossRefGoogle Scholar
  29. 29.
    Karplus, K., Barrett, C. and Hughey, R. (1998) Hidden Markov models for detecting remote protein homologies. Bioinformatics, 14, 846–856.PubMedCrossRefGoogle Scholar
  30. 30.
    Jaroszewski, L., Rychlewski, L. and Godzi, A. (2000) Improving the quality of twilight-zone alignments. Protein Sci, 9, 1487–1496.PubMedCrossRefGoogle Scholar
  31. 31.
    Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press: Cambridge.CrossRefGoogle Scholar
  32. 32.
    Krogh, A., Larsson, B., von Heijne, G. and Sonnhammer, E.L. (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol, 305, 567–580.PubMedCrossRefGoogle Scholar
  33. 33.
    Lupas, A., Van Dyke, M. and Stock, J. (1991) Predicting coiled coils from protein sequences. Science, 252, 1162–1164.CrossRefGoogle Scholar
  34. 34.
    Ward, J.J., Sodhi, J.S., McGuffin, L.J., Buxton, B.F. and Jones, D.T. (2004) Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol, 337, 635–645.PubMedCrossRefGoogle Scholar
  35. 35.
    Jones, D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol, 292, 195–202.PubMedCrossRefGoogle Scholar
  36. 36.
    Wootton, J. and Federhen, S. (1993) Statistics of local complexity in amino acid sequences and sequence databases. Comput Chem, 17, 149–163.CrossRefGoogle Scholar
  37. 37.
    Ginalski, K. and Rychlewski, L. (2003) Detection of reliable and unexpected protein fold predictions using 3D-Jury. Nucleic Acids Res, 31, 3291–3292.PubMedCrossRefGoogle Scholar
  38. 38.
    Sanchez, R. and Sali, A. (1997) Advances in comparative protein-structure modelling. Curr Opin Struct Biol, 7, 206–214.PubMedCrossRefGoogle Scholar
  39. 39.
    Wallner, B. and Elofsson, A. (2005) All are not equal: a benchmark of different homology modeling programs. Protein Sci, 14, 1315–1327.PubMedCrossRefGoogle Scholar
  40. 40.
    Michalsky, E., Goede, A. and Preissner, R. (2003) Loops In Proteins (LIP) – a comprehensive loop database for homology modelling. Protein Eng, 16, 979–985.PubMedCrossRefGoogle Scholar
  41. 41.
    Xiang, Z., Soto, C.S. and Honig, B. (2002) Evaluating conformational free energies: the colony energy and its application to the problem of loop prediction. Proc Natl Acad Sci USA, 99, 7432–7437.PubMedCrossRefGoogle Scholar
  42. 42.
    Sali, A. (1994) Modeller. A program for protein structure modelling by satisfaction of spatial restraints.
  43. 43.
    Canutescu, A.A., Shelenkov, A.A. and Dunbrack, R.L., Jr. (2003) A graph-theory algorithm for rapid protein side-chain prediction. Protein Sci, 12, 2001–2014.PubMedCrossRefGoogle Scholar
  44. 44.
    Vriend, G. (1990) WHAT IF: a molecular modeling and drug design program. J Mol Graph, 8, 52–56, 29.PubMedCrossRefGoogle Scholar
  45. 45.
    Schwede, T., Kopp, J., Guex, N. and Peitsch, M.C. (2003) SWISS-MODEL: an automated protein homology-modeling server. Nucleic Acids Res, 31, 3381–3385.PubMedCrossRefGoogle Scholar
  46. 46.
    Reeves, G.A., Dallman, T.J., Redfern, O.C., Akpor, A. and Orengo, C.A. (2006) Structural diversity of domain superfamilies in the CATH database. J Mol Biol, 360, 725–741.PubMedCrossRefGoogle Scholar
  47. 47.
    Ye, Y. and Godzik, A. (2005) Multiple flexible structure alignment using partial order graphs. Bioinformatics, 21, 2362–2369.PubMedCrossRefGoogle Scholar
  48. 48.
    Bowie, J.U., Luthy, R. and Eisenberg, D. (1991) A method to identify protein sequences that fold into a known three-dimensional structure. Science, 253, 164–170.PubMedCrossRefGoogle Scholar
  49. 49.
    Sippl, M.J. (1993) Recognition of errors in three-dimensional structures of proteins. Proteins, 17, 355–362.PubMedCrossRefGoogle Scholar
  50. 50.
    Morris, A.L., MacArthur, M.W., Hutchinson, E.G. and Thornton, J.M. (1992) Stereochemical quality of protein structure coordinates. Proteins, 12, 345–364.PubMedCrossRefGoogle Scholar
  51. 51.
    Hooft, R.W., Vriend, G., Sander, C. and Abola, E.E. (1996) Errors in protein structures. Nature, 381, 272.PubMedCrossRefGoogle Scholar
  52. 52.
    Melo, F., Devos, D., Depiereux, E. and Feytmans, E. (1997) ANOLEA: a www server to assess protein structures. Proc Int Conf Intell Syst Mol Biol, 5, 187–190.PubMedGoogle Scholar
  53. 53.
    Word, J.M., Lovell, S.C., LaBean, T.H., Taylor, H.C., Zalis, M.E., Presley, B.K., Richardson, J.S. and Richardson, D.C. (1999) Visualizing and quantifying molecular goodness-of-fit: small-probe contact dots with explicit hydrogen atoms. J Mol Biol, 285, 1711–1733.PubMedCrossRefGoogle Scholar
  54. 54.
    Jaroszewski, L., Rychlewski, L., Li, Z., Li, W. and Godzik, A. (2005) FFAS03: a server for profile – profile sequence alignments. Nucleic Acids Res, 33, W284–W288.PubMedCrossRefGoogle Scholar
  55. 55.
    Wallner, B. and Elofsson, A. (2005) Pcons5: combining consensus, structural evaluation and fold recognition scores. Bioinformatics, 21, 4248–4254.PubMedCrossRefGoogle Scholar
  56. 56.
    Fischer, D. (2003) 3D-SHOTGUN: a novel, cooperative, fold-recognition meta-predictor. Proteins, 51, 434–441.PubMedCrossRefGoogle Scholar
  57. 57.
    Fischer, D., Rychlewski, L., Dunbrack, R.L., Jr., Ortiz, A.R. and Elofsson, A. (2003) CAFASP3: the third critical assessment of fully automated structure prediction methods. Proteins, 53(Suppl 6), 503–516.PubMedCrossRefGoogle Scholar
  58. 58.
    Rychlewski, L. and Fischer, D. (2005) LiveBench-8: the large-scale, continuous assessment of automated protein structure prediction. Protein Sci, 14, 240–245.PubMedCrossRefGoogle Scholar
  59. 59.
    Fischer, D. (2006) Servers for protein structure prediction. Curr Opin Struct Biol, 16, 178–182.PubMedCrossRefGoogle Scholar
  60. 60.
    Rost, B., Yachdav, G. and Liu, J. (2004) The PredictProtein server. Nucleic Acids Res, 32, W321–W326.PubMedCrossRefGoogle Scholar
  61. 61.
    McGuffin, L.J., Bryson, K. and Jones, D.T. (2000) The PSIPRED protein structure prediction server. Bioinformatics, 16, 404–405.PubMedCrossRefGoogle Scholar
  62. 62.
    Pieper, U., Eswar, N., Davis, F.P., Braberg, H., Madhusudhan, M.S., Rossi, A., Marti-Renom, M., Karchin, R., Webb, B.M., Eramian, D. et al. (2006) MODBASE: a database of annotated comparative protein structure models and associated resources. Nucleic Acids Res, 34, D291–D295.PubMedCrossRefGoogle Scholar
  63. 63.
    Eswar, N., John, B., Mirkovic, N., Fiser, A., Ilyin, V.A., Pieper, U., Stuart, A.C., Marti-Renom, M.A., Madhusudhan, M.S., Yerkovich, B. et al. (2003) Tools for comparative protein structure modeling and analysis. Nucleic Acids Res, 31, 3375–3380.PubMedCrossRefGoogle Scholar
  64. 64.
    Bates, P.A., Kelley, L.A., MacCallum, R.M. and Sternberg, M.J. (2001) Enhancement of protein modeling by human intervention in applying the automatic programs 3D-JIGSAW and 3D-PSSM. Proteins, Suppl 5, 39–46.Google Scholar
  65. 65.
    Slabinski, L., Jaroszewski, L., Rychlewski, L., Wilson, I.A., Lesley, S.A. and Godzik, A. (2007) XtalPred: a web server for prediction of protein crystallizability. Bioinformatics, 23, 3403–3405.PubMedCrossRefGoogle Scholar
  66. 66.
    Fernandez-Fuentes, N., Rai, B.K., Madrid-Aliste, C.J., Fajardo, J.E. and Fiser, A. (2007) Comparative protein structure modeling by combining multiple templates and optimizing sequence-to-structure alignments. Bioinformatics, 23, 2558–2565.PubMedCrossRefGoogle Scholar

Copyright information

© Humana Press, a part of Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Lukasz Jaroszewski
    • 1
  1. 1.The Burnham InstituteLa JollaUSA

Personalised recommendations