Data Mining for Protein Secondary Structure Prediction

  • Haitao Cheng
  • Taner Z. Sen
  • Robert L. Jernigan
  • Andrzej Kloczkowski
Chapter
Part of the Structure and Bonding book series (STRUCTURE, volume 134)

Abstract

Accurate protein secondary structure prediction from the amino acid sequence is essential for almost all theoretical and experimental studies on protein structure and function. After a brief discussion of application of data mining for optimization of crystallization conditions for target proteins we show that data mining of structural fragments of proteins from known structures in the protein data bank (PDB) significantly improves the accuracy of secondary structure predictions. The original method was proposed by us a few years ago and was termed fragment database mining (FDM) (Cheng H, Sen TZ, Kloczkowski A, Margaritis D, Jernigan RL (2005) Prediction of protein secondary structure by mining structural fragment database. Polymer 46:4314–4321). This method gives excellent accuracy for predictions if similar sequence fragments are available in our library of structural fragments, but is less successful if such fragments are absent in the fragments database. Recently we have improved secondary structure predictions further by combining FDM with classical GOR V (Kloczkowski A, Ting KL, Jernigan RL, Garnier J (2002a) Combining the GOR V algorithm with evolutionary information for protein secondary structure prediction from amino acid sequence. Proteins 49:154–66; Sen TZ, Jernigan RL, Garnier J, Kloczkowski A (2005) GOR V server for protein secondary structure prediction. Bioinformatics 21:2787–8) predictions to form a combined method, so-called consensus database mining (CDM) (Sen TZ, Cheng H, Kloczkowski A, Jernigan RL (2006) A Consensus Data Mining secondary structure prediction by combining GOR V and Fragment Database Mining. Protein Sci 15:2499–506). FDM mines the structural segments of PDB, and utilizes structural information from the matching sequence fragments for the prediction of protein secondary structures. By combining it with the GOR V secondary structure prediction method, which is based on information theory and Bayesian statistics, coupled with evolutionary information from multiple sequence alignments (MSA), our CDM method guarantees improved accuracies of prediction. Additionally, with the constant growth in the number of new protein structures and folds in the PDB, the accuracy of the CDM method is clearly expected to increase in future. We have developed a publicly available CDM server (Cheng H, Sen TZ, Jernigan RL, Kloczkowski A (2007) Consensus Data Mining (CDM) Protein Secondary Structure Prediction Server: combining GOR V and Fragment Database Mining (FDM). Bioinformatics 23:2628–30) (http://gor.bb.iastate.edu/cdm) for protein secondary structure prediction.

Keywords

Data mining Protein structure prediction Secondary structure prediction Protein crystallography Structural fragments Fragment database mining Consensus database mining Crystallization data mining 

Notes

Acknowledgements

It is a pleasure to acknowledge the financial support provided by NIH grants 1R01GM073095, 1R01GM072014, and 1R01GM081680.

References

  1. 1.
    Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The Protein Data Bank. Nucleic Acids Res 28:235–42CrossRefGoogle Scholar
  2. 2.
    Pauling L, Corey RB (1951) Configuration of polypeptide chains. Nature 168:550–1CrossRefGoogle Scholar
  3. 3.
    Pauling L, Corey RB, Branson HR (1951) The structure of proteins; two hydrogen-bonded helical configurations of the polypeptide chain. Proc Natl Acad Sci USA 37:205–11CrossRefGoogle Scholar
  4. 4.
    Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22:2577–637CrossRefGoogle Scholar
  5. 5.
    Frishman D, Argos P (1995) Knowledge-based protein secondary structure assignment. Proteins 23:566–79CrossRefGoogle Scholar
  6. 6.
    Moult J, Pedersen JT, Judson R, Fidelis K (1995) A large-scale experiment to assess protein structure prediction methods. Proteins 23:ii–vGoogle Scholar
  7. 7.
    Biou V, Gibrat JF, Levin JM, Robson B, Garnier J (1988) Secondary structure prediction: combination of three different methods. Protein Eng 2:185–91CrossRefGoogle Scholar
  8. 8.
    Salamov AA, Solovyev VV (1995) Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignments. J Mol Biol 247:11–5CrossRefGoogle Scholar
  9. 9.
    Rost B, Sander C (2000) Third generation prediction of secondary structures. Methods Mol Biol 143:71–95Google Scholar
  10. 10.
    Jankarik J, Kim S (1991) Sparse matrix sampling: a screening method for crystallization of proteins. J Appl Crystallogr 24:409–411CrossRefGoogle Scholar
  11. 11.
    Kingston RL, Baker HM, Baker EN (1994) Search designs for protein crystallization based on orthogonal arrays. Acta Crystallogr D Biol Crystallogr 50:429–40CrossRefGoogle Scholar
  12. 12.
    McPherson A (1999) Crystallization of Biological Macromlecules. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, ME, p 586Google Scholar
  13. 13.
    Saridakis E, Chayen NE (2000) Improving protein crystal quality by decoupling nucleation and growth in vapor diffusion. Protein Sci 9:755–7CrossRefGoogle Scholar
  14. 14.
    Scott WG, Finch JT, Grenfell R, Fogg J, Smith T, Gait MJ, Klug A (1995) Rapid crystallization of chemically synthesized hammerhead RNAs using a double screening procedure. J Mol Biol 250:327–32CrossRefGoogle Scholar
  15. 15.
    Gilliland GL, Tung M, Ladner J (1996) The Biological Macromolecule Crystallization Database and NASA Protein Crystal Growth Archive. J Res Natl Inst Stand Technol 101: 309–20CrossRefGoogle Scholar
  16. 16.
    Gilliland GL, Tung M, Ladner JE (2002) The Biological Macromolecule Crystallization Database: crystallization procedures and strategies. Acta Crystallogr D Biol Crystallogr 58:916–20CrossRefGoogle Scholar
  17. 17.
    Jurisica I, Rogers P, Glasgow JI, Fortier S, Luft JR, Wolfley JR, Bianca MA, Weeks DR, DeTitta GT (2001) Intelligent decision support for protein crystal growth. IBM Syst J 40:394–409CrossRefGoogle Scholar
  18. 18.
    Kimber MS, Vallee F, Houston S, Necakov A, Skarina T, Evdokimova E, Beasley S, Christendat D, Savchenko A, Arrowsmith CH, Vedadi M, Gerstein M, Edwards AM (2003) Data mining crystallization databases: knowledge-based approaches to optimize protein crystal screens. Proteins 51:562–8CrossRefGoogle Scholar
  19. 19.
    Page R, Grzechnik SK, Canaves JM, Spraggon G, Kreusch A, Kuhn P, Stevens RC, Lesley SA (2003) Shotgun crystallization strategy for structural genomics: an optimized two-tiered crystallization screen against the Thermotoga maritima proteome. Acta Crystallogr D Biol Crystallogr 59:1028–37CrossRefGoogle Scholar
  20. 20.
    Page R, Stevens RC (2004) Crystallization data mining in structural genomics: using positive and negative results to optimize protein crystallization screens. Methods 34:373–89CrossRefGoogle Scholar
  21. 21.
    Segelke B (2001) Efficiency Analysis of Sampling Protocols Used in Protein Crystallization Screening. J Cryst Growth 232:553–562CrossRefGoogle Scholar
  22. 22.
    Rupp B (2003) Maximum-likelihood crystallization. J Struct Biol 142:162–9CrossRefGoogle Scholar
  23. 23.
    DeLucas LJ, Bray TL, Nagy L, McCombs D, Chernov N, Hamrick D, Cosenza L, Belgovskiy A, Stoops B, Chait A (2003) Efficient protein crystallization. J Struct Biol 142:188–206CrossRefGoogle Scholar
  24. 24.
    Oldfield TJ (2001) Creating structure features by data mining the PDB to use as molecular-replacement models. Acta Crystallogr D57:1421–1427Google Scholar
  25. 25.
    Matthews BW (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 405:442–51CrossRefGoogle Scholar
  26. 26.
    Rost B, Sander C, Schneider R (1994b) Redefining the goals of protein secondary structure prediction. J Mol Biol 235:13–26CrossRefGoogle Scholar
  27. 27.
    Zemla A, Venclovas C, Fidelis K, Rost B (1999) A modified definition of Sov, a segment-based measure for protein secondary structure prediction assessment. Proteins 34:220–3CrossRefGoogle Scholar
  28. 28.
    Chou PY, Fasman GD (1974) Prediction of protein conformation. Biochemistry 13:222–45CrossRefGoogle Scholar
  29. 29.
    Lim VI (1974a) Algorithms for prediction of alpha-helical and beta-structural regions in globular proteins. J Mol Biol 88:873–94CrossRefGoogle Scholar
  30. 30.
    Lim VI (1974b) Structural principles of the globular organization of protein chains. A stereochemical theory of globular protein secondary structure. J Mol Biol 88:857–72Google Scholar
  31. 31.
    Garnier J, Osguthorpe DJ, Robson B (1978) Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J Mol Biol 120: 97–120CrossRefGoogle Scholar
  32. 32.
    Gibrat JF, Garnier J, Robson B (1987) Further developments of protein secondary structure prediction using information theory. New parameters and consideration of residue pairs. J Mol Biol 198:425–43Google Scholar
  33. 33.
    Garnier J, Gibrat JF, Robson B (1996) GOR method for predicting protein secondary structure from amino acid sequence. Methods Enzymol 266:540–53CrossRefGoogle Scholar
  34. 34.
    Frishman D, Argos P (1997) Seventy-five percent accuracy in protein secondary structure prediction. Proteins 27:329–35CrossRefGoogle Scholar
  35. 35.
    Holley LH, Karplus M (1989) Protein secondary structure prediction with a neural network. Proc Natl Acad Sci USA 86:152–6CrossRefGoogle Scholar
  36. 36.
    Jones DT (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292:195–202CrossRefGoogle Scholar
  37. 37.
    Petersen TN, Lundegaard C, Nielsen M, Bohr H, Bohr J, Brunak S, Gippert GP, Lund O (2000) Prediction of protein secondary structure at 80% accuracy. Proteins 41:17–20CrossRefGoogle Scholar
  38. 38.
    Qian N, Sejnowski TJ (1988) Predicting the secondary structure of globular proteins using neural network models. J Mol Biol 202:865–84CrossRefGoogle Scholar
  39. 39.
    Rost B, Sander C (1993) Prediction of protein secondary structure at better than 70% accuracy. J Mol Biol 232:584–99CrossRefGoogle Scholar
  40. 40.
    Rost B, Sander C, Schneider R (1994a) PHD–an automatic mail server for protein secondary structure prediction. Comput Appl Biosci 10:53–60Google Scholar
  41. 41.
    Stolorz P, Lapedes A, Xia Y (1992) Predicting protein secondary structure using neural net and statistical methods. J Mol Biol 225:363–77CrossRefGoogle Scholar
  42. 42.
    Levin JM, Garnier J (1988) Improvements in a secondary structure prediction method based on a search for local sequence homologies and its use as a model building tool. Biochim Biophys Acta 955:283–95CrossRefGoogle Scholar
  43. 43.
    Levin JM, Robson B, Garnier J (1986) An algorithm for secondary structure determination in proteins based on sequence similarity. FEBS Lett 205:303–8CrossRefGoogle Scholar
  44. 44.
    Salamov AA, Solovyev VV (1997) Protein secondary structure prediction using local alignments. J Mol Biol 268:31–6CrossRefGoogle Scholar
  45. 45.
    Salzberg S, Cost S (1992) Predicting protein secondary structure with a nearest-neighbor algorithm. J Mol Biol 227:371–4CrossRefGoogle Scholar
  46. 46.
    Yi TM, Lander ES (1993) Protein secondary structure prediction using nearest-neighbor methods. J Mol Biol 232:1117–29CrossRefGoogle Scholar
  47. 47.
    Barton GJ (1995) Protein secondary structure prediction. Curr Opin Struct Biol 5:372–6CrossRefGoogle Scholar
  48. 48.
    Cuff JA, Barton GJ (1999) Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins 34:508–19CrossRefGoogle Scholar
  49. 49.
    Cuff JA, Barton GJ (2000) Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins 40:502–11CrossRefGoogle Scholar
  50. 50.
    Karplus K, Barrett C, Hughey R (1998) Hidden Markov models for detecting remote protein homologies. Bioinformatics 14:846–56CrossRefGoogle Scholar
  51. 51.
    King RD, Sternberg MJ (1990) Machine learning approach for the prediction of protein secondary structure. J Mol Biol 216:441–57CrossRefGoogle Scholar
  52. 52.
    Ouali M, King RD (2000) Cascaded multiple classifiers for secondary structure prediction. Protein Sci 9:1162–76CrossRefGoogle Scholar
  53. 53.
    Zvelebil MJ, Barton GJ, Taylor WR, Sternberg MJ (1987) Prediction of protein secondary structure and active sites using the alignment of homologous sequences. J Mol Biol 195: 957–61CrossRefGoogle Scholar
  54. 54.
    Levin JM, Pascarella S, Argos P, Garnier J (1993) Quantification of secondary structure prediction improvement using multiple alignments. Protein Eng 6:849–54CrossRefGoogle Scholar
  55. 55.
    Rost B (1996) PHD: predicting one-dimensional protein structure by profile-based neural networks. Methods Enzymol 266:525–39CrossRefGoogle Scholar
  56. 56.
    Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–402CrossRefGoogle Scholar
  57. 57.
    Di Francesco V, Garnier J, Munson PJ (1996) Improving protein secondary structure prediction with aligned homologous sequences. Protein Sci 5:106–13CrossRefGoogle Scholar
  58. 58.
    Lecompte O, Thompson JD, Plewniak F, Thierry J, Poch O (2001) Multiple alignment of complete sequences (MACS) in the post-genomic era. Gene 270:17–30CrossRefGoogle Scholar
  59. 59.
    Rost B (2001) Review: protein secondary structure prediction continues to rise. J Struct Biol 134:204–18CrossRefGoogle Scholar
  60. 60.
    Russell RB, Barton GJ (1993) The limits of protein secondary structure prediction accuracy from multiple sequence alignment. J Mol Biol 234:951–7CrossRefGoogle Scholar
  61. 61.
    Hua S, Sun Z (2001) A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. J Mol Biol 308:397–407CrossRefGoogle Scholar
  62. 62.
    Nguyen MN, Rajapakse JC (2005) Two-stage multi-class support vector machines to protein secondary structure prediction. Pac Symp Biocomput 346–57Google Scholar
  63. 63.
    Huang X, Huang DS, Zhang GZ, Zhu YP, Li YX (2005) Prediction of protein secondary structure using improved two-level neural network architecture. Protein Pept Lett 12:805–11CrossRefGoogle Scholar
  64. 64.
    Wood MJ, Hirst JD (2005) Protein secondary structure prediction with dihedral angles. Proteins 59:476–81CrossRefGoogle Scholar
  65. 65.
    Lin K, Simossis VA, Taylor WR, Heringa J (2005) A simple and fast secondary structure prediction method using hidden neural networks. Bioinformatics 21:152–9CrossRefGoogle Scholar
  66. 66.
    Krissinel E, Henrick K (2004) Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallogr D Biol Crystallogr 60:2256–68CrossRefGoogle Scholar
  67. 67.
    Wray LV Jr, Fisher SH (2007) Functional analysis of the carboxy-terminal region of Bacillus subtilis TnrA, a MerR family protein. J Bacteriol 189:20–7CrossRefGoogle Scholar
  68. 68.
    Kashlan OB, Maarouf AB, Kussius C, Denshaw RM, Blumenthal KM, Kleyman TR (2006) Distinct structural elements in the first membrane-spanning segment of the epithelial sodium channel. J Biol Chem 281:30455–62CrossRefGoogle Scholar
  69. 69.
    Jayaram B, Bhushan K, Shenoy SR, Narang P, Bose S, Agrawal P, Sahu D, Pandey V (2006) Bhageerath: an energy based web enabled computer software suite for limiting the search space of tertiary structures of small globular proteins. Nucleic Acids Res 34:6195–204CrossRefGoogle Scholar
  70. 70.
    Meiler J, Baker D (2003) Coupled prediction of protein secondary and tertiary structure. Proc Natl Acad Sci USA 100:12105–10CrossRefGoogle Scholar
  71. 71.
    Moult J (2006) Rigorous performance evaluation in protein structure modelling and implications for computational biology. Philos Trans R Soc Lond B Biol Sci 361:453–8CrossRefGoogle Scholar
  72. 72.
    Kihara D (2005) The effect of long-range interactions on the secondary structure formation of proteins. Protein Sci 14:1955–63CrossRefGoogle Scholar
  73. 73.
    Tsai CJ, Nussinov R (2005) The implications of higher (or lower) success in secondary structure prediction of chain fragments. Protein Sci 14:1943–4CrossRefGoogle Scholar
  74. 74.
    Garnier J, Robson B (1989) The GOR method for predicting secondary structures in proteins. In: Fasman GD (ed) Prediction of protein structure and the principles of protein conformation. Plenum, New York, pp 417–465CrossRefGoogle Scholar
  75. 75.
    Kloczkowski A, Ting KL, Jernigan RL, Garnier J (2002b) Protein secondary structure prediction based on the GOR algorithm incorporating multiple sequence alignment information. Polymer 43:441–449CrossRefGoogle Scholar
  76. 76.
    Simossis VA, Heringa J (2004) Integrating protein secondary structure prediction and multiple sequence alignment. Curr Protein Pept Sci 5:249–66CrossRefGoogle Scholar
  77. 77.
    Kloczkowski A, Ting KL, Jernigan RL, Garnier J (2002a) Combining the GOR V algorithm with evolutionary information for protein secondary structure prediction from amino acid sequence. Proteins 49:154–66CrossRefGoogle Scholar
  78. 78.
    Sen TZ, Jernigan RL, Garnier J, Kloczkowski A (2005) GOR V server for protein secondary structure prediction. Bioinformatics 21:2787–8CrossRefGoogle Scholar
  79. 79.
    Simons KT, Kooperberg C, Huang E, Baker D (1997) Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol 268:209–25CrossRefGoogle Scholar
  80. 80.
    Simons KT, Ruczinski I, Kooperberg C, Fox BA, Bystroff C, Baker D (1999) Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins. Proteins 34:82–95CrossRefGoogle Scholar
  81. 81.
    Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89:10915–9CrossRefGoogle Scholar
  82. 82.
    Dayhoff MO, Schwartz RM, Orcutt BC (1978) Atlas Protein Seq Struct, Suppl., 345–352Google Scholar
  83. 83.
    Cheng H, Sen TZ, Kloczkowski A, Margaritis D, Jernigan RL (2005) Prediction of protein secondary structure by mining structural fragment database. Polymer 46:4314–4321CrossRefGoogle Scholar

Copyright information

© Springer Heidelberg 2009

Authors and Affiliations

  • Haitao Cheng
    • 1
    • 2
  • Taner Z. Sen
    • 3
    • 4
  • Robert L. Jernigan
    • 1
    • 2
  • Andrzej Kloczkowski
    • 1
    • 2
  1. 1.Department of Biochemistry, Biophysics, and Molecular BiologyIowa State UniversityAmesUSA
  2. 2.L.H. Baker Center for Bioinformatics and Biological StatisticsIowa State UniversityAmesUSA
  3. 3.1025 Crop Genome Informatics LaboratoryAmesUSA
  4. 4.Department of Genetics, Development and Cell BiologyIowa State UniversityAmesUSA

Personalised recommendations