Machine Learning Framework: Predicting Protein Structural Features

  • Pramod Kumar
  • Vandana Mishra
  • Subarna Roy


Structural biology is a challenging scientific discipline that aims to uncover the topologies and shapes of biomolecules and macromolecules—that is, DNA, RNA, and proteins. Proteins are large macromolecules consisting of more than one chain of amino acids joined together in a linear chain by peptide bonds. Proteins are required in organisms; they help in all biological processes of cells. They catalyze biochemical reactions (enzymes), carry out key roles in cellular processes, and act as structural constituents, catalysis agents, signaling molecules, and molecular machines of every biological system. They are responsible for immune responses, can store molecules (e.g., casein and ovalbumin store amino acids), and are even responsible for cell mechanics (e.g., actin and myosin). The structure prediction of proteins is a difficult task with basic problems in computational biology, structural science, and structural biology. The complex structure of protein prediction has four different levels: (1) one-dimensional (1D) prediction of different structural features and linear chain of amino acids; (2) two-dimensional (2D) prediction of spatial arrangements between amino acids; (3) three-dimensional (3D) (tertiary) structural features prediction of a protein; and (4) four-dimensional (4D) (quaternary) structure prediction of multicomplex proteins. Researchers have recently used most of the various data mining methods, different scripting-based tools, and machine learning tools for structure prediction of a protein. In this chapter, we provide a comprehensive overview of proteins structure and use different data mining machine learning algorithms for protein structure prediction.


Algorithms Machine learning Proteins Structure prediction 



We are pleased to acknowledge the contributions of the authors from Biomedical Informatics Centre of ICMR-NITM, Belagavi and CSIR-URDIP, Pune. We also gratefully acknowledge the Indian Council of Medical Research, Dept. of Health Research, New Delhi, Government of India, for providing infrastructure and funding for the facility.


  1. Aloy P, Moont G, Gabb HA, Querol E, Aviles FX, Sternberg MJE (1998) Modelling protein docking using shape complementarity, electrostatics and biochemical information. Proteins 33:535–549. CrossRefPubMedGoogle Scholar
  2. Altschul SF, Madden TL, Schaer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402. CrossRefPubMedPubMedCentralGoogle Scholar
  3. Andersen PH, Nielsen M, Lund O (2006) Prediction of residues in discontinuous B-cell epitopes using protein 3D structures. Protein Sci 15:2558–2567. CrossRefGoogle Scholar
  4. Anfinsen CB (1973) Principles that govern the folding of protein chains. Science 181:223–230. CrossRefPubMedGoogle Scholar
  5. Aszodi A, Gradwell M, Taylor W (1995) Global fold determination from a small number of distance restraints. J Mol Biol 251:308–326. CrossRefPubMedGoogle Scholar
  6. Bairoch A, Apweiler R, Barker CH, Wu WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O’Donovan C, Redaschi N, Yeh LS (2005) The universal protein resource (UniProt). Nucleic Acids Res 33:D154–D159. CrossRefPubMedGoogle Scholar
  7. Baldi P, Brunak S (2001) Bioinformatics: the machine learning approach, 2nd edn. MIT Press, Cambridge, MA. Google Scholar
  8. Baldi P, Pollastri G (2002) Generalized IOHMMs and recurrent neural network architectures.
  9. Baldi P, Pollastri G (2003) The principle design of large-scale recursive neural network architectures-DAG-RNNs and the protein structure prediction problem. J Mach Learn Res 4:575–602. Google Scholar
  10. Baldi P, Brunak S, Frasconi P, Soda G, Pollastri G (1999) Exploiting the past and the future in protein secondary structure prediction. Bioinformatics 15:937–946. CrossRefPubMedGoogle Scholar
  11. Baldi P, Cheng J, Vullo A (2005) Large-scale prediction of disulphide bond connectivity. In: Advances in neural information processing systems, vol 17. MIT Press, Cambridge, MA, pp 97–104. Google Scholar
  12. Baldwin EN, Weber IT, Charles RS, Xuan J, Appella E, Yamada M, Matsushima K, Edwards BFP, Clore GM, Gronenborn AM, Wlodawar A (1991) Crystal structure of interleukin 8: symbiosis of NMR and crystallography. Proc Natl Acad Sci 88:502–506. CrossRefPubMedPubMedCentralGoogle Scholar
  13. Barton GJ, Newman RH, Freemont PS, Crumpton MJ (1991) Amino acid sequence analysis of the annexin supergene family of proteins. Eur J Biochem 198:749–760. CrossRefPubMedGoogle Scholar
  14. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The protein data bank. Nucleic Acids Res 28:235–242. CrossRefPubMedPubMedCentralGoogle Scholar
  15. Blom N, Gammeltoft S, Brunak S (1999) Sequence-and structure based prediction of eukaryotic protein phosphorylation sites. J Mol Biol 294:1351–1362. CrossRefPubMedGoogle Scholar
  16. Bjorkman PJ, Parham P (1990) Structure, function and diversity of class I major histocompatibility complex molecules. Annu Rev Biochem 59:253–288. CrossRefPubMedGoogle Scholar
  17. Bondugula R, Xu D (2007) MUPRED: a tool for bridging the gap between template based methods and sequence profile based methods for protein secondary structure prediction. Proteins 66:664–670. CrossRefPubMedGoogle Scholar
  18. Bourne P, Weissig H (2003) Structural bioinformatics. Wiley, Hoboken. Scholar
  19. Bragg SL (1975) The development of X-ray analysis. G Bell and Sons, London. Google Scholar
  20. Bryson K, Cozzetto D, Jones DT (2007) Computer-assisted protein domain boundary prediction using the DomPred server. Curr Protein Pept Sci 8:181–188. CrossRefPubMedGoogle Scholar
  21. Chandonia JM, Brenner SE (2006) The impact of structural genomics: expectations and outcomes. Science 311:347–351. CrossRefPubMedGoogle Scholar
  22. Cheng J, Sweredoski MJ, Baldi P (2005) Accurate prediction of protein disordered regions by mining protein structure data. Data Min Knowl Disc 11:213–222. CrossRefGoogle Scholar
  23. Cheng J, Sweredoski M, Baldi P (2006a) DOMpro: protein domain prediction using profiles, secondary structure, relative solvent accessibility, and recursive neural networks. Data Min Knowl Disc 13:1–10. 10.1007%2Fs10618-005-0023-5Google Scholar
  24. Cheng J, Saigo H, Baldi P (2006b) Large-scale prediction of disulphide bridges using kernel methods, two-dimensional recursive neural networks, and weighted graph matching. Proteins: Struct Funct Bioinf 62:617–629. CrossRefGoogle Scholar
  25. Cheng J, Randall A, Baldi P (2006c) Prediction of protein stability changes for single site mutations using support vector machines. Proteins 62(4):1125–1132. CrossRefPubMedGoogle Scholar
  26. Chou PY, Fasman GD (1978) Prediction of the secondary structure of proteins from their amino acid sequence. Adv Enzymol 47:45–148. PubMedGoogle Scholar
  27. Cozzetto D, Kryshtafovych A, Ceriani M, Tramontano A (2007) Assessment of predictions in the model quality assessment category. Proteins 69:175–183. CrossRefPubMedGoogle Scholar
  28. Crawford IP, Niermann T, Kirchner K (1987) Prediction of secondary structure by evolutionary comparison: application to a subunit of tryptophan synthase. Proteins 2:118–129. CrossRefPubMedGoogle Scholar
  29. Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14:755–763. CrossRefPubMedGoogle Scholar
  30. Emanuelsson O, Brunak S, Heijne GV, Nielsen H (2007) Locating proteins in the cell using TargetP, SignalP, and related tools. Nat Protoc 2:953–971. CrossRefPubMedGoogle Scholar
  31. Fariselli P, Riccobelli P, Casadio R (1999) Role of evolutionary information in predicting the disulfide-bonding state of cysteine in proteins. Proteins 36:340–346.<340::AID-PROT8>3.0.CO;2-D CrossRefPubMedGoogle Scholar
  32. Fariselli P, Casadio R (2004) Prediction of disulfide connectivity in proteins. Bioinformatics 17:957–964. CrossRefGoogle Scholar
  33. Fariselli P, Olmea O, Valencia A, Casadio R (2001) Prediction of contact maps with neural networks and correlated mutations. Protein Eng 13:835–843. CrossRefGoogle Scholar
  34. Frasconi P, Vullo A (2002) Prediction of protein coarse contact maps using recursive neural networks. Proc IEEE-EMBS Conf Mol Cell Tissue Eng.
  35. Freund Y (1990) Boosting a weak learning algorithm by majority. Inf Comput 121:256–285. CrossRefGoogle Scholar
  36. Gray JJ, Moughan SE, Wang C, Schueler-Furman O, Kuhlman B, Rohl CA, Baker D (2003) Protein-protein docking with simultaneous optimization of rigid body displacement and side chain conformations. J Mol Biol 331:281–299. CrossRefPubMedGoogle Scholar
  37. Izarzugaza JMG, Graña O, Tress ML, Valencia A, Clarke ND (2007) Assessment of intramolecular contact predictions for CASP7. Proteins 69:152–158. CrossRefPubMedGoogle Scholar
  38. Jacobson M, Sali A (2004) Comparative protein structure modeling and its applications to drug discovery. In: Overington J (ed) Annual reports in medical chemistry. Academic, London, pp 259–276. Google Scholar
  39. Jones DT (1999a) GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol 287:797–815. CrossRefPubMedGoogle Scholar
  40. Jones DT (1999b) Protein secondary structure prediction based on position specific scoring matrices. J Mol Biol 292:195–202. CrossRefPubMedGoogle Scholar
  41. Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22:2577–2637. CrossRefPubMedGoogle Scholar
  42. Katchalski-Katzir E, Shariv I, Eisenstein M, Friesem AA, Aflalo C, VakseI AR (1992) Molecular surface recognition: determination of geometric fit between proteins and their ligands by correlation techniques. Proc Natl Acad Sci 89:2195–2199. CrossRefPubMedPubMedCentralGoogle Scholar
  43. Kendrew JC, Dickerson RE, Strandberg BE, Hart RJ, Davies DR, Phillips DC, Shore VC (1960) Structure of myoglobin: a three-dimensional Fourier synthesis at 2°Å resolution. Nature 185:422–427. CrossRefPubMedGoogle Scholar
  44. Laskowski RA, Watson JD, Thornton JM (2003) From protein structure to biochemical function? J Struct Funct Genom 4:167–177. CrossRefGoogle Scholar
  45. Lorenzen S, Zhang Y (2007) Identification of near-native structures by clustering protein docking conformations. Proteins 68:187–194. CrossRefPubMedGoogle Scholar
  46. MacCallum R (2004) Striped sheets and protein contact prediction. Bioinformatics 20:i224–i231. CrossRefPubMedGoogle Scholar
  47. Moult J, Fidelis K, Kryshtafovych A, Rost B, Hubbard T, Tramontano A (2007) Critical assessment methods of protein structure prediction-Round VII. Proteins 29:179–187. Google Scholar
  48. Obradovic Z, Peng K, Vucetic S, Radivojac P, Dunker AK (2005) Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins 61:176–182. CrossRefPubMedGoogle Scholar
  49. Olmea O, Valencia A (1997) Improving contact predictions by the combination of correlated mutations and other sources of sequence information. Fold Des 2:s25–s32. CrossRefPubMedGoogle Scholar
  50. Perutz MF, Rossmann MG, Cullis AF, Muirhead G, Will G, North AT (1960) Structure of haemoglobin: a three-dimensional fourier synthesis at 5.5°Å resolution, obtained by X-ray analysis. Nature 185:416–422. CrossRefPubMedGoogle Scholar
  51. Petrey D, Honig B (2005) Protein structure prediction: inroads to biology. Mol Cell 20:811–819. CrossRefPubMedGoogle Scholar
  52. Plaxco K, Simons K, Baker D (1998) Contact order, transition state placement and the refolding rates of single domain proteins. J Mol Biol 277:985–994. CrossRefPubMedGoogle Scholar
  53. Pollastri G, Baldi P (2002) Prediction of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners. Bioinformatics 18:S62–S70. CrossRefPubMedGoogle Scholar
  54. Pollastri G, Przybylski D, Rost B, Baldi P (2002a) Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins 47:228–235. CrossRefPubMedGoogle Scholar
  55. Pollastri G, Baldi P, Fariselli P, Casadio R (2002b) Prediction of coordination number and relative solvent accessibility in proteins. Proteins 47:142–153. CrossRefPubMedGoogle Scholar
  56. Pollastri G, McLysaght A (2005) Porter: a new, accurate server for protein secondary structure prediction. Bioinformatics 21:1719–1720. CrossRefPubMedGoogle Scholar
  57. Punta M, Rost B (2005) Protein folding rates estimated from contact predictions. J Mol Biol 348:507–512. CrossRefPubMedGoogle Scholar
  58. Qian N, Sejnowski TJ (1988) Predicting the secondary structure of globular proteins using neural network models. J Mol Biol 202:265–884. CrossRefGoogle Scholar
  59. Qiu J, Sheffler W, Baker D, Noble WS (2007) Ranking predicted protein structures with support vector regression. Proteins 71:1175–1182. CrossRefGoogle Scholar
  60. Randall A, Cheng J, Sweredoski M, Baldi P (2008) TMBpro: secondary structure, beta- contact, and tertiary structure prediction of transmembrane beta-barrel proteins. Bioinformatics 24:513–520. CrossRefPubMedGoogle Scholar
  61. Rohl CA, Baker D (2004) De novo determination of protein backbone structure from residual dipolar couplings using Rosetta. J Am Chem Soc 124:2723–2729. CrossRefGoogle Scholar
  62. Rost B, Chasman D (2003) Rising accuracy of protein secondary structure prediction. In: Chasman D (ed) Protein structure determination, analysis, and modeling for drug discovery. Marcel Dekker, New York, pp 207–249. CrossRefGoogle Scholar
  63. Rost B, Sander C (1993a) Improved prediction of protein secondary structure by use of sequence profiles and neural networks. Proc Natl Acad Sci 90(16):7558–7562. CrossRefPubMedPubMedCentralGoogle Scholar
  64. Rost B, Sander C (1993b) Prediction of protein secondary structure at better than 70% accuracy. J Mol Bio 232(2):584–599. CrossRefGoogle Scholar
  65. Rost B, Sander C (1994) Conservation and prediction of solvent accessibility in protein families. Proteins 20(3):216–226. CrossRefPubMedGoogle Scholar
  66. Sali A, Blundell TL (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234:779–815. CrossRefPubMedGoogle Scholar
  67. Sanger F, Thompson EO (1953) The amino-acid sequence in the glycyl chain of insulin. 1. The identification of lower peptides from partial hydrolysates. J Biochem 53:353–366. CrossRefGoogle Scholar
  68. Shackelford G, Karplus K (2007) Contact prediction using mutual information and neural nets. Proteins 69:159–164. CrossRefPubMedGoogle Scholar
  69. Skolnick J, Kolinski A, Ortiz A (1997) MONSSTER: a method for folding globular proteins with a small number of distance restraints. J Mol Biol 265:217–241. CrossRefPubMedGoogle Scholar
  70. Smialowski P, Martin-Galiano AJ, Mikolajka A, Girschick T, Holak TA, Frishman D (2007) Protein solubility: sequence based prediction and experimental verification. Bioinformatics 23:2536–2542. CrossRefPubMedGoogle Scholar
  71. Soeding J (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21:951–960. CrossRefGoogle Scholar
  72. Sweredoski MJ, Baldi P (2009) COBEpro: a novel system for predicting continuous B-cell epitopes. Protein Eng Des Sel 22:113–120. CrossRefPubMedGoogle Scholar
  73. Travers A (1989) DNA conformation and protein binding. Annu Rev Biochem 58:427–452. CrossRefPubMedGoogle Scholar
  74. Vassura M, Margara L, Di Lena P, Medri F, Fariselli P, Casadio R (2008) FT-COMAR: fault tolerant three-dimensional structure reconstruction from protein contact maps. Bioinformatics 24:1313–1315. CrossRefPubMedGoogle Scholar
  75. Vendruscolo M, Kussell E, Domany E (1997) Recovery of protein structure from contact maps. Fold Des 2:295–306. CrossRefPubMedGoogle Scholar
  76. Vullo A, Frasconi P (2003) A recursive connectionist approach for predicting disulfide connectivity in proteins. In: Eighteenth annual ACM symposium on applied computing (SAC ’03), pp 67–71. Google Scholar
  77. Vullo A, Frasconi P (2004) Disulfide connectivity prediction using recursive neural networks and evolutionary information. Bioinformatics 20:653–659. CrossRefPubMedGoogle Scholar
  78. Wallner B, Elofsson A (2007) Prediction of global and local model quality in CASP7 using Pcons and ProQ. Proteins 69:184–193. CrossRefPubMedGoogle Scholar
  79. Ward JJ, McGuffin LJ, Buxton BF, Jones DT (2003) Secondary structure prediction using support vector machines. Bioinformatics 19:1650–1655. CrossRefPubMedGoogle Scholar
  80. Wodak SJ (2007) From the Mediterranean coast to the shores of Lake Ontario: CAPRI’s premiere on the American continent. Proteins 69:687–698. CrossRefGoogle Scholar
  81. Wodak SJ, Mendez R (2004) Prediction of protein-protein interactions: the CAPRI experiment, its evaluation and implications. Curr Opin Struct Biol 14:242–249. CrossRefPubMedGoogle Scholar
  82. Wu S, Zhang Y (2008) A comprehensive assessment of sequence-based and template-based methods for protein contact prediction. Bioinformatics 24:924–931. CrossRefPubMedPubMedCentralGoogle Scholar
  83. Wuthrich K (1986) NMR of proteins and nucleic acids. Wiley, New York. Google Scholar
  84. Zhang Y, Skolnick J (2004a) Automated structure prediction of weakly homologous proteins on a genomic scale. Proc Natl Acad Sci 101:7594–7599. CrossRefPubMedPubMedCentralGoogle Scholar
  85. Zhou HX, Qin S (2007) Interaction-site prediction for protein complexes: a critical assessment. Bioinformatics 23:2203–2209. CrossRefPubMedGoogle Scholar
  86. Zhou HX, Shan Y (2001) Prediction of protein interaction sites from sequence profile and residue neighbor list. Proteins 44:336–343. CrossRefPubMedGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  • Pramod Kumar
    • 1
  • Vandana Mishra
    • 2
  • Subarna Roy
    • 1
  1. 1.Biomedical Informatics Centre, ICMR-National Institute of Traditional Medicine (Formerly Regional Medical Research Centre), Department of Health ResearchBelagaviIndia
  2. 2.CSIR- Unit for Research and Development of Information ProductsPuneIndia

Personalised recommendations