Machine Learning Methods for the Protein Fold Recognition Problem

  • Katarzyna StaporEmail author
  • Irena Roterman-Konieczna
  • Piotr Fabian
Part of the Intelligent Systems Reference Library book series (ISRL, volume 149)


The protein fold recognition problem is crucial in bioinformatics. It is usually solved using sequence comparison methods but when proteins similar in structure share little in the way of sequence homology they fail and machine learning methods are used to predict the structure of the protein. The imbalance of the data sets, the number of outliers and the high number of classes make the task very complex. We try to explain the methodology for building classifiers for protein fold recognition and to cover all the major results in this field.


Supervised learning algorithm Classifier Features Protein fold recognition 


  1. 1.
    Alpaydin, E.: Introduction to Machine Learning. MIT Press (2009)Google Scholar
  2. 2.
    Altschul, S.F., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 3389–3402 (1997)Google Scholar
  3. 3.
    Anfinsen, B.C.: Principles that govern the folding of protein chains. Science, 223–230 (1973)Google Scholar
  4. 4.
    Apweiler, R., Bairoch, A., Wu, C.H., et al.: UniProt: the universal protein knowledgebase. Nucleic Acids Res. D115–D119 (2004)Google Scholar
  5. 5.
    Banach, M., Konieczny, L., Roterman, I.: The late-stage intermediate. In: Protein Folding in Silico, pp. 21–38Google Scholar
  6. 6.
    Banach, M., Konieczny, L., Roterman, I.: The fuzzy oil drop model, based on hydrophobicity density distribution, generalizes the influence of water environment on protein structure and function. J. Theor Biol. 6–17 (2014)Google Scholar
  7. 7.
    Berman, H.M., et al. The protein databank. Nucleic Acids Res. 235–242 (2000)Google Scholar
  8. 8.
    Bishop, MCh.: Pattern Recognition and Machine Learning. Springer, New York (2006)Google Scholar
  9. 9.
    Breiman, L.: Bagging predictors. Mach. Learn. 123–140 (1996)Google Scholar
  10. 10.
    Breiman, L.: Random Forests. Mach. Learn. 5–32 (2001)Google Scholar
  11. 11.
    Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees (1984)Google Scholar
  12. 12.
    Brown, G., et al.: Diversity creation methods: a survey and categorization. Inf. Fusion, 5–20 (2005)Google Scholar
  13. 13.
    Chan, H.S., Dill, K.: The protein folding problem. Phys. Today, 24–32 (1993)Google Scholar
  14. 14.
    Chen, D., Tian, X., Zhou, B., Gao, J.: ProFold: protein fold classification with additional structural features and a novel ensemble classifier. BioMed. Res. Int. (2016)Google Scholar
  15. 15.
    Chen, K., Kurgan, L.: PFRES: protein fold classification by using evolutionary information and predicted secondary structure. Bioinformatics, 2843–2850 (2007)Google Scholar
  16. 16.
    Cheng, J.: SCRATCH: a protein structure and structural feature prediction server. Nucleid Acid Res. 72–76 (2005)Google Scholar
  17. 17.
    Chinnasamy, A., Sung, W.K., Mittal, A.: Protein structure and fold prediction using tree-augmented naïve Bayesian classifier. In: Proceedings of PSB, Stanford CA (2004)Google Scholar
  18. 18.
    Chmielnicki, W., Stapor, K.: Protein fold recognition with combined RDA-SVM classifier. Lecture Notes on Artificial Intelligence, pp. 162–169 (2010)Google Scholar
  19. 19.
    Chmielnicki, W., Stapor, K.: A hybrid discriminative/generative approach to protein fold recognition. Neurocomputing, 194–198 (2012)Google Scholar
  20. 20.
    Chothia, C.: One thousand families for the molecular biologist. Nature, 543–544 (1992)Google Scholar
  21. 21.
    Chou, K.C.: Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins, 246–255 (2001)Google Scholar
  22. 22.
    Chou, K.C.: Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr. Proteomics, 262–274Google Scholar
  23. 23.
    Clearly, J.G., Trigg, I.E.: K*: an instance-based learner using an entropic distance measure. Proc. Int. Conf. Mach. Learn. 108–114 (1995)Google Scholar
  24. 24.
    Crammer, K., Singer, Y.: On the learnability and design of output codes for multiclass problems. In: 13th Computational Learning Theory Conference, pp. 35–46 (2000)Google Scholar
  25. 25.
    Craven, M.W., Mural, R.J., Hauser, L.J., Uberbacher, E.C.: Predicting protein folding classes without overly relying on homology. In: Proceedings of Intelligent Systems in Molecular Biology (ISMB), pp. 98–106 (1995)Google Scholar
  26. 26.
    Damoulas, T., Girolami, M.: Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection. Bioinformatics, 1264–1270 (2008)Google Scholar
  27. 27.
    Denoeux, T.: A k-nearest neighbor classification rule based on Dempster-Shafer theory. IEEE Trans. Syst. Man Cybern. 804–813 (1995)Google Scholar
  28. 28.
    Deschavanne, P., Tuffery, P.: Enhanced protein fold recognition using a structural alphabet. Proteins, 129–137 (2009)Google Scholar
  29. 29.
    Dietterich, T.G.: Ensemble methods in machine learning. In: 1st International Workshop on Multiple Classifier Systems, pp. 1–15 (2000)Google Scholar
  30. 30.
    Dill, K.A., Chan, H.S.: From Levinthal to pathways to funnels. Nat. Struct. Biol. 10–19 (1997)Google Scholar
  31. 31.
    Ding, C.H., Dubchak, I.: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, pp. 349–358 (2001)Google Scholar
  32. 32.
    Dong, Q., Zhou, S., Guan, J.: A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation. Bioinformatics, 2655–2662 (2009)Google Scholar
  33. 33.
    Dubchak, I., Muchnik, I. Holbrook, S.R., Kim, S.H.: Prediction of protein folding class using global description of amino acid sequence. Proc. Natl. Acad. Sci. USA, 8700–8704 (1995)Google Scholar
  34. 34.
    Freund, Y., Shapire, R.: A decision-theoretic generalization of online learning and an application to boosting. J. Comput. Sys. Sci. 119–139 (1997)Google Scholar
  35. 35.
    Ghahramani, Z.: An introduction to Hidden Markov Models and Bayesian networks. Int. J. Pattern Recognit. Artif. Intell. 9–42Google Scholar
  36. 36.
    Guo, X., Gao, X.: A novel hierarchical ensemble classifier for protein fold recognition. Protein Eng. Des. Sel. 659–664 (2008)Google Scholar
  37. 37.
    Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer (2009)Google Scholar
  38. 38.
    Hinton, G.E., Osindero S., Teh, Y.: A fast learning algorithm for deep belief nets. Neural Comput. 1527–1554 (2006)Google Scholar
  39. 39.
    Huang, C.D., Lin, C.T., Pal, N.R.: Hierarchical learning architecture with automatic feature selection for multiclass protein fold classification. IEEE Trans. Nanobiosci. 221–232 (2003)Google Scholar
  40. 40.
    Ibrahim, W., Abadeh, M.S.: Extracting features from protein sequences to improve deep extreme learning machine for protein fold recognition. J. Theor. Biol. 1–15 (2017)Google Scholar
  41. 41.
    Jo, T., Hou, J., Eickholt, J., Cheng, J.: Improving protein fold recognition by deep learning networks. Sci. Rep. (2015)Google Scholar
  42. 42.
    Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 195–202 (1999)Google Scholar
  43. 43.
    Jurkowski, W., Baster, Z., Dulak, D., Roterman, I.: The early-stage intermediate. In: Protein Folding in Silico, pp. 1–20 (2012)Google Scholar
  44. 44.
    Kmiecik, S., Gront, D., Kolinski, M., Wieteska, L., Dawid, A.E., Kolinski, A.: Coarse-grained protein models and their applications. Chem. Rev. 7898–7936 (2016)Google Scholar
  45. 45.
    Konieczny, L., Roterman-Konieczna, I., Spólnik, P.: The structure and function of living organisms. Syst. Biol. 1–32 (2013)Google Scholar
  46. 46.
    Krupa, P., Sieradzan, A.K., Rackovsky, S., Baranowski, M., Olldziej, S., Scheraga, H.A., Liwo, A., Czaplewski, C.: Improvement of the treatment of loop structures in the UNRES force field by inclusion of coupling between backbone- and side-chain-local conformational states. J. Chem. Theory Comput. (2013)Google Scholar
  47. 47.
    Leslie, C.S., et al.: Mismatch string kernels for discriminative protein classification. Bioinformatics, 467–476 (2004)Google Scholar
  48. 48.
    Levitt, M.: Accurate modeling of protein conformation by automatic segment matching. J. Mol. Biol. 507–533 (1992)Google Scholar
  49. 49.
    Li, J., Wu, J., Chen, K.: PFP-RFSM: protein fold prediction by using random forests and sequence motifs. J. Biomed. Sci. Eng. 1161–1170 (2013)Google Scholar
  50. 50.
    Liao, L., Noble, W.S.: Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. J. Comput. Biol. 857–868 (2003)Google Scholar
  51. 51.
    Lin, K.L., Lin, C.Y., Huang, C.D., Chang, H.M., Yang, C.Y., Lin, C.T., Hsu, D.F.: Feature selection and combination criteria for improving accuracy in protein structure prediction. IEEE Trans. NanoBiosci. 186–196 (2007)Google Scholar
  52. 52.
    Lindahl, E., Elofsson, A.: Identification of related proteins on family, superfamily and fold level. J. Mol. Biol. 613–625 (2000)Google Scholar
  53. 53.
    Lo Conte, L., Ailey, B., Hubbard, T.J.P., Brenner, S.E., Murzin, A.G., Chothia, C.: SCOP: a structural classification of protein database. Nucleic Acids Res. 257–259 (2000)Google Scholar
  54. 54.
    Marchler-Bauer, A., et al.: CDD: a conserved domain database for interactive domain family analysis. Nucleid Acid Res. D237–D240 (2007)Google Scholar
  55. 55.
    Nanni, L.: A novel ensemble of classifiers for protein fold recognition. Neurocomputing, 2434–2437 (2006)Google Scholar
  56. 56.
    Okun, O.: Protein fold recognition with k-local hyperplane distance nearest neighbor algorithm. In: Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics, pp. 51–57 (2004)Google Scholar
  57. 57.
    Pedersen, J.T., Moult, J.: Genetic algorithms for protein structure prediction. Curr. Opin. Struct. Biol. 227–231 (1996)Google Scholar
  58. 58.
    Rangwala, H., Karypis, G.: Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics, 4239–4247 (2005)Google Scholar
  59. 59.
    Rashid, M.A., Newton, M.A.H., Hoque, M.T., Sattar, A.: Mixing energy models in genetic algorithms for on-lattice protein structure prediction. BioMed. Res. Int. (2013)Google Scholar
  60. 60.
    Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 1–39 (2010)Google Scholar
  61. 61.
    Roterman, I., Bryliński, M., Konieczny, L., Jurkowski, W.: Early-stage protein folding—in silico model. Recent Adv. Struct. Biol. (2007)Google Scholar
  62. 62.
    Saigo, H., et al.: Protein homology detection using string alignment kernels. Bioinformatics, 1682–1689 (2004)Google Scholar
  63. 63.
    Sali, A., Blundell, T.L.: Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 779–815 (1993)Google Scholar
  64. 64.
    Schaffer, A., et al.: Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleid Acids Res. 2994–3005 (2001)Google Scholar
  65. 65.
    Shamim, M., et al.: Support vector machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs. Bioinformatics, 3320–3327 (2007)Google Scholar
  66. 66.
    Shapire, R.: The strength of weak learnability. Mach. Learn. 197–227 (1995)Google Scholar
  67. 67.
    Sharma, A., Lyons, J., Dehzangi, A., Paliwal, K.: A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. J. Theor. Biol. 41–46 (2013)Google Scholar
  68. 68.
    Shawe-Taylor, J., Cristiannini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press (2004)Google Scholar
  69. 69.
    Shen, H.B., Chou, K.C.: Predicting protein fold pattern with functional domain and sequential evolution information. J. Theor. Biol. 441–446 (2009)Google Scholar
  70. 70.
    Stapor, K.: Classification methods in computer vision (in Polish). Scientific Publishing House PWN, Warsaw (2011)Google Scholar
  71. 71.
    Unger, R., Moult, J.: Genetic algorithms for protein folding simulations. J. Mol. Biol. 75–81 (1993)Google Scholar
  72. 72.
    Wei, L., Liao, M., Gao, X., Zou, Q.: Enhanced protein fold prediction method through a novel feature extraction technique. IEEE Trans. Nanobiosci. 649–659Google Scholar
  73. 73.
    Wei, L., Zou, Q.: Recent progress in machine learning-based methods for protein fold recognition. Int. J. Mol. Sci. (2016)Google Scholar
  74. 74.
    Yang, J.-Y., Chen, X.: Improving taxonomy-based protein fold recognition by using global and local features. Proteins, 2053–2064 (2011)Google Scholar
  75. 75.
    Ying, Y., Huang, K., Campbell, C.: 2009. Enhanced protein fold recognition through a novel data integration approach. BMC Bioinformat. 267–287Google Scholar
  76. 76.
    Yu, L., Liu, H.: Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of 10th International Conference Machine Learning, pp. 856–863Google Scholar
  77. 77.
    Zouhal, L.M., Denoeux, T.: An evidence-theoretic kNN rule with parameter optimization. IEEE Trans. Syst. Man Cybern. 263–271 (1998)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2019

Authors and Affiliations

  • Katarzyna Stapor
    • 1
    Email author
  • Irena Roterman-Konieczna
    • 2
  • Piotr Fabian
    • 1
  1. 1.Silesian University of TechnologyGliwicePoland
  2. 2.Jagiellonian UniversityKrakówPoland

Personalised recommendations