Skip to main content

Machine Learning Methods for the Protein Fold Recognition Problem

  • Chapter
  • First Online:
Book cover Machine Learning Paradigms

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 149 ))

Abstract

The protein fold recognition problem is crucial in bioinformatics. It is usually solved using sequence comparison methods but when proteins similar in structure share little in the way of sequence homology they fail and machine learning methods are used to predict the structure of the protein. The imbalance of the data sets, the number of outliers and the high number of classes make the task very complex. We try to explain the methodology for building classifiers for protein fold recognition and to cover all the major results in this field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alpaydin, E.: Introduction to Machine Learning. MIT Press (2009)

    Google Scholar 

  2. Altschul, S.F., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 3389–3402 (1997)

    Google Scholar 

  3. Anfinsen, B.C.: Principles that govern the folding of protein chains. Science, 223–230 (1973)

    Google Scholar 

  4. Apweiler, R., Bairoch, A., Wu, C.H., et al.: UniProt: the universal protein knowledgebase. Nucleic Acids Res. D115–D119 (2004)

    Google Scholar 

  5. Banach, M., Konieczny, L., Roterman, I.: The late-stage intermediate. In: Protein Folding in Silico, pp. 21–38

    Google Scholar 

  6. Banach, M., Konieczny, L., Roterman, I.: The fuzzy oil drop model, based on hydrophobicity density distribution, generalizes the influence of water environment on protein structure and function. J. Theor Biol. 6–17 (2014)

    Google Scholar 

  7. Berman, H.M., et al. The protein databank. Nucleic Acids Res. 235–242 (2000)

    Google Scholar 

  8. Bishop, MCh.: Pattern Recognition and Machine Learning. Springer, New York (2006)

    Google Scholar 

  9. Breiman, L.: Bagging predictors. Mach. Learn. 123–140 (1996)

    Google Scholar 

  10. Breiman, L.: Random Forests. Mach. Learn. 5–32 (2001)

    Google Scholar 

  11. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees (1984)

    Google Scholar 

  12. Brown, G., et al.: Diversity creation methods: a survey and categorization. Inf. Fusion, 5–20 (2005)

    Google Scholar 

  13. Chan, H.S., Dill, K.: The protein folding problem. Phys. Today, 24–32 (1993)

    Google Scholar 

  14. Chen, D., Tian, X., Zhou, B., Gao, J.: ProFold: protein fold classification with additional structural features and a novel ensemble classifier. BioMed. Res. Int. (2016)

    Google Scholar 

  15. Chen, K., Kurgan, L.: PFRES: protein fold classification by using evolutionary information and predicted secondary structure. Bioinformatics, 2843–2850 (2007)

    Google Scholar 

  16. Cheng, J.: SCRATCH: a protein structure and structural feature prediction server. Nucleid Acid Res. 72–76 (2005)

    Google Scholar 

  17. Chinnasamy, A., Sung, W.K., Mittal, A.: Protein structure and fold prediction using tree-augmented naïve Bayesian classifier. In: Proceedings of PSB, Stanford CA (2004)

    Google Scholar 

  18. Chmielnicki, W., Stapor, K.: Protein fold recognition with combined RDA-SVM classifier. Lecture Notes on Artificial Intelligence, pp. 162–169 (2010)

    Google Scholar 

  19. Chmielnicki, W., Stapor, K.: A hybrid discriminative/generative approach to protein fold recognition. Neurocomputing, 194–198 (2012)

    Google Scholar 

  20. Chothia, C.: One thousand families for the molecular biologist. Nature, 543–544 (1992)

    Google Scholar 

  21. Chou, K.C.: Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins, 246–255 (2001)

    Google Scholar 

  22. Chou, K.C.: Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr. Proteomics, 262–274

    Google Scholar 

  23. Clearly, J.G., Trigg, I.E.: K*: an instance-based learner using an entropic distance measure. Proc. Int. Conf. Mach. Learn. 108–114 (1995)

    Google Scholar 

  24. Crammer, K., Singer, Y.: On the learnability and design of output codes for multiclass problems. In: 13th Computational Learning Theory Conference, pp. 35–46 (2000)

    Google Scholar 

  25. Craven, M.W., Mural, R.J., Hauser, L.J., Uberbacher, E.C.: Predicting protein folding classes without overly relying on homology. In: Proceedings of Intelligent Systems in Molecular Biology (ISMB), pp. 98–106 (1995)

    Google Scholar 

  26. Damoulas, T., Girolami, M.: Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection. Bioinformatics, 1264–1270 (2008)

    Google Scholar 

  27. Denoeux, T.: A k-nearest neighbor classification rule based on Dempster-Shafer theory. IEEE Trans. Syst. Man Cybern. 804–813 (1995)

    Google Scholar 

  28. Deschavanne, P., Tuffery, P.: Enhanced protein fold recognition using a structural alphabet. Proteins, 129–137 (2009)

    Google Scholar 

  29. Dietterich, T.G.: Ensemble methods in machine learning. In: 1st International Workshop on Multiple Classifier Systems, pp. 1–15 (2000)

    Google Scholar 

  30. Dill, K.A., Chan, H.S.: From Levinthal to pathways to funnels. Nat. Struct. Biol. 10–19 (1997)

    Google Scholar 

  31. Ding, C.H., Dubchak, I.: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, pp. 349–358 (2001)

    Google Scholar 

  32. Dong, Q., Zhou, S., Guan, J.: A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation. Bioinformatics, 2655–2662 (2009)

    Google Scholar 

  33. Dubchak, I., Muchnik, I. Holbrook, S.R., Kim, S.H.: Prediction of protein folding class using global description of amino acid sequence. Proc. Natl. Acad. Sci. USA, 8700–8704 (1995)

    Google Scholar 

  34. Freund, Y., Shapire, R.: A decision-theoretic generalization of online learning and an application to boosting. J. Comput. Sys. Sci. 119–139 (1997)

    Google Scholar 

  35. Ghahramani, Z.: An introduction to Hidden Markov Models and Bayesian networks. Int. J. Pattern Recognit. Artif. Intell. 9–42

    Google Scholar 

  36. Guo, X., Gao, X.: A novel hierarchical ensemble classifier for protein fold recognition. Protein Eng. Des. Sel. 659–664 (2008)

    Google Scholar 

  37. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer (2009)

    Google Scholar 

  38. Hinton, G.E., Osindero S., Teh, Y.: A fast learning algorithm for deep belief nets. Neural Comput. 1527–1554 (2006)

    Google Scholar 

  39. Huang, C.D., Lin, C.T., Pal, N.R.: Hierarchical learning architecture with automatic feature selection for multiclass protein fold classification. IEEE Trans. Nanobiosci. 221–232 (2003)

    Google Scholar 

  40. Ibrahim, W., Abadeh, M.S.: Extracting features from protein sequences to improve deep extreme learning machine for protein fold recognition. J. Theor. Biol. 1–15 (2017)

    Google Scholar 

  41. Jo, T., Hou, J., Eickholt, J., Cheng, J.: Improving protein fold recognition by deep learning networks. Sci. Rep. (2015)

    Google Scholar 

  42. Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 195–202 (1999)

    Google Scholar 

  43. Jurkowski, W., Baster, Z., Dulak, D., Roterman, I.: The early-stage intermediate. In: Protein Folding in Silico, pp. 1–20 (2012)

    Google Scholar 

  44. Kmiecik, S., Gront, D., Kolinski, M., Wieteska, L., Dawid, A.E., Kolinski, A.: Coarse-grained protein models and their applications. Chem. Rev. 7898–7936 (2016)

    Google Scholar 

  45. Konieczny, L., Roterman-Konieczna, I., Spólnik, P.: The structure and function of living organisms. Syst. Biol. 1–32 (2013)

    Google Scholar 

  46. Krupa, P., Sieradzan, A.K., Rackovsky, S., Baranowski, M., Olldziej, S., Scheraga, H.A., Liwo, A., Czaplewski, C.: Improvement of the treatment of loop structures in the UNRES force field by inclusion of coupling between backbone- and side-chain-local conformational states. J. Chem. Theory Comput. (2013)

    Google Scholar 

  47. Leslie, C.S., et al.: Mismatch string kernels for discriminative protein classification. Bioinformatics, 467–476 (2004)

    Google Scholar 

  48. Levitt, M.: Accurate modeling of protein conformation by automatic segment matching. J. Mol. Biol. 507–533 (1992)

    Google Scholar 

  49. Li, J., Wu, J., Chen, K.: PFP-RFSM: protein fold prediction by using random forests and sequence motifs. J. Biomed. Sci. Eng. 1161–1170 (2013)

    Google Scholar 

  50. Liao, L., Noble, W.S.: Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. J. Comput. Biol. 857–868 (2003)

    Google Scholar 

  51. Lin, K.L., Lin, C.Y., Huang, C.D., Chang, H.M., Yang, C.Y., Lin, C.T., Hsu, D.F.: Feature selection and combination criteria for improving accuracy in protein structure prediction. IEEE Trans. NanoBiosci. 186–196 (2007)

    Google Scholar 

  52. Lindahl, E., Elofsson, A.: Identification of related proteins on family, superfamily and fold level. J. Mol. Biol. 613–625 (2000)

    Google Scholar 

  53. Lo Conte, L., Ailey, B., Hubbard, T.J.P., Brenner, S.E., Murzin, A.G., Chothia, C.: SCOP: a structural classification of protein database. Nucleic Acids Res. 257–259 (2000)

    Google Scholar 

  54. Marchler-Bauer, A., et al.: CDD: a conserved domain database for interactive domain family analysis. Nucleid Acid Res. D237–D240 (2007)

    Google Scholar 

  55. Nanni, L.: A novel ensemble of classifiers for protein fold recognition. Neurocomputing, 2434–2437 (2006)

    Google Scholar 

  56. Okun, O.: Protein fold recognition with k-local hyperplane distance nearest neighbor algorithm. In: Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics, pp. 51–57 (2004)

    Google Scholar 

  57. Pedersen, J.T., Moult, J.: Genetic algorithms for protein structure prediction. Curr. Opin. Struct. Biol. 227–231 (1996)

    Google Scholar 

  58. Rangwala, H., Karypis, G.: Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics, 4239–4247 (2005)

    Google Scholar 

  59. Rashid, M.A., Newton, M.A.H., Hoque, M.T., Sattar, A.: Mixing energy models in genetic algorithms for on-lattice protein structure prediction. BioMed. Res. Int. (2013)

    Google Scholar 

  60. Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 1–39 (2010)

    Google Scholar 

  61. Roterman, I., Bryliński, M., Konieczny, L., Jurkowski, W.: Early-stage protein folding—in silico model. Recent Adv. Struct. Biol. (2007)

    Google Scholar 

  62. Saigo, H., et al.: Protein homology detection using string alignment kernels. Bioinformatics, 1682–1689 (2004)

    Google Scholar 

  63. Sali, A., Blundell, T.L.: Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 779–815 (1993)

    Google Scholar 

  64. Schaffer, A., et al.: Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleid Acids Res. 2994–3005 (2001)

    Google Scholar 

  65. Shamim, M., et al.: Support vector machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs. Bioinformatics, 3320–3327 (2007)

    Google Scholar 

  66. Shapire, R.: The strength of weak learnability. Mach. Learn. 197–227 (1995)

    Google Scholar 

  67. Sharma, A., Lyons, J., Dehzangi, A., Paliwal, K.: A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. J. Theor. Biol. 41–46 (2013)

    Google Scholar 

  68. Shawe-Taylor, J., Cristiannini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press (2004)

    Google Scholar 

  69. Shen, H.B., Chou, K.C.: Predicting protein fold pattern with functional domain and sequential evolution information. J. Theor. Biol. 441–446 (2009)

    Google Scholar 

  70. Stapor, K.: Classification methods in computer vision (in Polish). Scientific Publishing House PWN, Warsaw (2011)

    Google Scholar 

  71. Unger, R., Moult, J.: Genetic algorithms for protein folding simulations. J. Mol. Biol. 75–81 (1993)

    Google Scholar 

  72. Wei, L., Liao, M., Gao, X., Zou, Q.: Enhanced protein fold prediction method through a novel feature extraction technique. IEEE Trans. Nanobiosci. 649–659

    Google Scholar 

  73. Wei, L., Zou, Q.: Recent progress in machine learning-based methods for protein fold recognition. Int. J. Mol. Sci. (2016)

    Google Scholar 

  74. Yang, J.-Y., Chen, X.: Improving taxonomy-based protein fold recognition by using global and local features. Proteins, 2053–2064 (2011)

    Google Scholar 

  75. Ying, Y., Huang, K., Campbell, C.: 2009. Enhanced protein fold recognition through a novel data integration approach. BMC Bioinformat. 267–287

    Google Scholar 

  76. Yu, L., Liu, H.: Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of 10th International Conference Machine Learning, pp. 856–863

    Google Scholar 

  77. Zouhal, L.M., Denoeux, T.: An evidence-theoretic kNN rule with parameter optimization. IEEE Trans. Syst. Man Cybern. 263–271 (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Katarzyna Stapor .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Stapor, K., Roterman-Konieczna, I., Fabian, P. (2019). Machine Learning Methods for the Protein Fold Recognition Problem. In: Tsihrintzis, G., Sotiropoulos, D., Jain, L. (eds) Machine Learning Paradigms. Intelligent Systems Reference Library, vol 149 . Springer, Cham. https://doi.org/10.1007/978-3-319-94030-4_5

Download citation

Publish with us

Policies and ethics