Machine Learning Methods for the Protein Fold Recognition Problem

Stapor, Katarzyna; Roterman-Konieczna, Irena; Fabian, Piotr

doi:10.1007/978-3-319-94030-4_5

Katarzyna Stapor⁶,
Irena Roterman-Konieczna⁷ &
Piotr Fabian⁶

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 149 ))

1611 Accesses
3 Citations

Abstract

The protein fold recognition problem is crucial in bioinformatics. It is usually solved using sequence comparison methods but when proteins similar in structure share little in the way of sequence homology they fail and machine learning methods are used to predict the structure of the protein. The imbalance of the data sets, the number of outliers and the high number of classes make the task very complex. We try to explain the methodology for building classifiers for protein fold recognition and to cover all the major results in this field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alpaydin, E.: Introduction to Machine Learning. MIT Press (2009)
Google Scholar
Altschul, S.F., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 3389–3402 (1997)
Google Scholar
Anfinsen, B.C.: Principles that govern the folding of protein chains. Science, 223–230 (1973)
Google Scholar
Apweiler, R., Bairoch, A., Wu, C.H., et al.: UniProt: the universal protein knowledgebase. Nucleic Acids Res. D115–D119 (2004)
Google Scholar
Banach, M., Konieczny, L., Roterman, I.: The late-stage intermediate. In: Protein Folding in Silico, pp. 21–38
Google Scholar
Banach, M., Konieczny, L., Roterman, I.: The fuzzy oil drop model, based on hydrophobicity density distribution, generalizes the influence of water environment on protein structure and function. J. Theor Biol. 6–17 (2014)
Google Scholar
Berman, H.M., et al. The protein databank. Nucleic Acids Res. 235–242 (2000)
Google Scholar
Bishop, MCh.: Pattern Recognition and Machine Learning. Springer, New York (2006)
Google Scholar
Breiman, L.: Bagging predictors. Mach. Learn. 123–140 (1996)
Google Scholar
Breiman, L.: Random Forests. Mach. Learn. 5–32 (2001)
Google Scholar
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees (1984)
Google Scholar
Brown, G., et al.: Diversity creation methods: a survey and categorization. Inf. Fusion, 5–20 (2005)
Google Scholar
Chan, H.S., Dill, K.: The protein folding problem. Phys. Today, 24–32 (1993)
Google Scholar
Chen, D., Tian, X., Zhou, B., Gao, J.: ProFold: protein fold classification with additional structural features and a novel ensemble classifier. BioMed. Res. Int. (2016)
Google Scholar
Chen, K., Kurgan, L.: PFRES: protein fold classification by using evolutionary information and predicted secondary structure. Bioinformatics, 2843–2850 (2007)
Google Scholar
Cheng, J.: SCRATCH: a protein structure and structural feature prediction server. Nucleid Acid Res. 72–76 (2005)
Google Scholar
Chinnasamy, A., Sung, W.K., Mittal, A.: Protein structure and fold prediction using tree-augmented naïve Bayesian classifier. In: Proceedings of PSB, Stanford CA (2004)
Google Scholar
Chmielnicki, W., Stapor, K.: Protein fold recognition with combined RDA-SVM classifier. Lecture Notes on Artificial Intelligence, pp. 162–169 (2010)
Google Scholar
Chmielnicki, W., Stapor, K.: A hybrid discriminative/generative approach to protein fold recognition. Neurocomputing, 194–198 (2012)
Google Scholar
Chothia, C.: One thousand families for the molecular biologist. Nature, 543–544 (1992)
Google Scholar
Chou, K.C.: Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins, 246–255 (2001)
Google Scholar
Chou, K.C.: Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr. Proteomics, 262–274
Google Scholar
Clearly, J.G., Trigg, I.E.: K*: an instance-based learner using an entropic distance measure. Proc. Int. Conf. Mach. Learn. 108–114 (1995)
Google Scholar
Crammer, K., Singer, Y.: On the learnability and design of output codes for multiclass problems. In: 13th Computational Learning Theory Conference, pp. 35–46 (2000)
Google Scholar
Craven, M.W., Mural, R.J., Hauser, L.J., Uberbacher, E.C.: Predicting protein folding classes without overly relying on homology. In: Proceedings of Intelligent Systems in Molecular Biology (ISMB), pp. 98–106 (1995)
Google Scholar
Damoulas, T., Girolami, M.: Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection. Bioinformatics, 1264–1270 (2008)
Google Scholar
Denoeux, T.: A k-nearest neighbor classification rule based on Dempster-Shafer theory. IEEE Trans. Syst. Man Cybern. 804–813 (1995)
Google Scholar
Deschavanne, P., Tuffery, P.: Enhanced protein fold recognition using a structural alphabet. Proteins, 129–137 (2009)
Google Scholar
Dietterich, T.G.: Ensemble methods in machine learning. In: 1st International Workshop on Multiple Classifier Systems, pp. 1–15 (2000)
Google Scholar
Dill, K.A., Chan, H.S.: From Levinthal to pathways to funnels. Nat. Struct. Biol. 10–19 (1997)
Google Scholar
Ding, C.H., Dubchak, I.: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, pp. 349–358 (2001)
Google Scholar
Dong, Q., Zhou, S., Guan, J.: A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation. Bioinformatics, 2655–2662 (2009)
Google Scholar
Dubchak, I., Muchnik, I. Holbrook, S.R., Kim, S.H.: Prediction of protein folding class using global description of amino acid sequence. Proc. Natl. Acad. Sci. USA, 8700–8704 (1995)
Google Scholar
Freund, Y., Shapire, R.: A decision-theoretic generalization of online learning and an application to boosting. J. Comput. Sys. Sci. 119–139 (1997)
Google Scholar
Ghahramani, Z.: An introduction to Hidden Markov Models and Bayesian networks. Int. J. Pattern Recognit. Artif. Intell. 9–42
Google Scholar
Guo, X., Gao, X.: A novel hierarchical ensemble classifier for protein fold recognition. Protein Eng. Des. Sel. 659–664 (2008)
Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer (2009)
Google Scholar
Hinton, G.E., Osindero S., Teh, Y.: A fast learning algorithm for deep belief nets. Neural Comput. 1527–1554 (2006)
Google Scholar
Huang, C.D., Lin, C.T., Pal, N.R.: Hierarchical learning architecture with automatic feature selection for multiclass protein fold classification. IEEE Trans. Nanobiosci. 221–232 (2003)
Google Scholar
Ibrahim, W., Abadeh, M.S.: Extracting features from protein sequences to improve deep extreme learning machine for protein fold recognition. J. Theor. Biol. 1–15 (2017)
Google Scholar
Jo, T., Hou, J., Eickholt, J., Cheng, J.: Improving protein fold recognition by deep learning networks. Sci. Rep. (2015)
Google Scholar
Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 195–202 (1999)
Google Scholar
Jurkowski, W., Baster, Z., Dulak, D., Roterman, I.: The early-stage intermediate. In: Protein Folding in Silico, pp. 1–20 (2012)
Google Scholar
Kmiecik, S., Gront, D., Kolinski, M., Wieteska, L., Dawid, A.E., Kolinski, A.: Coarse-grained protein models and their applications. Chem. Rev. 7898–7936 (2016)
Google Scholar
Konieczny, L., Roterman-Konieczna, I., Spólnik, P.: The structure and function of living organisms. Syst. Biol. 1–32 (2013)
Google Scholar
Krupa, P., Sieradzan, A.K., Rackovsky, S., Baranowski, M., Olldziej, S., Scheraga, H.A., Liwo, A., Czaplewski, C.: Improvement of the treatment of loop structures in the UNRES force field by inclusion of coupling between backbone- and side-chain-local conformational states. J. Chem. Theory Comput. (2013)
Google Scholar
Leslie, C.S., et al.: Mismatch string kernels for discriminative protein classification. Bioinformatics, 467–476 (2004)
Google Scholar
Levitt, M.: Accurate modeling of protein conformation by automatic segment matching. J. Mol. Biol. 507–533 (1992)
Google Scholar
Li, J., Wu, J., Chen, K.: PFP-RFSM: protein fold prediction by using random forests and sequence motifs. J. Biomed. Sci. Eng. 1161–1170 (2013)
Google Scholar
Liao, L., Noble, W.S.: Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. J. Comput. Biol. 857–868 (2003)
Google Scholar
Lin, K.L., Lin, C.Y., Huang, C.D., Chang, H.M., Yang, C.Y., Lin, C.T., Hsu, D.F.: Feature selection and combination criteria for improving accuracy in protein structure prediction. IEEE Trans. NanoBiosci. 186–196 (2007)
Google Scholar
Lindahl, E., Elofsson, A.: Identification of related proteins on family, superfamily and fold level. J. Mol. Biol. 613–625 (2000)
Google Scholar
Lo Conte, L., Ailey, B., Hubbard, T.J.P., Brenner, S.E., Murzin, A.G., Chothia, C.: SCOP: a structural classification of protein database. Nucleic Acids Res. 257–259 (2000)
Google Scholar
Marchler-Bauer, A., et al.: CDD: a conserved domain database for interactive domain family analysis. Nucleid Acid Res. D237–D240 (2007)
Google Scholar
Nanni, L.: A novel ensemble of classifiers for protein fold recognition. Neurocomputing, 2434–2437 (2006)
Google Scholar
Okun, O.: Protein fold recognition with k-local hyperplane distance nearest neighbor algorithm. In: Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics, pp. 51–57 (2004)
Google Scholar
Pedersen, J.T., Moult, J.: Genetic algorithms for protein structure prediction. Curr. Opin. Struct. Biol. 227–231 (1996)
Google Scholar
Rangwala, H., Karypis, G.: Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics, 4239–4247 (2005)
Google Scholar
Rashid, M.A., Newton, M.A.H., Hoque, M.T., Sattar, A.: Mixing energy models in genetic algorithms for on-lattice protein structure prediction. BioMed. Res. Int. (2013)
Google Scholar
Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 1–39 (2010)
Google Scholar
Roterman, I., Bryliński, M., Konieczny, L., Jurkowski, W.: Early-stage protein folding—in silico model. Recent Adv. Struct. Biol. (2007)
Google Scholar
Saigo, H., et al.: Protein homology detection using string alignment kernels. Bioinformatics, 1682–1689 (2004)
Google Scholar
Sali, A., Blundell, T.L.: Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 779–815 (1993)
Google Scholar
Schaffer, A., et al.: Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleid Acids Res. 2994–3005 (2001)
Google Scholar
Shamim, M., et al.: Support vector machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs. Bioinformatics, 3320–3327 (2007)
Google Scholar
Shapire, R.: The strength of weak learnability. Mach. Learn. 197–227 (1995)
Google Scholar
Sharma, A., Lyons, J., Dehzangi, A., Paliwal, K.: A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. J. Theor. Biol. 41–46 (2013)
Google Scholar
Shawe-Taylor, J., Cristiannini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press (2004)
Google Scholar
Shen, H.B., Chou, K.C.: Predicting protein fold pattern with functional domain and sequential evolution information. J. Theor. Biol. 441–446 (2009)
Google Scholar
Stapor, K.: Classification methods in computer vision (in Polish). Scientific Publishing House PWN, Warsaw (2011)
Google Scholar
Unger, R., Moult, J.: Genetic algorithms for protein folding simulations. J. Mol. Biol. 75–81 (1993)
Google Scholar
Wei, L., Liao, M., Gao, X., Zou, Q.: Enhanced protein fold prediction method through a novel feature extraction technique. IEEE Trans. Nanobiosci. 649–659
Google Scholar
Wei, L., Zou, Q.: Recent progress in machine learning-based methods for protein fold recognition. Int. J. Mol. Sci. (2016)
Google Scholar
Yang, J.-Y., Chen, X.: Improving taxonomy-based protein fold recognition by using global and local features. Proteins, 2053–2064 (2011)
Google Scholar
Ying, Y., Huang, K., Campbell, C.: 2009. Enhanced protein fold recognition through a novel data integration approach. BMC Bioinformat. 267–287
Google Scholar
Yu, L., Liu, H.: Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of 10th International Conference Machine Learning, pp. 856–863
Google Scholar
Zouhal, L.M., Denoeux, T.: An evidence-theoretic kNN rule with parameter optimization. IEEE Trans. Syst. Man Cybern. 263–271 (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Silesian University of Technology, Gliwice, Poland
Katarzyna Stapor & Piotr Fabian
Jagiellonian University, Kraków, Poland
Irena Roterman-Konieczna

Authors

Katarzyna Stapor
View author publications
You can also search for this author in PubMed Google Scholar
Irena Roterman-Konieczna
View author publications
You can also search for this author in PubMed Google Scholar
Piotr Fabian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Katarzyna Stapor .

Editor information

Editors and Affiliations

University of Piraeus , Piraeus, Greece
George A. Tsihrintzis
University of Piraeus , Piraeus, Greece
Dionisios N. Sotiropoulos
Faculty of Engineering and Information Technology, Centre for Artificial Intelligence, University of Technology, Sydney, New South Wales, Australia
Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Stapor, K., Roterman-Konieczna, I., Fabian, P. (2019). Machine Learning Methods for the Protein Fold Recognition Problem. In: Tsihrintzis, G., Sotiropoulos, D., Jain, L. (eds) Machine Learning Paradigms. Intelligent Systems Reference Library, vol 149 . Springer, Cham. https://doi.org/10.1007/978-3-319-94030-4_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-94030-4_5
Published: 04 July 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94029-8
Online ISBN: 978-3-319-94030-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics