Prediction of Secondary Protein Structure Content from Primary Sequence Alone – A Feature Selection Based Approach

Kurgan, Lukasz; Homaeian, Leila

doi:10.1007/11510888_33

Lukasz Kurgan²⁰ &
Leila Homaeian²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3587))

Included in the following conference series:

International Workshop on Machine Learning and Data Mining in Pattern Recognition

2084 Accesses
9 Citations

Abstract

Research in protein structure and function is one of the most important subjects in modern bioinformatics and computational biology. It often uses advanced data mining and machine learning methodologies to perform prediction or pattern recognition tasks. This paper describes a new method for prediction of protein secondary structure content based on feature selection and multiple linear regression. The method develops a novel representation of primary protein sequences based on a large set of 495 features. The feature selection task performed using very large set of nearly 6,000 proteins, and tests performed on standard non-homologues protein sets confirm high quality of the developed solution. The application of feature selection and the novel representation resulted in 14-15% error rate reduction when compared to results achieved when standard representation is used. The prediction tests also show that a small set of 5-25 features is sufficient to achieve accurate prediction for both helix and strand content for non-homologous proteins.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Berman, H.M., et al.: The Protein Data Bank. Nucleic Acids Research 28, 235–242 (2000)
Article Google Scholar
Bussian, B., Sender, C.: How to Determine Protein Secondary Structure in Solution by Raman Spectroscopy: Practical Guide and Test Case DNsae I. Biochem. 28, 4271–4277 (1989)
Article Google Scholar
Boeckmann, B., et al.: The SWISS-PROT Protein Knowledgebase and Its Supplement TrEMBL in 2003. Nucleic Acids Research 31, 365–370 (2003)
Article Google Scholar
Dwyer, D.: Electronic Properties of Amino Acids Side Chains Contribute to the Structural Preferences in Protein Folding. J. Bimolecular Structure & Dynamics 18(6), 881–892 (2001)
MathSciNet Google Scholar
Eisenhaber, F., et al.: Prediction of Secondary Structural Contents of Proteins from Their Amino Acid Composition Alone, I. New Analytic Vector Decomposition Methods. Proteins 25(2), 157–168 (1996)
Article Google Scholar
Ganapathiraju, M.K., et al.: Characterization of Protein Secondary Structure. IEEE Signal Processing Magazine, 78–87 (May 2004)
Google Scholar
Hobohm, U., Sander, C.: A Sequence Property Approach to Searching Protein Databases. J. of Molecular Biology 251, 390–399 (1995)
Article Google Scholar
Krigbaum, W., Knutton, S.: Prediction of the Amount of Secondary Structure in a Globular Protein from its Amino Acid Composition. Proc. of the Nat. Academy of Science 70, 2809–2813 (1973)
Article Google Scholar
Lodish, H., et al.: Molecular Cell Biology, 4th edn., pp. 50–54. W.H. Freeman & Company, New York (2000)
Google Scholar
Muskal, S.M., Kim, S.-H.: Predicting Protein Secondary Structure Content: a Tandem Neural Network Approach. J. of Molecular Biology 225, 713–727 (1992)
Article Google Scholar
Nelson, D., Cox, M.: Lehninger Principles of Biochemistry Amino. Worth Publish., Belmont (2000)
Google Scholar
Ruan, J., et al.: Highly Accurate and Consistent Method for Prediction of Helix and Strand Content from Primary Protein Sequences. Artificial Intelligence in Medicine, special issue on Computational Intelligence Techniques in Bioinformatics (accepted, 2005)
Google Scholar
Sreerama, N., Woody, R.W.: Protein Secondary Structure from Circular Dichroism Spectroscopy. J. Molecular Biology 242, 497–507 (1994)
Google Scholar
Syed, U., Yona, G.: Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Function. In: Proc. of RECOMB 2003 Conf., pp. 224–234 (2003)
Google Scholar
Wang, J., et al.: Application of Neural Networks to Biological Data Mining: a Case Study in Protein Sequence Classification. In: Proc. of 6^th ACM SIGKDD Inter. Conf. on Knowledge Discovery and Data Mining, pp. 305–309 (2000)
Google Scholar
Yang, X., Wang, B.: Weave Amino Acid Sequences for Protein Secondary Structure Prediction. In: Proc. of 8^th ACM SIGMOD workshop on Research issues in Data Mining and Knowledge Discovery, pp. 80–87 (2003)
Google Scholar
Zhang, C.T., Zhang, Z., He, Z.: Prediction of the Secondary Structure of Globular Proteins Based on Structural Classes. J. of Protein Chemistry 15, 775–786 (1996)
Article Google Scholar
Zhang, C.T., et al.: Prediction of Helix/Strand Content of Globular Proteins Based on Their Primary Sequences. Protein Engineering 11(11), 971–979 (1998a)
Article Google Scholar
Zhang, C.T., Zhang, Z., He, Z.: Prediction of the Secondary Structure Contents of Globular Proteins based on Three Structural Classes. J. Protein Chemistry 17, 261–272 (1998b)
Article Google Scholar
Zhang, Z.D., Sun, Z.R., Zhang, C.T.: A New Approach to Predict the Helix/Strand Content of Globular Proteins. J. Theoretical Biology 208, 65–78 (2001)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, T6G 2V4, Canada
Lukasz Kurgan & Leila Homaeian

Authors

Lukasz Kurgan
View author publications
You can also search for this author in PubMed Google Scholar
Leila Homaeian
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Vision and applied Computer Sciences, IBaI, Germany
Petra Perner
Institute of Media and Information Technology, Chiba University, Japan
Atsushi Imiya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kurgan, L., Homaeian, L. (2005). Prediction of Secondary Protein Structure Content from Primary Sequence Alone – A Feature Selection Based Approach. In: Perner, P., Imiya, A. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2005. Lecture Notes in Computer Science(), vol 3587. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11510888_33

Download citation

DOI: https://doi.org/10.1007/11510888_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26923-6
Online ISBN: 978-3-540-31891-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics