Skip to main content

Prediction of Secondary Protein Structure Content from Primary Sequence Alone – A Feature Selection Based Approach

  • Conference paper
Machine Learning and Data Mining in Pattern Recognition (MLDM 2005)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3587))

Abstract

Research in protein structure and function is one of the most important subjects in modern bioinformatics and computational biology. It often uses advanced data mining and machine learning methodologies to perform prediction or pattern recognition tasks. This paper describes a new method for prediction of protein secondary structure content based on feature selection and multiple linear regression. The method develops a novel representation of primary protein sequences based on a large set of 495 features. The feature selection task performed using very large set of nearly 6,000 proteins, and tests performed on standard non-homologues protein sets confirm high quality of the developed solution. The application of feature selection and the novel representation resulted in 14-15% error rate reduction when compared to results achieved when standard representation is used. The prediction tests also show that a small set of 5-25 features is sufficient to achieve accurate prediction for both helix and strand content for non-homologous proteins.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Berman, H.M., et al.: The Protein Data Bank. Nucleic Acids Research 28, 235–242 (2000)

    Article  Google Scholar 

  2. Bussian, B., Sender, C.: How to Determine Protein Secondary Structure in Solution by Raman Spectroscopy: Practical Guide and Test Case DNsae I. Biochem. 28, 4271–4277 (1989)

    Article  Google Scholar 

  3. Boeckmann, B., et al.: The SWISS-PROT Protein Knowledgebase and Its Supplement TrEMBL in 2003. Nucleic Acids Research 31, 365–370 (2003)

    Article  Google Scholar 

  4. Dwyer, D.: Electronic Properties of Amino Acids Side Chains Contribute to the Structural Preferences in Protein Folding. J. Bimolecular Structure & Dynamics 18(6), 881–892 (2001)

    MathSciNet  Google Scholar 

  5. Eisenhaber, F., et al.: Prediction of Secondary Structural Contents of Proteins from Their Amino Acid Composition Alone, I. New Analytic Vector Decomposition Methods. Proteins 25(2), 157–168 (1996)

    Article  Google Scholar 

  6. Ganapathiraju, M.K., et al.: Characterization of Protein Secondary Structure. IEEE Signal Processing Magazine, 78–87 (May 2004)

    Google Scholar 

  7. Hobohm, U., Sander, C.: A Sequence Property Approach to Searching Protein Databases. J. of Molecular Biology 251, 390–399 (1995)

    Article  Google Scholar 

  8. Krigbaum, W., Knutton, S.: Prediction of the Amount of Secondary Structure in a Globular Protein from its Amino Acid Composition. Proc. of the Nat. Academy of Science 70, 2809–2813 (1973)

    Article  Google Scholar 

  9. Lodish, H., et al.: Molecular Cell Biology, 4th edn., pp. 50–54. W.H. Freeman & Company, New York (2000)

    Google Scholar 

  10. Muskal, S.M., Kim, S.-H.: Predicting Protein Secondary Structure Content: a Tandem Neural Network Approach. J. of Molecular Biology 225, 713–727 (1992)

    Article  Google Scholar 

  11. Nelson, D., Cox, M.: Lehninger Principles of Biochemistry Amino. Worth Publish., Belmont (2000)

    Google Scholar 

  12. Ruan, J., et al.: Highly Accurate and Consistent Method for Prediction of Helix and Strand Content from Primary Protein Sequences. Artificial Intelligence in Medicine, special issue on Computational Intelligence Techniques in Bioinformatics (accepted, 2005)

    Google Scholar 

  13. Sreerama, N., Woody, R.W.: Protein Secondary Structure from Circular Dichroism Spectroscopy. J. Molecular Biology 242, 497–507 (1994)

    Google Scholar 

  14. Syed, U., Yona, G.: Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Function. In: Proc. of RECOMB 2003 Conf., pp. 224–234 (2003)

    Google Scholar 

  15. Wang, J., et al.: Application of Neural Networks to Biological Data Mining: a Case Study in Protein Sequence Classification. In: Proc. of 6th ACM SIGKDD Inter. Conf. on Knowledge Discovery and Data Mining, pp. 305–309 (2000)

    Google Scholar 

  16. Yang, X., Wang, B.: Weave Amino Acid Sequences for Protein Secondary Structure Prediction. In: Proc. of 8th ACM SIGMOD workshop on Research issues in Data Mining and Knowledge Discovery, pp. 80–87 (2003)

    Google Scholar 

  17. Zhang, C.T., Zhang, Z., He, Z.: Prediction of the Secondary Structure of Globular Proteins Based on Structural Classes. J. of Protein Chemistry 15, 775–786 (1996)

    Article  Google Scholar 

  18. Zhang, C.T., et al.: Prediction of Helix/Strand Content of Globular Proteins Based on Their Primary Sequences. Protein Engineering 11(11), 971–979 (1998a)

    Article  Google Scholar 

  19. Zhang, C.T., Zhang, Z., He, Z.: Prediction of the Secondary Structure Contents of Globular Proteins based on Three Structural Classes. J. Protein Chemistry 17, 261–272 (1998b)

    Article  Google Scholar 

  20. Zhang, Z.D., Sun, Z.R., Zhang, C.T.: A New Approach to Predict the Helix/Strand Content of Globular Proteins. J. Theoretical Biology 208, 65–78 (2001)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kurgan, L., Homaeian, L. (2005). Prediction of Secondary Protein Structure Content from Primary Sequence Alone – A Feature Selection Based Approach. In: Perner, P., Imiya, A. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2005. Lecture Notes in Computer Science(), vol 3587. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11510888_33

Download citation

  • DOI: https://doi.org/10.1007/11510888_33

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-26923-6

  • Online ISBN: 978-3-540-31891-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics