Skip to main content

Advertisement

Log in

Optimal subset selection of primary sequence features using the genetic algorithm for thermophilic proteins identification

  • Original Research Paper
  • Published:
Biotechnology Letters Aims and scope Submit manuscript

Abstract

A genetic algorithm (GA) coupled with multiple linear regression (MLR) was used to extract useful features from amino acids and g-gap dipeptides for distinguishing between thermophilic and non-thermophilic proteins. The method was trained by a benchmark dataset of 915 thermophilic and 793 non-thermophilic proteins. The method reached an overall accuracy of 95.4 % in a Jackknife test using nine amino acids, 38 0-gap dipeptides and 29 1-gap dipeptides. The accuracy as a function of protein size ranged between 85.8 and 96.9 %. The overall accuracies of three independent tests were 93, 93.4 and 91.8 %. The observed results of detecting thermophilic proteins suggest that the GA-MLR approach described herein should be a powerful method for selecting features that describe thermostabile machines and be an aid in the design of more stable proteins.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Albayrak A, Sezerman UO (2012) Discrimination of thermophilic and mesophilic proteins using reduced amino acid alphabets with n-grams. Curr Bioinform 7:152–158

    Article  CAS  Google Scholar 

  • Bommarius AS, Broering JM, Chaparro-Riggers JF, Polizzi KM (2006) High-throughput screening for enhanced protein stability. Curr Opin Biotechnol 17:606–610

    Article  CAS  PubMed  Google Scholar 

  • Chakravarty S, Varadarajan R (2000) Elucidation of determinants of protein stability through genome sequence analysis. FEBS Lett 470:65–69

    Article  CAS  PubMed  Google Scholar 

  • Ghosh K, Dill KA (2009) Computing protein stabilities from their chain lengths. Proc Natl Acad Sci USA 106:10649–10654

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Gromiha MM, Suresh MX (2008) Discrimination of mesophilic and thermophilic proteins using machine learning algorithms. Proteins Struct Funct Bioinform 70:1274–1279

    Article  CAS  Google Scholar 

  • Lin H, Chen W (2011) Prediction of thermophilic proteins using feature selection technique. J Microbiol Methods 84:67–70

    Article  CAS  PubMed  Google Scholar 

  • Mahmoudabadi H, Izadi M, Menhaj MB (2009) A hybrid method for grade estimation using genetic algorithm and neural networks. Comput Geosci 13:91–101

    Article  Google Scholar 

  • Nakariyakul S, Liu ZP, Chen L (2012) Detecting thermophilic proteins through selecting amino acid and dipeptide composition features. Amino Acids 42:1947–1953

    Article  CAS  PubMed  Google Scholar 

  • Radestock S, Gohlke H (2008) Exploiting the link between protein rigidity and thermostability for data-driven protein engineering. Eng Life Sci 8:507–522

    Article  CAS  Google Scholar 

  • Sadeghi M, Naderi-Manesh H, Zarrabi M, Ranjbar B (2006) Effective factors in thermostability of thermophilic proteins. Biophys Chem 119:256–270

    Article  CAS  PubMed  Google Scholar 

  • Szilagyi A, Zavodszky P (2000) Structural differences between mesophilic, moderately thermophilic and extremely thermophilic protein subunits: results of a comprehensive survey. Structure 8:493–504

    Article  CAS  PubMed  Google Scholar 

  • Zhang GY, Fang BS (2006a) Application of amino acid distribution along the sequence for discriminating mesophilic and thermophilic proteins. Process Biochem 41:1792–1798

    Article  CAS  Google Scholar 

  • Zhang GY, Fang BS (2006b) Support vector machine for discrimination of thermophilic and mesophilic proteins based on amino acid composition. Protein Pept Lett 13:965–970

    Article  CAS  PubMed  Google Scholar 

  • Zhang GY, Fang BS (2007) LogitBoost classifier for discriminating thermophilic and mesophilic proteins. J Biotechnol 127:417–424

    Article  CAS  PubMed  Google Scholar 

  • Zhou XX, Wang YB, Pan YJ, Li WF (2008) Differences in amino acids composition and coupling patterns between mesophilic and thermophilic proteins. Amino Acids 34:25–33

    Article  CAS  PubMed  Google Scholar 

  • Zuo YC, Chen W, Fan GL, Li QZ (2013) A similarity distance of diversity measure for discriminating mesophilic and thermophilic proteins. Amino Acids 44:573–580

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgments

We thank Songyot Nakariyakul and Luonan Chen for kindly supporting the dataset. The work is financially supported by the National Natural Science Foundation of China (No. 30871614) and Tianjin Natural Science Foundation (No. 08JCYBJC04100).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to CuiFeng Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, L., Li, C. Optimal subset selection of primary sequence features using the genetic algorithm for thermophilic proteins identification. Biotechnol Lett 36, 1963–1969 (2014). https://doi.org/10.1007/s10529-014-1577-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10529-014-1577-3

Keywords

Navigation