Abstract
A genetic algorithm (GA) coupled with multiple linear regression (MLR) was used to extract useful features from amino acids and g-gap dipeptides for distinguishing between thermophilic and non-thermophilic proteins. The method was trained by a benchmark dataset of 915 thermophilic and 793 non-thermophilic proteins. The method reached an overall accuracy of 95.4 % in a Jackknife test using nine amino acids, 38 0-gap dipeptides and 29 1-gap dipeptides. The accuracy as a function of protein size ranged between 85.8 and 96.9 %. The overall accuracies of three independent tests were 93, 93.4 and 91.8 %. The observed results of detecting thermophilic proteins suggest that the GA-MLR approach described herein should be a powerful method for selecting features that describe thermostabile machines and be an aid in the design of more stable proteins.
Similar content being viewed by others
References
Albayrak A, Sezerman UO (2012) Discrimination of thermophilic and mesophilic proteins using reduced amino acid alphabets with n-grams. Curr Bioinform 7:152–158
Bommarius AS, Broering JM, Chaparro-Riggers JF, Polizzi KM (2006) High-throughput screening for enhanced protein stability. Curr Opin Biotechnol 17:606–610
Chakravarty S, Varadarajan R (2000) Elucidation of determinants of protein stability through genome sequence analysis. FEBS Lett 470:65–69
Ghosh K, Dill KA (2009) Computing protein stabilities from their chain lengths. Proc Natl Acad Sci USA 106:10649–10654
Gromiha MM, Suresh MX (2008) Discrimination of mesophilic and thermophilic proteins using machine learning algorithms. Proteins Struct Funct Bioinform 70:1274–1279
Lin H, Chen W (2011) Prediction of thermophilic proteins using feature selection technique. J Microbiol Methods 84:67–70
Mahmoudabadi H, Izadi M, Menhaj MB (2009) A hybrid method for grade estimation using genetic algorithm and neural networks. Comput Geosci 13:91–101
Nakariyakul S, Liu ZP, Chen L (2012) Detecting thermophilic proteins through selecting amino acid and dipeptide composition features. Amino Acids 42:1947–1953
Radestock S, Gohlke H (2008) Exploiting the link between protein rigidity and thermostability for data-driven protein engineering. Eng Life Sci 8:507–522
Sadeghi M, Naderi-Manesh H, Zarrabi M, Ranjbar B (2006) Effective factors in thermostability of thermophilic proteins. Biophys Chem 119:256–270
Szilagyi A, Zavodszky P (2000) Structural differences between mesophilic, moderately thermophilic and extremely thermophilic protein subunits: results of a comprehensive survey. Structure 8:493–504
Zhang GY, Fang BS (2006a) Application of amino acid distribution along the sequence for discriminating mesophilic and thermophilic proteins. Process Biochem 41:1792–1798
Zhang GY, Fang BS (2006b) Support vector machine for discrimination of thermophilic and mesophilic proteins based on amino acid composition. Protein Pept Lett 13:965–970
Zhang GY, Fang BS (2007) LogitBoost classifier for discriminating thermophilic and mesophilic proteins. J Biotechnol 127:417–424
Zhou XX, Wang YB, Pan YJ, Li WF (2008) Differences in amino acids composition and coupling patterns between mesophilic and thermophilic proteins. Amino Acids 34:25–33
Zuo YC, Chen W, Fan GL, Li QZ (2013) A similarity distance of diversity measure for discriminating mesophilic and thermophilic proteins. Amino Acids 44:573–580
Acknowledgments
We thank Songyot Nakariyakul and Luonan Chen for kindly supporting the dataset. The work is financially supported by the National Natural Science Foundation of China (No. 30871614) and Tianjin Natural Science Foundation (No. 08JCYBJC04100).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, L., Li, C. Optimal subset selection of primary sequence features using the genetic algorithm for thermophilic proteins identification. Biotechnol Lett 36, 1963–1969 (2014). https://doi.org/10.1007/s10529-014-1577-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10529-014-1577-3