Abstract
The accurate prediction of whether a protein will crystallize plays a crucial role in improving the success rate of protein crystallization projects. A common critical problem in the development of machine-learning-based protein crystallization predictors is how to effectively utilize protein features extracted from different views. In this study, we aimed to improve the efficiency of fusing multi-view protein features by proposing a new two-layered SVM (2L-SVM) which switches the feature-level fusion problem to a decision-level fusion problem: the SVMs in the 1st layer of the 2L-SVM are trained on each of the multi-view feature sets; then, the outputs of the 1st layer SVMs, which are the “intermediate” decisions made based on the respective feature sets, are further ensembled by a 2nd layer SVM. Based on the proposed 2L-SVM, we implemented a sequence-based protein crystallization predictor called TargetCrys. Experimental results on several benchmark datasets demonstrated the efficacy of the proposed 2L-SVM for fusing multi-view features. We also compared TargetCrys with existing sequence-based protein crystallization predictors and demonstrated that the proposed TargetCrys outperformed most of the existing predictors and is competitive with the state-of-the-art predictors. The TargetCrys webserver and datasets used in this study are freely available for academic use at: http://csbio.njust.edu.cn/bioinf/TargetCrys.
Similar content being viewed by others
Notes
In this study, multi-view features mean the features extracted from different sources, such as amino acids composition, protein evolutionary profile, and so on.
References
Babnigg G, Joachimiak A (2010) Predicting protein crystallization propensity from protein sequence. J Struct Funct Genomics 11(1):71–80
Berman HM et al (2000) The Protein Data Bank. Nucleic Acids Res 28(1):235–242
Bradshaw NI et al (2012) 15: 30 structural elucidation of disc1 pathway proteins using electron microscopy, chemical cross-linking and mass spectroscopy. Schizophr Res 136:S74
Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27
Charoenkwan P, Shoombuatong W, Lee HC, Chaijaruwanich J, Huang HL, Ho SY (2013) SCMCRYS: predicting protein crystallization using an ensemble scoring card method with estimating propensity scores of P-collocated amino acid pairs. PLoS One 8(9):e72368
Chauhan JS, Mishra NK, Raghava GP (2009) Identification of ATP binding residues of a protein from its primary sequence. BMC Bioinform 10:434
Chen K, Kurgan L, Rahbari M (2007) Prediction of protein crystallization using collocation of amino acid pairs. Biochem Bioph Res Co 355(3):764–769
Chen C, Chen LX, Zou XY, Cai PX (2008) Predicting protein structural class based on multi-features fusion. J Theor Biol 253(2):388–392
Chen K, Mizianty MJ, Kurgan L (2011) ATPsite: sequence-based prediction of ATP-binding residues. Proteome Sci 9(Suppl 1):S4
Chen K, Mizianty MJ, Kurgan L (2012) Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors. Bioinformatics 28(3):331–341
Chou KC (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Struct Funct Genetics 43(3):246–255
Chou K-C (2004) Structural bioinformatics and its impact to biomedical science. Curr Med Chem 11(16):2105–2134
Chou KC (2005) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21(1):10–19
Chou K-C, Shen H-B (2007) MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochem Bioph Res Co 360(2):339–345
Dieckmann A, Rieskamp J (2007) The influence of information redundancy on probabilistic inferences. Memory Cogn 35(7):1801–1813
Ding C, Yuan L-F, Guo S-H, Lin H, Chen W (2012) Identification of mycobacterial membrane proteins and their types using over-represented tripeptide compositions. J Proteom 77:321–328
Foulonneau M (2007) Information redundancy across metadata collections. Inf Process Manage 43(3):740–751
Gao JZ, Hu G, Wu ZH, Ruan JS, Shen SY, Hanlon M, Wang K (2014) Improved prediction of protein crystallization, purification and production propensity using hybrid sequence representation. Curr Bioinform 9(1):57–64
Gromiha MM (2010) Protein bioinformatics: from sequence to function. Academic Press, Cambridge
Haibo H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Hu G et al (2014a) Human structural proteome-wide characterization of Cyclosporine A targets. Bioinformatics 30(24):3561–3566
Hu J, He X, Yu D-J, Yang X-B, Yang J-Y, Shen H-B (2014b) A new supervised over-sampling algorithm with application to protein-nucleotide binding residue prediction. PLoS One 9(9):e107676
Jackman L (2012) Dynamic nuclear magnetic resonance spectroscopy. Elsevier, New York
Jahandideh S, Mahdavi A (2012) RFCRYS: sequence-based protein crystallization propensity prediction by means of random forest. J Theor Biol 306:115–119
Kandaswamy KK, Pugalenthi G, Suganthan PN, Gangal R (2010) SVMCRYS: an SVM approach for the prediction of protein crystallization propensity from protein sequence. Protein Peptide Lett 17(4):423–430
Kantardjieff KA, Rupp B (2004) Protein isoelectric point as a predictor for increased crystallization screening efficiency. Bioinformatics 20(14):2162–2168
Kantardjieff KA, Jamshidian M, Rupp B (2004) Distributions of pI versus pH provide prior information for the design of crystallization screening experiments: response to comment on ‘Protein isoelectric point as a predictor for increased crystallization screening efficiency’. Bioinformatics 20(14):2171–2174
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97:273–324
Kurgan L, Razib AA, Aghakhani S, Dick S, Mizianty M, Jahandideh S (2009) CRYSTALP2: sequence-based protein crystallization propensity prediction. BMC Struct Biol 9:50
Mizianty MJ, Kurgan L (2009) Meta prediction of protein crystallization propensity. Biochem Bioph Res Co 390(1):10–15
Mizianty MJ, Kurgan L (2011) Sequence-based prediction of protein crystallization, purification and production propensity. Bioinformatics 27(13):i24–i33
Mizianty MJ, Kurgan LA (2012) CRYSpred: accurate sequence-based protein crystallization propensity prediction using sequence-derived structural characteristics. Protein Pept Lett 19(1):40–49
Mizianty MJ, Fan X, Yan J, Chalmers E, Woloschuk C, Joachimiak A, Kurgan L (2014) Covering complete proteomes with X-ray structures: a current snapshot. Biol Crystallogr 70(11):2781–2793
Nanni L, Lumini A, Gupta D, Garg A (2012) Identifying bacterial virulent proteins by fusing a set of classifiers based on variants of Chou’s pseudo amino acid composition and on evolutionary information. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 9(2):467–475
Overton IM, Barton GJ (2006) A normalised scale for structural genomics target ranking: the OB-Score. FEBS Lett 580(16):4005–4009
Overton IM, Padovani G, Girolami MA, Barton GJ (2008) ParCrys: a Parzen window density estimation approach to protein crystallization propensity prediction. Bioinformatics 24(7):901–907
Price Ii WN et al (2009) Understanding the physical properties that control protein crystallization by analysis of large-scale experimental data. Nat Biotechnol 27(1):51–57
Rodrigues A, Hubbard RE (2003) Making decisions for structural genomics. Brief Bioinform 4(2):150–167
Roy A, Zhang Y (2012) Recognizing protein-ligand binding sites by global structural alignment and local geometry refinement. Structure 20(6):987–997
Rung J, Brazma A (2013) Reuse of public genome-wide gene expression data. Nat Rev Genet 14(2):89–99
Rupp B, Wang J (2004) Predictive models for protein crystallization. Methods 34(3):390–407
Saeys Y, Inza I, Larranaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517
Schaffer AA et al (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29(14):2994–3005
Service R (2005) Structural biology. Structural genomics, round 2. Science 307(5715):1554–1558
Shen H-B, Chou K-C (2008) PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. Anal Biochem 373(2):386–388
Singh H, Chauhan JS, Gromiha MM, Raghava GP (2011) ccPDB: compilation and creation of data sets from Protein Data Bank. Nucleic Acids Res gkr1150
Slabinski L, Jaroszewski L, Rychlewski L, Wilson IA, Lesley SA, Godzik A (2007) XtalPred: a web server for prediction of protein crystallizability. Bioinformatics 23(24):3403–3405
Smialowski P, Schmidt T, Cox J, Kirschner A, Frishman D (2006) Will my protein crystallize? A sequence-based predictor. Proteins 62(2):343–355
Todd AE, Marsden RL, Thornton JM, Orengo CA (2005) Progress of structural genomics initiatives: an analysis of solved target structures. J Mol Biol 348(5):1235–1260
Tramontano A, Cozzetto D (2004) The relationship between protein sequence, structure and function: protein function prediction. Supramolecular Struct Funct 8:15–29
Vapnik VN (ed) (1998) Statistical learning theory. Wiley, New York
Yu D, Wu X, Shen H, Yang J, Tang Z, Qi Y (2012) Enhancing membrane protein subcellular localization prediction by parallel fusion of multi-view features. IEEE Trans Nanobioscience 11(4):375–385
Yu D-J et al (2013a) Learning protein multi-view features in complex space. Amino Acids 44(5):1365–1379
Yu DJ, Hu J, Huang Y, Shen HB, Qi Y, Tang ZM, Yang JY (2013b) TargetATPsite: a template-free method for ATP-binding sites prediction with residue evolution image sparse representation and classifier ensemble. J Comput Chem 34(11):974–985
Yu DJ, Hu J, Tang ZM, Shen HB, Yang J, Yang JY (2013c) Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling. Neurocomputing 104:180–190
Zhang Y (2014) Interplay of I-TASSER and QUARK for template-based and ab initio protein structure prediction in CASP10. Proteins Struct Funct Bioinform 82(S2):175–187
Zucker FH et al (2010) Prediction of protein crystallization outcome using a hybrid method. J Struct Biol 171(1):64–73
Acknowledgments
This work was supported by the National Natural Science Foundation of China (No. 61373062, 61175024, 61222306, and 61233011), the Natural Science Foundation of Jiangsu (No. BK20141403), the Jiangsu University Graduate Research and Innovation Project (No. KYZZ_0123), Jiangsu Postdoctoral Science Foundation (No. 1201027C), the Science and Technology Commission of Shanghai Municipality (No. 16JC1404300), “The Six Top Talents” of Jiangsu Province (No. 2013-XXRJ-022), and the Fundamental Research Funds for the Central Universities (No. 30916011327). D. J. Yu is the corresponding author for this paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Handling Editor: S. C. E. Tosatto.
Appendix A: Parameters of TargetCrys
Appendix A: Parameters of TargetCrys
See Table 8.
Rights and permissions
About this article
Cite this article
Hu, J., Han, K., Li, Y. et al. TargetCrys: protein crystallization prediction by fusing multi-view features with two-layered SVM. Amino Acids 48, 2533–2547 (2016). https://doi.org/10.1007/s00726-016-2274-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00726-016-2274-4