NClassG+: A classifier for non-classically secreted Gram-positive bacterial proteins
Most predictive methods currently available for the identification of protein secretion mechanisms have focused on classically secreted proteins. In fact, only two methods have been reported for predicting non-classically secreted proteins of Gram-positive bacteria. This study describes the implementation of a sequence-based classifier, denoted as NClassG+, for identifying non-classically secreted Gram-positive bacterial proteins.
Several feature-based classifiers were trained using different sequence transformation vectors (frequencies, dipeptides, physicochemical factors and PSSM) and Support Vector Machines (SVMs) with Linear, Polynomial and Gaussian kernel functions. Nested k-fold cross-validation (CV) was applied to select the best models, using the inner CV loop to tune the model parameters and the outer CV group to compute the error. The parameters and Kernel functions and the combinations between all possible feature vectors were optimized using grid search.
The final model was tested against an independent set not previously seen by the model, obtaining better predictive performance compared to SecretomeP V2.0 and SecretPV2.0 for the identification of non-classically secreted proteins. NClassG+ is freely available on the web at http://www.biolisi.unal.edu.co/web-servers/nclassgpositive/
KeywordsSupport Vector Machine Dipeptide Matthews Correlation Coefficient Gaussian Kernel Function Dipeptide Composition
Machine Learning (ML) tools have been successfully applied to the solution of a variety of biological problems such as the classification of proteins according to their subcellular localization and secretion mechanism. Different computational methods have been used to obtain reliable subcellular localization predictions, such as Artificial Neural Networks (ANNs), Hidden Markov Models (HMMs) and Support Vector Machines (SVM) [1, 2, 3, 4].
The simplest way of addressing classification problems is to follow a binary approach, trying to discriminate objects according to two categories: positive (+) and negative (-). SVMs rely on two concepts in order to solve this type of problems: the first one is known as the large-margin separation principle, which is motivated by the idea of classifying points in two dimensions; and the second one is known as Kernel methods .
The Kernel methods that have been applied to bioinformatics are classified into three categories mainly: Kernels for real-valued data, Kernels for sequences and Kernels developed for specific purposes such as the Position-Specific Scoring Matrix (PSSM)-Kernel . In the first case, examples that represent a data set can be usually expressed as feature vectors of a given dimensionality. In the case of Kernel functions for real-valued data, linear, polynomial and Gaussian Kernels are some of the most commonly used functions and they were used in the implementation of NClassG+. In the third case, the most frequently used Kernels for sequences are the Spectrum Kernels describing l-mer content , positional Weighted Degree (WD) Kernels that use positional information  and other Kernels for sequences such as the Local Alignment Kernel [5, 9].
The use of Kernels for exploring real-valued biological data such as proteins usually involves two steps. In the first step, amino acid sequences are transformed into fixed-length vectors that are then used to feed ML tools so that they can learn to make predictions in a second step [10, 11]. The SVM classification method outstands among the techniques based on Kernel learning, which searches for an optimal separation hyperplane in the feature space and determines the optimal data separation margin, maximizing the generalization capacity of the detected pattern. This separation hyperplane is trained by means of quadratic programming . SVMs and Kernel functions are very effective for solving classification problems because they are based on probability theory, can handle large data sets of high dimensionality, and have great flexibility to model diverse data sources .
Comparison of the evaluation measurements of NClassG+, SecretomeP 2.0 and SecretP 2.0 for the classification of Gram-positive bacterial proteins
Split set a3
Split set a3
Split set a3
Over the last 20 years the use of the ML techniques mentioned above have allowed proposing novel solutions to the identification of protein secretion and post-translational modifications. The validation of the different methods available for predicting protein secretion [19, 20], as well as the use of such algorithmic methods for the identification of potential drug and vaccine target proteins, followed by the experimental validation of such predictions [21, 22], have shown to be a consistent approach to obtain novel biological findings supported on computational processes and with direct application to the solution of protein secretion problems.
ML tools used in the identification of secreted proteins have been developed taking into account the biological principles of protein subcellular localization, which is essential for the correct functioning of these proteins . The localization of secreted proteins in their appropriate cellular compartments involves diverse processes that range from the transport of small molecules through highly complex routes with intrinsic sequence signaling processes. Much of the current efforts in understanding protein secretion have focused on how such protein transportation systems work and on the identification of membrane proteins to drive drug development toward products that have specific effects on such proteins [23, 24].
In Gram-positive bacteria, proteins might localize in at least four different locations: the cytoplasm, cytoplasm membrane, cell wall and extracellular milieu. Since protein synthesis takes place in the cytoplasm, secreted proteins have to be transported across the cell membrane so that they can fulfill their function effectively [25, 26, 27]. Given the complexity of such secretion systems, it is not surprising that new mechanisms of secretion are being constantly discovered . Thus, there is a considerable number of proteins that have been experimentally identified as secreted but whose mechanism or route of secretion has not been yet identified and therefore are said to be secreted via non-classical or alternative means .
Many of the proteins that are secreted via alternative pathways are directly associated with pathogenic processes, thus their identification is of key importance . In the case protein secretion in Gram-positive bacteria, there are six secretion systems to transport proteins across the cytoplasmic membrane reported up to date: secretion (Sec), twin-arginine translocation (Tat), flagella export apparatus (FEA), fimbrilin-protein exporter (FPE), hole forming (holing) and WXG100 secretion system [30, 31]; however, it is important to emphasize that non-classical protein secretion should not be considered as a single mechanism but rather as a range of secretion systems that differ from classical secretion but are still not clearly characterized. This discloses problems both with the experimental and computational strategies currently used to identify new secretory mechanisms and highlights the importance of developing new strategies to study non-classical secretion.
The development of this work focused on the identification of non-classically secreted proteins. It is worth noting that for some of these secreted proteins a known function has been also reported in the cytoplasm, leading to their classification as "moon-lightning" or multi-functional proteins. NClassG+ identifies proteins that are secreted through signal-peptide independent pathways and was here validated based on a compiled list of extracellular proteins lacking a signal peptide. NClassG+ was compared to the two available algorithms for classifying non-classically secreted Gram-positive proteins, named SecretomeP 2.0  and SecretP 2.0 .
A training and a split set were built from a learning data set containing 420 positive proteins and 433 negative proteins with thoroughly adjusted parameters. Independently, a test set containing 82 positive examples of non-classically secreted proteins and 263 negative examples were constructed for comparing NClassG+ to the other classifiers of non-classical secretion. These data sets were the result of removing redundant proteins with more than 25% of identity. Linear, polynomial and Gaussian Kernel functions were selected for constructing the representation vectors, as literature revision indicated that these are very well explored Kernel functions. The data sets were supported on experimental reports and the necessary vector transformations were applied to them during the learning process.
A nested k-fold CV procedure was used to tune the model and compute the error separately. This was done with the aim of finding the best parameters to train the complete data set. The exploration was optimized using a grid search approach and led to proposing a classifier, which was trained independently on frequencies, dipeptides, factors and PSSM vectors as well as on all possible combinations between such vectors. The predictive behavior of NClassG+ was analyzed and contrasted against SecretomeP 2.0 and SecretP 2.0 in two occasions: one with the split set during the training process and the other one with the test set during a separate testing step.
About 15 000 hyperparameter combinations comprising feature vectors, SVM C values, and Kernel functions and their parameters were explored to select the best classifier. The optimized exploration of combinations pointed to a linear classifier combining factors, dipeptides and PSSM vectors as the one that yielded the highest accuracy in the inner loop of the nested CV procedure. The C parameter of the classifier was equal to 64. The average accuracy of the outer folds in the nested k-fold cross-validation was 0.93.
Compared to SecretomeP and SecretP, NClassG+ showed a better performance both in the test with the split set after the training process, as well as in the independent test with the test set, as indicated by its higher accuracy and MCC. The correct identification of non-classically secreted and non-secreted proteins, understood in terms of the tools' sensitivity and specificity, were notably high for NClassG+ (both values were above 0.84), thus indicating that this tool recognizes a similar proportion of both protein types, in contrast to SecretomeP 2.0 and SecretP 2.0, in which such relationships were unbalanced (Table 1).
One of the most complex areas of ML is directly associated with finding and constructing training and exploration data sets . In this study, a positive training set containing 3 794 protein sequences and a negative training set comprising 21 459 protein sequences were obtained by screening the SwissProt database. Both protein sets were balanced by adjusting the percentage of identity in each set.
In this study, prediction of non-classically secreted proteins is done based on a modification of classically secreted proteins, as proposed by Bendtsen and colleagues [29, 34]. However, here we postulate novel training and exploration data sets that were astringently adjusted, as well as innovative data transformations and methods not previously used in the classification of non-classically secreted proteins.
It is important to highlight that the input data for the construction of NClassG+, SecretomeP 2.0 and SecretP 2.0 were all extracted from SwissProt (version 53.1 for NClassG+, version 44.1 for SecretomeP and version 57.7 for SecretP); therefore, there is probably some data overlapping between the training data sets of the three tools. Nevertheless, the diversity of protein prediction methods, the constant increase of protein data and the identification of new problems stress the importance of analyzing and extracting data to construct new hypotheses in terms of protein localization.
Different pre-processing techniques were used in the construction of the feature vectors that represented each of the sequences in the input data set. These techniques have some intrinsic computation details that can result in comparatively more expressive vectors . In the specific case of dipeptide and PSSM vectors, both types of vectors use 400 features to represent each amino acid sequence, but evidently, PSSM is the vector that represents each protein more effectively. PSSM vectors have been reported to be one of the most efficient ways of representing proteins in statistical learning [16, 17, 36, 37, 38, 39, 40] but the strategy of mixing different vectors resulted in even better results in terms of the evaluation measurements.
It is worth noting that NClassG+, SecretomeP 2.0 and SecretP 2.0 use data from two biological classes of Gram-positive bacteria (Firmicutes and Actinobacteria). However, part of the features used in SecretomeP 2.0 come from prediction methods that were trained with protein sequences that belong to biological groups different from Gram-positive bacteria, which suggests that there are common secretion mechanisms among the different biological entities; however, such hypothesis should be experimentally validated in the same way as it has been done for classical secretion in Gram-positive bacteria [22, 25, 27, 41, 42, 43, 44, 45, 46].
Although both NClassG+ and SecretP 2.0 use an SVM algorithm, there are deep differences in terms of the methodology approach followed by both tools. Both tools use different techniques to build their vector representations, but SecretP 2.0 does a smaller exploration to obtain its final classifier. Yu et al. reported a lower ability of SecretP 2.0 to predict non-classically secreted Gram-positive proteins compared to SecretomeP 2.0 , which also agrees with the results of NClassG+ (Table 1). However, it is particularly interesting that SecretP 2.0 was built to classify 3 protein categories (classically secreted proteins, non-classically secreted proteins and non-secreted proteins) but was validated using classical measures (sensitivity, specificity, accuracy and MCC), which are basically adequate to evaluate binary results.
In particular for NClassG+, the linear, polynomial and Gaussian Kernel functions were explored under equal conditions for its optimization. The best results were obtained using the linear function, which is consistent with reports by Ben-Hur and colleagues  stating that the linear kernel provides a useful baseline and is hardly beaten in many bioinformatics applications, especially when the dimensionality of the input set is large and there is a small number of samples, as occurred with NClassG+.
In order to select the best classifier, the results were optimized according to parameters, exploring different vector combinations as well as different Kernel functions. In the case of the function exploration, it is important to mention that the Gaussian function has less difficulties compared to the polynomial function because 0 < Kij ≤ 1, in contrast to the polynomial Kernel function, where values may tend to infinity as the degree of the polynomial increases . This is observed in the nature of the variables of the polynomial function, where the number of experiments is larger compared to the other two methods (linear and Gaussian).
In the validation of the different classifiers proposed in this study, the results obtained by calculating the ROC showed good discrimination between false positives and true positive proteins. Nevertheless, it should be taken into account that the ROCs characterize the potential ranges of the algorithm but not the performance of a given classifier .
This study reports the NClassG+ tool for the classification of Gram-positive bacterial proteins that are secreted independently of the classical secretory pathway. This tool has a novel training data set and is composed of a classifier based on a polynomial function that uses vectors built from dipeptides, frequencies and PSSM data.
Among the 4 types of vectors, the similarity-based PSSM vector was always present in the optimization process, which reflects the efficiency of this type of vector for representing protein sequences, compared to the other 3 types of vectors. However, the combination of the different vector representations was a good approach to solve the classification problem, as it minimized the optimistic biased thanks to the nested CV and allowed to obtain a robust classifier.
There are still novel protein secretion and translocation mechanisms to be discovered, where the use of computational and ML methods can play a key role for elucidating new processes and discovering new biological mechanisms.
Learning and test data
The UniprotKB (version 15.5) protein database was used as reference for constructing NClassG+ . This database includes several databases such as PRI-PSD, TrEMBL and SwissProt version 53.1 . Among these databases, SwissProt was used for the construction of the learning and test data sets because it is publicly available and the protein sequences reported in it have gone through a careful annotation process . Until October 2009, a total of 10 424 881 proteins were reported in SwissProt; 512 994 of these proteins had been manually annotated and reviewed, while the remaining proteins were under adjustment at that time.
Data set selection
Proteins were selected according to the systematic classification of Gram-positive bacteria reported in SwissProt version 53.1. Accordingly, bacterial proteins are classified into two large biological classes: Actinobacteria (19 897 curated proteins reported), which are characterized by a high G+C content, and Firmicutes, which have a low G+C content . As general data adjustment criteria, proteins had to be at least 50 amino acids long and no more than 10 000 amino acids in length. Sequences annotated as 'fragment', 'probable', 'probably', 'potential', 'hypothetical', 'putative', 'maybe' and 'likely', were excluded from the positive and negative sets.
Adjustment of the learning and test data sets
The learning (training and split sets) and the test sets (independent set) were adjusted using the PISCES algorithm [52, 53]. This algorithm reduces sequence redundancies based on an identity measure by making "all against all" comparisons of PSSM matrixes obtained using PSI-BLAST (3 iterations, E-value: 0.0001, BLOSUM 62 matrix). Only proteins with ≤25% of identity were included within the learning and test data sets .
Learning and test data sets
The positive data set comprised only proteins whose annotation in SwissProt v.53.1 contained the words 'signal', 'secreted', 'extracellular', 'periplasmic', 'periplasm', 'plasma membrane', 'integral membrane' or 'single pass membrane'. This resulted in a set of 3 794 bacterial proteins that fulfilled all criteria. The sequence portion corresponding to the translocation mechanism (first region between position 1 up to a varying point that ranges between amino acids 21 and 55) was manually removed based on the annotation reported in SwissProt [29, 34]. This procedure yielded a set of proteins that lacked a signal sequence and was only applied to this set; all other sets were not modified. The set was reduced to 420 proteins after adjusting its identity to ≤25%, as described above.
The negative protein set included proteins whose annotations contained the words 'cytoplasm' or 'cytoplasmic'. This selection criteria identified a total of 21 459 proteins. To obtain a negative set with experimental support, proteins were randomly divided into two sets. Ninety percent of the negative set was used for the learning process (training and split sets) of the classifiers and 10% of the negative set was used to complement the test data set. The first one contained 433 proteins and the second one 263 proteins after adjusting the identity to ≤25%.
For the test set (independent set), an initial screening of SwissProt v.53.1 identified 178 curated redundant proteins being secreted despite lacking a signal sequence, which formed the positive data set. Proteins labeled with the word "secreted" in the keyword line and without the word "signal" in the feature table line were selected to construct the test set, as reported by Yu et al.; this set also included the test set reported by Bendtsen et al. 2005 for SecretomeP. The set was depurated to 82 proteins after adjusting its identity to ≤25% and was complemented with 10% of the negative set (263 proteins) that was built based on a random partition of the redundant negative set. This set was used for analyzing the predictive capacity of NClassG+ and contrasting its predictions with the results obtained with SecretomeP 2.0 and SecretP 2.0 [29, 32, 34].
Protein prediction models are frequently constructed using structural and physicochemical features extracted from amino acid sequences . Among the different types of data that can be used to construct feature-based vectors are amino acid composition or "frequencies" [36, 56], dipeptides [57, 58, 59], physicochemical features , and PSSM .
Construction and normalization
Because of methodological requirements, it is necessary to transform the variable length of the protein sequences into fixed-length vectors. This step is of key importance for protein processing and classification with ML tools . All the transformations explained below produce fixed-length vectors.
Amino acid composition vectors (frequencies)
These types of vectors are constructed based on the composition of dipeptides and have been extensively used to represent protein sequences [57, 58, 59]. Dipeptide composition vectors contain information regarding the frequency as well as the local order of amino acid pairs in a given sequence and describe proteins using 400 features [60, 61].
Statistical factor vectors
On the basis of the study described by Atchley et al., a multivariate statistical analysis was carried out over the 494 physicochemical and biological attributes predetermined for each amino acid, as it is reported in the AAindex . Such study defined a set of highly interpretable factors based on the characteristics contained in this database for representing amino acid variability. These high-dimension data attributes were summarized in the following 5 factors (a) Factor I or polarity index, (b) Factor II or secondary structure factor, (c) Factor III related to the molecular size or volume with high factor coefficients for bulkiness, (d) Factor IV, which reflects relative amino acid composition, and (e) Factor V, which refers to electrostatic charge with high coefficients on isoelectric point and net charge. Based on this method, proteins are represented as vectors of 100 features .
PSSM vectors (PSI-BLAST)
Profiles of biological data with evolutive implications can be extracted using PSI-BLAST  to construct profiles from the estimated PSSM [17, 64]. Basically, a PSI-BLAST search is carried out for each protein using the non-redundant (NR) database that contains the GenBank CDS translations, PDB, SwissProt, PIR and PRF databases, iterating thrice. PSI-BLAST parameters have to be adjusted so that the discriminating criterion of the E-value corresponds to 0.001, and the BLOSUM62 substitution matrix is used. This results in a PSSM from which a vector of 400 features is obtained per sequence by collapsing rows over columns, as described in detail by Jones . The elements of these input vectors are subsequently divided according to the length of the sequence and are then escalated to a range between "0" and "1" using the sigmoid function [39, 40, 65]. This method allows constructing vectors that describe proteins using 400 features. PSSMs were locally calculated using Blastpgp , downloading the NR BLAST database which contains 9 993 394 protein sequences.
Amino acid composition, dipeptide composition, factors and PSSM vector combinations were explored and optimized to identify which were more expressive. The output format of the vectors corresponds to the standard output of the LIBSVM software package .
Taking into account the recommendations of Fan et al. for exploring Kernel function parameters and methods, the comparison should be efficient under different conditions established by the user in order to obtain a wide approach to all the different behaviors of the classifier. Such recommendations are: (a) "Selection of parameters", which is related to performing cross-validations of the models to be trained in order to find the set of parameters that best fit the data, the Kernel function and the type of SVM, so as to obtain the final model, and (b) "Final training", which consists on training the classifiers with the complete data set based on the best set of parameters. The linear, polynomial and Gaussian Kernel functions as well as C-SVC for the SVMs were explored in the construction of NClassG+.
ROC plot analysis
The final performance of NClassG+ was calculated based on the total average of the subsets and the performance was evaluated based on their standard parameters of sensitivity, specificity and accuracy [48, 68, 71].
Sensitivity, specificity, accuracy and Matthews correlation coefficient (MCC)
The threshold parameters of prediction methods can be set dependently or independently, and each method has its own limitations. The performance of the CV and the ability of a method to predict novel sequences can be evaluated using four threshold-independent parameters: sensitivity, specificity, accuracy and MCC. These measures were defined in terms of the following values: true positives (TP), false negatives (FN), true negatives (TN) and false positives (FP), as follows:
We would like to thank Nora Martinez for helping in the translation of this manuscript, and to Professors Juan Carlos Galeano and Fabio Gonzalez for contributing to the construction of this method with important suggestions. We would also like to thank to Centro de Super Computación (CSC), Faculty of Engineering, Universidad Nacional de Colombia for its computational services to run NClassG+.
Funding: This project was supported by Asociación Investigación Solidaria SADAR, Caja Navarra (CAN) (Navarra, Spain) and the Spanish Agency for International Development Cooperation (AECID).
- 7.Leslie C, Eskin E, Noble WS: The spectrum kernel: A string kernel for SVM protein classification. Proceedings of the Pacific Symposium on Biocomputing: 2002 2002, 566–575.Google Scholar
- 8.Sonnenburg S, Ratsch G, Schafer C, Scholkopf B: Large scale multiple kernel learning. The Journal of Machine Learning Research 2006, 7: 1531–1565.Google Scholar
- 9.Vert JP, Saigo H, Akutsu T: 6 Local Alignment Kernels for Biological Sequences. Kernel methods in Computational Biology 2004, 131–154.Google Scholar
- 12.Cortes C, Vapnik V: Support-vector networks. Machine Learning 1995, 20(3):273–297.Google Scholar
- 19.Leversen NA, de Souza GA, Malen H, Prasad S, Jonassen I, Wiker HG: Evaluation of signal peptide prediction algorithms for identification of mycobacterial signal peptides using sequence data from proteomic methods. Microbiology 2009, 155(Pt 7):2375–2383. 10.1099/mic.0.025270-0PubMedCentralCrossRefPubMedGoogle Scholar
- 22.Vizcaino C, Restrepo-Montoya D, Rodriguez D, Nino LF, Ocampo M, Vanegas M, Reguero MT, Martinez NL, Patarroyo ME, Patarroyo MA: Computational prediction and experimental assessment of secreted/surface proteins from mycobacterium tuberculosis H37Rv. PLoS Comput Biol 2010, 6(6):e1000824. 10.1371/journal.pcbi.1000824PubMedCentralCrossRefPubMedGoogle Scholar
- 30.Bendtsen JD, Wooldridge KG: Bacterial Secreted Proteins: Secretory Mechanisms and Role in Pathogenesis. Norfolk, UK: Caister Academy Press; 2009.Google Scholar
- 40.Ruchi V, Ajit T, Sukhwinder K, Grish V, Gajendra R: Identification of Proteins Secreted by Malaria Parasite into Erythrocyte using SVM and PSSM profiles. BMC Bioinformatics 2008, 9.Google Scholar
- 45.Tjalsma H, Antelmann H, Jongbloed JDH, Braun PG, Darmon E, Dorenbos R, Dubois JYF, Westers H, Zanen G, Quax WJ, et al.: Proteomics of protein secretion by Bacillus subtilis: separating the "secrets" of the secretome. Microbiology and Molecular Biology Reviews 2004, 68(2):207–233. 10.1128/MMBR.68.2.207-233.2004PubMedCentralCrossRefPubMedGoogle Scholar
- 59.Idicula-Thomas S, Kulkarni AJ, Kulkarni BD, Jayaraman VK, Balaji PV: A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli. Bioinformatics 2006, 22(3):278–284. 10.1093/bioinformatics/bti810CrossRefPubMedGoogle Scholar
- 66.Tao T: Standalone PSI/PHI-BLAST: blastpgp. NCBI 2007. [http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastpgp.html]Google Scholar
- 68.Fan RE, Chen PH, Lin CJ: Working set selection using second order information for training support vector machines. The Journal of Machine Learning Research 2005, 6: 1918.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.