Introduction

In most bioinformatics research, there are many classification problems related to peptides/proteins (e.g. subcellular localization (Chou and Shen 2007) and protein–protein interactions (Nanni and Lumini 2006a), it is important to search for and evaluate methods for reliably extracting features from peptides/proteins (Brusic et al. 2002).

In literature, most amino acid sequence descriptors are based on a vectorial representation. In nanni and Lumini (2006b), for example, the feature vector that describes a peptide is obtained by concatenating the vectors that describe each amino acid. This results in a 20-dimensional vector that represents each amino acid. The vector is all zeros except for the position corresponding to the considered amino acid, which takes on the value of a given physicochemical property. Another well-known method for extracting features from peptides/proteins is Chou’s pseudo amino acid (PseAA) composition (Chou and Cai 2006) and its many variants, e.g. physicochemical distance (Chou 2000), digital code (Gao et al. 2005) and digital signal (Xiao and Chou 2007).

A completely different class of descriptors has been proposed, based on kernels. Some kernels used in amino acid representation include the Fisher kernel (Jaakkola et al. 1999), proposed for remote homology detection, the mismatch string kernel (Leslie et al. 2004), which has a performance similar to the Fisher kernel but is lower in computational cost, and a whole new class of kernels specifically proposed for predicting protein subcellular localization (Lei and Dai 2005).

To design useful vaccines for a large population, it is very important to predict the peptides that bind multiple human leukocyte antigen (HLA) molecules (Brusic et al. 2002). Developing automatic systems for predicting whether a peptide binds multiple HLA molecules is very useful for making the design of vaccines more time effective. Examples of automatic systems that have been proposed in literature include systems based on support vector machines (SVMs) (Bozic et al. 2005), artificial neural networks (ANNs) and hidden Markov models (HMMs) (Brusic et al. 2004). For a good survey of automatic systems that predict whether a peptide binds multiple HLA molecules, see (Brusic et al. 2004).

The acquired immune deficiency syndrome (AIDS) virus is considered as one of the most devastating diseases humankind has ever faced, with more than 60 million people currently infected. It is well known throughout the scientific community that the HIV-1 protease is essential for the replication of the AIDS virus. A protease is an enzyme that cleaves proteins to their component peptides. The HIV-1 protease hydrolyzes viral polyproteins into functional protein products that are essential for viral assembly and subsequent activity. Inhibition of this protease prevents maturation of HIV particles and is thus a feasible way of blocking the viral life cycle. However, there is a major problem in synthesizing chemically modified inhibitors of the protease that can be used to bind the active site in HIV-1 protease: the protease cleaves at different sites with little or no sequence similarities.

HIV protease-susceptible sites in a given protein extend to an octapeptide region, the amino acid residues of which are sequentially symbolized by eight subsites, P 4 P 3 P 2 P 1 P 1P 2P 3P 4′, and the counterparts in the HIV protease are symbolized by S 4 S 3 S 2 S 1 S 1S 2S 3S 4′. According to the “lock and key” paradigm, if the amino acids in P (the “key”) fit the positions in S (the “lock”), then the protease will cleave the octamer between positions P 1 and P 1′. Recently, several algorithms based on machine learning have been employed to learn this “lock and key” rule from a set of experimental observations. Studies specifically proposed to approach the HIV-1 protease problem using ANNs (specifically, feed-forward multilayer perceptrons) include Thompson et al. (1995), Cai and Chou (1998) and Narayanan et al. (2002). Better results have been obtained using linear SVMs (Rögnvaldsson and You 2003). Recently, a Web server (available at http://www.csbio.sjtu.edu.cn/bioinf/HIV/) has been created that predicts HIV-1 protease cleavage sites given a protein sequence (Shen and Chou 2008).

A bottleneck in the early development of machine learning techniques for this problem has been the lack of a large and reliable dataset that could be used to evaluate and compare existing approaches. One of the most commonly used datasets in the last 10 years, the HIV-1 PR 362 dataset (Rögnvaldsson and You 2003), has recently been shown to be unreliable (Rögnvaldsson et al. 2007). Rules on the most important peptides for the protease classification were discovered to be true only in HIV-1 PR 362. Some findings on the most selective physicochemical properties for classification extracted by Nanni and Lumini (2006b) from HIV-1 PR 362 were discovered not to be true for a newly developed dataset, the HIV-1 PR 1625 (Kontijevskis et al. 2007). Very recently, a new and larger dataset (the HIV-1 PR 3261) (Schilling and Overall 2008) has been collected. The most important published methods proposed in the bioinformatics literature are summarized in Table 1, where several brief indications are reported.

Table 1 The most important methods are summarized

In this paper, we propose new techniques for using texture descriptors for representing a peptide. In particular, we compare two different methods for constructing a representation of the peptide as a matrix, and we use several variants of the discrete cosine transform (DCT) and the local binary pattern (LBP) descriptors. We find that the best results are obtained using 25 DCT coefficients with higher variance in the training data and by using a random subspace (RS) of 50 dominant local ternary patterns (LTP).

The remainder of this paper is organized as follows. In Sect. 2, we describe our system and introduce the peptide descriptors examined in this work. In Sect. 3, we report the experimental results. Finally, in Sect. 4 we draw a number of conclusions.

System description

The aim of this work is to study the behavior of different peptide descriptors and their combination for HIV-1 protease cleavage site prediction and the predictions of peptides that bind HLA. All the approaches evaluated in the experimental section can be roughly schematized as in Fig. 1. Given a peptide represented as a fixed sequence of amino acids, feature extraction is performed to obtain a fixed length descriptor. Feature selection is then sometimes performed to optimize the feature extraction parameters. Finally, an SVM is used for classification. We describe steps 2 and 3 in more detail below.

Fig. 1
figure 1

A general schema of the proposed approach

Feature extraction and selection

A peptide is a sequence of Len consecutive amino acid letters that can be numbered from 1, 2,…,20 according to the order of 20 native amino acids: A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V. Several fixed length descriptors for peptide representation include orthonormal representation (OR) (Rögnvaldsson and You 2003) and physicochemical encoding (PE) (Nanni and Lumini 2006b). We describe each of these below as well as the two texture descriptor (TD) matrix representations (Nanni and Lumini 2010): LBP (and its variants dominant LBP and LTP) and DCT.

In OR, each amino acid is mapped into a sparse orthonormal vector space using a 20-bit vector with 19 bits set to zero and the bit related to the amino acid position set to one. The peptide is represented by the concatenation of Len × 20 features (Rögnvaldsson and You 2003).

In PE, each amino acid of the sequence is coded using a 20-dimensional vector with 19 values set to zero and the value related to the amino acid position set to the value of the fixed physicochemical property of the given amino acid (see Fig. 2) (Nanni and Lumini 2006b). The set of physicochemical properties are obtained from the amino acid index (Kawashima and Kanehisa 2000) database.Footnote 1 An amino acid index is a set of 20 numerical values representing any of the different physicochemical properties of amino acids. This database currently contains 544 such indices and 94 substitution matrices.Footnote 2 The number of possible encoding representations of each amino acid is given by the number NP = 544 + 94, which is the number of different physicochemical properties obtained by the amino acid index databaseFootnote 3 (Kawashima and Kanehisa 2000). An index function index(j, d) can be defined as a function that returns the value of the property d for the amino acid j (d∈[1..NP], j∈[1..20]).

Fig. 2
figure 2

Example of physicochemical encoding representation

In TD, a matrix representation of a peptide is used to represent the information related to both the positions of the amino acids in the sequence and their physicochemical properties (Nanni and Lumini 2010). A feature extraction method is then applied.

We have examined two methods for representing a peptide as a matrix. The first is based simply on the physicochemical properties, where each element (i, j) of the texture that represents a peptide (given a physicochemical property d) is the following:

$$ {\text{OM}}_{d} (i,j) = index(a\min (i),d)) + index(a\min (j),d) $$

where OM d  ∈ ℜLen×Len is a squared matrix of the dimension of the peptide, amin(i) returns the index of the amino acid in position i, and index(j, d) returns the value of the property d for the amino acid j.

The second method for representing a peptide as a matrix is based on the Hasse matrix, reported in Nanni and Lumini (2010). The Hasse matrix representation can be calculated for each given physicochemical property d according to the following procedure. The 20 amino acids are sorted according to the value of d, and a “ranking value” is assigned to each (Feng and Wang 2008). The first amino acid has value 1/20, the second 2/20, and so onto 1, as long as there are no two amino acids with the same values. Otherwise, the sequence is divided by the number of different values of d in the sequence. For example, if the 20 bases are sorted in the following way according to d: N < K < R < Y < F = Q < S < H < M < W < G = L < V < E < I < A < D < T < P < C, then the corresponding ranking values are rank d (N) = 1/18, rank d (K) = 2/18,…,rank d (C) = 1. Given a physicochemical property d and a peptide, its representation matrix OM  d ∈ℜLen×Len  is a squared matrix of the dimension of the peptide, the elements of which are obtained as follows:

$$ {\text{OM}}_{d} (i,j) = \frac{1}{2}\left[ {rank_{d} (a\min (i)) + rank_{d} (a\min (j))} \right]i,j \in [1..Len] $$

where the function a min(i) returns the index of the amino acid position.

The texture descriptors are extracted from OM d . The first texture descriptor used in our experiments, LTP (Tan and Triggs 2007), is a variant of LBP (Ojala et al. 2002), which is a texture operator defined by considering the binary difference between the value of a cell, x, in a matrix, and the values of its P neighborhood, placed on a circle of radius R. LBP is illustrated in Fig. 3. The resulting pattern associated with x is uniform if the number of transactions between “0” and “1” of the sequence is less than or equal to two. The LBP operator can be made rotation invariant by performing P-1 bitwise shift operations and selecting the smallest value. The final descriptor is the histogram that measures the occurrence of each type of uniform pattern and the number of non-uniform patterns contained in the whole matrix.

Fig. 3
figure 3

LBP neighborhood sets for different (P, R), where P is the number of points and R is the radius of the neighborhood

A problem with conventional LBP is that it is sensitive to noise in the near-uniform matrix regions. LTP overcomes this problem using a “trick” in the binarization process: in LTP the difference, bd(x, u), between a pixel x and its neighbor u is encoded by three values according to a threshold τ:

$$ bd({\mathbf{x,u}}) = \left\{ {\begin{array}{*{20}c} 1 \hfill & {{\mathbf{u}} \ge {\mathbf{x}} + \tau } \hfill \\ 0 \hfill & {{\mathbf{x}} - \tau \le {\mathbf{u}} < {\mathbf{x}} + \tau1 } \hfill \\ { - 1} \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right. $$
(1)

The ternary pattern is then split into two binary patterns by considering both its positive and negative components (see Fig. 4). Finally, the histograms that are computed from the binary patterns are concatenated to form the final descriptor. In this study, we have extracted and concatenated the histograms (as suggested in Ojala et al. (2002)) obtained using the following parameters: (P = 8; R = 1); (P = 16; R = 2), where P is the number of points and R is the radius of the neighborhood.

Fig. 4
figure 4

An example of splitting a ternary code into positive and negative LBP codes

Liao et al. (2009) propose a method they call dominant LBP that utilize those LBP patterns that represent 80% of the whole pattern in the training data. In our experiments, we combine both dominant LBP and LTP descriptors.

The second texture descriptor used in our experiments is the DCT, as used in Nanni and Lumini (2010). DCT (Pan et al. 2000) is a texture descriptor that expresses a sequence of points in terms of a sum of cosine functions oscillating at different frequencies. DCT is similar to the discrete Fourier transform, except that it uses real numbers only. It has good information packing ability since most DCT components are typically very small in magnitude. This is because most of the salient information exists in the coefficients with low frequencies. As a result, when compared to other input independent transforms, DCT has the advantage of packing the most useful information into the fewest coefficients. In our experiments, we retain the DCT coefficients with highest variance considering only the training data.

Classification

As illustrated in Fig. 1, the classifier used in our system is the SVM,Footnote 4 the same classifier used in Nanni and Lumini (2006b). The aim of SVM (Duda and Hart 1973) is to find the equation of a hyperplane that divides the training set into two classes, leaving all the points of one class on the same side while maximizing the distance between the two classes and the hyperplane. The basic two-class SVM formulation gets as input an implicit embedding Φ and a labeled training set {x i} and returns the hyperplane w T Φ(x) + b = 0 that best separates the training samples of the two classes (see Fig. 5).

Fig. 5
figure 5

A graphical representation of the SVM hyperplane

When the patterns that belong to the training set are not linearly separable, a different kernel [e.g. a polynomial kernel or radial basis function (RBF) kernel] can be used to map the input vectors into a higher dimensional feature space where an optimal hyperplane can be constructed.

Experimental results

In this section, we describe the results of our experiments using our proposed methods. We compare performance using our peptide descriptors on two databases: the vaccine and the HIV-1 PR 1,625 datasets.

In this section, the following two datasets are used for validating our proposed techniques:

  • Vaccine (VAC) (Bozic et al. 2005), which contains peptides from five HLA-A2 molecules that bind/non-bind multiple HLA. The testing protocol suggested in Bozic et al. (2005) has been adopted, which is a “five-molecule” cross-validation method, where all the peptides related to a given molecule are used as the testing set and all the peptides related to the other four molecules form the training set (see Table 2 for details).

    Table 2 Number of binders (B) and non-binders (NB) in training and testing sets for HLA-A2
  • HIV-1 PR 1625 dataset (HIV) (Kontijevskis et al. 2007), which contains 1,625 octamer protein sequences: 374 HIV-1 protease cleavable sites and 1,251 uncleavable sites. In this dataset, the tenfold cross-validation is used to assess the performance.

To compare the different approaches, we use the area under the receiver operating characteristic curve (AUC). This performance indicator is a scalar measure ranging between 0 (worst performance) and 1 (best performance). It can be interpreted as the probability that the classifier will assign a higher score to a randomly picked positive sample as opposed to a randomly picked negative sample (Qin 2006). The main advantage of using AUC instead of accuracy as the performance indicator is that it takes into account the scores of the classifiers.

In Tables 3 and 4, we compare the following methods:

Table 3 AUC obtained using the matrix representation of the peptide proposed in Nanni and Lumini (2010), based on the Hasse matrix. LBP, DCT and DAU as reported in Nanni and Lumini (2010)
Table 4 AUC obtained using the matrix representation proposed
  • LBP/DCT/DAU, the methods based on LBP, DCT and Daubechies wavelet used in Nanni and Lumini (2010);

  • SDC, the 25 DCT coefficients with higher variance in the training data;

  • LBU, all the uniform patterns (not only the rotation invariant, as in LBP) are retained for training SVM;

  • DLB, the dominant LBP;

  • DLT, the dominant LTP with τ = 0.1;

  • RST, an RS of 50 DLT.

In the following tables, average (avg) and maximum (max) performance are obtained on the pool of physicochemical properties.

Since LBP and LBU obtain similar performance with the matrix representation of the peptide based on the Hasse matrix and with the representation based on the substitution matrix, and considering that the substitution matrix is easier to implement, we have extended our experiments to include some of the other LBP variants using the substitution matrix representation.

From the results reported in Tables 3 and 4, the following observations can be made:

  • SDC, where the coefficients are selected using the variance, outperforms DCT.

  • The variants of LBP that we tested outperform the standard rotation invariant uniform LBP (notice the performance difference between LBP and RST).

  • It is clear that to optimize the neighborhood-based descriptors, the performance difference between LBP and RST, both based on the same idea, has to be seen. Notice that in this application problem, the neighborhood-based descriptors permit studying the correlation among neighbor amino acids (see the definition of LBP in Sect. 2).

In the methods described below, we named method X, H_X when the descriptor X is extracted from the peptide representation proposed in Nanni and Lumini (2010), based on the Hasse matrix, and we named it S_X when the descriptor X was extracted from the matrix representation we proposed.

In Table 5, we compare the following methods:

Table 5 Comparison among PE and some fusion approaches
  • PE, a standard method for extracting features from the amino acid sequence described in Sect. 2;

  • FUS1, fusion by sum rule between H_SDC and S_RT;

  • FUS2, fusion by sum rule among H_SDC, S_RST and PE;

  • FUS3, fusion by weighted sum rule among H_SDC, S_RT and PE, the weight of H_SDC and S_RST is 1 while the weight of PE is 3.

In Table 6, we report the results obtained combining the different SVMs trained using 50 descriptors and by changing the physicochemical property used for building the matrix representation of the peptide. For example, the column SDC reports the AUC obtained combing 50 different SDC descriptors, each built using a different (randomly extracted) physicochemical property.Footnote 5 Those 50 SVMs (one for each SDC) are then combined using the sum rule. In Table 6, the results between parentheses is the average AUC obtained considering only one physicochemical property. As can be verified examining Table 6, a simple random selection of 50 physicochemical properties improves the performance of the texture descriptor representation.

Table 6 AUC obtained by combining with sum rule 50 descriptors each extracted from a different physicochemical property

The most interesting results reported in Tables 6 are:

  • to combine descriptors based on different physicochemical properties is very useful in the texture-based approaches (see the AUC improvement of H_SDC and S_RST when, instead of a single physicochemical property, 50 descriptors are combined); instead, in the standard approach (i.e. PE), the performance improvement, when 50 descriptors are combined, is not so interesting. This is due to the consideration that different matrices are extracted from the same peptide using different physicochemical properties (see also Table 8 where the Q-statistic of H_SDC and S_RST, based on different physicochemical properties, is reported). In Fig. 6, we report some samples of texture extracted from the same peptide using different physicochemical properties for showing that different textures are extracted from the same peptide using different physicochemical properties.

    Fig. 6
    figure 6

    Some samples of textures extracted from the same peptide (‘SLNLRETN’) using different physicochemical properties (reported under the figures)

  • In the VAC dataset, the texture-based approach H_SDC outperforms PE; the method based on the weighted sum rule named FUS3 permits obtaining good performance in both HIV and VAC datasets. FUS3 obtains an accuracy of 97.0% in the HIV dataset and 82% in the VAC dataset.

Now we report, in Fig. 7, the specificity/sensitivity curve obtained by PE (the dark line) and our proposed method FUS3 (the gray line).

Fig. 7
figure 7

Specificity/sensitivity curve obtained by PE (the dark line) and our proposed method FUS3 (the gray line)

Also, these plots confirm our previous conclusions on the usefulness of combining our new approaches and a standard method based on the amino acid sequence.

As a further experiment, we investigated the relationship among the different feature extraction approaches tested in this paper by evaluating the error independence between the classifiers trained using those features. Table 7 reports the average Yule’s Q-statistic (Kuncheva and Whitaker 2003) in the tested datasets for each couple of feature vectors. For two classifier G i and G j , the Q-statistic a posteriori measure is defined as:

Table 7 Yule’s Q-statistic obtained using different peptide descriptors
$$ Q_{i,j} = {\frac{{N^{11} N^{00} - N^{01} N^{10} }}{{N^{11} N^{00} + N^{01} N^{10} }}} $$

where N ab is the number of instances in the test set, classified correctly (a = 1) or incorrectly (a = 0) by the classifier G i , and correctly (b = 1) or incorrectly (b = 0) by the classifier G j . Q varies between −1 and 1; Q i,j  = 0 for statistically independent classifiers. Classifiers that tend to recognize the same patterns correctly will have Q > 0, and those which commit errors on different patterns will have Q < 0. In this problem, the Q-statistic values are low enough (Kuncheva and Whitaker 2003) to validate the idea of combining the descriptors.

The results reported in Table 7 clearly show that both H-SDC and S-RST give different information with respect to PE. The main difference between PE and the texture-based approaches is that PE does not extract information on the correlation of neighbor amino acids (PE transforms the peptide in a vector representing each amino acid with its physicochemical value, see Sect. 2), while in the texture-based approaches each local neighborhood is given by a set of amino acids.

Moreover, in Table 8, we report the Q-statistic among different H-SDC and S-RST descriptors obtained using different physicochemical properties. It is clear that different physicochemical properties permit obtaining slightly different descriptors and for this reason their fusion permits improving the performance with respect to that obtained by a stand-alone descriptor (see Table 6).

Table 8 Yule’s Q-statistic among H-SDC and S-RST descriptors based on different physicochemical properties

As a further test, we report some results, see Table 9, to show how many observations are required to build a good predictor with the proposed encodings. We run the tests on the HIV dataset with a different number of training patterns (75, 50 and 25%). Ten experiments are performed (the training patterns are randomly extracted) and the average results are reported. From these results, it is clear that the new encoding needs the same number of training patterns of standard method as PE.

Table 9 AUC obtained in the HIV dataset with different numbers of training patterns

Conclusion

In this paper, we have proposed new methods for describing the peptides, starting from the matrix representation of the peptides. Two different methods for constructing a representation of the peptide as a matrix are compared, and different texture descriptors are extracted from these images.

Our tests on different datasets, HIV-1 protease cleavage site prediction and predictions of peptides that bind HLA, show that the proposed system outperforms previous methods based on a matrix representation of the peptides (Nanni and Lumini 2010). Moreover, fusion with the standard physicochemical encoding representation is studied. The best practical finding revealed in this work is that texture descriptors extracted from the matrix representation of the peptides and standard amino acid descriptors should be combined to obtain a more reliable method. Moreover, several tests using the Q-statistic are performed for studying the correlation among the tested methods. These results further confirm that texture-based and standard sequence amino acid-based methods provide different information.

In future, we would like to study a number of different texture descriptors, as well as better methods for combining those descriptors. Moreover, we would like to test our best approach FUS3 in other datasets for confirming the good performance of the weighted sum rule for combining texture-based and sequence-based peptide descriptors.