Graph pyramids for protein function prediction
Uncovering the hidden organizational characteristics and regularities among biological sequences is the key issue for detailed understanding of an underlying biological phenomenon. Thus pattern recognition from nucleic acid sequences is an important affair for protein function prediction. As proteins from the same family exhibit similar characteristics, homology based approaches predict protein functions via protein classification. But conventional classification approaches mostly rely on the global features by considering only strong protein similarity matches. This leads to significant loss of prediction accuracy.
Here we construct the Protein-Protein Similarity (PPS) network, which captures the subtle properties of protein families. The proposed method considers the local as well as the global features, by examining the interactions among 'weakly interacting proteins' in the PPS network and by using hierarchical graph analysis via the graph pyramid. Different underlying properties of the protein families are uncovered by operating the proposed graph based features at various pyramid levels.
Experimental results on benchmark data sets show that the proposed hierarchical voting algorithm using graph pyramid helps to improve computational efficiency as well the protein classification accuracy. Quantitatively, among 14,086 test sequences, on an average the proposed method misclassified only 21.1 sequences whereas baseline BLAST score based global feature matching method misclassified 362.9 sequences. With each correctly classified test sequence, the fast incremental learning ability of the proposed method further enhances the training model. Thus it has achieved more than 96% protein classification accuracy using only 20% per class training data.
KeywordsProtein classification Protein homology Protein-Protein similarity network Network biology
The life of an organism is encrypted in the sequence of a genome, but decryption of the genetic information depends upon functions of the proteins that it encodes. The assignment of biological or biochemical roles to proteins has many challenges. Knowing just amino-acid sequence and structure of a protein does not guarantee that we can predict everything about that protein. However these measures are a good starting point for quickly predicting protein functions with the help of known homology. There are plenty of proteins which have totally unknown functions and the whole genome sequencing projects are major sources of these. So an approach based on protein homology is a fast, approximate and a primary way used to tackle such a daunting task of protein function prediction. The rationale behind this is that two proteins with similar sequence or structure could evolve from a common ancestor and thus have similar functions.
The homology of protein sequence is usually found by assessing similarity between pairs of sequences. An optimal algorithm based on dynamic programming like Needleman-Wunsch  is computationally inefficient for searching similar sequences in the large protein database. So most of the existing methods use suboptimal algorithms like BLAST  for matching a pair of sequences. Searching for only the highest scoring match in a protein database is nothing but looking for the global feature in the sequence similarity space.
In attempts to overcome the above limitations, various matching methods have been developed that use the features extracted from multiple sequence alignment (MSA) of the protein family sequences. These methods use sequence templates  and profiles of the sequences  as features. They ask for accurate MSA of related sequences with low residue identities which requires some domain expertise. These profile based methods use ad hoc scoring systems without associating any evolutionary meaning to it , unlike PAM or BLOSUM . These factors put limitations over these methods for using them for large protein databases.
Pattern recognition based only on local features could be useful for analyzing the large amount of data in real time like abnormal event detection from video . Use of only local features is a trade-off between speed and the accuracy. Domains like biometrics, as well as bioinformatics, require high recognition accuracy as well as reliability. So methods which fuse local and global features have improved recognition performance .
MSA of a protein family reveals selective pressures for conservation of specific residues with evolutionary functional importance. Some MSA regions seem to tolerate insertions and deletions while others tend to remain conserved. So position-specific features from MSA are desirable when searching databases for homologies. Profile HMM, a generative model used widely for protein sequence classification, uses these position-specific features . It uses global as well as local features by considering multiple sequences at the same time. But it lacks quick and incremental training functionality. After classification of a test sequence by profile HMM, to update the training model, MSA need to be calculated again and after that the new model parameters will be estimated from it. MSA is a time consuming process and also limits the number of sequences to be used for sequentially updating the training model. Unlike profile HMM, in the proposed graph based method, incremental training is performed easily and quickly. Test sequences can be easily added, either sequentially or all at a time, to the original graph to produce the new trained graph. This puts no limit on the number of test sequences to be added for updating the training model. In fact, the more new correct sequences added, the better the model will become.
Methods based on similarity clustering, k-nearest neighbors, phylogenetic clustering , gene fusion analysis  look for closely interacting sequences near the query sequence. They fail to account for interactions among the closely interacting neighborhood. Thus it leaves room for further performance improvement.
Intermediate sequence search (ISS) has also been successfully used for detecting remote homology . For the sequences whose homology cannot be established by a direct comparison, ISS attempts to relate them through a third weakly interacting sequence with them. Thus for detection of remotely related protein sequences, use of intermediate sequences has been known to increase the predictive power significantly . However, use of intermediate sequences can propagate errors dramatically when they are not of the same function. Excessive inclusion of the false positives can be effectively controlled by using graph theory. For example, Kim and Lee  used biconnectedness and articulation points to control the false positives effectively in an iterative manner. However, the relationships among sequences become very complicated as the number of sequences increases, so these relationships should be defined at multiple levels in a systematic manner. Thus our graph pyramid approach is an important solution for tackling above issues. The proposed method tightly controls false positives by considering strong interactions (global features) as well as all weak interactions (local features) in the graph with a hierarchical manner.
Protein-Protein Interaction (PPI)  plays a critical role in many biological processes. Protein expresses its functions when it interacts with the other proteins . So PPI is a vital information for protein function prediction. On the other hand, understanding protein functions is critical for understanding the various biological processes . PPI is modeled as a network, with protein sequences represent the nodes and biological protein interactions depict the edges in the network . Protein function prediction methods based on PPI are promising as well as producing high performance but the availability of high throughput PPI data is an essential requirement for them [20, 21]. So we propose a graph based protein classification method, which requires only amino-acid protein sequences as an input data. An edge in the graph is constructed by using the protein sequence similarity measure instead of PPI to produce the new Protein-Protein Similarity (PPS) network.
Motivation and contributions
Relationships among biological sequences can be effectively represented by building a PPS network. However, in protein families, modularity, local clustering and scale-free topology coexist . Thus use of single graph for modeling them has not been efficient. Here we propose the graph pyramid approach, where multiple graph features are used at different levels for modeling protein families. Along with this, the proposed algorithm using hierarchical voting scheme, tries to blend important characteristics from the PPI network and ISS methods for protein classification. This makes it possible to more objectively and reasonably predict the protein functions with high accuracy.
This paper is organized as follows. In the section 'Methods', we present the PPS network construction approach and its modeling via graph pyramid. The same section also discusses the important graph topological or graph structured features and the hierarchical protein classification algorithm. Experimental evaluation results of the proposed method, their comparison with existing prediction methods and the techniques of searching for the optimal algorithm parameters are presented in section 'Results'. We conclude our work with discussions about related issues and the direction of our future work.
The performance comparison given in the section 'Results', shows importance of the EB-score over the Bitscore.
The proposed algorithm uses the graphical modeling of protein sequences from each protein family/class. Consider the large protein database having M classes. Let the set of class labels be given as . Consider one of the protein classes, , then let be the i th training sequence in that class. Similarly is the new query or test sequence and its class label c q is what we have to find.
These edges form a set . Now the graph of a protein class is given by G(V c , E c ). This is a weighted and an undirected graph. An edge weight is nothing but the strength of protein sequence similarity.
To construct the graph of each protein class, we just need to consider interaction (EB-score) among all protein sequences within that class. Number of proteins in a class is far smaller than that of the entire database. So graphs of all protein classes can be easily and independently constructed by using protein-protein BLAST within the corresponding classes.
In protein similarity graphs, modularity, local clustering and scale-free topology  coexist. To explain this phenomenon we need the hierarchy, so graphs are analyzed in the hierarchical manner. At each hierarchical level, the edges with weights lower than a certain threshold are pruned. Now surviving edges are considered to be weightless. So the graph structure changes along hierarchical levels, as well as the graph becomes unweighted and remains undirected. This hierarchical analysis helps to extract different graph features for weakly similar hits (sequence matches) and thus captures the complex relationship between sequence similarity and protein function.
The corresponding graph is given as . For notation simplicity lets represent it as G(V c , E c , t), instead of and note that G(V c , E c ) = G(V c , E c , 0).
and note that , thus added set contains repetitive elements when . For example, let and then . This operation obeys the associative and the commutative laws like numerical addition. As defined earlier, is the query sequence from the unknown class c q . Now , and edges among the vertices in the set is given by an edge set . After adding to the original graph G(V c , E c ), we will get the new graph .
Graph Structured Features (GSF)
Most of the real world and biological (scale-free) networks communicate via a few highly connected nodes known as Hubs. These hubs determine the properties of networks . In real world networks like airline route maps, the important cities form hubs. Proteins with high degrees of connectedness are more likely to be essential for survival than proteins with lesser degrees . Gene duplication leads to growth and preferential attachment in biological networks . This leads to translating the proteins having high similarity. This shows the possibility of hub formation in the protein family graphs, G(V c , E c ), in the similarity space.
Figure 2(b) shows building of the PPS network with a varying threshold for one of the COG  protein families. We can see that as the threshold is lowered, trivially more edges are formed but most of them are associated with only particular nodes (hubs). Thus hubs are getting stronger and becoming more evident in the PPS network. We are not interested in the detailed assessment of whether the network is scale-free (a power-law degree distribution ) or not. But the above analysis helps to guide us for finding proper features which take graph structure (i.e. complex relationships among the protein sequences) into account. Also, the different protein families have different characteristics. Thus use of a single graph feature may not be effective. Features are selected such that they could extract different but vital network information.
Average Clustering coefficient (AC)
Rich Club coefficient (RC)
Star Motifs (SM)
Either the graph is dense or sparse, TR can be computed readily within the same time as the computation is only associated with the matrix of the same size. TR inherently captures different network properties than SM. Figure 3(c) illustrates this difference more elaborately. Query node q interacts with the nodes a, b, c and d. So they constitute an 'interacting neighborhood' to the query. TR has capability to simultaneously assess the interactions within interacting neighborhood. After the query interaction, one extra triangle is formed since only the nodes b and c were previously interacting. Whereas at node a SM power has increased from 3 to 4. Formation of new triangles in a graph indicates the fact that 'query interacts simultaneously to the already interacting nodes'. Newly formed number of triangles due to query interaction ΔTR(c, t), are found similarly as equation (8).
Graph Energy (GE)
Since GE depends only on the adjacency matrix, the density of the graph does not affect its computation time. The effect of query interaction on the original graph structure is captured by change in GE (i.e. ΔGE(c, t)), given similarly like equation (8). We are dealing with graphs whose edge strength is the similarity between connecting nodes, which is inverse of the usual edge length definition. So we need to look for the maximum ΔGE(c, t).
Graphs are analyzed hierarchically (as discussed in section 'Graph analysis') and threshold plays an important role in making hierarchical graph structure. If the query can interact at the higher layer of GP (see Figure 2(a)), then it means it is a strong interaction and it accounts for the global feature. Because in GP as the level rises, the threshold also increases and at every level, the graph edges can only be formed if their strength is greater than the given threshold. Here interactions are in the sequence similarity space, so a sstrong interaction means the highest similarity between corresponding sequences, which occurs when both sequences match at most of the nucleic acids (i.e. match globally). Hence strong interaction accounts for the global feature and similarly weak interactions account for local features.
Maximum value of H(·) is 1, so all the arguments (classes c) are assigned to whenever it produces output 1. For reducing FP, the subroutine for SM and TR slightly changes (see algorithm 2). Here we look for k maximally influenced classes from by the query. Thus classes from only are assessed (voted) again by the features SM and TR. Subroutine for GE finds the class from for which ΔGE(·) is maximum. Thus this produces a hierarchical class voting scheme which helps to improve the classification accuracy and to reduce the computational load.
Each GSF has an ability to extract different information from different levels of the protein family GP. However, applying each GSF to entire GP, is computationally inefficient when dealing with large numbers of protein families. In addition, it may add up the FP, when a decision is being made in the much lower GP level than the level defined by t ∗ . To avoid these problems, the GP based hierarchical voting scheme is necessary. And the rational behind placing different GSF at different GP levels, is explained in the next 'Results' section.
Algorithm 1 Graph pyramid search subroutine
input: ; secondary threshold T AC ; primary threshold , where t i >t j for i >j
2: for t ∈ t AC from t n to t1 do
3: if then
4: for all classes do
5: I AC = H (ΔAC(c,t)−(T AC ))
7: t* = t
8: end for
9: end if
10: end for
Dataset and evaluation details
Proposed method is evaluated on entire COG database . It is the protein database of Clusters of Orthologous Groups (COG). It consists of 4,873 COG (protein families), having total 138,458 proteins from 66 different genomes. Approximately 10% sequences from each COG, are selected randomly, which has produced 14,086 test sequences. This procedure is repeated 5 times further to get average performance. First, each GSF is tested independently, for various thresholds without using hierarchical voting scheme. Here, the class which produces maximum change in the GSF for a given , is selected as the output. These output labels are produced with either correct decision (cd), wrong decision (wd) or no decision (nd, when ). Then let the performance measure be defined as, .
Rational behind hierarchical voting
Algorithm 2 Classification by hierarchical voting
training: all G(Vc,Ec) are constructed
input: , primary threshold set , secondary thresholds T AC , T RC
output: c q
6: \∗ get the class label having the highest frequency ∗\
7: ψ q = mode
8: \∗ resolve the indecisive case step by step ∗\
9: if |ψ q | ≥ 2 then
13: if then
17: end if
19: c q = ψ q
20: end if
Sometimes, for the query , having subtle interactions with many classes, it is difficult for all GSF to come up with an unique agreement about c q . When it happens, the threshold would have already hit the bottom of its range. Thus, with earlier reasoning, the solution would be to rely only on GE to find c q , and 14 th step in the algorithm 2 does the same.
Deciding algorithmic parameters
Figure 5(b) shows the precision Vs threshold plots for SM and RC with different parameters. This helps to decide optimal parameters (p ∗ , r ∗ ) for them at the particular threshold. In the implementation, p ∗ = 2 for all GP levels, while r ∗ is set 3 for lower and 5 for higher GP levels. Other thresholds are set using the validation set by analyzing the maximum values for each GSF. And each set in the has the uniformly quantized numbers from 0 to maximum GSF value.
Figure 6 shows the normalized time taken by different classifiers for testing 14,086 sequences. In Majority voting scheme, first all GSF classify each sequence from the large pool of testing sequences, and then the voting begins. This slows down the scheme. On the other hand, in the proposed GP hierarchical scheme, the testing pool is gradually shrunken down. So the subsequent GSF have to investigate only small set of sequences, which likely to contain the true protein class. Which in turn speeds up the proposed algorithm along with maintaining high accuracy.
Average number of protein sequences misclassified (out of 14,086 testing sequences from COG  database) by the different protein classification methods.
T R without GP
T R with GP hierarchy
Highest scoring BLAST match 
Boujenfa et al. (using ClustalW) 
Proposed majority voting of GSF (No Hie)
Proposed GP based hierarchical voting (GP Hie)
Discussion and conclusions
As discussed initially in the background section, here we took an approach based on protein homology for protein function prediction. According to this approach an entire task boils down to protein classification, because two proteins with similar sequence or structure could evolve from a common ancestor and thus have similar functions. So once we classify a protein to its true family, we can easily ascertain its probable functions from the characteristics of its family. We took this approach because it is fast, an approximate and primary way to tackle a daunting task of function prediction of a large number of proteins.
This paper proposes a novel protein classification method based on PPS network modeling using the proposed EB-scores. It tries to blend important characteristics from PPI network and ISS methods for protein classification. Importance of the method is that it exploits the topological structural information of the PPS network, using hierarchical network analysis guided by the graph pyramid. This helps to analyze the different protein interactions at different pyramid levels. Thus the necessary information for protein classification from weak interactions in the PPS network is not suppressed by the other strong interactions. And proposed features extract the different network properties at various pyramid levels. This makes it possible to more objectively and reasonably predict the protein class.
The hierarchical voting algorithm helps to improve the computational efficiency with maintaining high classification accuracy. Some of the salient features of the proposed method are; protein sequences as the only input requirement; fast and easy incremental learning; can show topologically, how the query sequence interacts with the protein family; quick training and the high performance. The proposed graph based modeling, has an extra advantage that the relationship between protein families can also be found by finding the corresponding inter-graph similarities. The experimental evaluation on COG database demonstrated the effectiveness of the proposed method.
This graph pyramid approach is also promising to use in the PPI network and various other graph based bioinformatics methods. Protein characteristics like 3D structure and presence of various domains, along with sequence similarity measure can be used for more efficient protein network construction. Our future work will try to address these issues.
Publication of this article has been funded by Next-Generation Information Computing Development Program through the National Research Foundation of South Korea (NRF), funded by the Ministry of Science, ICT and Future Planning (No.NRF-2012M3C4A7033341). It was also partially sponsored by BK 21 Plus program.
This article has been published as part of BMC Medical Genomics Volume 8 Supplement 2, 2015: Selected articles from the 4th Translational Bioinformatics Conference and the 8th International Conference on Systems Biology (TBC/ISB 2014). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcmedgenomics/supplements/8/S2.
- 2.Altschul SF, Gish W, Miller W, Myers EW, J LD: Basic local alignment search tool. Journal of Molecular Biology. 1990Google Scholar
- 3.Sandhan T, Sonowal S, Choi JY: Audio bank: A high-level acoustic signal representation for audio event recognition. Int Conf on Control, Automation and Systems (ICCAS). 2014, IEEEGoogle Scholar
- 4.Sandhan T, Choi JY: Sandhan, t and yoo, y and yoo, h and yun, s and byeon, m. Int Conf on Advanced Video and Signal-Based Surveillance (AVSS). 2014, IEEEGoogle Scholar
- 5.Sandhan T, Choi JY: Frequencygrams and multi-feature joint sparse representation for action and gesture recognition. Int Conf on Image Processing (ICIP). 2014, IEEEGoogle Scholar
- 6.Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The cog database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Research. 2001Google Scholar
- 8.Luthy R, Xenarios I, Bucher P: Improving the sensitivity of the sequence of the sequence profile method. Protein Sci. 1994Google Scholar
- 9.Henikoff S: Scores for sequence searches and alignments. Current Opinion in Structural Biology. 1996, 353-360.Google Scholar
- 10.Wheeler D: Selecting the right protein-scoring matrix. Curr Protoc Bioinformatics. 2002Google Scholar
- 11.Sandhan T, Srivastava T, Sethi A, Choi JY: Unsupervised learning approach for abnormal event detection in surveillance video by revealing infrequent patterns. Int Conf on Image and Vision Computing New Zealand (IVCNZ). 2013, IEEEGoogle Scholar
- 12.Sandhan T, Chang HJ, Choi JY: Abstracted radon profiles for fingerprint recognition. Int Conf on Image Processing (ICIP). 2013, IEEEGoogle Scholar
- 13.Soding J: Protein homology detection by hmm-hmm comparison. Bioinformatics. 2005Google Scholar
- 14.Pellegrini M, Marcotte E, Thompson M, Eisenberg D, Yeates T: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA. 1999Google Scholar
- 15.Enright A, Iliopoulos I, Kyrpides N, Ouzounis C: Protein interaction maps for complete genomes based on gene fusion events. Nature. 1999Google Scholar
- 16.John B, Sali A: Detection of homologous proteins by an intermediate sequence search. Protein Sci. 2004, 54-62.Google Scholar
- 17.Park J, Teichmann SA, Hubbard T, Chothia C: Intermediate sequences increase the detection of homology between sequences. J Mol Biol. 1997Google Scholar
- 18.Kim S, Lee J: Bag: a graph theoretic sequence clustering algorithm. Int J Data Min Bioinformatics. 2006Google Scholar
- 19.Barabasi AL, Oltvai ZN: Network biology: understanding the cell's functional organization. Nature Reviews Genetics. 2004Google Scholar
- 20.Chua HN, Wong L: Predicting protein functions from protein interaction networks. Bio Data Mining in Proein Int Net. 2009Google Scholar
- 21.Schwikowski B, Uetz P, Fields S: A network of protein-protein interactions in yeast. Nature Biotech. 2000, 1257-1261.Google Scholar
- 22.Jeong H, Mason SP, Barabasi AL, Oltvai ZN: Lethality and centrality in protein networks. Nature. 2001, 411-412.Google Scholar
- 23.Bhan A, Galas DJ, Dewey TG: A duplication growth model of gene expression networks. Bioinformatics. 2002Google Scholar
- 24.Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL: Hierarchical organization of modularity in metabolic networks. Science. 2002Google Scholar
- 25.Colizza V, Flammini A, Serrano MA, Vespignani A: Detecting richclub ordering in complex networks. Nature Physics. 2006Google Scholar
- 26.Gutman I: The energy of a graph. Steiermarkisches Mathematisches Symposium. 1978, 100-105.Google Scholar
- 27.Boujenfa K, Essoussi N, Limam M: Tree-knn: A tree-based algorithm for protein sequence classification. Int Journal on Comp Sci and Engineering (IJCSE). 2011Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.