Robust sequence alignment using evolutionary rates coupled with an amino acid substitution matrix
Selective pressures at the DNA level shape genes into profiles consisting of patterns of rapidly evolving sites and sites withstanding change. These profiles remain detectable even when protein sequences become extensively diverged. A common task in molecular biology is to infer functional, structural or evolutionary relationships by querying a database using an algorithm. However, problems arise when sequence similarity is low. This study presents an algorithm that uses the evolutionary rate at codon sites, the dN/dS (ω) parameter, coupled to a substitution matrix as an alignment metric for detecting distantly related proteins. The algorithm, called BLOSUM-FIRE couples a newer and improved version of the original FIRE (Functional Inference using Rates of Evolution) algorithm with an amino acid substitution matrix in a dynamic scoring function. The enigmatic hepatitis B virus X protein was used as a test case for BLOSUM-FIRE and its associated database EvoDB.
The evolutionary rate based approach was coupled with a conventional BLOSUM substitution matrix. The two approaches are combined in a dynamic scoring function, which uses the selective pressure to score aligned residues. The dynamic scoring function is based on a coupled additive approach that scores aligned sites based on the level of conservation inferred from the ω values. Evaluation of the accuracy of this new implementation, BLOSUM-FIRE, using MAFFT alignment as reference alignments has shown that it is more accurate than its predecessor FIRE. Comparison of the alignment quality with widely used algorithms (MUSCLE, T-COFFEE, and CLUSTAL Omega) revealed that the BLOSUM-FIRE algorithm performs as well as conventional algorithms. Its main strength lies in that it provides greater potential for aligning divergent sequences and addresses the problem of low specificity inherent in the original FIRE algorithm. The utility of this algorithm is demonstrated using the Hepatitis B virus X (HBx) protein, a protein of unknown function, as a test case.
This study describes the utility of an evolutionary rate based approach coupled to the BLOSUM62 amino acid substitution matrix in inferring protein domain function. We demonstrate that such an approach is robust and performs as well as an array of conventional algorithms.
KeywordsEvolutionary Rate Substitution Matrix Identity Score Codon Site Reference Alignment
The initial steps when investigating phylogenetic relationships or protein functions usually relies on performing accurate sequence alignments. Typically, tools such as BLAST  are employed to search a biological database like GenBank . Once a statistically significant match has been made with a protein of known function, hypotheses concerning the putative function or evolutionary history  can then be generated for the query sequence. Challenges arise when sequence similarity is low. Conventional alignment approaches are not sufficiently robust to detect homology in rapidly evolving sequences, evolutionary distant organisms or in sequences that have nucleotide and amino acid biases. Sequences may share important functional and evolutionary relationships in the range of low similarity, for example in the region of 20 to 30 % (the “twilight zone” of sequence alignment [4, 5]) or even when similarities are as low as 15 % [6, 7]. Structural data are frequently used as the standard of truth in these circumstances, however, they are often challenging to perform computationally and limited . In the absence of structural data, amino acid residue match or percentage identity based performance measures are used in comparing algorithm alignment quality performances. However, residue based performance measures are flawed  and biased as they miss similarities that can only be detected by structural approaches or evolutionary based approaches.
Selective pressure is used to indicate the effects of natural selection on genes and can be used as an indicator of molecular evolution . Evolutionary pressures resulting from natural selection at the DNA level have been found to mould genes into patterns of sites that are highly conserved (resistant to change) and those that are poorly conserved (evolving rapidly). The level of conservation can be inferred from the ratio of the non-synonymous substitution rate (dN or Ka) to the synonymous substitution rate (dS or Ks) and corrected for opportunity, loosely referred to as the evolutionary rate, represented by the parameter ω or dN/dS . It has been demonstrated that these patterns of evolutionary rates or “Evolutionary fingerprints”  can be used as a similarity metric in a sequence alignment algorithm. A novel alignment algorithm Functional Inference using Rates of Evolution (FIRE) was developed to address the low similarity challenge . FIRE uses the evolutionary rate (ω = dN/dS) at codon sites, rather than individual amino acid residue identities to align sequences thus circumventing the problem of low sequence similarity. FIRE alignments provided a proof of concept that the evolutionary rate could be used as an alignment metric and that in some cases at least, sequences with similar selection pressure profiles at codon sites have functional similarity. These findings supported the hypothesis that protein domains under similar selective pressures measured through the evolutionary rate may be responsible for similar functions. It has also been suggested that the distribution of the evolutionary rates on a gene could be used in an approach that is analogous to homology searching using a query sequence . Aligning sequences based on their evolutionary rate profiles could therefore offer an additional method for testing functional and evolutionary relationships. One of the major limitations of the FIRE approach, however, was the finding of numerous false positives, particularly when two unrelated highly conserved domains were aligned . This study evaluates the current version of the algorithm (FIRE) and describes the implementation and evaluation of the new, more robust algorithm, which we call BLOSUM-FIRE.
In this study, the evolutionary rate was coupled with a standard BLOSUM substitution matrix in a dynamic scoring function. In doing so the problem of false positives was addressed. The new algorithm (BLOSUM-FIRE) performs as well as MAFFT, T-COFFEE, MUSCLE and CLUSTAL Omega algorithms. An evolutionary rates database (EvoDB) is reported and described elsewhere (manuscript accepted in Database) and can be queried with FIRE data. As a test case, the enigmatic hepatitis B virus X protein (HBx) was examined with BLOSUM-FIRE. The Hepatitis B Virus (HBV) has been implicated in diseases such as Hepatocellular carcinoma (HCC) , chronic hepatitis and liver cirrhosis affecting millions of people worldwide. One of the challenges in understanding the biology of the HBV has been the failure to elucidate the numerous functions of the HBx protein . Experiments have failed to conclusively identify the role played by the protein in the hepadnavirus life cycle . Consequently, its structure and function has sparked controversy and speculation. This controversy is attributed to a lack of homology to any known protein in biological databases, exacerbated by the fact that the structure has defied conventional structure determination methods . Here we provide alignment evidence that the protein may harbour viral endopeptidase functions.
The FIRE algorithm
The FIRE algorithm is implemented in the Python programming language, it finds the optimal alignment of two amino acid sequences using the Bayes Empirical Bayes (BEB) maximum likelihood estimates (MLEs)  of the evolutionary rate (ω = dN/dS) at codon sites. FIRE requires two multiple sequence alignments (MSAs) of nucleotide sequences with their corresponding phylogenetic trees to generate a pairwise amino acid alignment. A modified Needleman-Wunsch algorithm  determines the global alignment between the sequences based on a dynamic programming (DP) approach. The generation of alignments requires MSA files and their corresponding phylogenetic tree files which are used as input for the CODEML program found in the Phylogenetic Analysis by Maximum Likelihood (PAML) suite of software  to produce the BEB MLEs of ω at codon sites. This pre-processing using CODEML provides the rst output files which are consequently used as input for the FIRE algorithm to generate alignments.
For the initial evaluation of the algorithm, data sets used in the concept paper were utilised . To evaluate the BLOSUM-FIRE algorithm evolutionary rate profiles of the Pfam (Protein family)  database were compiled into the evolutionary rates database (EvoDB) described elsewhere (manuscript accepted in Database).
To demonstrate the utility of the database evolutionary profiles of the enigmatic HBx protein were used. HBx was chosen because functional inference has been elusive. HBx curated nucleotide sequences were obtained from . An alignment of 20 nucleotide sequences of the HBx was generated using the CLUSTAL Omega (ver. 1.2.0) program and phylogenetic trees inferred from the FastTree program (ver. 2.1.7) . MLEs of ω were determined using the CODEML program in PAML (ver. 4.4) suite of software.
To infer the domain functions of HBx the EvoDB database was used. EvoDB (www.bioinf.wits.ac.za/software/fire/evodb) is a database of 98 % of the gapped nucleotide sequence alignments for the PFAM-A database. It provides the evolutionary rate (ω = dN/dS) profiles determined under the M2a model (CODEML algorithm in the PAML suite) for 97 % of the PFAM domains. The clustering of proteins into families in PFAM using domain functions provided a suitable framework for implementing a searchable database for inferring domain functions. The database was compiled for use by BLOSUM-FIRE. Briefly, BEBs  ω MLEs at codon sites were calculated using the CODEML program (PAML ver. 4.4)  under the M2a Model (NSsites = 2). This parameter assumes one ratio for all the branches and allows for the detection of positive selection at codon sites. The ω MLEs were extracted from the rst CODEML output file and used to compile the ω MLE profiles for each dataset. This rst file contains supplemental results including: Naive Empirical Bayes (NEB) probabilities for the site classes of ω, a list of positively selected sites, log likelihood values and the BEBs. These rst files for PFAM domain analysis using CODEML under the M1a and M2a models are available for download on EvoDB. Individual domains form functional units and are less variable than multi-domain proteins. The evolutionary rates at codon sites across discrete domains provide signature profiles of ω values, which may be used for homology detection . An independent study of RNA viruses based on a novel model of sequence evolution also demonstrated that protein-coding regions were moulded into sites of varying selective pressure .
To evaluate the accuracy of the new BLOSUM-FIRE algorithm against widely used algorithms: CLUSTAL Omega, MUSCLE and T-COFFEE using MAFFT as a reference, 20 datasets were used. The datasets consisted of ω profiles of unrelated and related domains using nucleotide sequence data for the PFAM-A domains obtained from EvoDB.
Evaluation of the FIRE algorithm with real data
The limitation of the FIRE algorithm identified in the concept paper of  was that the algorithm produced false positive results in certain datasets . Those data sets where the FIRE algorithm produces false positives were identified, analysed and the correlation between the statistical distribution of the ω values and the quality of alignments produced was investigated. To further explore the problem posed by false positives, 100 alignments of evolutionary rate profiles of the PFAM-A database were investigated. Alignments were generated using the FIRE algorithm and the number of residues aligned was counted as a measure of the quality of alignment. Alignments were then generated using the CLUSTAL Omega algorithm (ver. 1.2.0)  and MAFFT (ver. 7.130b) algorithm  using default parameters. Each alignment was scored using the numbers of aligned residues normalised by maximum sequence length indicated by the identity score. Our identity score is equivalent to the percentage identity score (PID)  normalised between 0 and 1 for intuitive comparison to a FIRE algorithm score. To assess the false positive rate we defined a false positive as any alignment with a FIRE score above 0.6 for functionally unrelated domains. This threshold was adopted from the concept study where domains with this score or higher were inferred to share similar domain functions. Each alignment was scored using the numbers of matched residues normalised by the maximum sequence length indicated by the identity score. In such scenarios, structural data could provide a standard of truth for evaluating alignment quality ; however, the requirement for MSAs in BLOSUM-FIRE and the limited availability of domain structures made this impossible in this study.
Evaluation of FIRE using simulated data
The FIRE algorithm had reduced specificity in highly conserved data sets resulting in false positives. Simulated data sets were therefore created to investigate false positives further. Highly conserved datasets were created such that the ω values were in the range [0,0.02] across all coding sites. Simulations were carried out using real and truncated datasets generated by a combination of custom scripts and manual editing of input files to insert sites of positive selection. Alignments were generated using FIRE default parameters for gap open and gap extension penalties of 0.5 and 0.1, respectively. The simulated datasets were then aligned using the MAFFT and CLUSTAL Omega algorithms.
Coupling FIRE with a BLOSUM substitution matrix
Evaluation of the false positive results identified in the concept paper  revealed that the homogeneity of highly conserved data, where the variance of ω values was low, resulted in reduced specificity. To address this challenge the new algorithm called BLOSUM-FIRE was implemented by incorporating the identity of the amino acids using the BLOSUM62 substitution matrix . Recently, the efficacy of substitution matrices has been questioned with the development of search approaches that do not utilise amino acid substitution matrices, for example, the CS-BLAST tool, is a novel search approach in which context-specific (CS) substitution matrices were incorporated into the BLAST algorithm . At the same time, MIQS (Matrix to Improve Quality in Similarity search), a more robust matrix, has been developed from numerous matrices using principal component analysis for detecting remote homologies, for example, in transmembrane regions .
Determine the BLOSUM amino acid score s(i,j), from the BLOSUM matrix using the identity of the aligned residues.
Determine the selective pressure on the amino acids using the values of ω i and ω j .
Scale the BLOSUM score using the following principles of selective pressure: ω > 1 is positive selection, ω < 1 is negative selection or purifying selection and ω = 1 indicates neutral selection.
Determine the similarity of ω i and ω j values normalised between 0 and 1 to a score called similarity.
Use the selective pressure to scale BLOSUM values to a BLOSUM comparable score called selection.
Combine the similarity score and selection score to obtain the BLOSUM-FIRE score.
Use a modified Needleman-Wunsch DP algorithm to obtain the global alignment of the amino acid sequences.
Therefore, the BLOSUM-FIRE scoring function scores amino acid residues using BLOSUM scores which are subsequently adjusted according to selective pressure and this is added to similarity score of ω values aligned at codon sites.
Approximating selective pressure and scaling the BLOSUM score
Either Equation (3) or Equation (4) can be used to calculate the similarity sim ab to the absolute difference normalised in the range [0,1]. Therefore, sim ab , maximises the absolute normalised differences at codon sites.
Coupling the BLOSUM matrix approach with the FIRE approach
where F is a BLOSUM comparable score, K is the proportionality constant obtained from the substitution matrix and sim ab is described in Equations (3) and (4). To calculate K, an assumption that an identical amino acid match is analogous to a match at codon sites where the evolutionary rates are close to zero was adopted. The proportionality constant (K) is the mean amino acid identical match scores determined from the BLOSUM62 matrix and it was found that K = 5.64. Using this proportionality framework we adopt the BLOSUM approach and weigh ambiguity characters in the same manner as regular amino acids.
The BLOSUM-FIRE scoring function
The strength of the FIRE approach is its ability to detect shared functions using ω values at low sequence similarity without using the amino acid identity. The BLOSUM approach is effective for aligning sequences using amino acid identity. The FIRE algorithm produced false positives when sequence conservation was high. This is because at high conservation the high similarity of the ω values creates “noise” for the scoring function resulting in poor alignments. Under such conditions the power of aligning sequences based on dN/dS values at amino acid sites becomes weak. Therefore, at high conservation the BLOSUM score is more acceptable than a FIRE score.
Conversely, the approach recognises that when conservation is low the BLOSUM score may not be strong evidence. It has not left our attention that determining evolutionary rates requires nucleotide MSAs, which can be challenging to obtain. Therefore, to mitigate this challenge the database EvoDB is provided (described above) and BLOSUM-FIRE allows for the comparison of protein sequences with evolutionary rate profiles in a one sided comparison. However, this comparison may not be as robust as a pairwise evolutionary rates comparison. To allow for this one-sided comparison, the algorithm assumes that all the protein residue sites for the protein sequence without ω MLEs are highly conserved such that ω = 0.
A variable gap penalty scheme
where P is the total gap open penalty g, is the initial gap open penalty t, is the initial gap extension penalty l, is the length of the gap, e− a is the approximated selective pressure at the codon site, a is the ω value where the gap is opened and similarly, e− x is the approximated selection pressure where x is the ω value for the site where the gap is extended. Therefore, gaps are extended based on the level of conservation at each site where the gap open penalty and its extension thereof varies as a function of the evolutionary rate at each site. According to this penalty scheme, variable sites are more likely to be gap insertion or extension sites than conserved sites.
Determining optimal gap penalties for the new BLOSUM-FIRE algorithm
Theoretical guidelines for the determination of gap penalties are scarce. In this study an empirical approach similar to the investigation by  was adopted. The datasets consisted of 50 amino acid sequences with their ω MLEs of varying sequence lengths randomly selected from the Pfam-A seed alignments database (ver. 27.0) . The BLOSUM-FIRE score was used as a proxy for the quality of the alignments, the score is the mean of the sequence identity and the normalised evolutionary rate score (old FIRE score) for that alignment. We aligned the Pfam sequences against each other for iterations of the gap penalties. Iterations were carried out using gap open penalties in the range [0,11] and gap extension penalties in the range [0,6]. Analysis was only carried out on the gap penalties and the BLOSUM-FIRE score for that alignment.
Evaluating the effect of MSA quality on the final alignment
We evaluated the effect of sequence number and MSA accuracy on the final alignment. To evaluate the effect of number of sequences on the final alignment, 10 Pfam families were randomly selected and using between 3 and 20 taxa we investigated how shuffling the order of sequences in the MSA affected the BLOSUM-FIRE score. The domains with the varied number of sequence were aligned with the same domain with 20 sequences. The effect of MSA accuracy on the final alignment was investigated using the heat shock hsp70 (PF00012) domain using nucleotide sequence data corresponding to the manually curated Pfam alignments from EvoDB. Using the MAFFT algorithm and by iteratively increasing the gap open penalty from 0 to 3 in increments of 0.1, we were able to obtain alignments of varying quality and accuracy using the comparison against the alignment generated using default parameters as a performance measure. These alignments were then aligned to the reference alignment to evaluate the effect on the final alignment. Phylogenetic trees for the above experiments were determined using the CLUSTALO program and ω profiles were determined under the M2a model using CODEML.
The resource requirement of an algorithm is an important consideration as some algorithms may require prohibitive resources or running times. While BLOSUM-FIRE has not been optimised for performance, we assessed the time and resource requirements for generating alignments from the MSAs to the final pairwise alignment. For this analysis, the heat shock hsp70 domain was used; we trimmed the sequence to 900 nucleotides to approximate the average length of a protein. Alignments were then de-aligned and iteratively added from 3 sequences to 30. Resources were measured using the Unix time command. Execution times were measured for the generation of alignments and calculating the phylogenetic tree and determining the ω MLEs. Furthermore, the time taken to generate the final alignment using BLOSUM-FIRE was also measured. Resources were measured on one of the nodes on our cluster (WITS-CORE). This is an Intel (R) Xeon (R) machine with 15 E5630 processors at 2.53GHz running the Scientific Linux operating system. The machine has a cache size of 12M and 23GB RAM. All experiments were carried out on a single core using default algorithm parameters.
Evaluation of BLOSUM-FIRE performance
The conventional approach when evaluating the performance of a sequence alignment algorithm is to make use of a sequence alignment benchmark. However, the reliance on pre-processed data makes it a challenge to evaluate the BLOSUM-FIRE approach. The MAFFT algorithm is well known for its accuracy and speed, for example,  and more recently  and was therefore chosen as a reference aligner for evaluating the new implementation. Datasets comprised 10 unrelated and 10 related domains obtained from EvoDB. To demonstrate the utility of our approach and EvoDB, we simulated the 10 related datasets by randomly selecting and generating an alignment of 10 taxa from the selected families with total sequence numbers in the range [14,187] and these were then aligned against their full alignments. The quality of alignments was measured using the Sum of Pairs Score (SPS) and Total-Column score (TC) implemented in the bali_score tool provided with the BAliBASE benchmark database . The SPS is the ratio of number aligned pairs in the test alignment to the number in the reference alignment; it evaluates the quality of the alignment produced. The TC score is a binary score of the comparison of the test and reference alignment for each column and the SPS was measured with MAFFT as the reference alignment.
The statistical significance of the alignments
To evaluate the significance of the results, a statistical framework was required. The inference of homology requires assessment of the statistical significance of an alignment to identify those real alignments from those due to chance. A p-value framework was implemented to assess the statistical significance of the alignments in EvoDB. The HBx ω MLEs were shuffled and used to query the EvoDB database. The scores of the alignments were tested for statistical significance using a p-value statistical framework. In this framework, the fraction of shuffled alignments scoring higher than the actual alignment was determined and this provided the p-value for that alignment. The functions of those proteins that were statistically significant were assessed by analysing the alignments produced. The BLAST algorithm finds statistically significant regions of local similarity between two sequences. The most statistically significant results found by BLOSUM-FIRE were then assessed using the BLAST algorithm to determine their statistical significance using a local alignment approach.
Results and discussion
Evaluation of the FIRE algorithm using real and simulated data
Approximating selective pressure at codon sites and scaling the BLOSUM score
Resources and accuracy
Performance of the BLOSUM-FIRE algorithm
EvoDB results for the HBx query
The top 5 statistically significant results of HBx aligned against EvoDB PFAM-A profiles
Description of family
Trans-activation protein X family.
Rubella virus endopeptidase family.
7.62 × 10− 5
Hydrophobic abundant protein (HAP) family.
2.28 × 10− 4
RimK-like ATP-grasp domain family.
3.81 × 10− 4
Protein of unknown function (DUF2715).
6.85 × 10− 4
Some of inherent limitations of BLOSUM-FIRE approach to sequence alignment are the assumptions made when measuring positive selection under the M2a model using the CODEML program in the PAML suite. One of these assumptions is that for those sites under positive selection there have been numerous substitutions at that site across the phylogeny ; however, the selection pressure may vary across the lineages for example in HIV . Additionally, it is assumed that the non-synonymous substitution rate value varies at codon sites while the same value is used for the synonymous substitution rate, this assumption can be violated in real data .
This work provides evidence for the efficacy of an evolutionary rate based approach to sequence alignment; we also address the challenge of low specificity. We show that coupling evolutionary rates with conventional amino acid substitution matrices produces robust algorithms comparable in performance to conventional approaches to sequence alignment. We note that the approach has inherent limitations as methods and models of measuring the site by site evolutionary rate accurately still remain a challenging field. This work supports the hypothesis that proteins under similar selective pressures share similar functions. We provide a proof of concept that evolutionary rate profiles can be used as an alignment metric and that in certain cases at least the similarity of these evolutionary rate profiles can be used to infer domain functions. Additionally, we show that aligning sequences based on their evolutionary rate profiles could be used to extend the traditional alignment techniques in testing hypothesis in homology inference. The BLOSUM-FIRE software, user information and sample data files are freely available for download at http://www.bioinf.wits.ac.za/software/fire.
This work was partially supported by the following grants PMD: NASA (#NNX13AH41G) principal investigator Prof R.E Michod (University of Arizona); SH: NRF (IFR2011040500038); AN: NRF (#SFH13091742708). The financial assistance of the National Research Foundation (NRF) towards this research is hereby acknowledged. Opinions expressed and conclusions arrived at, are those of the author and are not necessarily to be attributed to the NRF. AN is supported by the Durand Foundation Scholarship for Evolutionary Biology and Phycology.
- 10.Kimura M. The neutral theory of molecular evolution. Cambridge: Cambridge University Press; 1984.Google Scholar
- 21.Thompson,L.J. (2012) Recombinant expression and bioinformatic analysis of the Hepatitis B virus X protein.Google Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.