Fast estimation of the difference between two PAM/JTT evolutionary distances in triplets of homologous sequences
- 6.1k Downloads
- 4 Citations
Abstract
Background
The estimation of the difference between two evolutionary distances within a triplet of homologs is a common operation that is used for example to determine which of two sequences is closer to a third one. The most accurate method is currently maximum likelihood over the entire triplet. However, this approach is relatively time consuming.
Results
We show that an alternative estimator, based on pairwise estimates and therefore much faster to compute, has almost the same statistical power as the maximum likelihood estimator. We also provide a numerical approximation for its variance, which could otherwise only be estimated through an expensive re-sampling approach such as bootstrapping. An extensive simulation demonstrates that the approximation delivers precise confidence intervals. To illustrate the possible applications of these results, we show how they improve the detection of asymmetric evolution, and the identification of the closest relative to a given sequence in a group of homologs.
Conclusion
The results presented in this paper constitute a basis for large-scale protein cross-comparisons of pairwise evolutionary distances.
Keywords
Evolutionary Distance Pairwise Distance Close Homolog Substitution Matrice Accurate Confidence IntervalBackground
The estimation of evolutionary distances between biological sequences is at the basis of many bioinformatics problems: it plays a particularly important role in phylogenetic tree inference [1, 2] and in an increasing number of comparative genomics analyses over large sets of genes or proteins (e.g. [3, 4, 5]). The most accurate way of estimating evolutionary distances is currently maximum likelihood, but the procedure is so time-consuming that is hardly practical when dealing with large datasets. In such cases, complexity is often tackled by working on the basis of individual pairs, such as in distance tree methods or in the "all-against-all" at the beginning of many comparative genomics analyses. However, by estimating an evolutionary distance for each pair individually, no knowledge about the covariance of distance estimates with common evolution can be directly obtained. Thus, when comparing pairwise distances among related sequences, for instance to infer which of two homologs is closer to a third one, confidence intervals cannot be derived directly from the pairwise estimates.
PAM model of sequence evolution
The evolutionary distance between two biological sequences is generally based on the assumption of a first-order Markovian process of amino acid evolution. This implies two biological assumptions, common to all standard models of evolution: no memory and position-independence. The substitutional processes are described in the form of substitution matrices, defining mutation probabilities from each character to every other character for a given evolutionary distance. These matrices are either parametrical models of sequence evolution or empirically based substitution matrices. Parametrical models are often employed for nucleotide substitution (e.g. Jukes-Cantor [6] or Hasegawa-Kishino-Yano [7]), while empirical matrices (based on counted substitutions of large sets of sequence alignments) are widely used for peptide replacements in proteins. Pioneered by Dayhoff in the 1970s [8], these models have been improved with more sequence data becoming available in the 1990s (e.g. the updated Dayhoff matrices by Gonnet-Cohen-Benner [9] or Jones-Taylor-Thornton (JTT) [10]). Codon substitutions have been described by parametrical (e.g. [11]) as well as empirical (e.g. [12]) matrices.
Because of the additivity of distances computed under the Markovian model of sequence evolution. substitution matrices for a wide range of evolutionary distances can be derived from a single substitution matrix M(d_{0}) through the equation M(d_{0})^{ x }= M(xd_{0}), which is a special form of the Chapman-Kolmogorov equation for Markov chains. It is common and computationally more efficient to formulate this process in terms of a rate matrix Q from which the probability matrices for distance d are derived as M(d) = e^{ dQ }. We normally measure d in PAM units [8], which completely defines Q.
Maximum likelihood estimation
Evolutionary distances are best estimated by maximum likelihood (ML). In case of a pair of sequences, the ML estimation is well known and practical (see Methods part). When more sequences are under consideration, the complexity of distance estimation by ML increases very steeply, mainly because it requires a multiple sequence alignment (MSA) and the inference of the phylogenetic tree topology, two difficult procedures for which the optimal solution can currently only be computed in exponential time with respect to the number of sequences. A common strategy for tackling this problem is to work on the basis of pairs, such as in distance tree methods. In this article, we focus on the specific problem of estimating, in a triplet of homologs X,Y,Z (Fig. 1). the difference ? between two distances d_{ XY }and d_{ XZ }. In such case, the multidimensional ML approach over the triplet is still practical. We call the estimator of ? obtained by this method $\widehat{?}$_{ triplet }. Alternatively, ? can be estimated by a simple algebraic relation over pairwise distances over X, Y, Z estimated individually. We call this alternative estimator $\widehat{?}$_{ pairwise }. Details about the computation of $\widehat{?}$_{ triplet }and $\widehat{?}$_{ pairwise }are provided in the Methods section.
Results and discussion
In the present section, we compare the estimators $\widehat{?}$_{ triplet }and $\widehat{?}$_{ pairwise }, and introduce a numerical approximation to estimate the variance of $\widehat{?}$_{ pairwise }, and show that it gives accurate confidence intervals. Finally, we describe two applications of the results.
Comparison between the two estimators
In terms of computational complexity, the two estimators differ significantly. Given m sequences of length n, $\widehat{?}$_{ triplet }requires the separate treatment of each O(m^{3}) triplet, and considering that an optimal 3-way alignment by dynamic programming (DP) is O(n^{3}), the time complexity is O(m^{3}n^{3}). In contrast, all $\widehat{?}$_{ pairwise }can be computed on the basis of O(m^{2}) pairs of sequences aligned by DP in O(n^{2}), yielding a time complexity of O(n^{2}m^{2}). Typically, whenever an analysis involves more than a few thousand proteins, millions of triplets have to be considered and $\widehat{?}$_{ pairwise }is the only practical approach of the two. In terms of accuracy, both estimators are asymptotically unbiased: in the case of $\widehat{?}$_{ triplet }, it is a property of the ML estimator, while in the case of $\widehat{?}$_{ pairwise }, it is the consequence of the linearity of the expected value (see Methods). We compared the two estimators by simulation over a large number of triplets (length: 300 AA), generated randomly according to the PAM model of evolution with different distances d_{ OX }, d_{ OY }, d_{ OZ }(Fig. 1). In each experiment, both estimators were converging toward the correct value for the difference, which confirms that the asymptotic behavior is a reasonable assumption for protein sequences of typical length. In terms of statistical power; surprisingly, the observed variance of the estimates obtained by $\widehat{?}$_{ pairwise }was on average less than 1% larger than the observed variance of the ML estimator over the triplet, suggesting that $\widehat{?}$_{ pairwise }, although much faster to compute, is on average almost as accurate as $\widehat{?}$_{ triplet }.
The variance of $\widehat{?}$_{ triplet }can be computed exactly (see Methods section). But there is no direct estimator of the variance of $\widehat{?}$_{ pairwise }, since it results from an algebraic relation over pairwise distances estimated individually, whose covariances are therefore unknown. There are indirect ways of estimating that variance, through the sampling distribution when doing simulation such as the one mentioned above, or bootstrapping when handling real data. However, such procedures are very time consuming. To overcome this problem, we devised a numerical approximation of s^{2}($\widehat{?}$_{ pairwise }) as function of the pairwise distance estimates.
Numerical approximation of s^{2}($\widehat{?}$_{pairwise})
In essence, the numerical approximation described here was obtained through regression over a large number of samples. We settled for this approach after discovering that the analytical solution to this problem, even when using a simpler model of evolution (all amino-acid mutations with equal probability). requires solving a polynomial of degree 23. The details of this investigation are reported in the Appendix. In view of this inherent complexity, the regression cannot be exact, but it turns out to be a surprisingly precise numerical approximation for comparisons that involve proteins that have an evolutionary distance smaller than 250 PAM units, which corresponds to percentage sequence identity greater or equal to 19.68%. We generated random triplets in the following way: a random-length (uniform 100..500) sequence was chosen as the origin O. Three random PAM distances (uniform 1..125) were selected for d_{ OX }, d_{ OY }and d_{ OZ }. The sequence O was mutated according to these distances to obtain X,Y and Z, our triplet. We generated about 30,000 triplets for three types of scoring matrix: updated Dayhoff matrices [9], DNA for coding genes and JTT [10]. The DNA scoring matrices were computed from a very large set of entire coding gene alignments from mammals. It is used in the OMA project [4] to align entire coding genes and is based on a 4-symbol alphabet. For each triplet, we computed pairwise distance estimates and their variances as input for the approximation. Given that $\widehat{?}$_{ pairwise }is almost as powerful as $\widehat{?}$_{ triplet }, we computed and used s^{2} ($\widehat{?}$_{ triplet }) as reference value for s^{2}($\widehat{?}$_{ pairwise }).
Coefficient of the approximation of s^{2}($\widehat{?}$_{ pairwise })
Type | $\widehat{d}$_{ XY }+ $\widehat{d}$_{ XZ } | s^{2}($\widehat{d}$_{ XY }) + s^{2}($\widehat{d}$_{ XZ }) | ${\widehat{d}}_{YZ}^{2}$ | s^{2}($\widehat{d}$_{ YZ }) | s^{2}($\widehat{d}$_{ XY }) s^{2}($\widehat{d}$_{ XZ }) | error | dim |
---|---|---|---|---|---|---|---|
Day | -1.3090 | 1.0435 | 0.6895 | -0.3339 | 0.1590 | 0.087 | 2.13 |
DNA | -1.2449 | 1.0933 | 0.6591 | -0.3026 | 0.1181 | 0.098 | 2.13 |
JTT | -1.2921 | 1.0978 | 0.6741 | -0.3065 | 0.1144 | 0.080 | 2.10 |
For example, the approximation for DNA variances is
Readers familiar with numerical analysis will find an analogy between the approximation presented here and standard approximations for transcendental functions. For example, it is customary to approximate exp(x) through a quotient of polynomials p(x)/q(x), for some limited range of x.
The relative error is in all the three cases less than 10%. Furthermore, since we normally use the square root of the variance, the relative error is in such cases half of the indicated. The last column indicates the dimension of the approximation which should be 2 in perfect conditions, and is indeed quite close.
The fact that very different matrices have very similar coefficients, the low error and the almost correct dimensionality reassures us of the quality of the approximation.
Verification of accuracy of confidence intervals
k = 1.960 | k = 2.576 | |
---|---|---|
|$\widehat{?}$_{ triplet }- ?| > k·s($\widehat{?}$_{ triplet }) | 0.95129 ± 0.00067 | 0.99062 ± 0.00030 |
|$\widehat{?}$_{ pairwise }- ?| > k·s_{ bootstrap }($\widehat{?}$_{ pairwise }) | 0.9511 ± 0.0020 | 0.99001 ± 0.00091 |
|$\widehat{?}$_{ pairwise }- ?| > k·s($\widehat{?}$_{ triplet }) | 0.94641 ± 0.00070 | 0.98896 ± 0.00032 |
|$\widehat{?}$_{ pairwise }- ?| > k·$\tilde{s}$($\widehat{?}$_{ pairwise }) | 0.94808 ± 0.00069 | 0.98953 ± 0.00032 |
|$\widehat{?}$_{ pairwise }- ?| > k·s_{ ind }($\widehat{?}$_{ pairwise }) | 0.98137 ± 0.00042 | 0.99774 ± 0.00015 |
As expected, the ML estimator over the entire triplet (first row) yields a precise variance estimate. On the other hand, we see that assuming independence for the estimation of the variance (last row) leads to very inaccurate confidence intervals. Estimating the variance of $\widehat{?}$_{ pairwise }by bootstrapping (10,000 re-samples) gives good confidence intervals, but the procedure is even more computationally intensive than $\widehat{?}$_{ triplet }, and therefore of little practical use in the present context. Using $\tilde{s}$^{2}($\widehat{?}$_{ pairwise }) in conjunction with the variance of the ML estimator works remarkably well (third and fourth row). And surprisingly, applying the numerical approximation (fourth row) happened to give slightly more accurate results than the exact triplet variance (third row).
Applications
In the following, we provide two examples of applications that benefit from the increase in statistical power of the estimator $\widehat{?}$_{ pairwise }enabled by the approximation: detection of asymmetric evolution and identification of the closest relative in a set of homologs. Furthermore, in [14], we show how our result can be used in the context of paralogy detection.
We first define three indicator functions that will be used in these comparisons. They decide whether the pair of proteins X, Y is significantly closer than X, Z at the confidence level expressed by the number of standard deviations k. The first and second ones both use the estimator $\widehat{?}$_{ pairwise }, but the first definition uses as variance of the estimate the upper bound that is obtained by assuming independence of $\widehat{d}$_{ XY }and $\widehat{d}$_{ XZ }(see Methods), whereas the second use the approximation $\tilde{s}$^{2}($\widehat{?}$_{ pairwise }) of the variance. The third indicator function uses the estimator $\widehat{?}$_{ triplet }.
Asymmetric evolution
After a gene duplication, the two copies can evolve independently. It has been suggested that in many cases, one duplicate maintains the ancestral function while the other is free to evolve and acquire novel functionality [15]. This scenario implies that the protein with conserved functionality will undergo less sequence evolution than the one exploring new functionalities.
Detecting this asymmetric evolution after duplication is an important factor not only for function prediction or orthologs assignment, but also for bringing new insights in our understanding of genome evolution in general (e.g. [16, 17, 18, 19]).
In order to identify cases of asymmetric evolution, one typically considers three sequences – the two duplicates (Y and Z)and an out-group (X). Several methods have been developed to test the significance of the unequal lengths of the branches leading from the common ancestor to the two duplicated sequences. Tests on simulated and real data from Arabidopsis thaliana for two of such methods have suggested very low statistical power to detect asymmetric evolution of duplicates [20].
The closer indicator function can be used to detect asymmetric evolution. With d_{ XY }being the distance from the out-group to the closer of the two duplicates and d_{ XZ }the distance to the other one, closer (X, Y, Z, k) decides if the two duplicated proteins have evolved at significantly different rates. The parameter k can be chosen to reflect the confidence level, e.g. 1.96 for the 95% level.
We tested the method using all three variants of closer (k = 1.96) on a protein set from a recent publication about whole genome duplication in S. cerevisiae [21]. From a set of 450 genes pairs that arose by whole genome duplication, they report 115 cases of one paralog evolving at least 50% faster than the other paralog. The position of the ancestral gene was determined by an out-group gene from K. waltii. Additionally, a set of 76 gene pairs is given where at least one of the S. cerevisiae genes evolved at least 50% faster than the K. waltii homolog.
If we now compare the predictions of Kellis and colleagues with our results, it appears that in 98 out of 115 cases, their prediction of asymmetric evolution could be confirmed by closer_{ app }, while with the remaining 17 pairs, our method did not support the asymmetry prediction. It is remarkable, however, that all these 17 pairs belong to the group of the 76 pairs with a fast evolving K. waltii homolog. It seems likely that the uncertainty in placing the origin of the triplet (arising from a longer branch to the out-group) causes rate-based methods as used in [21] to report asymmetric divergence despite the unclear situation. As opposed to that, the distance-based methods presented here, by incorporating the variance of the estimates explicitly, take the uncertainty about the point of origin into account, and therefore give more conservative predictions in these cases.
Furthermore, closer_{ app }found 134 additional cases of asymmetry among the remaining 335 gene pairs in the data set. Together with the 98 cases above, this results in 51.6% of all genes arising from the genome duplication event. This is clearly more than the 5% that could be expected from random chance and agrees with previous studies were significant amounts of asymmetrically evolving duplicates have been reported (e.g. [22, 23]).
Closest homolog without phylogenetic reconstruction
The identification of the closest relative of a protein (or gene) in a set of homologs traditionally requires the reconstruction of the corresponding phylogenetic tree. However, building gene trees remains a time consuming and error-prone task, thus methods based on pairwise evolutionary distance estimates are attractive. In this section, we show that using the variance approximation presented above can boost the statistical power of PAM distance comparisons to determine the closest homolog.
In simple contexts, or when accuracy is not a concern, the problem of identifying the closest relative can be solved reasonably well by coarse approaches, such as the top blast hit, or even the sequence with highest percentage identity. As the number of proteins grows larger and the number of homologs with similar distances increase, these methods show their limits. Indeed, it has been previously shown that the top blast hit is often not the closest relative [24]. At least two ideas lead to better results: the use of evolutionary distance estimates such as PAM distances, and accounting for confidence intervals, so that whenever there is not enough information to reliably discriminate among several distances, all of them are kept, presumably for further analysis.
For each protein X belonging to such a family, the closest homolog predictions using the following three criteria were compared to the actual closest homolog. The first one computes the subset of homologous sequences H that align with X with score higher than a particular fraction of the top score.
The second method computes the set of closest homologs, without using our variance approximation, formally
Set_{2} = {Y ? H | ? Z ? H, Z ? Y, closer_{ ind }(X, Z, Y, k_{2})}
The third method computes the set of closest homologs using our approximation, formally
Set_{3} = {Y ? H | ? Y ? H, Z ? Y, closer_{ app }(X, Z, Y, k_{3})}
The cut-off parameters k_{1}, k_{2}, k_{3} can be set according to the desired level of confidence. At k = 0, only the top score, respectively the shortest expected distance, is returned. Higher k values correspond to more conservative predictions, with increasing number of closest homolog candidates. For the evaluation of the methods, we vary k_{1} between 0 and 0.25, while k_{2}, k_{3} are varied between 0 and 3. Note that only k_{3} corresponds to the number of standard deviations from the expected value.
Conclusion
Computing the difference of two evolutionary distances that are not independent is a common operation in an increasing number of bioinformatics analyses. We presented and compared two estimators for the difference of two evolutionary distances in a triplet of homologs, one estimator based on pairwise distance estimates and the maximum likelihood estimator. Surprisingly, the estimator based on pairwise distance is almost as powerful as the ML estimator. But in terms of time complexity, it scales much better than the ML estimator and is therefore better suited at large-scale analyses. However, since its variance is not easy to estimate, we introduced a numerical approximation that allows the computation of accurate confidence intervals. Finally, we showed how these results can be used to test for asymmetrical evolution, and to identify the closest relative of a sequence in a group of homologs without phylogenetic reconstruction. As of future work, we plan to extend these results to models of evolution allowing rate variations, as well as insertion-deletions.
Methods
PAM distance estimator for a pair
The likelihood of an alignment A at an evolutionary distance d is defined [25, 26, 27] as
with x and y being aligned characters (e.g. amino acids, bases, but no deletions), and f(x) the background frequency of the character x. Maximizing L(A | d) in terms of d gives the ML estimator $\widehat{d}$ of the evolutionary distance. This is usually done numerically using the Newton-Raphson method. The variance of the ML estimator $\widehat{d}$ can be computed from the second derivative of the log-likelihood:
Notice that the variance is obtained for free as it is already computed in Newton's iteration.
PAM distance estimator for a triplet
Estimator based on pairwise distances
One can estimate ? by performing pairwise alignments between X and Y, and between X and Z. The ML method for pairs of homologs, which was described above, computes the estimates $\widehat{d}$_{ XY }and $\widehat{d}$_{ XZ }. By subtracting the first from the second, an estimator for the difference is obtained:
Since the two pairwise distance estimators are asymptotically unbiased and normally distributed, and considering the linearity of the expected value and the fact that the difference of two normally distributed variables is also normally distributed, the pairwise estimator $\widehat{?}$_{ pairwise }is also asymptotically unbiased and normally distributed, with variance
As described above, we obtain s^{2}($\widehat{d}$_{ XY }) and s^{2}($\widehat{d}$_{ XZ }) from the ML distance estimation, but the process does not say anything about their covariance. If the two distances are independent, which is only the case if d_{ OX }= 0, the covariance is zero and the variance ${s}_{ind}^{2}$($\widehat{?}$_{ pairwise }) = s^{2}($\widehat{d}$_{ XY }) + s^{2}($\widehat{d}$_{ XZ }) can be computed. In all other cases, $\widehat{d}$_{ XY }and $\widehat{d}$_{ XZ }covary and the variance of their difference is smaller than the sum of their variances. Therefore, we only have an upper bound for the variance of our estimator:
Note that previous work on covariance estimation (e.g. [7, 28]) do not apply here, because they require 3-way sequence alignments and are constrained to parametric models of evolution such as Jukes-Cantor and its generalizations.
Estimator based on triplet
Alternatively, we can estimate ? by subtracting estimates of the distances d_{ OY }and d_{ OZ }
The estimates $\widehat{d}$_{ OY }and $\widehat{d}$_{ OZ }can be obtained by maximum likelihood over the multiple sequence alignment of X, Y, Z [25], in a manner analogous to the ML estimation for a pair. The likelihood L of a multiple sequence alignment (MSA) is the product, over all positions of the MSA, of the probability of observing characters x, y, z at distance d_{ OX }, d_{ OY }, d_{ OZ }of the origin, where such a probability is obtained by marginalizing over every character o at the origin:
where C is the set of characters – the 20 amino-acids in the present case, and f(o) the background frequency of the character o. Consequently, the log-likelihood function l is
The log-likelihood is maximum where its gradient disappears:
There again, the problem can be solved efficiently by Newton's iteration
where (?^{2}l)^{-1} is the inverse of the Hessian (derivable in the same fashion as the gradient, not shown here). The inverse of the Hessian also yields the variance-covariance matrix of the estimates $\widehat{d}$_{ OX }, $\widehat{d}$_{ OY }, $\widehat{d}$_{ OZ }when multiplied by -1. A final use of the Hessian is to check that its complement is positive definite, a condition necessary to ensure that the solution found is indeed a maximum and not a minimum or a saddle point. Therefore, we obtain the variance of $\widehat{?}$_{ triplet }from the variance-covariance matrix:
Appendix
Complexity of the analytical solution of k-states model for triplets
In the following, we show that the analytical solution of the maximum-likelihood estimator for the distances of a triplet is very complex, even for a simplified model of mutation. The k-state model [29] is an idealized situation where each position has k possible states and the transition probabilities are all identical and only depend on the time t. For k = 4 this is equivalent to the Jukes-Cantor model [6]. Whatever is the initial state, the probability of a mutation after time t is given by
where r is
so that t is measured in PAM units. (Measuring in PAM units is proportional to any other measure, and it means that at t = 1 one percent of the characters are changed, i.e. p(1) = 1/100.) and that all transitions are equally likely, and only depend on the PAM distance. Under this model, the log-likelihood can be expressed in terms of the counts of matches/mismatches of the triplet (X, Y, Z), i.e. N_{ xxx }is the number of positions where all the characters are identical, N_{ xxz }is the number of positions where X and Y coincide but Z differs, etc.
where p_{ x }is the probability of mutating from the origin to X and similarly for p_{ y }and p_{ z }. Taking partial derivatives of the likelihood with respect to p_{ x }, p_{ y }and p_{ z }gives a system of 3 rational polynomial equations (all the logarithms disappear) in 3 unknowns and 6 parameters. Such a system of equations has a solution that will be an algebraic function of the parameters (a root of a polynomial, where the coefficients of the polynomial involve the parameters). Despite its simple appearance, this system of equations is beyond the capabilities of current computer algebra systems to resolve. And this is not a complete surprise, as the algebraic numbers/functions involved are at least of degree 23. The special case where two of the branches have the same length, has been solved exactly in [30], they find that their solution is an algebraic function of degree 11. This unfortunately is not applicable as we are interested in the cases where the branches away from the origin are of different lengths.
We have computed the exact solution for concrete values of the parameters, in particular N_{ xxx }= 10, N_{ xxz }= 5, N_{ xyx }= 4, N_{ xyy }= 3, N_{ xyz }= 2, k = 3 using Maple and the value of p_{ x }is a root of the irreducible polynomial
6582435840000 + 189590785228800 z - 2438333515038720 z^{2} + ...
... + 10304020514917800 z^{21} - 1635488137841976 z^{22} + 99990709180560 z^{23}
This means that the general solution will be an algebraic function of degree 23 or higher, it cannot be lower. If an instantiation of the polynomial with values gives this irreducible polynomial, then the general polynomial must be irreducible of degree 23 or higher (some terms could have simplified in the instantiation). This makes the usefulness of an exact solution inexistent. it is more difficult to solve the polynomial and select the right root than to maximize the likelihood and/or solve the system of equations by numerical methods.
Notes
Acknowledgements
The authors thank Dan Graur and two anonymous reviewers for helpful comments and ideas.
Supplementary material
References
- 1.Swofford DL, Olsen GL, Waddell PJ, Hillis DM: Phylogenetic inference. 2nd edition. Sunderland, Massachusetts: Sinauer Associates; 1996:407–514.Google Scholar
- 2.Felsenstein J: PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. University of Washington. Seattle., Department of Genome Sciences; 2004.Google Scholar
- 3.Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acids Res 2004, (32 Database):277–280. 10.1093/nar/gkh063Google Scholar
- 4.Dessimoz C, Cannarozzi G, Gil M, Margadant D, Roth A, Schneider A, Gonnet G: OMA, A Comprehensive, Automated Project for the Identification of Orthologs from Complete Genome Data: Introduction and First Achievements. In RECOMB 2005 Workshop on Comparative Genomics, Volume LNBI 3678 of Lecture Notes in Bioinformatics. Edited by: McLysath A, Huson DH. Springer-Verlag; 2005:61–72.Google Scholar
- 5.DeLuca TF, Wu IH, Pu J, Monaghan T, Peshkin L, Singh S, Wall DP: Roundup: a multi-genome repository of orthologs and evolutionary distances. Bioinformatics 2006, 22(16):2044–2046. 10.1093/bioinformatics/btl286CrossRefPubMedGoogle Scholar
- 6.Jukes T, Cantor C: Evolution of protein molecules. In Mammalian protein metabolism III. Edited by: Munro H. New York: Academic Press; 1969:21–132.CrossRefGoogle Scholar
- 7.Hasegawa M, Kishino H, Yano T: Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol 1985, 22: 160–174. 10.1007/BF02101694CrossRefPubMedGoogle Scholar
- 8.Dayhoff MO, Schwartz RM, Orcutt BC: A model for evolutionary change in proteins. In Atlas of Protein Sequence and Structure. Volume 5. Edited by: Dayhoff MO. National Biomedical Research Foundation; 1978:345–352.Google Scholar
- 9.Gonnet GH, Cohen MA, Benner SA: Exhaustive matching of the entire protein sequence database. Science 1992, 256(5003):1443–1445. 10.1126/science.1604319CrossRefPubMedGoogle Scholar
- 10.Jones DT, Taylor WR, Thornton JM: The Rapid Generation of Mutation Data Matrices from Protein Sequences. Comput Applic Biosci 1992, 8: 275–282.Google Scholar
- 11.Goldman N, Yang Z: A Codon-based Model of Nucleotide Substitution for Protein-coding DNA Sequences. Mol Biol Evol 1994, 11(5):725–736.PubMedGoogle Scholar
- 12.Schneider A, Cannarozzi GM, Gonnet GH: Empirical codon substitution matrix. BMC Bioinformatics 2005., 6(134):Google Scholar
- 13.Gonnet GH, Hallett MT, Korostensky C, Bernardin L: Darwin v. 2.0: An Interpreted Computer Language for the Biosciences. Bioinformatics 2000, 16(2):101–103. 10.1093/bioinformatics/16.2.101CrossRefPubMedGoogle Scholar
- 14.Dessimoz C, Boeckmann B, Roth A, Gonnet GH: Detecting Non-Orthology in the COG Database and Other Approaches Grouping Orthologs Using Genome-Specific Best Hits. Nucleic Acids Res 2006, 34(11):3309–3316. 10.1093/nar/gkl433PubMedCentralCrossRefPubMedGoogle Scholar
- 15.Ohno S: Evolution by Gene Duplication. Springer-Verlag, New York; 1970.CrossRefGoogle Scholar
- 16.Van de Peer Y, Taylor JS, Braasch I, Meyer A: The Ghost of Selection Past: Rates of Evolution and Functional Divergence of Anciently Duplicated Genes. J Mol Evol 2001, 53(4):436–446. 10.1007/s002390010233CrossRefPubMedGoogle Scholar
- 17.Dermitzakis ET, Clark AG: Differential Selection After Duplication in Mammalian Developmental Genes. Mol Biol Evol 2001, 18: 557–562.CrossRefPubMedGoogle Scholar
- 18.Li YJ, Tsoi SCM: Phylogenetic analysis of vertebrate lactate dehydrogenase (LDH) multigene families. J Mol Evol 2002, 54(5):614–24. 10.1007/s00239-001-0058-1CrossRefPubMedGoogle Scholar
- 19.Wagner A: Asymmetric Functional Divergence of Duplicate Genes in Yeast. Mol Biol Evol 2002, 19: 1760–1768.CrossRefPubMedGoogle Scholar
- 20.Seoighe C, Scheffler K: Very Low Power to Detect Asymmetric Divergence of Duplicated Genes. In RECOMB 2005 Workshop on Comparative Genomics, Volume LNBI 3678 of Lecture Notes in Bioinformatics. Edited by: McLysath A, Huson DH. Springer-Verlag; 2005:142–152.Google Scholar
- 21.Kellis M, Birren BW, Lander ES: Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature 2004, 428: 617–624. 10.1038/nature02424CrossRefPubMedGoogle Scholar
- 22.Blanc G, Barakat A, Guyot R, Cooke R, Delseny M: Extensive Duplication and Reshuffling in the Arabidopsis Genome. Plant Cell 2000, 12: 1093–1102. 10.1105/tpc.12.7.1093PubMedCentralCrossRefPubMedGoogle Scholar
- 23.Conant GC, Wagner A: Asymmetric Sequence Divergence of Duplicate Genes. Genome Res 2003, 13: 2052–2058. 10.1101/gr.1252603PubMedCentralCrossRefPubMedGoogle Scholar
- 24.Koski LB, Golding GB: The Closest BLAST Hit Is Often Not the Nearest Neighbor. J Mol Evol 2001, 52(6):540–542.CrossRefPubMedGoogle Scholar
- 25.Felsenstein J: Evolutionary trees from gene frequencies and quantitative characters: finding maximum likelihood estimates. Evolution 1981, 35: 1229–1242. 10.2307/2408134CrossRefGoogle Scholar
- 26.Gonnet GH: A Tutorial Introduction to Computational Biochemistry Using Darwin. Tech. rep., Informatik, ETH Zurich, Switzerland; 1994.Google Scholar
- 27.Muller T, Vingron M: Modeling amino acid replacement. J Comput Biol 2000, 7(6):761–776. 10.1089/10665270050514918CrossRefPubMedGoogle Scholar
- 28.Bulmer M: Use of the method of generalized least-squares in reconstructing phylogenies from sequence data. Mol Biol Evol 1991, 8(6):868–883.Google Scholar
- 29.Cannarozzi GM, Gonnet GH: Idealized Mutational Clocks.Tech. rep., Informatik, ETH, Zurich; 2005. [http://www.biorecipes.com/IdealMut/code.html]Google Scholar
- 30.Chor B, Hendy MD, Snir S: Maximum Likelihood Jukes-Cantor Triplets: Analytic Solutions. Mol Biol Evol 2006, 23(3):626–632. 10.1093/molbev/msj069CrossRefPubMedGoogle Scholar
Copyright information
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.