Utility of characters evolving at diverse rates of evolution to resolve quartet trees with unequal branch lengths: analytical predictions of long-branch effects
- 1.6k Downloads
The detection and avoidance of “long-branch effects” in phylogenetic inference represents a longstanding challenge for molecular phylogenetic investigations. A consequence of parallelism and convergence, long-branch effects arise in phylogenetic inference when there is unequal molecular divergence among lineages, and they can positively mislead inference based on parsimony especially, but also inference based on maximum likelihood and Bayesian approaches. Long-branch effects have been exhaustively examined by simulation studies that have compared the performance of different inference methods in specific model trees and branch length spaces.
In this paper, by generalizing the phylogenetic signal and noise analysis to quartets with uneven subtending branches, we quantify the utility of molecular characters for resolution of quartet phylogenies via parsimony. Our quantification incorporates contributions toward the correct tree from either signal or homoplasy (i.e. “the right result for either the right reason or the wrong reason”). We also characterize a highly conservative lower bound of utility that incorporates contributions to the correct tree only when they correspond to true, unobscured parsimony-informative sites (i.e. “the right result for the right reason”). We apply the generalized signal and noise analysis to classic quartet phylogenies in which long-branch effects can arise due to unequal rates of evolution or an asymmetrical topology. Application of the analysis leads to identification of branch length conditions in which inference will be inconsistent and reveals insights regarding how to improve sampling of molecular loci and taxa in order to correctly resolve phylogenies in which long-branch effects are hypothesized to exist.
The generalized signal and noise analysis provides analytical prediction of utility of characters evolving at diverse rates of evolution to resolve quartet phylogenies with unequal branch lengths. The analysis can be applied to identifying characters evolving at appropriate rates to resolve phylogenies in which long-branch effects are hypothesized to occur.
KeywordsLong-branch effects Felsenstein zone Signal Noise Phylogenetic Inference
General Time Reversible
Hasegawa, Kishino, and Yano
Jukes and Cantor
Relative Apparent Synapomorphy Analysis
The detection and avoidance of long-branch effects in phylogenetic inference has been a longstanding challenge. Arising when there is unequal divergence among taxa, long-branch effects are caused by convergent and parallel changes that give rise to a systematic bias in the phylogenetic estimation procedure, producing one or more artefactual phylogenetic groupings of taxa [1-15]. While early investigations discussed long-branch effects as a significant problem for inference with parsimony, it has since been demonstrated that inference by maximum likelihood (ML) and Bayesian approaches can also be subject to long-branch effects [7-9,14-20], even when the correct model is specified exactly [11,21].
An extensive literature exists composed of simulation studies that have evaluated the performance of different inference methods on model trees, investigating the branch length conditions wherein long-branch effects lead to misleading results. For example, in what is classically termed the Felsenstein zone, two long-branched taxa are non-sisters in a four-taxon tree. Simulation studies have demonstrated that parsimony is more likely to group the long-branched non-sister taxa together (“long-branch attraction” [22-25]) than likelihood methods. Siddall  referred to the converse zone, where two long-branched taxa are true sisters in a four-taxon tree, as the “Farris zone”. Simulations performed by Swofford et al.  demonstrated that along a tree-axis that includes both the Felsenstein zone and the Farris zone, ML outperforms parsimony overall in recovering the correct quartet topology. Many subsequent simulation studies compared the performance of parsimony and ML in other model trees (e.g. [5,28-30]). As Bergsten  pointed out, the conclusions of these comparative simulation studies have been highly dependent on the specific model tree and branch length conditions subjectively chosen for individual investigations. Analysis of these comparative simulation studies shows clearly that parsimony has a strong bias towards grouping long-branched taxa together, but also that ML and other probabilistic methods that in principal account for unequal branch lengths and correct for unobserved changes [27,28] can minimize but not eliminate the risks of long-branch effects [6,31].
In contrast to the extensive simulation studies comparing the performance of different inference methods, few analytical frameworks are available to quantify the phylogenetic utility of molecular loci for resolving specific phylogenies with unequal branch lengths. Theory provided by Felsenstein , Hendy and Penny , and Kim  has revealed general branch length conditions in which inference becomes inconsistent. But because these works assume a character with binary states with equal substitution rates, the inconsistency conditions identified by assuming such a simplistic model cannot be directly applied to real-life molecular loci, which typically follow much more complex molecular evolutionary models and vary in rates of evolution.
Post-hoc analytical methods have been developed that detect the presence of long-branch effects in molecular data. For example, split decomposition  with spectral analysis  has been utilized to plot split graphs to show where conflicting signal exists in a molecular data set [10,34-38], and Relative Apparent Synapomorphy Analysis (RASA [39,40]) has been developed to detect problematic long branches by examining the taxon-variance plot of a molecular data set [41-49]. The taxon-variance plot has attracted some zealous criticism in several studies that report false outcomes for identifying problematic long branches [50-54]. No such method is perfect for all examples. Even so, one issue with these post-hoc analytical methods is that the graphic outputs produced evaluate realized sequence data to convey a qualitative sense rather than quantification of phylogenetic utility.
Recently, progress has been made towards analytical prediction of the utility of sequence data for resolving phylogenies in which long-branch attraction bias may arise. Extending the work of Fischer and Steel , which evaluated the sequence length needed for accurately resolving a binary four-taxon phylogenetic tree with four long subtending branches and a short internode, Martyn and Steel  investigated the required sequence length to resolve a quartet in which just one subtending branch is long, rather than all four, in the presence and absence of a molecular clock. However, they also demonstrated that those results were critically dependent on the assumption that all sites are evolving at a single rate. Susko  advanced an analytical method based on Laplace approximations to provide simple corrections for long-branch attraction biases in Bayesian-based inference towards particular topologies; the effectiveness of the corrections was further demonstrated in simulations of four-taxon and five-taxon trees.
In this paper, we quantify an accurate prediction of utility of molecular characters for resolving a quartet phylogeny with uneven subtending branches as assessed by parsimony, by incorporating contributions toward the correct tree from any parsimony-informative sites that are consistent with the actual quartet topology (i.e. support for the correct quartet topology due to true, unobscured signal or homoplasy). We also characterize a highly conservative lower bound of utility by incorporating contributions toward the correct tree only from those true, unobscured parsimony-informative sites (i.e. support for the correct topology due to true, unobscured signal only). We build on the signal and noise framework of Townsend et al. , which uses the estimated substitution rates of individual molecular characters to estimate the power of a set of molecular sequences for resolving a four-taxon tree with equal subtending branch lengths. This result, applied to the Poisson model of molecular evolution, was subsequently generalized by Su et al.  to apply to all standard symmetric molecular evolutionary models of nucleotide substitution up to and including the General Time Reversible model (GTR [58,59]). Herein we further generalize the signal and noise analysis by relaxing the assumption of equal subtending branch lengths for the four-taxon tree. Further, we use the generalized signal and noise analysis to explore how varying branch length conditions and alternative model assumptions affect the predicted phylogenetic utility. We apply the generalized signal and noise analysis to four-taxon trees in which long-branch attraction bias arises as a consequence of unequal evolution rates or an asymmetrical topology. We demonstrate that the generalized signal and noise analysis can help identify for these example phylogenies branch length conditions in which inference is inconsistent.
Phylogenetic signal and noise
Note although the derivation of Equations 3–8 above is presented for nucleotide characters, these equations are also applicable to amino acid characters by substituting an amino acid substitution rate matrix for the nucleotide substitution rate matrix Q(λ) in Equations 1 and 2, and could also be applied to morphological characters that evolve in accord with the Mk matrix [61,62].
Predicting phylogenetic utility
To simplify notation hereafter, we will suppress the routine but continuing functional dependencies on λ 0, λ 1, λ 2, λ 3, λ 4, t 0, T 1, T 2, T 3, and T 4. Because parsimony uses almost exclusively the AABB patterns to inform quartet topology reconstruction, evaluating y − Max(x 1, x 2) for a molecular character gives an accurate quantitative measure of the character’s phylogenetic utility for resolving a quartet phylogeny as assessed by parsimony. For a given character, if y − Max(x 1, x 2) > 0, the character has more support for the correct quartet topology than for either of the incorrect quartet topologies as assessed by parsimony, and thus by sampling more of such a character, inference via parsimony will converge to the correct topology. Conversely, if y − Max(x 1, x 2) < 0, the character has a stronger support for an incorrect topology than for the correct topology as assessed by parsimony, and thus by sampling more of such a character, inference via parsimony will not converge to the correct topology. Therefore, evaluating y − Max(x 1, x 2) yields a quantitative measure of whether inference will be consistent under parsimony.
However, evaluating y − Max(x 1, x 2) for predicting phylogenetic utility and consistency conditions under probabilistic inference methods such as ML and Bayesian methods faces two opposing biases. First, ML and Bayesian methods can obtain additional information to resolve a quartet phylogeny—albeit of markedly lower impact per character—from some non-AABB patterns. For example, given a non-AABB pattern observed at a character that resulted from a signal in the internode having then been partially masked by noise (i.e. randomizing state changes in subtending branches), a probabilistic inference method will attribute likelihood to the correct topology from this character if the state changes that occurred in subtending branches are consistent enough with the model and occurred slowly enough to provide useful information. On the other hand, unlike with parsimony-based inference, not every character showing an AABB pattern is interpreted by probabilistic methods to support a quartet topology. For instance, given a synapomorphic pattern observed at a character that actually arose from an absence of state change in the internode followed by parallel state changes in sister subtending branches, a probabilistic method that classifies the site as fast-evolving will rightfully obtain little support for the correct topology from this character.
Addressing the first bias as outlined in the preceding paragraph is not straightforward within the framework of signal and noise analysis, because tracking all non-AABB patterns that can have varying and ambiguous levels of support for the correct quartet topology as assessed by probabilistic inference methods is impractical and would render analysis highly cumbersome. However, the second bias as explained above can be addressed by evaluating an alternative measure of predicted utility that excludes support for the correct quartet topology due to apparent synapomorphy. Such a measure can be obtained by comparing the probability of true synapomorphy only, Π, to the probability of observing either homoplasious pattern consistent with an incorrect quartet topology, Max(x 1, x 2). The resultant measure, Π − Max(x 1, x 2), represents a conservative lower bound of utility, since it does not include support for the correct quartet topology due to partially masked signal, which parsimony typically does not recognize but probabilistic inference methods can recognize under ideal circumstances. Ultimately, because true synapomorphy represents unmasked, actual phylogenetic signal and provides unambiguous support for the correct quartet topology regardless of which inference method is concerned, in branch length conditions where Π − Max(x 1, x 2) > 0, the strength of unmasked actual signal is greater than the strength of homoplasy that supports an incorrect topology, and therefore correct inference can likely be achieved by both parsimony and probabilistic methods.
Example 1: predicted utility of a character in the felsenstein and “Farris” zones
In demonstrating long-branch attraction by parsimony and “long-branch repulsion” by ML, Huelsenbeck and Hillis  and Siddall  performed simulations for two four-taxon model trees with different branch length conditions that encompass the Felsenstein zone and the Farris zone, respectively. In this example study, we apply the signal and noise analysis to these two model trees to predict the phylogenetic utility of a nucleotide character in the Felsenstein zone and the Farris zone.
From Equation 9, the length of a branch can range between 0 and 0.75 under the JC model.
For the Huelsenbeck and Hillis  model tree, the probability of true synapomorphy is greater than the probability of a character exhibiting either homoplasious pattern (i.e. Π ∕ Max(x 1, x 2) > 1) in an area that borders on the horizontal axis of the branch length space (Figure 3E). For the Siddall  model tree, Π ∕ Max(x 1, x 2) > 1 is true in a similar but slightly more extended area that borders on both the horizontal and vertical axis of the branch length space (Figure 3F).
Example 2: predicted utility of a character with an identical rate across lineages for resolving an asymmetrical quartet tree
In this example, we assess the predicted utility of a nucleotide character for resolving a hypothetical four-taxon tree with an asymmetrical topology. For this analysis we consider a nucleotide character which follows the molecular clock assumption and has an equal substitution rate in the internode and four subtending branches in the four-taxon tree of interest (i.e. setting λ 0 = λ 1 = λ 2 = λ 3 = λ 4 = λ in Figure 1). We assume the JC model for the nucleotide character. The four-taxon tree in question has an internode with a length in an arbitrary time unit of t 0 = 0.1; two non-sister subtending branches have an equal length of 4t 0 = 0.4 (i.e. setting T 1 = T 3 = 0.4 in Figure 1), while the other two non-sister subtending branches both have a length of 0.4l (i.e. T 2 = T 4 = 0.4l in Figure 1), where l > 1.
Example 3: predicted utility of a character with a variable rate across lineages for resolving a symmetric quartet tree
In this example, we evaluate the predicted utility of a nucleotide character for resolving a hypothetical four-taxon tree with a symmetric topology. The four-taxon tree in question has an internode with a length (in time) of t 0 = 0.1 and four subtending branches with an equal length of 0.1l, where l > 1 (i.e. setting T 1 = T 2 = T 3 = T 4 = 0.1l in Figure 1). For this analysis, we again assume the JC model for the nucleotide character; however, the character does not necessarily follow a molecular clock across the quartet. We assign a fixed substitution rate of 1 (per unit time) to two non-sister subtending branches of the four-taxon tree (i.e. λ 2 = λ 4 = 1 in Figure 1), and a free substitution rate of λ in the internode and the other two non-sister subtending branches of the tree (i.e. setting λ 0 = λ 1 = λ 3 = λ in Figure 1).
Example 4: effects of alternative model assumptions on predicted utility
Estimated parameter values for the models for the actin (ACT1) marker
In this paper, we have relaxed an assumption of phylogenetic signal and noise analysis by allowing a four-taxon tree of unequal subtending branch lengths. Previous analyses [56,57] assumed a phylogenetic quartet with four subtending branches of equal lengths. Although any internode has an inherent quartet structure , not all internodes have subtending branches that have equal lengths, even without heterochrony. Furthermore, sampling additional taxa can effectively reduce branch lengths [67-72], rendering appropriate branch lengths to consider for phylogenetic informativeness shorter than the extracted quartet. While slight differences in branch lengths probably do not represent a significant violation of the theoretical assumption under the previous versions of signal and noise analysis, for internodes where all of the subtending branches have markedly different lengths, the assumption of equal branch lengths is no longer acceptable. The generality and the accuracy of the signal and noise analysis can therefore be improved by quantifying the probability of synapomorphic and homoplasious character state patterns in four subtending branches of unequal lengths. This improvement, if it could seamlessly incorporate increased taxon sampling in addition, would facilitate the application of signal and noise analysis freely and precisely to all describable internodes of phylogenetic interest.
We have also recast previous analysis so that it can characterize the probability of a true synapomorphy in a four-taxon tree, including only true synapomorphy as support for the correct quartet topology. Previous signal and noise analyses [56,57] have not distinguished true synapomorphy vs. apparent synapomorphy and include both as support for the correct quartet topology. While parsimony infers support for the correct quartet topology from both true synapomorphy and apparent synapomorphy, probabilistic inference methods can better discriminate against apparent synapomorphy by accounting for fast rates of evolution and correcting for unobserved changes [6,27,28,73]. In the meantime, however, the generalized signal and noise analysis does not quantify contributions from obscured signal at sites that are not parsimony-informative, even though probabilistic inference methods can recognize some support for the correct topology from these sites. Therefore, including support for the correct quartet topology only from true, unobscured parsimony-informative sites yields a conservative lower bound for predicting phylogenetic utility.
In the first example, based on the two model quartet trees with branch length conditions that correspond to the Felsenstein and “Farris” zones, our analysis has characterized the probability distributions of true synapomorphy, apparent synapomorphy, and homoplasy in support for an incorrect topology in the those zones. These analysis results provide analytical predictions of the contrasting performances of parsimony and ML in the Felsenstein and Farris zones as shown by simulations of Huelsenbeck and Hillis  and Siddall . In the Felsenstein zone, parsimony is likely to give incorrect inference of the quartet topology, because support for the correct quartet topology as assessed by parsimony (i.e. including both true and apparent synapomorphy) is less than support for an incorrect topology in the corresponding area of the branch length space. This observation is consistent with the expectation that parsimony-informative sites that are consistent with an incorrect quartet topology are more likely to occur and accumulate if the internode is short (i.e. there is a low probability of true signal occurring in the internode), the rate of evolution of the character is fast (i.e. there is a high probability of noise accumulating in the subtending branches), or the differences in the rate of evolution between branches is large (i.e. there is a high probability of convergent and parallel changes in the two non-sister branches with faster rates of evolution). In contrast, ML can perform better than parsimony by gathering additional support for the correct quartet topology from partially-informative non-AABB patterns, which are not tracked by our theory. In the Farris zone, parsimony is likely to yield correct inference of the quartet topology, since support for the correct quartet topology as assessed by parsimony is greater than support for either incorrect topology in the corresponding area of the branch length space. However, the strong performance of parsimony in the Farris zone is in fact due to apparent synapomorphy; in the corresponding area of the branch length space, almost all support for the correct quartet topology is contributed to by apparent synapomorphy. Since ML does not accrue likelihood for the correct quartet topology in the presence of apparent synapomorphy in the way that parsimony does, ML is not misled into performing as well as parsimony in the Farris zone in terms of recovering the correct quartet topology.
This generalized signal and noise analysis can be applied to diverse scenarios in which unequal branch lengths can arise and potentially introduce long-branch effects. Unequal branch lengths can be either caused by unequal evolution rates across lineages within the study group (i.e. relaxation of the molecular clock assumption), or due to an asymmetrical topology, which can arise as a result of differential speciation or extinction rates and/or incomplete taxon sampling . The signal and noise theory decouples the rate of substitution and time in characterizing the length of a branch. Thus, the theory can account for differences in both substitution rates and evolution times across lineages, and it can be applied to phylogenies in which unequal branch lengths occur due to unequal rates of evolution, asymmetrical topologies, or both.
In the second example, based on a four-taxon tree with an asymmetrical topology, results of the signal and noise analysis demonstrated that the chance of correctly resolving an asymmetrical quartet phylogeny can be increased by sampling slower-evolving molecular loci; the more asymmetrical the underlying topology is, the slower-evolving the sampled molecular loci should be. Rapidly-evolving molecular loci have poor predicted phylogenetic utility because at these loci, there is a higher probability of observing noise or homoplasy than actual signal. For the quartet tree used in this example study, the signal and noise analysis furthermore quantified the threshold substitution rate above which a nucleotide character may contribute a negative utility towards correct resolution of the quartet tree. In molecular phylogenetic investigations, a common practice to reduce long-branch effects is to exclude fast-evolving molecular loci—such as third codon positions—from inference analysis, based on the rationale that these loci are likely saturated or randomized [19,40,74-80]. On the other hand, third codon positions can contain a significant amount of information of the phylogenetic structure , and removing an excessive amount of rapidly-evolving loci can lead to a significant reduction in resolution [79,80,82]. Therefore, for an actual quartet phylogeny for which the inferred topology is suspected to result from long-branch effects, by applying the generalized signal and noise analysis to an alternative topology that is hypothesized to reflect the actual taxon relationship, one can estimate a threshold substitution rate for sampling molecular loci for overcoming the suspected long-branch effects while in the meantime minimizing the number of fast-evolving loci that are unnecessarily excluded from analysis.
In the third example, in which the substitution rate of a nucleotide character was variable across the four taxa within the study group, the signal and noise analysis demonstrated that in addition to sampling slower-evolving molecular loci, sampling loci with less variation in substitution rate across lineages is helpful for avoiding biases towards topologies that group faster-evolving non-sister branches together. The deeper the internode in question is, the more likely there is to be significant rate variation, and yet the deeper the internode is, the less variation in substitution rate across lineages the sampled molecular loci should have. At molecular loci with significant rate variation across lineages, convergent or parallel character state changes tend to accumulate along the lineages with faster substitution rates, thereby obscuring actual signal and reducing the phylogenetic utility of these loci. For the quartet tree assessed in this example, the signal and noise analysis has also quantified the range of rate variation across lineages within which a nucleotide character has a positive predicted utility towards correct quartet resolution. In phylogenetic studies, another proposed approach to reducing long-branch effects involves selecting only representative taxa with the lowest substitution rates and minimum rate variation across lineages [83-85]. However, numerous studies have suggested that increased taxonomic sampling generally leads to improved accuracy in phylogenetic inference ([67,68,75,86-90]; but see also [3,91]; as summarized in [6,7]), and excluding a large number of taxa may thus significantly decrease the accuracy of inference outcomes. Therefore, in an investigation in which the inferred topology is suspected to arise due to long-branch effects, by applying the generalized signal and noise analysis to an alternative topology hypothesized to reflect the actual taxon relationship, one may estimate the desirable range of rate variation across lineages to inform taxon sampling while at the same time avoiding removing an excessive number of taxa from analysis.
Lastly, in the fourth example, which compared utility prediction for the four-taxon tree in the previous example based on four alternative nucleotide substitution models (i.e. the JC, K2P, HKY, and GTR models), analysis results indicated that predictions of the signal and noise analysis are fairly robust to alternative model specifications, consistent with the finding of Su et al.  in quartet trees with even subtending branches. In this example based on a four-taxon tree with unequal substitution rates across lineages, the predicted utility is higher under the JC model than under the other three more complex models; but as the model parameterization increases from the K2P model to the GTR model, the predicted utility remains largely unchanged. As explained by Su et al. , in most realistic molecular data sets, there is always a certain degree of heterogeneity in model parameter values when the data are fitted to an optimal model. As the model grows in complexity, some character states, due to association with higher model parameter values, will begin to dominate the evolutionary process and thus effectively reduce the character state space. Analysis results of Su et al.  also demonstrated that the predicted utility of a molecular character increases as the character state space increases (c.f. Figure 6 in ). Thus, specifying an overly simple model can fail to adequately account for heterogeneity in the evolutionary process and hence cause an increase of the effective character state space. Consequently, the predicted utility based on an overly simple model is higher than actual. But once a model of sufficient complexity is fitted to the molecular data in question, the effective character state space is reduced closer to its actual size, and the predicted utility is more accurate. Therefore, specifying increasingly more complex models will lead to decreasingly little impact on predictions of the signal and noise analysis.
In this paper, we have generalized phylogenetic signal and noise analysis by allowing a four-taxon tree of unequal subtending branch lengths. This generalized signal and noise analysis provides analytical prediction of utility of characters evolving at diverse rates of evolution to resolve quartet phylogenies in which unequal branch lengths arise due to unequal rates of evolution, asymmetrical topologies, or both.
Results and figures presented in the Result section were obtained by implementing the analytical calculations as outlined in the Theory section via Wolfram Mathematica 7 (Wolfram Research, Inc.).
Research ethical approval and consent are not applicable to this study, since the study involves no human subjects.
The authors sincerely thank Zheng Wang and Alex Dornburg for helpful discussion of the topic.
- 20.Farris JS. Likelihood and inconsistency. Cladistics. 1999;15:199–204.Google Scholar
- 36.Waddell PJ, Cao Y, Hauf J, Hasegawa M. Using novel phylogenetic methods to evaluate mammalian mtDNA, including amino acid invariant sites LogDet plus site stripping, to detect internal conflicts in the data, with special reference to the positions of hedgehog, armadillo, and elephant. Syst Biol. 1999;48:31–53.CrossRefPubMedGoogle Scholar
- 58.Tavaré S. Some probabilistic and statistical problems in the analysis of DNA sequences. In: Miura RM, editor. Some mathematical questions in biology: DNA sequence analysis (Lectures on mathematics in the life sciences). New York: American Mathematical Society; 1986. p. 57–86.Google Scholar
- 74.Swofford DL, Olsen GJ, Waddell PJ, Hillis DM. Phylogenetic inference. In: Hillis DM, Moritz C, Mable BK, editors. Phylogenetic Inference. Sunderland, MA, USA: Sinauer Associates; 1996. p. 407–514.Google Scholar
- 81.Källersjö M, Albert VA, Farris JS. Homoplasy increases phylogenetic structure. Cladistics. 1999;15:91–3.Google Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.