Detecting recombination in evolving nucleotide sequences
- 6.4k Downloads
Genetic recombination can produce heterogeneous phylogenetic histories within a set of homologous genes. These recombination events can be obscured by subsequent residue substitutions, which consequently complicate their detection. While there are many algorithms for the identification of recombination events, little is known about the effects of subsequent substitutions on the accuracy of available recombination-detection approaches.
We assessed the effect of subsequent substitutions on the detection of simulated recombination events within sets of four nucleotide sequences under a homogeneous evolutionary model. The amount of subsequent substitutions per site, prior evolutionary history of the sequences, and reciprocality or non-reciprocality of the recombination event all affected the accuracy of the recombination-detecting programs examined. Bayesian phylogenetic-based approaches showed high accuracy in detecting evidence of recombination event and in identifying recombination breakpoints. These approaches were less sensitive to parameter settings than other methods we tested, making them easier to apply to various data sets in a consistent manner.
Post-recombination substitutions tend to diminish the predictive accuracy of recombination-detecting programs. The best method for detecting recombined regions is not necessarily the most accurate in identifying recombination breakpoints. For difficult detection problems involving highly divergent sequences or large data sets, different types of approach can be run in succession to increase efficiency, and can potentially yield better predictive accuracy than any single method used in isolation.
KeywordsPrediction Accuracy Markov Chain Monte Carlo Recombination Event Tree Topology Phylogenetic Signal
A homologous recombination event between two DNA sequences can be either reciprocal or non-reciprocal. In reciprocal recombination, genetic information is transferred or exchanged between two similar DNA sequences. In non-reciprocal recombination, a contiguous region of DNA is replaced by, rather than exchanged with, the transferred region. Both types of recombination are a consequence of the DNA mismatch repair mechanism which protects genetic information from damage. Gene conversion, for example, is a cross-over process between homologous sequences in which a DNA strand replaces a damaged partner DNA strand with a copy of its own sequence . A number of models describe the mechanisms of recombination, addressing issues of strand breakage, displacement and extension, and mismatch repair in double-stranded DNA [2, 3, 4]. Gene conversion events can lead to reshuffling of parental open reading frames, or of structural and functional motifs within protein domains, and these can generate a gene with novel functions [5, 6]. Therefore, reciprocal and non-reciprocal recombination events are important mechanisms in the creation of genetic diversity .
Recombination events have been inferred in prokaryotes [8, 9, 10], unicellular eukaryotes [11, 12] and multicellular eukaryotes [13, 14]. Homologous recombination has contributed to the evolution and functional divergence of multi-gene families such as the β-globin gene family , heat shock proteins [16, 17] and the major histocompatibility complex gene family [18, 19]. While most cases reflect recombination between DNA sequences within a genome, recombining DNA can also come from the external environment of a cell, which results in the acquisition of foreign DNA by a genome [20, 21]. If lateral genetic transfer via homologous recombination is a significant contributor to prokaryotic evolution [22, 23], the detection of recombination events will be essential in the inference of phylogenetic relationships among genomes .
Elucidating patterns of genetic transfer will enhance our understanding of the role selective forces play in shaping genomes. Homologous recombination events can produce genes with mosaic evolutionary histories in which the underlying evolutionary pattern is not a tree but a network [25, 26]; such an evolutionary pattern confounds analyses that assume a common evolutionary path for every component of a biological sequence. The task of delineating recombination events is hard for two major reasons. Firstly, if the recombining sequences are too similar, subsequent detection of the event may be impossible due to the lack of 'signal' to distinguish among sequences. Secondly, evolutionary events that occur after recombination will tend to obscure the true relationships between sequences. Homologous recombination events can overwrite previous such events, fragmenting the regions with consistent evolutionary histories until the events cannot be distinguished with confidence . Sequence substitutions after a recombination event will diminish the apparent similarity between fragments of a gene and their closest relatives in other sequences; this phenomenon has been shown to influence the accuracy of phylogenetic inference .
A number of approaches are available to detect evidence of recombination events and/or to identify the recombination breakpoints. They are classified into different categories based on the algorithms used . Distance-based methods generate statistics of genetic distances with the use of a sliding window along a set of aligned sequences [30, 31]. Abnormal inversions of distance patterns are detected without reference to the underlying phylogenetic relationship among the sequences. Phylogenetic-based methods are based on the detection of alignment partitions with discordant phylogenetic relationships . In compatibility-based methods, inference of recombination and phylogenetic incongruence is based on parsimoniously informative sites identified within an alignment [33, 34, 35]. Substitution distribution-based methods detect regions within a set of sequences that are significantly similar or clustered together, with the level of significance based on a modelled statistical distribution of nucleotide substitution . Computer simulations and empirical data have been used to evaluate the performance of a number of available methods in detecting and analysing recombination events based on the amount of recombination and sequence divergence [29, 37, 38]. A uniform outcome from these studies is that compatibility-based and substitution-based methods perform better than the phylogenetic-based approaches. Furthermore, Posada and Crandall  demonstrated that phylogeny reconstruction from sequences could be biased owing to the reciprocality and the age of a recombination event, as well as the parental divergence of the sequences involved in the event. These studies suggested that conclusions about recombination should not be drawn on the basis of a single method due to biases of different approaches to the nature of the dataset, e.g. some methods were found to detect far fewer recombination events and breakpoints than expected [38, 39]. Using simulated sequence data, we examined the effect of subsequent substitution after a recombination event on the prediction accuracy of different recombination-detecting programs, within the simplified framework of homogeneous substitution rate and nucleotide composition throughout the lineages.
The five programs compared in this study were GENECONV (substitution distribution-based) , RecPars (phylogenetic-based) , Reticulate (compatibility-based) , BARCE  and DualBrothers  (both Bayesian phylogenetic-based). These programs were selected based on their capabilities for large-scale automated analyses, their usage within the scientific community and/or strong performance within their class of algorithm in recent reviews [29, 37, 38]. A brief overview of our methods for calculating prediction accuracy is provided below, with further details in the Methods section. The phylogeny of the recombined region between points r 1 and r 2 in each sequence set was assessed by Bayesian inference using a Markov chain Monte Carlo (MCMC) approach, implemented in MRBAYES . As the amount of subsequent substitutions increases, the phylogenetic signal becomes weaker, and the Bayesian posterior probability (BPP) of the tree topology is expected to decrease. Failure to assign high BPP to the correct topology within the recombined region indicates a loss of phylogenetic signal, and suggests that statistical searches for recombination events or breakpoints, particularly those based on phylogenetic relationships, may be futile.
Substitution distribution- and phylogenetic-based approaches
The phylogenetic signal of the recombined region was stronger when the recombining sequences were more divergent, i.e. when the immediate preceding lineage length leading up to the recombination event was longer. The BPP values obtained with MRBAYES were > 0.90 in all cases when the immediate preceding lineage was at 0.50 substitutions per site, even when subsequent substitutions were high (BPP in L05/50: 0.99, L50/50: 0.94 and L50: 1.00 at λ = 0.50). The recombination signal and prediction accuracy decreased more rapidly in response to increasing λ when the recombining sequences were more similar to one another (L05/05, L50/05 and L05).
Within the reciprocal set (Figure 3a), when recombining sequences were more dissimilar (L50), Reticulate showed higher NSS (0.9950 at λ = 0.00) and higher proportions of sets with significant clustering (1.00 at λ = 0.25) compared to L05 (NSS 0.9760 at λ = 0.00; proportion 0.27 at λ = 0.25). In comparison to the reciprocal set, lower NSS were obtained within the non-reciprocal set even when no subsequent substitution was simulated e.g. NSS in L05/50: 0.8174 and L50/50: 0.8078 at λ = 0.00. When λ reached 0.50 and the immediate preceding lineage length (θ 2 ) was long, the proportion of significant clustering within the non-reciprocal set was low (e.g. 0.10 in L05/50 and 0.17 in L50/50). This can be explained by the resulting identical sequence fragments in a non-reciprocal event, which yielded fewer parsimoniously informative sites in the alignment. The measure of statistical significance within a set of data is lower when the sample size is smaller, because the probability of obtaining a result by chance is higher. Therefore, as fewer parsimoniously informative sites were being considered in a non-reciprocal event, the clustering efficiency of sites in the compatibility matrix was lower.
Bayesian phylogenetic-based approach
For the reciprocal sets (Figure 4), both BARCE and DualBrothers showed high accuracy when recombining sequences were more divergent (L50), with DualBrothers showing higher accuracy than BARCE; within the recombined region, DualBrothers showed a minimum accuracy of 0.98 and BARCE showed a minimum accuracy of 0.82 when λ = 0.50.
A slightly different trend was observed in the non-reciprocal set (Figure 5). Consistent with the observation in Figure 1, when the immediate preceding lineage length (θ 2 ) was short (L05/05 and L50/05), the phylogenetic signal of the recombined region was diminished to a greater extent by the resulting identical sequence fragments. BARCE and DualBrothers showed lower accuracy in recovering this region than the non-recombined region, especially when λ was high. For instance in L50/05, DualBrothers showed lower accuracy in recovering the recombined region (minimum accuracy 0.58; λ = 0.50) than in recovering the non-recombined region (minimum accuracy 0.99; λ = 0.50). The exact opposite trend was observed in L05/50 when recombining sequences were more divergent, with accuracy in recovering the recombined region higher (minimum accuracy 0.98; λ = 0.50) than in recovering the non-recombined region (minimum accuracy 0.56; λ = 0.50). Similar bias in recovering recombined or non-recombined region was observed with BARCE, although to a lesser extent. When both primary and secondary lineage lengths of a tree were long (L50/50) prior to recombination, both programs showed high accuracy in recovering both regions, DualBrothers (e.g. minimum accuracy 0.97 within recombined region; λ = 0.50) more so than BARCE (e.g. minimum accuracy 0.84 within recombined region; λ = 0.50).
The Markov chains generated by DualBrothers were always initiated from a random point in the space of trees, breakpoints and models. To assess the sensitivity of sampled breakpoint posterior probabilities to the choice of start point, a subset of our simulated datasets was used, in which DualBrothers was implemented on each simulated set ten times, each time starting at a random point (see Methods section). For the majority of parameter combinations, the posterior probability of breakpoints at each of the 1000 sites in the sequence was extremely stable (BPP range < 0.01) across all ten replicates, with higher variability observed at the 'shoulder' regions of breakpoints (BBP range < 0.10). Even higher variability was observed in parameter combinations that yielded the most difficult-to-detect breakpoints (BPP range reached 0.8 in the worst case, L05/05; λ = 0.50), showing that the dataset requires a much longer run to achieve stable BPPs. Nevertheless, when a given site was assigned a breakpoint BPP ≥ 0.5 in a replicate, the BPP assigned to the same site in each of the other replicates was always greater than each respective median BPP. Consequently, breakpoint identification was at worst still consistent across replicates, in spite of the high BPP variation seen for this combination of parameters.
Sensitivity to parameter settings
Relationship between simulation parameters and prediction accuracy
In this study, for each reciprocal and non-reciprocal set, the variation in prediction accuracy of recombination-detecting programs was related to a few major factors: (i) evolutionary history prior to recombination (θ), (ii) substitutions after the recombination event (λ), and (iii) parameter settings for certain programs i.e. GENECONV and RecPars. To examine the relationship between these factors and prediction accuracy, multiple linear regression (MLR) analysis was carried out for each program individually. Across all five detection programs in this study, all of the factors considered had significant effects on the prediction accuracy (p-values < 2.20 × 10-16). Since extremely low p-values could not be differentiated, comparison of the t-statistics associated with different factors was assessed to determine the relative strength of their effects on prediction accuracy. Of the factors examined, the extent of subsequent substitution had the strongest effect on prediction accuracy in all approaches, e.g. GENECONV with t-values -135.40 in the reciprocal set and -177.83 in the non-reciprocal set, consistent with the observed trends in Figures 2 through 6.
The overall F values obtained in the MLR analysis for each factor examined are associated with very small p-values (< 2.20 × 10-16), supporting the graphical evidence in Figures 2 to 7 that the effects of simulation parameters on prediction accuracy are statistically significant. The adjusted R2 value shows the percentage of the outcome that could be explained by the factors examined in the study. The highest R2 value (0.85; F value 6863; 3596 degrees of freedom in reciprocal set) was assigned to the substitution distribution-based approach (GENECONV), implying that approximately 85% of the outcome could be explained by the factors examined. The compatibility-based approach, Reticulate, showed a similarly high R2 value (0.83). The phylogenetic-based approaches RecPars (0.58), BARCE (0.68) and DualBrothers (0.28), had lower associated R2 values (values shown for the reciprocal set). Most of the MLR assumptions  were not violated by the data analysed, except for a slightly non-linear relationship observed between independent and dependent variables, and evidence for heteroscedasticity in some analyses (results not shown). Although the assumption of homoscedasticity was sometimes violated, MLR is robust to violations of this assumption when sample sizes are large, so the observed heteroscedasticity is unlikely to have a strong influence on our analysis.
Using simulated sequence data and multiple regression analysis, we have shown that the prediction accuracy of recombination-detecting programs is affected by the reciprocal and non-reciprocal nature of the recombination event, prior evolutionary history, subsequent substitutions after the recombination event, and the choice of parameter settings in certain programs.
Reciprocal versus non-reciprocal recombination events
The approaches showed higher accuracy in recovering reciprocal recombination events than non-reciprocal events, owing to the strength of the phylogenetic signal within the recombined region. In a reciprocal event, two lineages are exchanged, disrupting the phylogenetic relationship while retaining the original tree shape that has four terminal edges of equal length. In a non-reciprocal event, one sequence is over-written by another, reducing the genetic diversity in the set and producing a four-taxon tree in which two of the sequences are identical. This effect is stronger when all four sequences are more similar to each other at the point of recombination: while the region consisting two identical sequences was easily identified as a recombined region, a relatively small number of substitutions simulated afterward was sufficient to attenuate the phylogenetic signal of this region. Therefore, the phylogenetic signal of non-reciprocally recombined regions was more sensitive to subsequent substitutions than was that of the reciprocally recombined regions.
Evolutionary history prior to recombination
The prior evolutionary history of the four sequences also affects the accuracy of a program in assigning the correct phylogeny to the recombined and non-recombined regions. Two good examples that illustrate this point are L50/05 and L05/50 in the non-reciprocal set (Figure 5).
In the case of L50/05, BARCE and DualBrothers showed high accuracy in recovering the phylogeny of the non-recombined region, but not of the recombined region. The phyletic grouping of sequences (1, 2) and sequences (3, 4) is supported by more shared substitutions in L50/05 than in L05/50, so the program can recover the non-recombined region more easily. The lower accuracy within the recombined region is due to the effect of topological structure as mentioned above when the sequences are more similar to each other, i.e. when θ 2 is short. The exact opposite trend was observed in L05/50 (longer θ 2 ), where the phylogenetic signal of the non-reciprocal event was stronger. Since θ 1 was short in this case, the phyletic grouping of sequences (1, 2) and sequences (3, 4) is not as obvious as in the case of L50/05. Therefore, the programs showed low accuracy in recovering the non-recombined region, but high accuracy in recovering the recombined region. Similarly with the observation in L50 of the reciprocal set (Figure 4), recombination-detecting programs showed high accuracy in recovering the phylogeny of both the recombined and non-recombined regions when the recombining sequences are more divergent. This observation supports previous studies that suggested that recombination is easier to detect with increasing levels of sequence divergence [37, 38], and that phylogenetic analyses of non-reciprocal recombination events are more sensitive to the prior evolutionary history of the sequences than are similar analyses of reciprocal events .
Substitutions after the recombination event
The extent of substitution after the recombination event (λ) also plays an important role in determining the prediction accuracy of an approach. The phylogenetic signal of the recombined region becomes attenuated as subsequent evolution progresses, and the task of detecting evidence of ancient recombination events can be difficult or impossible. While all approaches in this study showed lower prediction accuracy with increasing amounts of subsequent substitution, the substitution distribution-based approach was found to be most sensitive. High rates of subsequent substitution independently on each sequence disrupt long contiguous matches in a local pairwise comparison, which the substitution distribution-based approach i.e. GENECONV is solely based on. By iteratively sampling phylogenies across the alignment and suppressing minor variations in the phylogenetic pattern across a set of sequences, the Bayesian phylogenetic-based approach was least sensitive to the effects of subsequent substitution. As RecPars is based on finding parsimonious topological changes across alignment columns, and GENECONV is based on significant pairwise alignments of fragments, these programs perform better when recombined regions are longer.
We demonstrated that the prediction accuracy of certain programs (e.g. RecPars) is highly sensitive to specific parameter settings. When dealing with real datasets, it is almost impossible to know in advance the optimal recombination cost to use in RecPars . This supports the finding from a previous study that the phylogenetic approach performs poorly in detecting recombination . A similar trend was observed in GENECONV, for which prediction accuracy was shown to be affected by the gscale setting, as described in the Results section.
Identification of recombination breakpoints
Previous studies suggest that Reticulate and GENECONV have similar detection power in detecting recombination events , and that the compatibility approach of Reticulate is one of the more reliable methods . Although these methods are computationally less intensive than the Bayesian phylogenetic-based approaches, some of them (e.g. Reticulate) were designed to detect the presence of recombination events, not to locate recombination breakpoints [29, 37].
The Bayesian phylogenetic-based approach proved to be the best in delineating recombination breakpoints, showing high accuracy. This approach has a great advantage over the other approaches in detecting ancient recombination events, as the prediction accuracy was the least dependent on the extent of subsequent substitution. In many instances, BARCE showed a pattern of gradual transition from one topology to another, with intervening sites that have no dominant topology. This pattern is a good indicator of a recombination event, but the exact location of the breakpoint is not obvious. Although an abrupt change of tree topology BPP proposed between two adjacent sites is a good indication of a recombination breakpoint in some cases, there is no explicit proposal of breakpoints in BARCE. While the whole alignment was assumed to be under the same evolutionary pressure in BARCE, the dual Multiple Change-Point (MCP) model in DualBrothers , designed specifically to identify recombination breakpoints, proposes change point within the alignment, independently based on changes of tree topology and evolutionary rate. Although more false positives in defining phylogenies (e.g. L50/05 and L05/50 in Figure 5) were observed with DualBrothers, the explicit proposal of breakpoints in the program can be analysed in a more systematic manner. As shown in Figure 6, the breakpoints were identified with two sharp peaks in the graphs.
As is the case with other iterative MCMC approaches, the accuracy of BARCE comes at the expense of runtime, and the program can only be applied to datasets having exactly four sequences. DualBrothers can be applied to datasets with more than four sequences, but the program can be very time-consuming because the dual MCP model is described by eight parameters related to location of breakpoints, tree topologies and evolutionary rates.
Of all approaches examined, the conventional phylogenetic-based approach (RecPars) is of least practical use due to dependency of the program on parameter settings. Based on our findings in this study, we recommend use of the Bayesian phylogenetic-based approaches in detecting recombination events and breakpoints. Since the approach is time-consuming, faster approaches based on compatibility or substitution distribution can be used in the first instance to suggest the occurrence of a recombination event. For example, a marginally significant or significant clustering of incompatible sites in Reticulate or an unusually similar fragment (e.g., with p-value ≤ 0.10) determined by GENECONV can be taken as suggesting a recombination event. Bayesian phylogenetic-based approaches, e.g. DualBrothers, can then be applied to confirm, in a more-accurate manner, the possible breakpoints in the recombination event. We assumed a homogeneous evolutionary regime throughout the course of sequence evolution in this study using simulated data. When the evolutionary scenario is more complicated e.g. with different rates and different evolutionary models within a tree [46, 47], and for difficult detection problems involving highly divergent sequences, the use of fundamentally different approaches can provide multiple lines of evidence in support of the observed results.
In this study, we highlighted the strengths and weaknesses of different classes of recombination detection programs. We demonstrated differences in phylogenetic signals within recombined and non-recombined regions, between a reciprocal and a non-reciprocal event, and how these signals affect prediction accuracy of different approaches in detecting occurrence and identifying breakpoints of a recombination event. Bayesian phylogenetic-based approaches showed high accuracy in identifying recombination breakpoints but are time-consuming due to the complexity of MCMC and the models used. The compatibility-based approach is fast and does not depend on specific parameter settings. The conventional phylogenetic-based approach, and to a lesser extent the substitution distribution-based approach, are sensitive to key parameter settings, and infer recombination events and breakpoints only when these settings are tuned to the data, which may be impossible to achieve with empirical data. In detecting recombination events, the negative dataset can be filtered out by a first-pass run using faster methods; the more-accurate (and slower) methods can then be used in delineating the recombination breakpoints among the positive dataset. The combinatorial approach is more time-efficient, especially when scanning through a large dataset. Since the methods applied here are different in principle, identification of an event by multiple methods may also increase our confidence that a recombination event has indeed occurred.
Simulation of sequence evolution
Seq-Gen  was used to generate four-taxon sequence sets of length 1000 nt using the HKY  model of substitution with nucleotide frequencies A = 0.20, C = 0.30, G = 0.30, T = 0.20, a transition/transversion ratio of 2, and a four-category discrete approximation to a gamma distribution of among-site rate variation with shape parameter alpha = 1.0.
The simulation process is illustrated in Figure 1. For simulating reciprocal recombination, the sequences were first evolved along the separate lineages, each with length θ 1 of 0.05 or 0.50. For simulating non-reciprocal recombination events, sequences were first simulated using different combinations of θ 1 and θ 2 lineage lengths (0.05 and 0.50), in which θ 1 and θ 2 represent pre- and post-speciation lineage respectively. A reciprocal recombination event was simulated by manually exchanging a defined region between sequence 1 and sequence 3. A non-reciprocal recombination event was simulated manually by replacing a defined region of sequence 3 with that of sequence 1 as shown in Figure 1b. In all simulations, the recombined region was centred in the middle of the sequence block with 50% (500 nt) of the total sequence length (1000 nt), creating recombination breakpoints r 1 (250/251) and r 2 (750/751). After recombination, subsequent substitutions (λ) were simulated independently for each sequence with 0.00, 0.25 or 0.50 substitutions per site in each set. In all, 100 replicates were simulated for each test set.
Detection of recombination events
GENECONV, a substitution distribution-based method, uses non-parametric statistics to rank possible recombination events in an alignment, in which pairwise polymorphic sites are compared and scored . Recombination is inferred when sub-sequences in a two-sequence region are significantly more similar to each other than in the other regions in the sequence alignment. GENECONV version 1.81 was run with the simulated sequences in PHYLIP format as input. The parameter gscale = 1 was applied in all cases. Other default settings were used, i.e. minimum fragment length = 1, minimum number of polymorphisms = 2 and minimum pairwise score = 2.
RecPars is based on a parsimony algorithm that infers phylogenies for different segments in a sequence alignment; a recombination event can be inferred where these phylogenies change . The assignment of incongruent topologies is affected by the recombination cost, which is the penalty associated with introducing a recombination breakpoint into the sequence. In this work, no recombined regions were detected when the recombination cost was set too high (e.g. when the default setting of 100 was used), while a great number of incongruent tree topologies were defined when little or no recombination cost (e.g. < 10) was applied. Therefore, for each analysis, recombination cost was initially set at c = 100, repeated with c – 1 and so forth, until the sequences were partitioned into three or more fragments. We took the maximum recombination cost that caused the sequences to be partitioned into three or more fragments to be the optimal recombination cost. RecPars was run with input sequences in RecPars format, with a uniform substitution cost of 1.
Reticulate is a compatibility-based method to detect phylogenetic incongruence within an alignment, based on parsimoniously informative sites . The goal is to identify the tree(s) that provide the most parsimonious explanation (minimal number of substitutions) for each pair of sites, and then determine whether these trees are compatible with one another. Informative sites are defined as sites that have at least two different nucleotides present in two or more sequences each. Uninformative sites are discarded prior to analysis, since they cannot distinguish among tree topologies. The informative sites are paired and compared with each other. Two informative sites are considered compatible if both can be explained using the same phylogenetic tree. A matrix is generated in which each cell corresponds to the compatibility of pairs of informative sites. A cluster of incompatible sites in the matrix signifies a possible recombined region. Reticulate  was run with input sequences in FASTA format. We modified the source code to output the matrix directly in encapsulated postscript format, for the ease of large-scale batch runs. Clustering of sites was determined using neighbour similarity score (NSS) statistics incorporated in the program, with the generation of 1,000 random matrices for each simulation set.
BARCE, or Bayesian Application for Recombination and gene Conversion Estimation, is a program for detecting recombination breakpoints in alignments of four sequences . Hidden Markov models are used to represent the patterns among different tree topologies assigned to each site (column) of the sequence alignment. Bayesian posterior probabilities are then assigned to all three possible tree topologies for each site. A proposed change of tree topologies between two adjacent sites suggests a recombination breakpoint. BARCE version 1.2 was run with sequences in PHYLIP format using the F84  model, equally distributed prior probability of tree topologies at 1/3 each, difficulty of changing trees = 0.9, burn-in period = 100,000 generations, and length of sampling period = 100,000 generations. Initial character frequencies and transition/transversion ratios were estimated from the data.
DualBrothers is a Bayesian phylogenetic-based method for defining recombination breakpoints using a dual multiple change-point (MCP) model . The spatial phylogenetic variation within the alignment is described by two independent change-point processes introduced by the dual MCP model based on changes in tree topology and evolutionary pressures across a set of sequences. The proposed changes are captured by a reversible-jump MCMC sampling algorithm  extended from a Metropolis-Hastings sampling scheme. DualBrothers samples parameters that define an HKY substitution model , and assumes that branch lengths are a priori independent across the tree and from site to site . The algorithm can be applied to alignments with more than four sequences, but the sampling process is slow owing to the number of parameters considered.
A modified version of DualBrothers 1.1 was kindly provided by Aaron Darling (University of Wisconsin). The program was run with MCMC chain length = 550,000 generations, burnin = 50,000 generations, window_length = 5, Peter Green's constant  C = 0.20 and start_tree = (0,(1,(2,3))). Other parameters were run with default settings.
To assess the sensitivity of sampled breakpoint posterior probabilities to the choice of start point, we performed a replicated analysis of a subset of our simulated data sets. A set of sequences was chosen at random from each of the 18 combinations of event type (reciprocal or non-reciprocal event with combinations of θ 1 , θ 2 and λ). The DualBrothers analysis as described above was performed ten times on each of these data sets, with each run commencing from a random starting point. The posterior probability of each site being proposed as a breakpoint was obtained for each replicate for comparison.
Analysis of program accuracy
For GENECONV and RecPars, the average accuracy, determined separately for the recombined and non-recombined regions, was defined as the number of sites (columns) in the alignment that are correctly assigned to the expected topological relationship, divided by the total number of sites considered within the region:
in which N R is the number of sites that are correctly assigned within the recombined region, N NR is the number of sites that are correctly assigned within the non-recombined region, T R is the total number of sites within the recombined region and T NR is the total number of sites within the non-recombined region. This measure assigns equal weight to the recombined and non-recombined regions, regardless of their relative lengths. If the program correctly assigned a tree topology to all sites within a region but not to those sites within the other (for instance, when no recombination was detected), an average accuracy of 0.5 would be obtained. The average accuracy was scaled to a range of [-1,1] to yield the prediction accuracy, with 1.0 indicating perfect correlation between prediction and simulated history, and -1.0 indicating perfect anti-correlation:
In RecPars, sites in an alignment that were assigned the expected topology were considered as correct assignments. In GENECONV, a full list of pairwise fragments was obtained for each simulated run, with a statistical significance assigned to each fragment based on the observed sequence similarity. Each site (column) within the original alignment was assigned the relationship implied by the pairwise fragment bearing the highest statistical significance. If the relationship assigned to an alignment site was consistent with the known topological relationship of the sequences, the site was recorded as having a correct assignment.
To examine the support for the topology of the recombined region in relation to the prediction accuracy of GENECONV and RecPars, the phylogeny of the recombined region in each sequence sets was determined by Bayesian inference using a Markov chain Monte Carlo (MCMC) approach implemented in MRBAYES , applying the K2P  model of substitution, and a discrete approximation to a gamma distribution with four categories. MCMC analysis was run for 1,100,000 generations, with burn-in = 100,000 generations, number of chains = 4 and temperature parameter for heating the chains = 0.5.
For Reticulate, clustering significance of pairwise comparison in the compatibility matrix via the neighbour similarity score (NSS) was used as the criterion of prediction accuracy. There are two possible colours in each cell, corresponding to the relationship between the alignment columns under consideration: incompatible (black) or compatible (white). The NSS ranged from 0.5 to 1.0, with higher NSS representing stronger clustering of similar tree topologies. Using a Monte Carlo approach in Reticulate, a total of 1000 random matrices was generated for each simulation set to test the non-randomness of clustering, i.e. its independence from the ordering of informative sites. If the clustering is non-random, the NSS values obtained from these random matrices should rarely be greater than the NSS obtained from the original matrix. The probability of a random matrix with an NSS greater than or equal to the NSS of the original matrix was interpreted as a p-value, in which clustering with p ≤ 0.05 was considered non-random, and therefore significant.
Both BARCE and DualBrothers assigned each site (column) of the four-sequence alignment a BPP for possible tree topologies: (a,b),(c,d); (a,c),(b,d); or (a,d),(b,c). The predictive accuracy of BARCE was assessed at a series of BPP thresholds ranging from 0.5 to 1.0. For each threshold level, if any of the three topologies was given a BPP ≥ threshold level, the site was considered in the calculation of prediction accuracy; otherwise the prediction for that site was ignored. Prediction accuracy, determined separately for the recombined and non-recombined regions, was defined as the number of correct assignments over the total number of sites considered. If the expected tree topology was given the highest BPP at the particular site, the site was treated as a correct assignment. At higher threshold levels, the prediction accuracy is expected to be higher, as fewer sites are considered.
Multiple linear regression analysis
Multiple linear regression (MLR) analysis was carried out using the statistical package R  to examine the significance of the relationship between the prediction accuracy of a program, and the simulation parameters that were varied in the analysis. The linear model used was:
Accuracy ~ Template + Brlen
Factors considered are: (i) starting tree topologies in non-reciprocal set or primary (ancestral) lineage length in reciprocal set (Template) and (ii) subsequent substitution after recombination event (Brlen). Analysis for GENECONV and RecPars also includes Gscale for gscale setting in GENECONV and Recost for recombination cost setting in RecPars. To test for violations of the assumptions in MLR analysis, the Kolmogorov-Smirnov test  for normality and Fligner-Killeen test  for homoscedasticity were carried out. Residuals versus fitted value and Cook's distance  plots were employed to test for linearity and influence of outliers.
This study is supported by an Australian Research Council (ARC) grant CE0348221. The authors are indebted to the advice and comments of David Posada, Ziheng Yang, Ingrid Jakobsen and Marc Suchard; Aaron Darling for a modified version of DualBrothers; and Frank Wright on the use of BARCE. CXC is supported by a UQIPRS scholarship for his postgraduate study.
- 44.Osborne JW, Waters E: Four assumptions of multiple regression that researchers should always test. Pract Assess Res Eval 2002, 8: 2.Google Scholar
- 54.The R project for statistical computing[http://www.r-project.org/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.