Effect of the assignment of ancestral CpG state on the estimation of nucleotide substitution rates in mammals
- 4.3k Downloads
Molecular evolutionary studies in mammals often estimate nucleotide substitution rates within and outside CpG dinucleotides separately. Frequently, in alignments of two sequences, the division of sites into CpG and non-CpG classes is based simply on the presence or absence of a CpG dinucleotide in either sequence, a procedure that we refer to as CpG/non-CpG assignment. Although it likely that this procedure is biased, it is generally assumed that the bias is negligible if species are very closely related.
Using simulations of DNA sequence evolution we show that assignment of the ancestral CpG state based on the simple presence/absence of the CpG dinucleotide can seriously bias estimates of the substitution rate, because many true non-CpG changes are misassigned as CpG. Paradoxically, this bias is most severe between closely related species, because a minimum of two substitutions are required to misassign a true ancestral CpG site as non-CpG whereas only a single substitution is required to misassign a true ancestral non-CpG site as CpG in a two branch tree. We also show that CpG misassignment bias differentially affects fourfold degenerate and noncoding sites due to differences in base composition such that fourfold degenerate sites can appear to be evolving more slowly than noncoding sites. We demonstrate that the effects predicted by our simulations occur in a real evolutionary setting by comparing substitution rates estimated from human-chimp coding and intronic sequence using CpG/non-CpG assignment with estimates derived from a method that is largely free from bias.
Our study demonstrates that a common method of assigning sites into CpG and non CpG classes in pairwise alignments is seriously biased and recommends against the adoption of ad hoc methods of ancestral state assignment.
KeywordsSubstitution Rate Nucleotide Substitution Rate Fourfold Degenerate Site Intronic Site Estimate Substitution Rate
In mammals, the methylated form of cytosine (5-methylcytosine or 5mC) is hypermutable . 5mC is formed by the enzyme DNA methyltransferase operating on a cytosine occurring immediately 5' of a guanine. One effect of methylation is to increase the rate of spontaneous deamination of 5mC to form thymine. It has been estimated that transitions in the methylated CpG dinucleotide occur 8–16 times faster than non-CpG transitions [2, 3, 4]. A smaller elevation of the rate of transversion mutation at the CpG dinucleotide has also been observed [3, 5, 6]. It has been suggested that CpG mutability in mammals underwent an abrupt elevation sometime around the mammalian radiation (~90 Myr; ref ), possibly in response to invasion by rapidly replicating transposable elements .
As a result of this hypermutability, molecular evolution studies have frequently attempted to estimate CpG and non-CpG substitution rates by separating observed substitutions into those that are inferred to have occurred within and outside CpG dinucleotides. In many previous studies [8, 9, 10, 11, 12, 13, 14, 15, 16], any nucleotide occurring within a CpG (i.e. either the constituent "C" or "G" of a CpG) or opposite a CpG (i.e. a site that, whilst not necessarily occurring within a CpG dinucleotide, is aligned with the "C" or "G" of a CpG dinucleotide in an orthologous sequence) has been inferred to have been an ancestral CpG site. We hereafter refer to this process as CpG/non-CpG assignment. Although CpG/non-CpG assignment is likely to be biased, it is generally assumed that this bias will be negligible if two sequences are closely related. Furthermore, some studies that employed this assignment procedure in the analysis of protein-coding sequence have suggested that while the overall rate of substitution is higher at synonymous sites, both CpG and non-CpG synonymous substitution rates are substantially lower than substitution rates in noncoding DNA [10, 14]. This has been interpreted as evidence of purifying selection at synonymous sites.
CpG/non-CpG assignment bias at mutational equilibrium
CpG/non-CpG assignment bias in coding and noncoding sequences
The difference in level of estimation bias at fourfold and noncoding sites results from a difference in CpG content. Because of the structure of the genetic code, fourfold degenerate sites are typically enriched for CpG dinucleotides, whereas noncoding sequences are depleted (0.032 vs 0.010 in our murid sequences). With changing CpG frequency, the numbers of misassigned sites will make up different proportions of the total lost or gained from the CpG and non-CpG classes. Thus with decreasing CpG content, the overestimation of the CpG substitution rate becomes greater whereas the underestimation of the non-CpG substitution rate becomes less. Paradoxically, this result implies that whenever CpG/non-CpG assignment is used to estimate substitution rates between closely related species, CpG-rich sequences will appear to be evolving more slowly than CpG-poor sequences.
CpG/non-CpG assignment bias in real sequence data
Nucleotide substitution rates estimated at non-CpG (K nCG ) and non-CpG prone (K nCGprone ) sites at fourfold degenerate and intronic sites in 1470 human-chimpanzee orthologues.
K nCG /K nCGprone
Reliable, bias-free estimation of nucleotide substitution rates is a fundamental part of molecular evolutionary studies. Furthermore in order to test hypotheses about natural selection, underlying neutral processes, such as mutational rate variation, must be accounted for and their effects quantified or removed as efficiently as possible. Our study has shown that a commonly used method of effecting such a removal, assigning nucleotide sites to CpG and non-CpG categories, systematically biases the estimation of nucleotide substitution rates. In particular, across small to moderate evolutionary distances, CpG/non-CpG assignment seriously upwardly biases the estimate of the number of CpG changes and downwardly biases the estimate of the number of non-CpG changes. This is due to a simple artefact that causes many more true non-CpG changes to be misassigned as CpG than vice versa. Clearly, more reliable removal of the effects of CpG mutation is required prior to any meaningful comparison of rates of nucleotide substitution among different regions of mammalian genomes. Our simulations indicated that exclusion of non-CpG-prone sites (sites preceded by "C" and/or followed by "G"), is one simple and effective method of removing CpG-derived changes. One possible alternative to this is to employ an outgroup species to improve the accuracy of ancestral assignment. However simulations showed that parsimony-based ancestral assignment can introduce additional biases into the estimation of substitution rates [see Additional File 1]. These effects are consistent with the known problems with parsimony when base composition is biased , namely an excess of common-to-rare changes (in this case an excess of non CpG to CpG changes).
The bias we describe here also causes the substitution rate within CpG dinucleotides to be substantially overestimated between closely related species. Estimation of the rate of substitution at CpGs has also been attempted using CpG/non-CpG assignment by previous studies [8, 10]. It is likely that such studies will have overestimated the rate of substitution at CpG sites, again due to misassignment issues. When estimating the rate of substitution at CpG sites, it is clearly preferable to implement a method that explicitly models context-dependent evolution, such those proposed in refs [3, 4, 22] or . Our results are supported by recent studies that have demonstrated that methods of "optimal" ancestral state assignment, such as the ad hoc method described here, can be seriously biased and misleading [24, 25].
The results presented here also have implications for previous comparisons of the substitution rate at fourfold and noncoding sites in mammalian genomes [10, 14]. As a result of CpG/non-CpG misassignment bias and differences in base composition, typical mammalian noncoding and fourfold degenerate sites are likely to be predisposed to misleading inferences about their relative rates of substitution. The net effect is that fourfold synonymous sites appear to be evolving more slowly than noncoding sites when substitutions are divided into those that have apparently occurred within and outside a CpG dinucleotide. Furthermore, fourfold sites will typically have a higher CpG frequency than noncoding sites both during the progression to, and at, mutational equilibrium. Because of this, an apparent depression of evolutionary rates at fourfold degenerate sites, when compared to noncoding DNA using CpG/non-CpG assignment, is likely to be a general feature of mammalian molecular evolution.
Our study shows that a commonly used method of assigning sites to CpG or non CpG classes in pairwise alignments is seriously biased. We further note that the effects of CpG and non-CpG misassignment at fourfold and noncoding sites are dependent on differential CpG frequencies, and so these results will apply to any comparison of substitution rates where this is the case. Our results therefore recommend against the adoption of ad hoc methods of ancestral state assignment.
We studied a simple mutation model in which transitions occurred twice as frequently as transversions, and CpG mutations occurred at a different frequency to non-CpG mutations. In all our simulations, two sequences were copied from a single ancestral sequence and evolved. In molecular evolutionary studies fourfold degenerate sites in codons that code for different amino acids in both derived sequences are excluded. To simplify this, in our simulated coding sequences, all nonsynonymous substitutions were considered strongly deleterious and rejected. Qualitatively similar results were obtained when a given proportion of nonsynonymous changes were modelled as neutral (results not shown). In all cases, estimates of the numbers and rates of nucleotide substitution were corrected for multiple hits using the Jukes-Cantor model . This was to ensure simplicity in the interpretation of our results given that more parameter-rich multiple hits corrections take base composition into account, and it is unclear whether this is appropriate when dividing sites into CpG and non CpG. Our data were not simulated under a JC model, since we included a variable transition/transversion rate. However, varying the transition/transversion did not qualitatively impact our results (data not shown).
Ancestral sequences in our simulations were derived from two sources: randomly generated sequences that were evolved to reach approximate mutational equilibrium, and real sequences derived from the mouse genome. The former allowed us to quantify the effects of CpG misassignment bias free of the complicating effects of nonequilibrium processes. Real mouse coding and intronic sequences (collected in ref ) were used in order to capture the base composition and nonequilibrium evolution characteristic of mammalian genomes in our simulations [17, 28, 29, 30]. To investigate the level of bias in a real evolutionary scenario we also collected a random sample of 2000 human RefSeq genes for which we extracted orthologous chimpanzee sequence from the UCSC whole genome alignments. All alignments that did not start with ATG, included premature stop codons or in which the sequence length was not a multiple of three in either species were removed leaving a total of 1470 genes. We removed CpG islands from our intronic sequences using the criteria of .
This work was partly supported by FRSQ postdoctoral fellowship 11389 to DJG.
- 9.Hardison RC, Roskin KM, Yang S, Diekhans M, Kent WJ, Weber R, Elnitski L, Li J, O'Connor M, Kolbe D, Schwartz S, Furey TS, Whelan S, Goldman N, Smit A, Miller W, Chiaromonte F, Haussler D: Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution. Genome Res. 2003, 13: 13-26. 10.1101/gr.844103.PubMedCentralCrossRefPubMedGoogle Scholar
- 14.Mikkelsen TS, Hillier LW, Eichler EE, Zody MC, Jaffe DB, Yang SP, Enard W, Hellmann I, Lindblad-Toh K, Altheide TK, Archidiacono N, Bork P, Butler J, Chang JL, Cheng Z, Chinwalla AT, deJong P, Delehaunty KD, Fronick CC, Fulton LL, Gilad Y, Glusman G, Gnerre S, Graves TA, Hayakawa T, Hayden KE, Huang XQ, Ji HK, Kent WJ, King MC, Kulbokas EJ, Lee MK, Liu G, Lopez-Otin C, Makova KD, Man O, Mardis ER, Mauceli E, Miner TL, Nash WE, Nelson JO, Paabo S, Patterson NJ, Pohl CS, Pollard KS, Prufer K, Puente XS, Reich D, Rocchi M, Rosenbloom K, Ruvolo M, Richter DJ, Schaffner SF, Smit AFA, Smith SM, Suyama M, Taylor J, Torrents D, Tuzun E, Varki A, Velasco G, Ventura M, Wallis JW, Wendl MC, Wilson RK, Lander ES, Waterston RH: Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005, 437: 69-87. 10.1038/nature04072.CrossRefGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.