How and how much does RAD-seq bias genetic diversity estimates?
- 2.5k Downloads
RAD-seq is a powerful tool, increasingly used in population genomics. However, earlier studies have raised red flags regarding possible biases associated with this technique. In particular, polymorphism on restriction sites results in preferential sampling of closely related haplotypes, so that RAD data tends to underestimate genetic diversity.
Here we (1) clarify the theoretical basis of this bias, highlighting the potential confounding effects of population structure and selection, (2) confront predictions to real data from in silico digestion of full genomes and (3) provide a proof of concept toward an ABC-based correction of the RAD-seq bias. Under a neutral and panmictic model, we confirm the previously established relationship between the true polymorphism and its RAD-based estimation, showing a more pronounced bias when polymorphism is high. Using more elaborate models, we show that selection, resulting in heterogeneous levels of polymorphism along the genome, exacerbates the bias and leads to a more pronounced underestimation. On the contrary, spatial genetic structure tends to reduce the bias. We confront the neutral and panmictic model to “ideal” empirical data (in silico RAD-sequencing) using full genomes from natural populations of the fruit fly Drosophila melanogaster and the fungus Shizophyllum commune, harbouring respectively moderate and high genetic diversity. In D. melanogaster, predictions fit the model, but the small difference between the true and RAD polymorphism makes this comparison insensitive to deviations from the model. In the highly polymorphic fungus, the model captures a large part of the bias but makes inaccurate predictions. Accordingly, ABC corrections based on this model improve the estimations, albeit with some imprecisions.
The RAD-seq underestimation of genetic diversity associated with polymorphism in restriction sites becomes more pronounced when polymorphism is high. In practice, this means that in many systems where polymorphism does not exceed 2 %, the bias is of minor importance in the face of other sources of uncertainty, such as heterogeneous bases composition or technical artefacts. The neutral panmictic model provides a practical mean to correct the bias through ABC, albeit with some imprecisions. More elaborate ABC methods might integrate additional parameters, such as population structure and selection, but their opposite effects could hinder accurate corrections.
KeywordsPopulation genomics Reduced representation genomics Allele drop-out ABC Non-neutral model Population structure
Restriction Associated DNA sequencing
Reduced representation genomics aim at sequencing particular parts of the genomes of many individuals, rather than full genomes of one or a few individuals, in a single sequencing reaction. One such approach, RAD-seq (and related protocols) makes use of restriction enzymes to target DNA regions flanking cut sites that are more or less randomly distributed throughout the genome [1, 2]. Among other applications, this technique can provide genome wide estimates of population genetic diversity. Previous studies, however, have emphasized that RAD-seq diversity estimates can be systematically biased [3, 4, 5], impeding the use of RAD-seq as a standardised tool to measure and compare genetic diversity across study systems. First, heterogeneity in base composition along genomes implies that any particular cut site deviates to some extent from a random distribution across the genome . Because base composition and polymorphism can themselves be linked (e.g. lower GC content in neutral and thus more polymorphic regions), this can impact diversity estimates. Particular motifs present in the restriction site might also be enriched in some particular regions of the genomes (e.g. motifs corresponding to protein domains ).
Arguably, such biases probably exist for any kind of molecular marker, because of the inherent contradiction between “targeted” and “random” sequencing. But RAD-seq also presents an additional bias caused by polymorphism on restriction sites; just as its ancestor, the AFLP technique, although in a more subtle way [8, 9]. With the AFLP, any loss of restriction site turned an heterozygous to a seemingly homozygous genotype. In RAD-seq, sequencing depth can be used to identify such cases, and the presence/absence of restriction sites is not the primary source of information. Nevertheless, this Allele Drop Out (ADO) leads to underestimate the polymorphism, because of the linkage disequilibrium between the restriction sites and SNPs within the RAD sequences. In more simple terms, individuals or haplotypes that are more closely related than the population average tend to share the same state at the restriction site (presence or absence), and are thus over-represented in RAD-seq datasets.
Here we focus on this latter bias, hereafter simply referred to as “the RAD-seq bias”. The impact of ADO has been investigated in several earlier studies [10, 11], where difference in coverage between loci was proposed as a solution to detect null alleles. Here we do not consider this option, which requires a high sequencing depth that is not always achieved. We rather aim at a better understanding of this bias, through the confrontation of simulated and empirical data. Simulations were first performed under a Wright-Fisher neutral and panmictic model, in order to confirm the previously established relationship between the true polymorphism and its RAD-based estimate. We further explored the potential consequences of deviations from a neutral and panmictic model. We show that population structure tends to reduce the RAD-seq bias, because RAD-seq underestimates divergence within but not between populations. In contrast, variations in polymorphism along the genome, which is a typical signature of selective constraints, tend to intensify the RAD-seq bias, because regions of low polymorphism contribute disproportionally to the data.
We then confronted theoretical predictions to ideal empirical data, that is, in silico digestions of full genomes from natural populations of the fruit fly Drosophila melanogaster (DPGP ), harbouring moderate polymorphism, and the fungus Schizophyllum commune , harbouring high polymorphism. These two case studies generally confirm the expected relation between the level of polymorphism and the intensity of the RAD-seq bias: the bias is much stronger in the highly polymorphic species. In D. melanogaster, the bias is not intense enough to assess potential deviations from the neutral and panmictic model. In S. commune this model captures a large part of the bias, but the observed RAD polymorphism falls out of its predicted distribution. Accordingly, ABC corrections based on this model are satisfactory in D. melanogaster, but less accurate in S. commune. Although our results confirm those of previous studies having raised red flags regarding the RAD-seq bias [8, 9], we would argue that in many species, where polymorphism is low, this problem is of negligible importance in the face of other sources of uncertainty. In very polymorphic species, our ABC correction can mitigate the bias, although population structure, selection, or yet unidentified additional factors, introduce some imprecision in this correction.
Simulations and genetic diversity measures
To measure the theoretical impact of the RAD-seq bias, we simulated sequence data and retrieved RAD tags in silico. Each simulation consisted in the generation of coalescents for 1000 genetically unlinked loci, with complete linkage within loci, in four haploid lineages, using the ms programme . Seq-gen was then used to produce sequences of 10 kb for each locus . To generate RAD-seq data, we randomly merged by pairs the four haploid genomes to form two diploids, and searched ten randomly defined restriction sites of 8 bp (searching for more than one motif increases the number of RAD loci without increasing the alignment size and computing time). This yielded an average of 1500 RAD loci of 100 bp per genome.
In the first model, we assumed the population was diploid, unstructured, and θ, the population mutation rate (4*Ne*μ) was homogeneous along the genome. In a second model, we explored the potential consequences of selection by assigning different θ values to different groups of loci. Specifically, we assumed that 70 % of the genome had a given θ value, while θ was twice smaller in 20 % of the genome, and 10 times smaller in the remaining 10 % of the genome. To mimic variations in the fraction of coding regions and selection intensity, similar simulations were run with other proportions (50, 40 and 10 % instead of 70, 20 and 10 %) and even more heterogeneous θ (reduced 10 fold and 100 fold instead of 2 fold and 10 fold). In these simulations, πtrue is the mean of the θ values, weighted by their respective proportions in the genome. Finally, in a third model, we assumed θ was homogeneous along the genome but introduced spatial structure by sampling the two diploid genomes in two populations having diverged for a time t. For all simulations, θ values were randomly sampled from a uniform distribution between −5 and −1 of log10(θ), thus corresponding to θ values ranging from 10−5 to 10−1 (program commands are provided in supplementary materials).
To measure the RAD-seq bias in real data, we performed in silico RAD-seq experiments, using full genome sequences from natural populations of Drosophila melanogaster  and Schizophyllum commune . For both species, the sequences that are available correspond to haploid genomes. To mimic real RAD-seq experiments, which are generally performed on diploid individuals, we randomly selected pairs of haploid genomes coming from the same population, to generate diploid individuals (Additional file 1: Table S1). RAD tags were then retrieved from each individual.
In data simulated with spatial structure, the same values (also called π for simplicity) correspond to measures of the divergence between the two subpopulations.
ABC for the estimation of nucleotidic diversity from RAD data
We used Approximate Bayesian Computations (ABC) to correct RAD-seq estimates of genetic polymorphism. In these simulations, as in our first model, we assumed the population was diploid, unstructured, and θ was homogeneous along the genome. We considered the following summary statistics: (1) πRADobs, the observed nucleotidic diversity in RAD-seq data (average distance between individuals) and (2) the proportion of loci in each individual shared with the other one. Calculation of the posterior distribution of θ for each observed dataset was performed with functions from the R package abc . We used a tolerance rate of 0.05 and local linear regressions to adjust the accepted simulations to the observed data, and tested our approach by cross-validation.
The RAD-seq bias under a neutral and panmictic model
The RAD-seq bias under non neutral and spatially structured models
Real world populations do not necessarily follow the neutral panmictic model. Some regions of the genome are submitted to more frequent or intense episodes of selection than others, increasing the heterogeneity of the polymorphism along the genome. Population structure can also exist at various degrees, producing more or less pronounced genetic differentiation between sub-populations. In an attempt to provide a more general picture of the plausible scope of the RAD-seq bias, we thus explored the consequences of relaxing the assumptions of neutrality and panmixia.
Comparing empirical data with simulations
We used in silico digestions of full genomes (with phased haplotypes) from natural populations to assess the concordance between simulated and real data. Most species for which such data is available harbour a low to moderate genome-wide diversity (below 2 %), under which circumstances the RAD-seq bias is expected to be negligible. This is for example the case in the fruit fly D. melanogaster. In this species, πtrue ranged from 0.61 % to 0.73 % in the four populations studied. For such values, simulations under the neutral panmictic model predict that πRAD should only be 5 % lower than πtrue (Fig. 1b). The observed πRAD values fit this prediction, ranging from 0.59 % to 0.70 % in the four populations. However, with so small differences between πRAD and πtrue, this case study provides only limited power to assess the pertinence of the model.
We thus looked for full genome data from natural populations of more polymorphic species, which would provide more informative comparisons between simulated and real data. To our knowledge, the appropriate data exist only for the fungus Schizophyllum commune (NB: transcriptome data exist for several other species harbouring high polymorphism , but these sequences are not appropriate to mimic a RADseq experiment because i) these datasets do not provide phased haplotypes, and ii) the presence of introns in gene leads to reduce the genetic linkage between sites within mRNAs, which mitigates the RAD-seq bias). Sequences available from this species originate from two populations, from North America and Russia, each characterised by very high polymorphism (πTRUE = 9.7 % and 7.4 %, respectively). Accordingly, πRAD is substantially smaller than πTRUE in both cases (πRAD = 6.2 % and 3.5 %, respectively).
Partial corrections of the RAD-seq bias through ABC under a neutral panmictic model
We explored the possibility of using simulations from the neutral panmictic model to correct, at least partially, the RAD-seq estimates of polymorphism, through Approximate Bayesian Computation. The principle of this approach is to use the simulated relation between the true polymorphism and some summary statistics (e.g. proportion of shared RAD loci between specimens, proportion of homozygous loci) to infer corrected polymorphism values from the values of these statistics in empirical data. We developed such an ABC approach and first performed a cross-validation test, that is, tested our ability to correctly infer the input parameter values of simulations using the simulated data itself (here the simulated data is treated as an observation and is thus called “pseudo-observed”).
Building on previous studies [8, 9], we further explored here the impact of allele drop-out on the estimation of genetic diversity from RAD data. We first confirmed earlier findings based on simulations in a neutral and panmictic model: RAD-based estimates of diversity are lower than the true polymorphism, and this bias becomes more pronounced as the true polymorphism increases. Using more elaborate models, we further showed that deviations from the neutral and panmictic model can have complex and contradictory outcomes. Assigning different degrees of polymorphism to different regions of the genome, which mimics the effects of natural selection on genomic variation, tends to exacerbate the RAD-seq bias. This probably results from an excessive contribution to the data of the least polymorphic genomic regions, subject to the most intense purifying selection. We also simulated sampling of specimens from more or less isolated subpopulations, and thus showed that population structure should mitigate the bias. In other words, RAD-seq data tends to under-estimate divergence within but not between populations.
We used “ideal” empirical data, that is, RAD-seq data obtained from in silico digestion of full genomes from natural populations, to assess potential deviations from the neutral and panmictic model. Data from the fruit fly D. melanogaster confirmed that the bias is of negligible importance when the polymorphism is low, offering little power to assess the validity of the model. On the contrary, in the fungi S. commune, where the true polymorphism approaches 10 %, the bias is severe, producing a 50 % underestimation of the diversity. The neutral and panmictic model captures most of this effect, but the observed RAD-based values nevertheless fall out of the model predictions.
We investigated whether these deviations might be due to selection or spatial structure. In the American population, where the bias was weaker than expected, one specimen (from Florida) was significantly differentiated from all others (from Michigan, Fig. 1 in ). However, excluding this specimen from the analysis only has a minor effect on the bias for this population (not shown), suggesting the discrepancy with our theoretical expectation is not necessarily explained by population structure. We also explored whether heterogeneity in θ along the genome might occur in this data set. We found that the distribution of the SNPs across RAD tags was indeed significantly more heterogeneous than expected under a Poisson process (Additional file 1: Figure S1). Moreover, distributions of RAD distances were always significantly more heterogeneous in the Russian population, which might contribute to the excessive bias observed in this population (Additional file 1: Figure S2).
The fact that simulations can capture the RAD-seq bias, at least in part, opens the possibility of correcting estimations through an ABC approach. We developed such an approach based on simulations from the neutral panmictic model, where the number of parameters to be estimated is low enough. The results are encouraging: the corrected RAD-polymorphism values are much closer to the real polymorphism than the raw values. However, in accordance with the above-discussed deviations from the model, the corrections are inaccurate. It is clear that robust estimations of diversity measures from RAD-seq data would require more elaborate ABC models, including the potential effects of population structure and selection, or other, yet unidentified, relevant parameters. However, our simulations suggest that a given observed RAD polymorphism might be indicative of a certain θ value if the population is panmictic, a smaller θ if individuals were sampled from slightly divergent populations, or a larger θ if selection produced strong heterogeneity in θ along the genome. In other words, an excessive number of parameters, with contradictory effects, might prevent convergence of the model toward a single optimal solution.
Our analysis confirmed the tendency of RAD data to underestimate polymorphism. Regardless of the model used, simulations indicate this bias is of minor importance when the polymorphism is below 2 %, which is the case in most species, at least in animals . In silico RAD experiments on full genome data from natural populations confirm this prediction, which would undoubtedly be reinforced by more realistic RAD datasets, where all sorts of additional biases, from technical issues at the bench to downstream bioinformatics, introduce more important sources of uncertainty [3, 4, 5, 18]. Nevertheless, when the polymorphism is large the RAD-seq bias becomes of significant concern, and needs to be kept in mind. While ABC-corrections based on a neutral and panmictic model can partially solve the problem, deviations from this model introduce some uncertainty in these corrections. Developing more robust corrections, although desirable, might face the difficulty of estimating too many parameters with insufficient data.
Once a bias has been found to affect a widely used technique such as RAD-seq, it seems crucial to understand its causes and evaluate its range, which was the purpose of the present study. This being said, one should also keep in mind that any set of molecular markers, from single genes to “random” shotgun sequencing, also present various kinds of bias, because it is virtually impossible to randomly sample genomic data. Until full genomes will be made achievable at reasonable costs for population genomics studies, RAD-seq thus remains, in our opinion, an optimal compromise.
We thank Sylvain Mousset and Aline Muyle for helping with simulations and population genetics models. This work benefitted from the computing facilities of the CC LBBE/PRABI.
This study was supported by the "Centre National de la Recherche Scientifique" (CNRS, Institut Ecologie et Environnement, ATIP grant "SymbioCode" to S.C.), and by the "Agence Nationale de la Recherche" (ABS4NGS: ANR-11-BINF-0001-06).
Availability of data and materials
Scripts used to simulate sequences and to perform in silico RAD experiments on simulated data are available at http://pbil.univ-lyon1.fr/datasets/Cariou2016/
M.C., L.D. and S.C. conceived and designed the study; M.C. performed the simulations; MC, L.D. and S.C. wrote the paper and approved the final version.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
- 18.Davey JW, Cezard T, Fuentes-utrilla P, Eland C, Blaxter ML. Special features of RAD Sequencing data: implications for genotyping. Mol Biol Evol. 2013;22:3151–64.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.