A statistical framework for detecting mislabeled and contaminated samples using shallow-depth sequence data
Abstract
Background
Researchers typically sequence a given individual multiple times, either re-sequencing the same DNA sample (technical replication) or sequencing different DNA samples collected on the same individual (biological replication) or both. Before merging the data from these replicate sequence runs, it is important to verify that no errors, such as DNA contamination or mix-ups, occurred during the data collection pipeline. Methods to detect such errors exist but are often ad hoc, cannot handle missing data and several require phased data. Because they require some combination of genotype calling, imputation, and haplotype phasing, these methods are unsuitable for error detection in low- to moderate-depth sequence data where such tasks are difficult to perform accurately. Additionally, because most existing methods employ a pairwise-comparison approach for error detection rather than joint analysis of the putative replicates, results may be difficult to interpret.
Results
We introduce a new method for error detection suitable for shallow-, moderate-, and high-depth sequence data. Using Bayes Theorem, we calculate the posterior probability distribution over the set of relations describing the putative replicates and infer which of the samples originated from an identical genotypic source.
Conclusions
Our method addresses key limitations of existing approaches and produced highly accurate results in simulation experiments. Our method is implemented as an R package called BIGRED (Bayes Inferred Genotype Replicate Error Detector), which is freely available for download: https://github.com/ac2278/BIGRED.
Keywords
Error detection Biological replication Technical replication Shallow-depth sequence data Mislabeled samplesAbbreviations
- AD
Allelic depth
- BIGRED
Bayes Inferred Genotype Replicate Error Detector
- CPU
Central processing unit
- GBS
Genotyping-by-sequencing
- HTS
High-throughput sequencing
- HWE
Hardy-Weinberg Equilibrium
- IITA
International Institute of Tropical Agriculture
- MAF
Minor allele frequency
- NaCRRI
National Crops Resources Research Institute
- NRCRI
National Root Crops Research Institute
- PCA
Principal component analysis
- REs
Restriction enzymes
- VCF
Variant call format
- WGS
Whole-genome sequence
Background
A researcher may choose, for a number of reasons, to sequence an individual multiple times, performing technical replication, biological replication, or both. Because sequencing experiments involve many steps and errors can occur during any part of the workflow, one motivation for sequencing an individual more than once is to allow researchers to compare these replicates, identify outlier samples, and evaluate how well a sequencing pipeline is executed. This is particularly important for plant breeders, as they require ongoing estimates of their program’s error rates. Further discussion of reasons for intentional replication appear elsewhere [1]. In short, the three aspects of replication—sequencing read depth, technical replication, and biological replication—each play different roles in mitigating errors that are introduced in the experimental pipeline. Increasing sequencing read depth allows for improved variant calling while technical and biological replicates allow for optimization of bioinformatic filters [1]. Replication can also arise unintentionally as a result of human error or naming inconsistencies, and it is in a researchers best interest to make full use of the data, merging the replicate records rather than discarding them.
Before merging the data from biological or technical replicates or using them to inform quality filter thresholds, it is important to verify that no erroneous samples exist among the putative replicates (i.e. verify that all putative replicates derived from an identical individual). Existing methods for error detection include performing pairwise identity-by-state and –by-descent estimation [2], calculating the correlation between pairs of samples, and examining a heat map of a realized genomic relationship matrix. These approaches require some combination of genotype calling, imputation, and haplotype phasing, making them unsuitable for low- to moderate-depth, high-throughput sequence (HTS) data [3]. And because these methods employ a pairwise-comparison approach for error detection rather than joint analysis of the samples, results may be inconsistent when more than two replicates exist. To illustrate, the general protocol for heat map analysis involves starting off with some collection of sequenced samples (including the replicates of interest), calling genotypes, filtering based on percent missing, imputing missing genotypes, calculating the additive genomic relationship matrix, and finally plotting a heat map of the putative replicates. This method can work well on deeply sequenced samples, but complications arise when applying this method to shallow-depth sequence data. Firstly, it requires genotype calling, which is difficult to do accurately when we have low read depth. Secondly, it requires imputation, raising issues in regards to reference panel and imputation method selection. Furthermore, results from imputation vary depending on which samples were jointly imputed, which in turn, affects downstream analyses that use the imputed data. Finally, a third limitation of this method—common among existing error detection methods—is that it relies on pairwise comparisons of the putative replicates, rather than joint analysis of the replicates. For example, suppose we have three putative replicates, A, B, and C. It is possible that A and B are highly correlated, A and C are highly correlated, but B and C are only moderately correlated. In situations such as this, deciding if all three samples are replicates is not straightforward.
Methods
The proposed method
- 1.
All three samples originate from one source;
- 2.
Samples d = 1 and d = 2 originate from one source while d = 3 originates from a different source;
- 3.
Samples d = 1 and d = 3 originate from one source while d = 2 originates from a different source;
- 4.
Samples d = 2 and d = 3 originate from one source while d = 1 originates from a different source;
- 5.
All three samples originate from different sources.
We use “source vectors” to represent these relations and enumerate all possible source vectors for k = 3 on the right panel of Fig. 1. By convention: (1) source vectors are labeled vectors, e.g., the first, second, and third element of a given source vector describes the status of sample d = 1, d = 2, and d = 3, respectively, and (2) the first element of a source vector always takes on the value 1. Vector elements with the same value are indicated to be from the same source.
- 1.
Estimates of population allele frequency at L randomly sampled biallelic sites, sampled at the genome-wide level and
- 2.
The k putative replicates’ allelic depth (AD) data at the L sites. A site is only sampled if each putative replicate has at least one read at that site.
- 1.
The species is diploid;
- 2.
Each polymorphic site harbors exactly two alleles, allele A and allele B, i.e. all polymorphisms are biallelic;
- 3.
Sites are independent. BIGRED allows the user to specify a minimum distance, in base pairs, between any two sampled sites. The user may also filter sites based on linkage disequilibrium, although this is not a functionality of BIGRED.
Defining a likelihood function for G
Defining a likelihood function for S
- 1.
Enumerate all possible source vectors of length k = 3 (Fig. 1).
- 2.
Enumerate all labeled genotype vectors consistent with each source vector (Fig. 2). For instance, there are three genotype vectors consistent with source vector S = (1,1,1): (AA, AA, AA), (AB, AB, AB), and (BB, BB, BB). There are nine genotype vectors consistent with S = (1,1,2): (AA, AA, AA), (AA, AA, AB), (AA, AA, BB), (AB, AB, AB), (AB, AB, AA), (AB, AB, BB), (BB, BB, BB), (BB, BB, AA), and (BB, BB, AB).
- 3.
Define a likelihood function for S as a function of genotype likelihoods, defined previously in Eq. 1:
The function P(G^{(v)}| S) is the probability that the k samples have genotype vector \( {G}^{(v)}=\left({G}_{d=1}^{(v)},{G}_{d=2}^{(v)},\dots, {G}_{d=k}^{(v)}\right) \)given that source vector S describes how the k samples are related. We define P(G^{(v)}| S) using the (user-supplied) population allele frequency of allele B at site v and assuming Hardy-Weinberg Equilibrium (HWE; Fig. 2). For samples that are encoded as identical in source vector S, we treat their genotypes as a single observation and all non-identical genotypes are modeled as independent (Fig. 2).
Estimating P(S| X)
One may wish to compare the posterior probability of two assignments of S, and when doing so via the posterior odds-ratio, both the denominator and P(S) cancel from the two posteriors (since the denominator acts as a normalizing constant and we assume a uniform prior on S). The ratios of the posteriors are, therefore, equal to the ratios of the likelihoods.
Evaluating BIGRED
We examined how changes in mean read depth, L, and MAF at the L sites affect the accuracy of BIGRED. For simulation experiments, we used a fixed sequencing error rate of 0.01 and sampled sites such that no two sites fell within 20 kb from one another. In addition to accuracy, we evaluated the sensitivity of the algorithm. We used high-depth whole-genome sequence (WGS) data from 241 Manihot esculenta individuals to simulate a series of data sets. Filtering the data (e.g., removing sites with extremely low minor allele frequency and discarding regions prone to erroneous mapping) should be done prior to applying BIGRED to remove potentially spurious variants. We refer the reader to the section “Alignment of reads and variant calling of cassava haplotype map (HapMapII)” of [5] for a description of how the data was generated and the quality filters applied.
The data
Simulation experiments to evaluate the impact of mean read depth and MAF on accuracy
- 1.
Enumerate all possible pairs of genotypes, where order does not matter (n = 15(14) = 210).
- 2.
Sample one genotype pair.
- 3.
Randomly assign the status ‘source 1’ to one of the two genotypes. Assign the remaining genotype ‘source 2’ status.
- 4.
Randomly sample L = 1000 sites (genome-level) with a specified MAF.
- 5.
Simulating \( {X}_{d=1}^{(v)} \): Sample Y alleles (with replacement) from the pool of allele reads belonging to source 1 at that site, where Y~Poisson(λ).
- 6.
Simulating \( {X}_{d=2}^{(v)} \): Sample Y alleles (with replacement) from the pool of allele reads belonging to source 2 at that site, where Y~Poisson(λ).
- 7.
Simulating \( {X}_{d=3}^{(v)} \): Sample Y alleles (with replacement) from the pool of allele reads belonging to source 1 at that site, where Y~Poisson(λ).
- 8.
Feed the algorithm the simulated AD data and the population allele frequency of allele B at the L sites.
- 9.
Record the conditional posterior probability of S = (1,2,1).
- 10.
Repeat steps 2 through 9, 100 times. When repeating step 2, only sample from those genotype pairs that have not been sampled previously.
Note that evaluating scenario S = (1,2,1) is equivalent to evaluating scenarios S = (1,1,2) and S = (1,2,2). We performed a full factorial experiment for the source vectors associated with k = 2, k = 3, and k = 4, where λ = {1,2,3,6,15} and where we sampled sites with a given MAF falling in one of five possible intervals (0.0,0.1], (0.1,0.2], (0.2,0.3], (0.3,0.4], and (0.4,0.5]. Note that in these simulation experiments, all putative replicates of a given individual had identical mean read depths. We later tested the scenario where mean read depths varied among the samples.
Simulation experiments to evaluate the impact of L on accuracy
To assess the impact of L on accuracy, we repeated simulation experiments for S = (1,2,1) and S = (1,2,3), sampling sites with MAFs falling in (0.2,0.3] and testing seven values of L: 50, 100, 250, 500, 1000, 2000, and 5000.
Simulation experiments to evaluate BIGRED’s sensitivity
We next evaluated the algorithm’s sensitivity by simulating the scenario where S = (1,1) and corrupting (i.e., contaminating) p percent of sites in sample d = 2 with a second, randomly sampled genotype source. We tested five values of p (10, 20, 30, 40, 50%) at five mean depths (1x, 2x, 3x, 6x, and 15x). We repeated this procedure 100 times for each depth and p combination.
Simulation experiments to evaluate the scenario where mean read depths vary among the k putative replicates
We simulated data for three source vectors S = (1,1), S = (1,2), and S = (1,2,1). For S = (1,1) and S = (1,2), we varied the mean read depth of sample d = 2 while keeping the mean depth of sample d = 1 constant at 1x. We tested five different λ values for sample d = 2: 1, 2, 4, 6, and 12. For S = (1,2,1), we varied the mean read depth of sample d = 3 while keeping the mean depth of samples d = 1 and d = 2 constant at 1x. We again tested five λ values for sample d = 2: 1, 2, 4, 6, and 12. We held L constant at 1000 across all experiments and tested the same five MAF intervals as before.
Comparing results to hierarchical clustering
We compared results from BIGRED to results obtained from hierarchical cluster analysis. Results from [10] show that hierarchical clustering is an effective tool for matching accessions from farmers’ fields to corresponding varieties in an existing database of known varieties, a problem very similar to the one being addressed in this paper. We performed hierarchical clustering on the k putative replicates of each genotype. To do this, we first calculated the realized additive relationship matrix for the 1215 sequenced samples from IITA using sites harboring biallelic SNPs. Sites were filtered using criteria based on MAF and percent missing. Sites with a MAF falling within the interval (0.1,0.5] and with < 50% missing data across the 1215 samples were kept, leaving us with 46,862 sites (out of 100,267) to analyze. We calculated the realized additive relationship matrix using the A.mat() function from the R package rrBLUP [11]. We used a matrix of genotype dosages as input and imputed missing dosage values using the “mean” option. We then calculated a distance matrix between the rows of the additive relationship matrix using Euclidean distance as the distance measure. We performed complete-linkage hierarchical clustering using the hclust() function and the distance matrix as input [12]. For each genotype, the hclust() function returns a tree structure with k leaves, each leaf representing a putative replicate. We determined the underlying relationship among each genotype’s putative replicates by cutting each tree at a height of 0.5. We refer to this relationship as the “source vector” to keep terminology consistent with that of BIGRED’s. We compared results from the complete-linkage cluster analysis to results from BIGRED. For BIGRED, we set a posterior probability threshold of 0.99, i.e., BIGRED would only return an inferred source vector if that source vector had a posterior probability of at least 0.99. This minimum posterior probability threshold was met in all cases, i.e., we were able to infer a source vector in all cases. We repeated this procedure for NaCRRI (299 sequenced samples and 48,712 sites) and NRCRI (415 sequenced samples and 48,320 sites).
For each breeding institution, we categorized the institution’s genotypes into groups based on the number of putative replicates (k) each genotype had. We then calculated a mean non-replicate rate μ_{k} separately for each k. To calculate this, we computed a non-replicate rate for each individual that has k putative replicates (when k = 2, this rate is 1 - P(S = (1,1)|X)), and then averaged these values across all individuals of a given k.
Comparing the consistency of BIGRED and hierarchical clustering
To compare the consistency of BIGRED and hierarchical clustering, we performed a set of experiments using the GBS data from the 475 IITA individuals with 1 < k < 7 putative replicates. The basic premise of these experiments is that an analysis based on a larger set of sites is likely to be correct. The first step in these experiments is to perform error detection on an individual’s putative replicates using the data at a large number of sites and to set the inferred source vector as the “truth”. The second step is to perform error detection once more on the individual’s replicates, this time using the data at a smaller number of sites disjoint from the initial set. To obtain a measure of consistency, we compare the results from the first (larger) analysis with results from the second (smaller) analysis.
To evaluate the consistency of hierarchical clustering, we first filtered the data, retaining samples with a genome-wide mean read depth of ≥0.5 and sites with MAFs within the interval (0.3,0.5] and with < 50% missing data across the filtered samples. This left 1215 samples and 16,926 sites for analysis. As before, we called genotype dosages using the observed allelic read depth data and imputed missing values at a given site with the site mean. We then performed hierarchical clustering on each of the 475 individuals, using data from 2000 randomly sampled sites. We set the output of these analyses as the “truth”. We then performed hierarchical clustering on each of the individuals a second time, sampling L sites disjoint from the initial 2000, and compared the inferred source vector with the “true” source vector. We tested five values of L: 50, 100, 250, 500, and 1000. We repeated the experiment 10 times for each value of L and calculated a mean concordance rate between the “true” source vector and the source vector inferred from the L sites across the 10 runs and 475 cases for each L.
To evaluate the consistency of BIGRED, we first filtered the data, keeping samples with a genome-wide mean read depth of ≥0.5 and sites with MAFs within the interval (0.3,0.5]. As with hierarchical clustering, we defined the truth using 2000 randomly sampled sites. We used a fixed sequencing error rate of 0.01 and sampled sites such that no two sites fell within 20 kb from one another. We followed the same procedure as the one used to evaluate the consistency of hierarchical clustering, in particular, testing with the same five values of L.
Applying a pairwise-comparison approach to real data
Methods that employ a pairwise-comparison approach for error detection rather than joint analysis of the samples might produce ambiguous results when more than two putative replicates exist. To demonstrate, we applied a pairwise-comparison method to IITA’s data, specifically we calculated the Pearson correlation between all pairs of putative replicates. We refer to this method as the “correlation method”. Before calculating the Pearson correlation between replicate pairs, we filtered the data, retaining samples with a genome-wide mean read depth of ≥0.5, sites with MAFs within the interval (0.3,0.5], and with < 50% missing data across the filtered samples. This left 1215 samples and 16,926 sites for analysis. We called genotype dosages using the observed allelic read depth data and imputed missing values using glmnet [9]. We then calculated the Pearson correlation between all pairs of putative replicates using the cor() function [12]. For simplicity, we limited our analysis to the 154 cases where k = 3. Correlations ranged from 0.02 to 0.93, so we selected 0.85 as the replicate-call threshold (i.e., two putative replicates with a correlation ≥0.85 are considered true replicates). We also applied a replicate-call threshold of 0.80 to examine how results changed.
Run time
We measured computation time as the number of central processing unit (CPU) seconds required to run BIGRED. All jobs were submitted to the Computational Biology Service Unit at Cornell University, which uses a 112 core Linux (CentOS 7.4) RB HPC/SM Xeon E7 4800 2 U with 512GB RAM.
Results
Evaluating the accuracy and run-time of BIGRED
Evaluating the sensitivity of the algorithm
Evaluating BIGRED’s accuracy when mean read depths vary among the k putative replicates
Estimating NEXTGEN non-replicate rates
A table summarizing the mean non-replicate rate μ_{k} of each breeding institution
Institution | k = 2 | k = 3 | k = 4 | k = 5 | k = 6 | |
---|---|---|---|---|---|---|
n _{ k} | IITA | 272 | 154 | 37 | 11 | 1 |
μ _{ k} | IITA | 0.21 | 0.16 | 0.14 | 0.27 | 1 |
n _{ k} | NaCRRI | 58 | 61 | 0 | 0 | 0 |
μ _{ k} | NaCRRI | 0.05 | 0.21 | – | – | – |
n _{ k} | NRCRI | 128 | 31 | 5 | 8 | 1 |
μ _{ k} | NRCRI | 0.37 | 0.32 | 0.40 | 0.25 | 1 |
n _{ k} | IITA & NRCRI | 101 | 31 | 5 | 8 | 1 |
μ _{ k} | IITA & NRCRI | 0.33 | 0.32 | 0.40 | 0.25 | 1 |
For each institution, we categorized genotypes into groups based on the number of putative replicates each genotype had. Grey rows show the number of genotypes in each group n_{k} for each breeding institution. We then calculated the mean non-replicate rate among genotypes of a given k μ_{k} by calculating the mean probably of no errors then subtracting this value from one.
Method comparison
A table comparing the consistency of BIGRED and hierarchical clustering using the 475 IITA individuals with 1 < k < 7 putative replicates
Method | L = 50 | L = 100 | L = 250 | L = 500 | L = 1000 |
---|---|---|---|---|---|
BIGRED | 0.9832 | 0.9895 | 0.9958 | 0.9973 | 0.9981 |
Hierarchical clustering | 0.8322 | 0.9088 | 0.9488 | 0.9640 | 0.9771 |
To evaluate the consistency of the two methods, we performed error detection on an individual’s putative replicates using the data at 2000 sites and set the inferred source vector as the “truth”. We then performed error detection a second time using a smaller number of sites (L) disjoint from the initial set. We compared the “true” source vector with the source vector inferred from L sites. For each IITA individual, we tested five values of L and repeated the experiment 10 times for each value of L. We then calculated the mean concordance rate between the “true” source vector and the source vector inferred from L sites across the 475 cases and across 10 runs.
One motivation for BIGRED’s joint analysis framework is that pairwise-comparison methods might produce ambiguous results for cases of more than two putative replicates. We introduced a hypothetical example of this in the Background section and found real examples of these inconsistencies when applying a pairwise-comparison method to IITA’s data. More specifically, when examining cases of k = 3 and using a replicate-call threshold of 0.85, we found 80 cases (out of 154) where the pairwise method awarded any pair of samples (of an individual) replicate status. Of these 80 cases, we found 10 cases where the method produced ambiguous results (Additional file 9). When we decreased the call threshold to 0.80, we found 146 cases where the method inferred at least one true replicate pair but six of these cases had ambiguous results (Additional file 10).
Discussion
Researchers may choose, for a number of reasons, to sequence a given individual more than once. Regardless of intent, it is important to identify potentially mislabeled or contaminated samples before using the data (e.g. merging the data from replicate sequence runs or using the data to optimize bioinformatics quality filters). Unfortunately, existing methods to detect such errors are ad hoc and ill suited for use in shallow-depth HTS data since they require some combination of genotype calling, imputation, and haplotype phasing. We have introduced a new probabilistic framework for error detection that addresses key limitations of existing methods. Using Bayes Theorem, we calculate the posterior probability distribution over the set of relations describing the putative replicates (i.e. the set of source vectors), allowing us to infer which of the samples originated from an identical genotypic source.
We examined the impact of mean read depth, L, and MAF at the L sites on the accuracy of the proposed method through a series of simulation experiments. We found that the algorithm is most accurate when analyzing sites whose MAFs fall in the range (0.3,0.5], consistently across all mean read depths when L = 1000 (Fig. 5). Sites with MAFs falling in the interval (0.0,0.1] relay little information to the algorithm. When analyzing these sites, BIGRED assigns a median posterior probability of one to S = (1,1,1), regardless of the true source vector. Thus BIGRED appears to be biased towards inferring no error among putative replicates when analyzing sites with low MAF. One reason for this bias is our definition of P(G^{(v)}|S) (Fig. 2). Given a site that has a reference allele frequency of 0.1, when k = 3, the probability of G^{(v)} = (AA,AA,AA) given S = (1,1,1), i.e. no erroneous samples among the putative replicates, is 0.1^{2}, whereas the probability of G^{(v)} = (AA,AA,AA) given any other source vector is ≤0.1^{4}. This bias is compounded by the fact that we estimated allele frequencies from a set of 206 individuals but ran simulation experiments using a subset of 15. Some loci that had low but non-zero MAF among the 206 individuals appeared monomorphic among the 15 individuals, making the 15 individuals look more similar than they actually are in reality. We found that 47.14 and 5.29% of sites with MAFs in the (0.0,0.1] and (0.1,0.2] interval, respectively, became monomorphic among the 15 individuals.
To evaluate the impact of L on the algorithm’s accuracy, we repeated simulation experiments for S = (1,2,1) and S = (1,2,3) using different values of L and looking only at sites with MAFs falling in (0.2,0.3]. Surprisingly, we observed little to no change in median accuracy at a given depth when increasing the number of sampled sites. The only exception was S = (1,2,1) at 2x mean depth, where we observed a drastic increase in accuracy when increasing L from 100 to 250 (Fig. 6). For S = (1,2,3) at 2x and 3x, we observed a median accuracy of zero even when sampling 5000 sites. We observed an increase in median accuracy only after increasing the mean read depth of samples to 4x. These results indicate that the mean read depth of samples contributes more to accuracy than the number of sampled sites. In these simulation experiments, all k putative replicates of a given genotype were assigned identical mean read depths. These results, however, were robust to samples with varying mean read depths (Fig. 8).
We also assessed the sensitivity of the algorithm as a way to gauge how the proportion of exogenous DNA affects the algorithm and how allelic sampling bias impacts results. The GBS protocol uses methylation-sensitive restriction enzymes (REs) to avoid sampling highly repetitive regions of the genome. One potential complication when using methylation-sensitive REs is allelic sampling bias of a marker or unequal sampling and sequencing of homologous chromosomes, resulting from differential methylation in a region. ApeKI, the RE employed by NEXTGEN, for instance, will not cut if the 3′ base of the recognition sequence on both strands is 5-methylcytosine. To test the impact of imperfect marker “heritability”, we simulated the scenario where S = (1,1) and corrupted p percent of sites in sample d = 2 with a second genotype source. We tested the cases where p = {10,20,30,40,50%} for five different sample mean depths (λ = {1,2,3,6,15}) and found that the algorithm was robust to increases in p for lower values of λ (Fig. 7). Not surprisingly, the method assigned higher probability to S = (1,2) as p and mean depth increased. As mean depth increases, the algorithm grows increasingly confident that differences at sites reflect true biological differences rather than sampling variation or error.
When applying BIGRED and hierarchical clustering on real data, we found a relatively high concordance rate between the two methods (Fig. 9). Although this comparison does not directly tell the reader which of the two methods is more accurate, the comparison and the analyses in this paper demonstrate the benefits of using BIGRED over hierarchical clustering. Firstly, we found that BIGRED is a more consistent estimator relative to hierarchical clustering (Table 2). Secondly, BIGRED employs a probabilistic framework to tackle the problem of error detection rather than a heuristic one like hierarchical clustering, making BIGRED a more statistically rigorous and neatly packaged method. Hierarchical clustering requires the user to make many (arguably arbitrary) decisions throughout the protocol, whereas BIGRED requires the user to make one decision at the very end, i.e. the probability at which to “call” a source vector. Our results also highlight one of the major flaws of methods like hierarchical clustering: results can change depending on what samples were included in the analysis, specifically during imputation. There are 146 genotypes that are used in both IITA’s and NRCRI’s breeding programs, and these 146 genotypes appear in both institutions’ data (Fig. 4). We performed hierarchical clustering on these individuals a total of two different times: once in combination with the 329 genotypes unique to IITA and once in combination with the 27 genotypes unique to NRCRI. Ideally, the duplicate runs of an individual would produce identical results, regardless of what other samples where included in each analysis. Of the 146 cases, however, we found three cases where the hierarchical clustering-based duplicate analyses produced conflicting results: one case where the two analyses reported differ errors and two cases where the IITA analysis reported no error but the NRCRI analysis reported an error. These conflicts likely resulted from the imputation component of the cluster analysis procedure since sample composition is known to affect imputation. These issues highlight the benefits of our approach: when we ran BIGRED on these 146 individuals twice, we found that all duplicate runs produced identical results.
In our simulation experiments, we estimated allele frequencies from WGS data. Users of BIGRED will likely not have this option and will need to estimate allele frequencies using low- to moderate-depth sequence data. Although such frequency estimates will in general contain noise, we showed that BIGRED is robust to imperfect estimates of allele frequency. We estimated allele frequencies from a set of 206 individuals but ran simulation experiments using a subset of 15 individuals and were able to recover the true underlying source vector when analyzing sites with MAFs falling in the (0.3,0.5] interval (Fig. 5). We also suggest that a user perform preliminary analyses (e.g., with PCA) to detect the presence of population structure, and when structure is evident, we recommend analyzing subpopulations separately, estimating allele frequencies from samples of a given subpopulation then running BIGRED on the samples from that subpopulation.
The number of possible source vectors increases exponentially as k increases (Additional file 5). For this reason, we do not recommend using BIGRED on cases where k > 7. We, however, do not anticipate many scenarios where a researcher would have sequenced a given individual more than seven times, but if this scenario does occur, one could either randomly select seven putative replicates to analyze or divide the replicates into sets of no more than seven samples. If using the latter scheme, one would run BIGRED on each set, merge the true replicates within each set (discarding the erroneous samples), then combine the sets of merged samples before running BIGRED once more. By using a Poisson distribution to simulate AD data, we make the assumption that reads are uniformly distributed across the genome. While read data will often be more highly dispersed than these analyses, if at least L of the sites in those data have read depth of lambda or above, BIGRED will perform at least as well as in our analyses with these same parameters.
A motivation for BIGRED’s joint analysis framework is that pairwise-comparison methods might produce ambiguous results when more than two putative replicates exist, and we did, in fact, run into cases of this when applying the correlation method to real data. Of the cases where the method reported the presence of replicates when applying a replicate-call threshold of 0.85 and 0.80, 12.50 and 4.11% contained pairwise inconsistencies, respectively. By decreasing the call threshold, one lowers the number of ambiguous cases returned but doing so also increases the number of false positives returned. And although it may occur at low frequency, the possibility of pairwise inconsistencies exists and would be a problem for all methods that employ a pairwise-comparison approach.
Conclusions
In this study, we introduced a statistical framework for detecting mislabeled and contaminated samples among putative replicates. Our method addresses key limitations of existing approaches and produced highly accurate results in simulation experiments even when applied to samples with low read depth. Our method is implemented as an R package called BIGRED, which is freely available for download: https://github.com/ac2278/BIGRED.
Notes
Acknowledgements
We would like to thank all those involved in the Next Generation Cassava Breeding (NEXTGEN) project for the collection of sequence data used in this study and for data management.
Funding
This work was supported by the Bill & Melinda Gates Foundation and the Department for International Development of the United Kingdom. The funders had no role in study design, data analysis, decision to publish, or preparation of the manuscript.
Availability of data and materials
The AD and WGS data used and analyzed during the current study are available from CassavaBase, https://www.cassavabase.org/. The method is implemented as an R package called BIGRED, which is freely available for download: https://github.com/ac2278/BIGRED.
Authors’ contributions
AWC developed the algorithm with input from ALW. AWC and JLJ conceived and designed the experiments. AWC performed the experiments, analyzed the data, and wrote the manuscript. All authors read and approved the final version of the manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary material
References
- 1.Robasky K, Lewis NE, Church GM. The role of replicates for error mitigation in next-generation sequencing. Nat Rev Genet. 2014;15(1):56–62.CrossRefGoogle Scholar
- 2.Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, Sham PC. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75.CrossRefGoogle Scholar
- 3.Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011;12(6):443–51.CrossRefGoogle Scholar
- 4.“NEXTGEN Cassava.” [Online]. Available: http://www.nextgencassava.org/.
- 5.Ramu P, Esuma W, Kawuki R, Rabbi IY, Egesi C, Bredeson JV, Bart RS, Verma J, Buckler ES, Lu F. Cassava haplotype map highlights fixation of deleterious mutations during clonal propagation. Nat Genet. 2017;49(6):959–63.CrossRefGoogle Scholar
- 6.Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics. 2012;28(24):3326–8.CrossRefGoogle Scholar
- 7.Bredeson JV, Lyons JB, Prochnik SE, Wu GA, Ha CM, Edsinger-Gonzales E, Grimwood J, Schmutz J, Rabbi IY, Egesi C, Nauluvula P, Lebot V, Ndunguru J, Mkamilo G, Bart RS, Setter TL, Gleadow RM, Kulakow P, Ferguson ME, Rounsley S, Rokhsar DS. Sequencing wild and cultivated cassava and related species reveals extensive interspecific hybridization and genetic diversity. Nat Biotech. 2016;34(5):562–70.CrossRefGoogle Scholar
- 8.Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, Buckler ES, Mitchell SE. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One. 2011;6(5):e19379.CrossRefGoogle Scholar
- 9.Chan AW, Hamblin MT, Jannink J-L. Evaluating imputation algorithms for low-depth genotyping-by-sequencing (GBS) data. PLoS One. 2016;11(8):e0160733.CrossRefGoogle Scholar
- 10.Rabbi IY, Kulakow PA, Manu-Aduening JA, Dankyi AA, Asibuo JY, Parkes EY, Abdoulaye T, Girma G, Gedil MA, Ramu P, Reyes B, Maredia MK. Tracking crop varieties using genotyping-by-sequencing markers: a case study using cassava (Manihot esculenta Crantz). BMC Genet. 2015;16:115.CrossRefGoogle Scholar
- 11.Endelman JB. Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome J. 2011;4:250–5.CrossRefGoogle Scholar
- 12.R Development Core Team. R: A Language and Environment for Statistical Computing. Vienna: R Found. Stat. Comput; 2016.Google Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.