Sensitivity to sequencing depth in single-cell cancer genomics
Querying cancer genomes at single-cell resolution is expected to provide a powerful framework to understand in detail the dynamics of cancer evolution. However, given the high costs currently associated with single-cell sequencing, together with the inevitable technical noise arising from single-cell genome amplification, cost-effective strategies that maximize the quality of single-cell data are critically needed. Taking advantage of previously published single-cell whole-genome and whole-exome cancer datasets, we studied the impact of sequencing depth and sampling effort towards single-cell variant detection.
Five single-cell whole-genome and whole-exome cancer datasets were independently downscaled to 25, 10, 5, and 1× sequencing depth. For each depth level, ten technical replicates were generated, resulting in a total of 6280 single-cell BAM files. The sensitivity of variant detection, including structural and driver mutations, genotyping, clonal inference, and phylogenetic reconstruction to sequencing depth was evaluated using recent tools specifically designed for single-cell data.
Altogether, our results suggest that for relatively large sample sizes (25 or more cells) sequencing single tumor cells at depths > 5× does not drastically improve somatic variant discovery, characterization of clonal genotypes, or estimation of single-cell phylogenies.
We suggest that sequencing multiple individual tumor cells at a modest depth represents an effective alternative to explore the mutational landscape and clonal evolutionary patterns of cancer genomes.
KeywordsSingle-cell sequencing Intratumor genetic heterogeneity Variant calling Clonal inference Tumor phylogenies
Catalogue of somatic mutations in cancer
Genome analysis toolkit
Intratumor genomic heterogeneity
Multiple annealing and looping-based amplification cycles
Multiple displacement amplification
Recent advances in next-generation sequencing (NGS) technologies revealed that the large majority of cancer genomes are heterogeneous despite their monoclonal origin, with the continuous expansion of the tumor mass contributing to the accumulation of somatic mutations within malignant cells, hence promoting the proliferation of distinct genetic lineages (i.e., clones) through time . While quantifying this intratumor heterogeneity (ITH) remains a difficult task, as standard methods in cancer genomics generally rely on population-level analysis from bulk experiments, single-cell sequencing (SC-Seq) approaches are now widely viewed as a promising alternative to explore tumor evolution . Indeed, a collection of recent studies have successfully applied SC-Seq to determine the mutational load in individual tumors , estimate the frequency of subclones , infer evolutionary relationships , or explore the role of ITH in metastatic dissemination .
Nevertheless, several technical challenges surrounding current SC-Seq methodologies greatly limit our ability to obtain reliable genomic information from single cells. For instance, the multiple rounds of whole genome amplification (WGA) usually required prior to SC-Seq are known to introduce a high number of sequence artifacts that can be confounded with genuine biological variation (see  for a detailed review). Other technical errors, such as insufficient physical coverage, uneven genome amplification, and allelic dropout, may also generate substantial artificial variability in cancer genomes, compromising the ability to detect real somatic heterogeneity from SC-Seq data . As a consequence, alternative strategies are needed in order to eliminate the noise generated during WGA while effectively allowing the quantification of ITH from single cells.
Zhang et al.  started addressing some of these issues and demonstrated the efficiency of a census-based strategy for accurate variant detection in single cells. By using multiple cells and trusting only variants detected in at least two single-cell libraries, they detected up to 80% of germline SNPs in the human chromosome 5 with 59 cells sequenced at 0.3× or 22 cells at 1×. Their results suggest that for detecting clonal and subclonal variants in single cells, and given a fixed sequencing effort, it is best to sequence multiple cells (in their case a minimum of 20) at a modest depth (~ 1×).
Here, we further explore the sensitivity of SC-seq to sequencing depth using five publicly available single-cell whole-genome (scWGS) and whole-exome (scWXS) cancer datasets. We expand not only on the scale of the datasets, but also on the scope of the inferences, including copy-number variant detection, clonal inference, and phylogenetic estimation. Altogether, our results suggest that even though sequencing depth does indeed contribute to a better refinement of somatic variant characterization from tumor single cells, sample size plays a more determinant role for a reliable assessment of the general patterns of somatic variation in cancer genomes. For relatively large sample sizes (e.g., ≧ 25 samples), sequencing single cells at modest depths (i.e., 5×) enables a similar description of somatic variation, clonal composition, and evolutionary history compared to sequencing depths one order of magnitude higher.
Five publicly available sequencing datasets from four single-cell studies were retrieved from the Sequence Read Archive (SRA) in FASTQ format, including four single-cell genomes from a breast cancer patient  (we will call this dataset “W4” to indicate the authors and the number of cells), eight single-cell exomes from circulating tumor cells from one lung adenocarcinoma patient  (“N8” dataset), 25 single-cell exomes derived from a kidney tumor patient  (“X25” dataset), 55 single-cell exomes from a breast cancer patient  (“W55” dataset), and 65 single-cell exomes from a single JAK-2 negative neoplasm myeloproliferative patient  (“H65” dataset). Normal and tumor bulk WGS/WXS data from the same patients were also retrieved. Normal single cells were only available for the three largest datasets. A list of the individual samples and corresponding accession codes is available in Additional file 1: Table S1.
All the analyses enumerated below are described in detail in the accompanying Additional file 1: Note, including command lines. Both single-cell and bulk reads were aligned to human reference GRCh37 using the MEM algorithm in the BWA software . Following a standardized best-practices pipeline , mapped reads from all datasets were independently processed by filtering reads displaying low mapping-quality, performing local realignment around indels, and removing PCR duplicates. Raw single-nucleotide variant (SNV) calls for the bulk datasets were obtained using the paired-sample variant-calling approach implemented in the VarDict software . For the N8 dataset, since samples from both primary tumor and metastasis were available, VarDict was run twice, independently for both samples, and the resulting SNVs subsequently merged using the CombineVariants tool from the Genome Analysis Toolkit (GATK) . Low-quality SNV calls were removed using the SelectVariants tool from GATK. The remaining SNVs were further subdivided into two distinct categories: “germline” SNVs if present in both tumor and normal bulk samples, and “somatic” SNVs if found solely in the tumor bulk samples. Small indels and other complex structural rearrangements were ignored in order to generate a final list of “gold-standard” bulk SNVs. All analyses presented here were based on this set of variants.
The single-cell BAM files were independently downscaled to 25, 10, 5, and 1× sequencing depth using Picard . For each depth level, ten technical replicates were generated for statistical validation, resulting in a total of 6280 BAM files. Single-cell SNV calls were obtained from the original and down-sampled single-cell BAM files using Monovar , a variant caller specifically designed for single-cell data, under default settings. Single-cell variant-calling performance was evaluated by estimating the proportion of “gold-standard” germline and somatic bulk SNVs identified in the down-sampled single-cell datasets (germline and somatic recall, respectively). To further characterize the effect of sequencing depth on single-cell variant calling, we determined the fraction of somatic SNVs found in the down-sampled single-cell replicates that were also identified in the original single-cell datasets (“somatic precision”). In addition, we repeated the recall analysis focusing only on the somatic SNVs already described in the Catalogue Of Somatic Mutations In Cancer (COSMIC) database  and on the non-synonymous SNVs previously detected (Additional file 1: Table S2).
Single-cell copy-number variants (CNVs) were identified with Ginkgo  using variable-length bins of around 500 kb. After binning, data for each cell was normalized and segmented using default parameters. Sensitivity was evaluated by assessing the recall of the CNVs and segment breakpoints at the different sequencing depths.
Clonal genotypes were estimated from the somatic SNVs using the Single-Cell Genotyper (SCG)  (Additional file 1: Note), and their recall across sequencing depth was measured with the adjusted Rand Index , a version of the Rand Index corrected for chance . The Rand-Index is a popular statistical measure of the similarity between two data clusterings (corresponding here to groups of mutations, or clones). In addition, clonal trees were also inferred from the somatic SNVs with OncoNEM . Using a similar approach to Ross and Markowetz , the pairwise cell shortest-path distance was used to measure the consistency in tree reconstruction across the different sequencing depths. Furthermore, maximum-likelihood single-cell phylogenies were estimated from the SNVs using SiFit . In this case, phylogenetic recall across sequencing depth was measured using the standard Robinson-Foulds tree distance . In addition, we also calculated the homoplasy index (HI), a measure of the amount of homoplasy on a tree, using the phangorn R package . The HI is one minus the ratio between the minimum number of changes required and the actual number observed .
Statistical significance for the differences in recall or HI for the experiments described above were assessed using Tukey’s HSD test with a family-wise error rate of 0.05 in R. See the Additional file 1: Note for a detailed description.
Interestingly, a significant amount of somatic variants was detected exclusively in the single-cells (i.e., absent in the bulk), particularly at higher sequencing depths (Additional file 1: Figure S1A). However, the overall variant quality scores for these calls were much lower than for those shared with the bulk dataset (Additional file 1: Figure S1B), suggesting that most might be untrustworthy.
COSMIC and non-synonymous SNV detection
In this study we aimed to characterize the impact of sequencing depth in single-cell cancer genomics studies. Undeniably, here we have used five datasets with specific characteristics like number of mutations, number of clones, tissue of origin, genomic target, sequencing depth, or amplification bias. In consequence, although some general patterns seem to be more or less clear, care must be taken in generalizing our findings as particular trends may vary for other cancer datasets.
With this caveat in mind, our downsampling experiments suggest that, overall, larger sequencing depths for small numbers of cells (eight or less) might lead to relevant improvements. In contrast, for relatively large datasets (25 or more cells), our results indicate that sequencing single cells at moderate depths (i.e., 5×) should represent a reasonable approach to characterize the genomic diversity and evolution of tumors, including the identification of putative driver alterations. This is in line with the results of Zhang et al. , who showed that for variant detection it is better to have multiple cells sequenced at low depth, given a fixed sequencing effort.
Unsurprisingly, all recalls (SNVs, CNVs, clones, phylogenies) showed some kind of decrease at smaller sequencing depths. In many cases the drop was statistically significant despite being of small magnitude. Notably, for the larger datasets (and by large here we mean—only—dozens of cells), the impact of sequencing depth was much smaller, with the exception of the H65 dataset. This particular dataset, albeit being the largest, displays a very heterogeneous genome coverage for the single cells sampled which may have mislead some of the analyses. Indeed, genome coverage bias has been shown to contribute to a lower sensitivity to detect variants , hence potentially explaining some of the somewhat discordant results of the H65 dataset.
In any case, bulk germline SNVs were relatively easy to identify for the three largest datasets even at low sequencing depth. This was indeed expected since germline variants should be present in the vast majority, if not all, of tumor cells. Nevertheless, when the number of single cells was small, the effect of sequencing depth on germline SNV recall was much more pronounced and reached a limit of ~ 75% at the highest sequencing depth (i.e., 47×) reinforcing the idea that, due to the inherent bias in single-cell genome amplification, broader sampling effort should be favored over increased sequencing depth in variant detection analysis .
While somatic SNVs were much more difficult to detect, it should be highlighted that the number of somatic mutations detected at 5× were usually at the same order of magnitude as the number of mutations detected at higher sequencing depths, except for the smaller datasets. Still, for the smallest dataset analyzed (W4), the high number of somatic SNVs detected at 5× (7406) seem plenty enough to conduct many subsequent analyses, like clonal inference or phylogeny reconstruction.
In relation to this, it is important to highlight that, aside from sample size and sequencing depth, somatic variant detection can additionally be affected by the choice of thresholds during variant calling. Indeed, conservative thresholds may prevent the discovery of true mutations due to excessive filtering, whereas relaxed thresholds may cause an increase of false-positive calls. Determining the best parameters for filtering variants is, therefore, difficult. Most studies analyzing SC-Seq data have relied on “hard” filtering thresholds for a minimum depth of coverage (e.g., > 10 reads; e.g., ). Here, a similar filtering strategy would prove too stringent for most down-sampled datasets. To allow proper comparisons among the different depth levels we decided not to use a minimum depth threshold. Instead, we required each variant to be detected in at least two single cells. Such a consensus strategy has already been shown to be quite efficient [9, 18].
Remarkably, the somatic single-cell SNV precision was, in general, very robust to sequencing depth, suggesting that lower depths do not result in new calls that would not have been made at higher depths. Intuitively, this observation makes perfect sense since at lower sequencing depths the variants detected tend to be the clonal ones (i.e., variants shared by the majority of the single cells sampled) whereas the detection of low-frequency mutations required higher read depths (data not shown).
One might be worried, however, about missing putative driver mutations, but our results suggest that, as far as the number of single cells is reasonably large (here 25 or more), most COSMIC somatic variants can be detected at modest sequencing depths (here 5× or more). Similar results were also observed for the somatic non-synonymous variants, suggesting that, in principle, many relevant variants in single-cell genomes are likely to be detected at modest sequencing depths.
Obviously, assigning particular genotypes to the individual cells is a much more involved task than just detecting variants. Importantly, for SNV genotyping, reducing sequencing depth generally resulted in an increased amount of missing data in the single-cell genotype matrix, rather than different genotype calls.
Moreover, and in agreement with previous studies [20, 29], CNV characterization from single cells was also very robust to sequencing depth, with all down-sampled datasets showing remarkable preservation of CNV breakpoints. Furthermore, CNV genotype assignment was insensitive to the variation in the sequencing depths explored. In general, the copy-number analysis of single-cell libraries can be confounded by amplification bias. However, previous studies suggest that amplification biases are randomly distributed and sufficiently separated throughout the genome  as to not affect CNV calling at the level of resolution chosen here (500-kb bins). Popular single-cell amplification methods like multiple displacement amplification (MDA) and multiple annealing and looping-based amplification cycles (MALBAC) usually generate amplicons of around 10–100 kb and 1–5 kb, respectively; therefore, we do not expect many false positive CNV calls . Yet, we acknowledge that our choice of bin size may have prevented the identification of small CNVs .
It is relatively well established that an accurate identification of clonal genotypes can be very important to understand tumor dynamics and genomic architecture [32, 33, 34]. For the datasets analyzed here, our results suggest that SC-Seq depth does not affect the identification of tumor clones when the genomic variability between malignant cells is small (i.e., displaying limited clonal population genetic diversification). However, the same was not true for tumors comprising a larger number of subclones, where the different clonal genotypes were only distinguishable at higher sequencing depths. While these results are not necessarily surprising, as clonal identification remains a complex problem even for bulk sequencing data [35, 36], they seem to suggest that higher sequencing coverage is ultimately required to resolve fine-scale clonal structure in more heterogeneous tumors.
Finally, in our evolutionary analyses, we observed a moderate impact of sequencing depth with respect to the estimated phylogenetic relationships of the inferred clones and single cells. Perhaps due to the uncertainty stemming from significant amounts of missing data, datasets down-sampled to 1× resulted in phylogenetic trees with healthy cells intermingled with tumor cells, which can be safely considered as artifacts. While the amount of homoplasy was lower at 1×, this was likely an effect of the smaller amount of variant calls per cell at such a low depth. Otherwise, tree topologies at 5× seemed quite similar to those inferred at higher depths, suggesting that relatively few clonal variants might be enough to resolve the topology of the single-cell trees. Note that the topology does not include branch lengths, whose accurate estimation might require higher sequencing depths.
Single-cell DNA sequencing is expected to be key to obtain accurate inferences of the clonal architecture of tumor samples, which shall ultimately prove crucial to compare models of cancer evolution, trace cell lineages, measure mutation rates, and decipher cell clones responsible for metastatic dissemination and drug resistance [2, 37, 38]. While recent experimental and analytical improvements have improved the quality of single-cell DNA sequencing data [9, 18, 20, 21, 25, 39, 40, 41], the costs associated with sequencing multiple single-cell genomes or exomes at high depths are still largely prohibitive. Our results support the idea that sequencing multiple individual tumor cells at a modest depth, such as 5×, may help circumvent this limitation at least for the type of analyses implemented here. Finally, the results obtained here might be extrapolatable to some extent to non-tumor single-cell genomes.
We would like to thank Sereina Rutschmann, Harald Detering, Laura Tomás, and Sara Rocha for their comments on earlier versions of the manuscript. We also thank the anonymous reviewers for their useful suggestions.
This work was supported by the European Research Council (ERC-617457- PHYLOCANCER awarded to D.P.) and by the Ministry of Economy and Competitiveness—MINECO (BFU2015-63774-P awarded to D.P.) D.P. receives further support from the Galician government.
Availability of data and materials
A list of the public single-cell datasets analyzed is in Additional file 1: Table S1. The downsampled replicates are available from the authors on request.
DP conceived the project. DP designed and JMA performed the analyses. JMA and DP wrote the manuscript. Both authors read and approved the final manuscript.
Ethics approval and consent to participate
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 13.Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv. 2013;1303:3997v1.Google Scholar
- 14.Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A, et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics. 2013;43:11.10.1–33.Google Scholar
- 17.Picard software. http://broadinstitute.github.io/picard.Accessed 12 Apr 2018.
- 31.Sherman MA, Barton AR, Lodato MA, Vitzthum C, Coulter ME, Walsh CA, et al. PaSD-qc: quality control for single cell whole-genome sequencing data using power spectral density estimation. Nucleic Acids Res. 2017; https://doi.org/10.1093/nar/gkx1195.
- 33.Kuipers J, Jahn K, Beerenwinkel N. Advances in understanding tumour evolution through single-cell sequencing. Biochim Biophys Acta. 1867;2017:127–38.Google Scholar
- 35.Turajlic S, McGranahan N, Swanton C. Inferring mutational timing and reconstructing tumour evolutionary histories. Biochim Biophys Acta. 1855;2015:264–75.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.