ddSeeker: a tool for processing Bio-Rad ddSEQ single cell RNA-seq data
New single-cell isolation technologies are facilitating studies on the transcriptomics of individual cells. Bio-Rad ddSEQ is a droplet-based microfluidic system that, when coupled with downstream Illumina library preparation and sequencing, enables the monitoring of thousands of genes per cell. Sequenced reads show unique features that do not permit the use of freely available tools to perform single cell demultiplexing.
We present ddSeeker, a tool to perform initial processing and quality metrics of reads generated through Bio-Rad ddSEQ/Illumina experiments. Its application to the Illumina test dataset demonstrates that ddSeeker performs better than Illumina BaseSpace software, enabling a higher recovery of valid reads. We also show its utility in the analysis of an in-house dataset including two read sets characterized by low and high sequencing quality. ddSeeker and its source code are available at https://github.com/cgplab/ddSeeker.
ddSeeker is a freely available tool to perform initial processing and quality metrics of reads generated through Bio-Rad ddSEQ/Illumina single cell transcriptomic experiments.
KeywordsSingle-cell transcriptomics scRNA-seq Bioinformatics
Barcode block 1
Barcode block 2
Barcode block 3
Unique molecular identifier
Single cell RNA sequencing
Recent advances in single-cell transcriptome profiling (single cell RNA-seq, scRNA-seq), are improving our understanding of different biological processes, with impact in many areas of research, including the immune system, brain and mammal development and cancer [1, 2]. scRNA-seq techniques are contributing to refine our knowledge of cell types and states [3, 4] and have been successfully used to characterize intratumoral heterogeneity in different tumor types, including glioblastoma , melanoma  and breast cancer . Clinical application of scRNA-seq has been also investigated by various groups. Most of these studies have focused on dissecting the interplay between tumor cell biology and cancer treatment, with the ultimate goal of identifying new treatment hypotheses [6, 8, 9]. A recent example includes the application of scRNA-seq to triple negative breast cancer patients in order to understand clonal evolution in response to chemotherapy .
A variety of computational tools have been designed to address the specific challenges of scRNA-seq data. These involve new normalization methods for dealing with the small number of reads per gene and cell [11, 12], imputation strategies to model the sparsity of the data [13, 14], statistical methods to perform differential expression analysis [15, 16] and clustering techniques to capture cell population heterogeneity  and cell population dynamics .
Several platforms enabling single-cell transcriptome profiling are based on droplet-microfluidic technology, wherein cells are encapsulated into nanoliter droplets then lysed and mRNA is barcoded . Available commercial systems include 10x Genomics Chromium  and Bio-Rad ddSEQ Single Cell Isolator . In particular, ddSEQ is a new platform capable of isolating and barcoding about 300 cells per well. Libraries are then prepared by a specific Illumina kit (SureCell WTA 3’ ) and sequenced. Though in its infancy, the Bio-Rad ddSEQ platform has been successfully used to characterize the molecular heterogeneity of cystic precursor lesions (IPMNs) and its role towards progressive dysplastic changes . ddSEQ data have been also exploited for validating new computational methods [16, 24].
Here, we present ddSeeker, a new tool for the initial processing of data generated through Bio-Rad ddSEQ experiments that shows enhanced performance compared to the available commercial software. Our tool can be integrated in different scRNA-seq pipelines, allowing the users to take advantage of popular pipelines for the processing of scRNA-seq data. Additionally, ddSeeker provides a set of metrics which can assist the user to evaluate the quality of their own data.
the exact positions of the two linkers (L1 and L2) within the sequence of R1 are retrieved;
the distance between the starting positions of the two linkers is verified to be exactly 21 nt;
L1 is verified to start at least 7 nucleotides from the start of the sequence;
the sequences of the two trinucleotides flanking the UMI are verified to be ACG and GAC;
the sequences of the three barcode blocks (BC1, BC2, BC3) and the UMI are extracted based on their position relative to the linkers;
the sequences of the barcode blocks are compared with a predefined list of known barcode blocks, and retrieved only if a match is found.
In order to optimize the accuracy of our pipeline, we also considered insertions and deletions in the analysis of the sequences of linkers and barcodes (step 1 and 6) and at most one mismatch. Indeed, based on the analysis reported in Additional file 1: Figure S1 and Table S1, we observed that events with more than one mismatch, insertion or deletion in linkers represent a small fraction of total events (1.4% for L1 and 1.2% for L2). In step 4 only one mismatch and no indels are permitted due to the short length of sequences to be tested. For each valid read, the cell barcode is defined as the union of the three barcodes (BC1 + BC2 + BC3). Cell identifier and UMI are then associated to the corresponding R2. ddSeeker takes R1 and R2 fastq files as input and outputs an unmapped BAM file containing R2 and corresponding cell barcode and UMI molecular tags. ddSeeker uses as default a Drop-seq tools -like tag scheme (XC for cell barcode and XM for UMI), but different tag schemes can be chosen by the user. For non-valid reads, ddSeeker reports the error identifier in a custom defined tag XE (see Additional file 1 for the description of possible errors). Our tool is written in Python3 using Biopython  and pysam modules . ddSeeker can be integrated with existing scRNA-seq pipelines, including Drop-seq tools, dropEst and scPipe. A detailed description of the algorithm is reported in Additional file 1. Analyses were made using R custom scripts  and plots generated by the ‘Tidyverse’ packages .
Analysis of Illumina test dataset
Quality assessment of ddSeeker’s additionally detected reads
To evaluate the quality of barcodes exclusively identified by our tool, we first studied the cumulative fraction of reads per cell in the matched (i.e., barcodes identified by both BaseSpace and ddSeeker) and ddSeeker-only barcodes. As reported in Fig. 1b, we found an exact overlap between the two curves, both showing the knee at about 350 cells, as expected based on Illumina BaseSpace report. We also investigated whether considering insertions and/or deletions in our pipeline could introduce biases in the quantification of valid barcodes (i.e., the presence of certain barcodes showing more insertions or deletions than expected). As reported in Fig. 1c, we observed that the number of ddSeeker-only reads linearly correlates with the number of valid barcode reads identified by both algorithms (R>0.999).
We then studied how the ddSeeker pipeline impacts on downstream analysis, including read alignment and gene counting. First, we evaluated whether the additionally detected reads mapped equally well to the reference genome. To achieve that, we compared mapping quality values extracted from the Illumina bam file for matched and ddSeeker-only valid barcode reads. We found that 72% of ddSeeker-only valid reads has high mapping quality and, in general, the mapping quality distribution for matched and ddSeeker-only valid reads were markedly similar (Additional file 1: Figure S2). Secondly, we investigated the number of doublets detected using ddSeeker and BaseSpace and observed no difference (n=13 for both pipelines, see Additional file 1: Figure S3), demonstrating that additionally detected reads show high species-specificity. Lastly, we compared gene expression estimations following ddSeeker and BaseSpace pipelines. To calculate gene counts, we used the DigitalExpression tool included in Drop-seq tools. A library size normalization (i.e., gene counts per cell) was applied to quantify gene expression levels. Figure 2d reports the mean gene expression level for the 200 most expressed human genes across the 100 most read cells using ddSeeker and BaseSpace (results for mouse genes are reported in Additional file 1: Figure S4). We obtained high correlation (R>0.999) between ddSeeker and BaseSpace, demonstrating that overall gene expression estimation is not biased by additionally detected reads by ddSeeker.
Application of ddSeeker to in-house dataset
Sequencing run summary of the in-house dataset
Clusters PF (%)
# Cells (expected)
# Cells (ddSeeker)
ddSeeker is, to our knowledge, the first freely-available tool to perform initial processing and quality metrics of reads generated through Bio-Rad ddSEQ/Illumina experiments. We performed a comparative study of ddSeeker and BaseSpace in the analysis of an Illumina test dataset. We showed that ddSeeker was able to identify 5% more reads with valid barcodes and UMI than the Illumina BaseSpace tool. The enhanced ability of ddSeeker in identifying reads with valid barcodes derives from its exclusive feature that implements the analysis of insertions or deletions events in the sequences of linkers and barcodes. Additionally, we demonstrated the reliability of ddSeeker’s additionally detected reads. Our analyses show that downstream analysis is not biased in terms of mapping quality, presence of doublets and gene expression quantification. Finally, we showed the utility of ddSeeker in the analysis of an in-house dataset that includes two different read sets characterized by low and high sequencing quality. To conclude, our analyses suggest that ddSeeker is a valuable tool to perform quality control of Bio-Rad ddSEQ data, and to identify valid reads for downstream scRNA-seq data analysis.
Availability and requirements
Project name: ddSeeker
Project home page: https://github.com/cgplab/ddSeeker
Operating system: GNU/Linux
Programming language: Python 3
This work has been supported by Associazione Italiana per la Ricerca sul Cancro (MFAG n. 14371 to IM and 18880 to LM) for the development of the cell line model, and Fondazione Sandro Pitigliani per la lotta contro i tumori ONLUS and Fondazione Cassa di Risparmio di Firenze (MBE) for generation, sequencing and analysis of single cell RNA-seq experiments.
Availability of data and materials
The Illumina test dataset is available through the Illumina BaseSpace Sequence Hub . The entire in-house dataset is available from the corresponding author upon request. A subset of the dataset is available through a dedicated GitHub repository . The code used to conduct the data analyses of this manuscript is available from the corresponding author upon request.
MBE supervised development of work. MBE, DR and GB wrote the manuscript. DR designed and implemented the software. GB performed single cell experiments. MBO assisted in preparing the single cell libraries. PF, VDS and RB performed sequencing and helped to evaluate the data. ADL, IM and LM helped to evaluate the data and edit the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 3.Trapnell C. Defining cell types and states with single-cell genomics. Genome Res. 2015; 25(10):1491–8. https://doi.org/10.1101/gr.190595.115. http://genome.cshlp.org/content/25/10/1491.full.pdf+html.CrossRefGoogle Scholar
- 4.Saunders A, Macosko E, Wysoker A, Goldman M, Krienen F, Bien E, Baum M, Wang S, Goeva A, Nemesh J, Kamitaki N, Brumbaugh S, Kulp D, McCarroll SA. A single-cell atlas of cell types, states, and other transcriptional patterns from nine regions of the adult mouse brain. bioRxiv. 2018. https://doi.org/10.1101/299081. http://arxiv.org/abs/https://www.biorxiv.org/content/early/2018/04/10/299081.full.pdf.
- 5.Patel AP, Tirosh I, Trombetta JJ, Shalek AK, Gillespie SM, Wakimoto H, Cahill DP, Nahed BV, Curry WT, Martuza RL, Louis DN, Rozenblatt-Rosen O, Suvà ML, Regev A, Bernstein BE. Single-cell rna-seq highlights intratumoral heterogeneity in primary glioblastoma. Science. 2014; 344(6190):1396–401. https://doi.org/10.1126/science.1254257. http://science.sciencemag.org/content/344/6190/1396.full.pdf.CrossRefGoogle Scholar
- 6.Tirosh I, Izar B, Prakadan SM, Wadsworth MH, Treacy D, Trombetta JJ, Rotem A, Rodman C, Lian C, Murphy G, Fallahi-Sichani M, Dutton-Regester K, Lin J-R, Cohen O, Shah P, Lu D, Genshaft AS, Hughes TK, Ziegler CGK, Kazer SW, Gaillard A, Kolb KE, Villani A-C, Johannessen CM, Andreev AY, Van Allen EM, Bertagnolli M, Sorger PK, Sullivan RJ, Flaherty KT, Frederick DT, Jané-Valbuena J, Yoon CH, Rozenblatt-Rosen O, Shalek AK, Regev A, Garraway LA. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell rna-seq. Science. 2016; 352(6282):189–96. https://doi.org/10.1126/science.aad0501. http://science.sciencemag.org/content/352/6282/189.full.pdf.CrossRefGoogle Scholar
- 8.Tirosh I, Venteicher AS, Hebert C, Escalante LE, Patel AP, Yizhak K, Fisher JM, Rodman C, Mount C, Filbin MG, Neftel C, Desai N, Nyman J, Izar B, Luo CC, Francis JM, Patel AA, Onozato ML, Riggi N, Livak KJ, Gennert D, Satija R, Nahed BV, Curry WT, Martuza RL, Mylvaganam R, Iafrate AJ, Frosch MP, Golub TR, Rivera MN, Getz G, Rozenblatt-Rosen O, Cahill DP, Monje M, Bernstein BE, Louis DN, Regev A, Suvà ML. Single-cell rna-seq supports a developmental hierarchy in human oligodendroglioma. Nature. 2016; 539:309.CrossRefGoogle Scholar
- 9.Kim K-T, Lee HW, Lee H-O, Song HJ, Jeong DE, Shin S, Kim H, Shin Y, Nam D-H, Jeong BC, Kirsch DG, Joo KM, Park W-Y. Application of single-cell rna sequencing in optimizing a combinatorial therapeutic strategy in metastatic renal cell carcinoma. Genome Biol. 2016; 17(1):80. https://doi.org/10.1186/s13059-016-0945-9.CrossRefGoogle Scholar
- 14.van Dijk D, Sharma R, Nainys J, Yim K, Kathail P, Carr AJ, Burdziak C, Moon KR, Chaffer CL, Pattabiraman D, Bierie B, Mazutis L, Wolf G, Krishnaswamy S, Pe’er D. Recovering gene interactions from single-cell data using data diffusion. Cell. 2018; 174(3):716–72927. https://doi.org/10.1016/j.cell.2018.05.061.CrossRefGoogle Scholar
- 17.Rodriguez A, Laio A. Clustering by fast search and find of density peaks. Science. 2014; 344(6191):1492–6. https://doi.org/10.1126/science.1242072. http://science.sciencemag.org/content/344/6191/1492.full.pdf.CrossRefGoogle Scholar
- 20.10x Genomics. https://www.10xgenomics.com. Accessed Oct 2018.
- 21.Bio-Rad. https://www.bio-rad.com. Accessed Oct 2018.
- 22.Illumina. http://www.illumina.com. Accessed Oct 2018.
- 23.Bernard V, Semaan A, Huang J, San Lucas FA, Mulu FC, Stephens BM, Guerrero PA, Huang Y, Zhao J, Kamyabi N, Sen S, Scheet PA, Taniguchi CM, Kim MP, Tzeng C-W, Katz MH, Singhi AD, Maitra A, Alvarez HA. Single cell transcriptomics of pancreatic cancer precursors demonstrates epithelial and microenvironmental heterogeneity as an early event in neoplastic progression. bioRxiv. 2018. https://doi.org/10.1101/306134. https://www.biorxiv.org/content/early/2018/04/26/306134.full.pdf.
- 24.Monocle. http://cole-trapnell-lab.github.io/monocle-release/docs/. Accessed Oct 2018.
- 25.Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, Trombetta JJ, Weitz DA, Sanes JR, Shalek AK, Regev A, McCarroll SA. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets,. Cell. 2015; 161(5):1202–14. https://doi.org/10.1016/j.cell.2015.05.002.CrossRefGoogle Scholar
- 26.dropSeqPipe. http://github.com/Hoohm/dropSeqPipe. Accessed Oct 2018.
- 27.Petukhov V, Guo J, Baryawno N, Severe N, Scadden DT, Samsonova MG, Kharchenko PV. dropest: pipeline for accurate estimation of molecular counts in droplet-based single-cell rna-seq experiments. Genome Biol. 2018; 19(1):78. https://doi.org/10.1186/s13059-018-1449-6. Accessed Oct 2018.CrossRefGoogle Scholar
- 30.Illumina BaseSpace. https://basespace.illumina.com. Accessed Oct 2018.
- 31.Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJL. Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009; 25(11):1422–3. https://doi.org/10.1093/bioinformatics/btp163.CrossRefGoogle Scholar
- 34.Wickham H. Tidyverse: Easily Install and Load the ’Tidyverse’. 2017. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse. Accessed Oct 2018.
- 35.FastQC. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed Oct 2018.
- 36.Subset of the In-house Dataset. https://github.com/cgplab/ddSeeker_example_dataset. Accessed Oct 2018.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.