Background

Aneuploidy, the phenomenon that genomes acquire or lose chromosomal fragments, has been causally implicated in a wide variety of human diseases such as neuropsychiatric disorders and cancer [1,2,3]. Genetic and phenotypic plasticity resulting from aneuploidy evolution causes treatment resistances and disease recurrences [4,5,6], which fundamentally challenges current medicine. Recent studies have shown that not only disease tissues, but also pathologically normal tissues may contain a high degree of somatic mosaicisms (e.g., peripheral blood [7] and esophagus [8]). Therefore, defining which copy number alterations (CNAs) cause pathogenesis and which are part of normal variations becomes increasingly important in genome medicine, especially for cancer [9, 10].

Various efforts have been made to obtain comprehensive knowledge of CNAs responsible for cancer diagnostics, prognostics, and targeted therapeutics. Systematic CNA analysis in over 10,000 primary tumor samples in the cancer genome atlas (TCGA) and 2500 samples in the International Cancer Genome Consortium (ICGC) revealed distinct CNA landscapes in different cancer types [11,12,13]. Comparison of CNAs among autologous tumors obtained at different stage from different histology revealed that CNAs are critical for tumor evolution across time and space. However, studies based on bulk tissue samples cannot fully depict the history of tumor evolution, which occurs in single-cell resolution [14], and thus have limited power to discover the associated genetic drivers.

Recent advances in single-cell DNA sequencing (e.g., tagmentation-based approach [15] and single-cell CNV solution by the 10x Genomics) have enabled large-scale acquisition of single-cell copy number (SCCN) profiles in tens of thousands of cells at around 100-kb resolution (~ 0.1X sequencing coverage per cell) [16,17,18,19]. Other platforms such as single-cell RNA-sequencing [20, 21] and single-cell ATAC-sequencing [22] have also been utilized for SCCN profiling. A set of bioinformatic tools have been developed to call SCCN profiles, taking into consideration various confounding factors [23,24,25].

These SCCN profiles not only present a rich pool of genetic perturbations that are invisible at tissue level, but also potentiate reconstruction of cellular lineage, based on which the impact of an allele on cellular fitness can be measured. Thus, statistical approaches that integrate cellular lineage tracing with population genetic analysis [26] can enable discovery of novel disease genes and mechanisms of disease progression.

So far, studies performing retrospective lineage tracing from single-cell data have largely been utilizing phylogenetics approaches designed to model species evolution, which is quite different from cellular evolution in terms of duration, scale, genetics, and dynamics [27, 28]. Many existing phylogenetics approaches assume that genomic sites evolve independently and follow the so-called infinite site assumption (ISA) [29]. But in the context of aneuploidy, a genome site can often be altered repeatedly by different CNAs, due partly to constraints on genome and chromatin structures, properties of DNA replication/repairing [30], and functional selection. To apply conventional maximum parsimony approaches on SCCN data, one has to over-segment genomic regions and represent copy numbers as characters in disjoint intervals, which ill-represents the properties of DNAs and distorts evolution propensity across copy number states. Other conventional methods using Euclidean, Hamming, or correlational distances also ill-represent the segmental, non-linear nature of CNA evolution [31], leading to inaccurate inference of tree topology and branch lengths.

A few new phylogenetics approaches have been developed to tackle these limitations by introducing a new distance metric called Minimal Event Distance (MED), which postulates the minimal number and the series of single-copy gains or losses that are required to evolve one genome to another. Particularly, the MEDICC [32] algorithm infers a copy number phylogenetic tree from the allelic copy number profiles of a set of samples. However, the problem is NP-hard [33]. Even the simplified solutions could be applied to only tens of genomes and are not scalable to current single-cell datasets consisting of thousands of cells. Zeira et al. [34] proposed a linear-time solution to the problem based on an integer linear programming (ILP) formulation, but no tool was released.

Having the new distance and efficient tree inference algorithms was a good step forward, but it remains unclear how to identify functional variants, given a cell phylogenetic tree. Intuitively, functional variants affecting cellular fitness should lead to altered variant allele frequencies in the descendant populations, as implicated by previous multiregional tumor phylogenetics studies [10, 35]. However, mathematical procedures [28, 31] have not been developed to quantify the impact of a genomic alteration over a phylogenetic tree, taking into account sparsity in cell population sampling, multiplicity in subset partitioning, and propensity of the alteration at a particular genomic location, etc. [36]

Results

Overview of the methods

To address these challenges, we propose a new computational framework that performs lineage tracing from SCCN data and detects significant focal (gene resolution) and broad (chromosomal-arm resolution) CNAs associated with lineage expansion (Fig. 1).

Fig. 1
figure 1

Algorithm flowchart and evaluation. a Algorithm for constructing a MEDALT. b Identification of non-random fitness-associated CNAs in an individual sample. c Identifying non-random fitness-associated CNAs in a cohort of samples. d Identifying parallel/convergent evolution CNAs in an individual sample. e AUC of FAA identification based on 100 synthetic datasets without noise. MP, maximal parsimony tree. NJ, neighbor-joining tree. ML, maximum likelihood tree

The SCCN profiles are represented as an integer-valued matrix using previously published approaches [16, 18], in which each row represents a cell and each column a chromosomal region. We then deduce the minimal number and the series of single-copy gains or losses (i.e., minimal event distance) that are required to evolve the genome of one cell to the next (Additional file 1: Fig. S1a) using an efficient greedy algorithm which is similar and has the same asymptotic bound as Zeira et al. [34] (see the “Methods” section and Additional file 1: Table S1).

We then infer a directed minimal spanning tree, named Minimal Event Distance Aneuploidy Lineage Tree (MEDALT, Fig. 1a), using an adapted version of the Edmond’s algorithm that scales polynomially with respect to the number of cells (see the “Methods” section and Additional file 1: Table S2). In a MEDALT, each node represents a cell, each edge represents a kinship between two cells, arrows point towards younger cells, and the root represents a normal diploid cell.

MEDALT allows a genomic region to be repetitively altered by multiple single-copy gains or losses. It provides a parsimonious interpretation, the minimal number of single-copy gains or losses that may have led to the evolution of the entire cell population.

An important constraint is that chromosomal fragments cannot be recovered if completely lost. To reflect that property, the MEDs originating from cells containing homozygous copy number loss are set to infinity.

Since MEDALT describes copy number evolution by segments instead of sites, we expect that it will enable more accurate cellular lineage tracing than do conventional phylogenetics methods (Additional file 1: Fig. S1b; see the “Methods” section).

We further establish a statistical routine, named Lineage Speciation Analysis (LSA), to prioritize CNAs and genes that are non-randomly associated with lineage expansion and thereby have potential functional impact.

To perform LSA, we first iteratively partition cells into lineages (subsets) based on the topology of the lineage tree. For each CNA region in each candidate lineage, we calculate a cumulative fold level (CFL) as the summation of the copy number levels in constituent cells (Fig. 1b and Additional file 1: Fig. S1c). We then assess the statistical significance of the observed CFL with respect to a background distribution established from random lineages of similar sizes (the same or the closest size) obtained from a permutation process (see the “Methods” section; Fig. 1b and Additional file 1: Fig. S1d). The permutation process randomly assigns SCCN profiles by chromosomes into different cells 1000 times and reconstruct a lineage tree from each permutated dataset using the same lineage tracing algorithm. For each lineage from the real data, at least 1000 lineages of similar size (the same or the next closest size) from the permutated trees are selected, since multiple lineages of similar size may exist in each permutated tree. It is important to account for background variations induced by factors unrelated to cellular fitness such as high CNA prevalence at fragile sites or repeats that are non-functional, as shown in previous studies [36, 37] and to account for bias of lineage tracing algorithms. We used three additional statistical approaches as controls, which estimate background distributions without reconstructing trees from permutated SCCN data (see the “Methods” section). LSA clearly outperformed other approaches for identifying CNAs that are non-randomly associated with lineage expansion (Additional file 1: Fig. S1e). The efficiency of the MEDALT algorithm, which is linear with respect to the number of cells and genome size (Additional file 1: Fig. S2), makes it possible to perform a large number of permutations in order to obtain a reasonably accurate background distribution. The statistically significant CNAs and genes so identified may not be causal themselves, but are associated with (e.g., co-occur) with causal fitness-impacting alterations. Thus, LSA distills the massive genome-wide SCCN data into a compact molecular blueprint, consisting of CNAs/genes occurring non-randomly at important moment during the course of the evolution with significant impact on the fitness of the descendant cells.

LSA can also be applied at cohort level to analyze single-cell data obtained from multiple patient samples. In that setting, the method creates meta-lineages combining cells from different patients and prioritizes events non-randomly occurring across background lineages established over the entire cohort (Fig. 1c, Additional file 1: Fig. S1f and see the “Methods” section). Genes that are altered nonrandomly in multiple patients will likely have higher scores than those altered in a single patient.

Additionally, LSA can be applied to prioritize CNAs associated with parallel/convergent evolution [38] (abbr. PLSA) by estimating the chance of a CNA occurring nonrandomly in two or more parallel lineages, as a consequence of positive selection (Fig. 1d, Additional file 1: Fig. S1g and see the “Methods” section). This opens a new way for gene discovery that was substantially underpowered in bulk sample studies.

In silico evaluation

To evaluate our approaches, we simulated copy number evolution in single cells using a Markov process parameterized by cell fitness parameters (Additional file 1: Fig. S3a and b; see the “Methods” section) [39]. Spiked in randomly were fitness-associated alterations (FAAs), which indicate fitness change in a cell triggering subsequent lineage expansion. Synthetic SCCN profiles were created mimicking various CNA mechanisms such as genome doubling, breakage-fusion-bridge (BFB), tandem duplication, terminal deletion, and unbalanced translocation [30]. We created 100 simulated datasets, each containing around 200 cells. Besides obtaining MEDALTs, we also obtained phylogenetic trees using conventional maximum likelihood (ML), maximum parsimony (MP), and neighbor joining (NJ) approaches (see the “Methods” section). In addition, we ran GISTIC [37] (see the “Methods” section), a method developed to prioritize CNAs in tissue samples by treating the cells as unrelated samples.

We then performed FAA detection in each dataset by performing LSA on individual trees inferred by various methods. We compared the detection performances using area under receiver operating characteristic curves (AUC; see the “Methods” section). Overall, the MEDALT approach achieved substantially better detection performance than the other methods (Fig. 1e). The benefits appeared robust over a range of cell numbers, when we repeated the benchmarking on subsets of the cells via random down-sampling, until the number of cells dropped below 60. It appeared that at least 30% of the cells were required to recapitulate the major population structure in this simulation irrespective of the algorithms (Fig. 1e).

We further dissected the contribution of each of the 3 steps in our approach, i.e., MED, MEDALT, and LSA, to the final performance of FAA detection. Compared to MED, the MEDALT and LSA steps had more contribution to the final performance (Additional file 1: Fig. S3c). Therefore, although MED can be affected by noise in the SCCN data, the net effects appeared limited (Additional file 1: Fig. S3d).

Detecting fitness-associated CNAs in disease cohorts

We applied our methods on the single-cell DNA-sequencing data acquired from 20 triple-negative breast cancer patients (TNBCs, Additional file 1: Table S3) [16, 18]. SCCN profiles were downloaded from the original paper which were generated using a variable binning method followed by circular binary segmentation (CBS) [40] (Additional file 1: Fig. S4; see the “Methods” section). We obtained both MEDALTs and phylogenetic trees for each sample and ran LSA to identify non-random alterations at both sample and cohort levels.

We then compared the accuracy of the trees in inferring cellular timing using data from 4 patients with longitudinal pre-, mid-, and post-treatment (neoadjuvant chemotherapy) samples. We found that MEDALTs ordered cells much more consistently with their biopsy timing than did the phylogenetic trees (Additional file 1: Fig. S5), with pre-treatment cells appearing near the root and post-treatment cells near the leaves.

Consistent with previously studies [16, 18], most of the TNBC samples appeared to have developed through branched evolution via multiple parallel lineages. Interestingly, the MEDALTs indicated that these parallel lineages may have distinct mutation rates (Fig. 2a and b, Additional file 1: Fig. S6), which may be attributable to variable degree of DNA damage repair (DDR) loss (Fig. 2b; see the “Methods” section) [41]. Indeed, when we performed gene set enrichment analysis on genes identified by LSA, we found that the lineages of higher CNA rates have more DDR genes affected by the CNAs than the lineages of lower CNA rates (Additional file 1: Fig. S7).

Fig. 2
figure 2

Application on scDNA-seq data from TNBCs. a The MEDALT Inferred from patient t1. The widths of the edges are drawn proportional to the MEDs. Colors (yellow, blue and green) highlight the 3 main branches. b The relationship between CNA numbers (Y-axis) and the depth on the tree (or distance to the root, X-axis). The barplot shows the fraction of DDR genes among the genes with copy number losses in the 3 lineages. c ROC curve for identifying functionally important, broad CNAs in the literature (Table S4). d The gene knockout effect scores of the gene sets (cohort LSA test p value < 0.001) identified based on the MEDALTs, MP trees, and GISTIC in 29 breast cancer cell lines. Included as controls are 100 sets of 197 randomly selected genes, 234 oncogenes, and 269 tumor suppressors (TS) in oncoKB and intOGen. The overall survival (OS) (e) and the progression-free survival (PFS) (f) of RBBP8 loss in TCGA breast cancer patients. g OS of RBBP8 loss in the METABRIC

We identified fitness-associated CNAs at chromosomal and gene resolution using cohort-level LSA (p value < 0.001; see the “Methods” section). For benchmarking, we also performed the same LSA on the MP trees. We also ran GISTIC [37] on the pseudo-bulk copy number profiles generated by averaging the SCCN profiles across the cells in each sample (see the “Methods” section).

Overall, the MEDALT plus LSA approach identified 30 broad CNAs, 80% of which have been functionally associated with breast cancer development and treatment outcome in the literature (Additional file 1: Table S4). The accuracy was at least 13% higher than the results derived by the other methods (Fig. 2c; see the “Methods” section). We independently performed the LSA at gene resolution, focusing on 448 genes from 11 oncogenic pathways including Notch, PI3K, Hippo, RTK/RAS, MYC, cell cycle, p53, Nrf2, Wnt, TGFB, and DDR defined in TCGA Pan-can atlas research [41, 42]. Our approach identified 197 genes, including 109 amplified and 88 deleted genes (Additional file 2). In contrast, the MP plus LSA approach identified 130 genes, 82 of which were amplified and 48 deleted. GISTIC identified 60 genes, 33 of which were amplified and 27 deleted.

By examining the CRISPR knockout screen data in 29 breast cancer cell lines in the DepMap database [43], we found that the 109 amplified genes identified by the MEDALT plus LSA approach had significantly lower gene knockoff effect scores than those of the 82 amplified genes detected based on the MP trees (one-side Wilcoxon rank-sum test, p = 2.75 × 10−9) and of the 33 genes detected by GISTIC (one-side Wilcoxon rank-sum test, p = 6.65 × 10−17) (Fig. 2d). The scores were also significantly lower than those of oncogenes (one-side Wilcoxon rank-sum test, p = 1.12 × 10−15) and tumor suppressors (one-side Wilcoxon rank-sum test, p = 2.81 × 10−16) reported in the oncoKB [44] and intOGen [45] databases, which are not specific to TNBC, and sets of randomly selected genes of identical size (one-side Wilcoxon rank-sum test, p = 8.97 × 10−21). Not significant were the scoring differences among the sets of deleted genes, due likely to challenges in calling deletions from noisy low-coverage data and in quantifying deleterious effects in lineages of limited cell numbers.

Among the 197 genes MEDALT nominated, some are not reported in the oncoKB [44], COSMIC [46], and intOGen [45] databases (Additional file 3) but supported by functional genomics data in large-scale cancer patient studies (Additional file 1: Fig. S8a). For example, loss of RBBP8 indicated worse prognosis among the breast cancer patients in TCGA and those in the METABRIC [47] (Fig. 2e to g). RBBP8 is a potentially interesting target as it interacts with BRCA1 and modulates its function in transcriptional regulation, DNA repair, and/or cell cycle checkpoint control [48]. In addition, loss of PPP4R1 indicated worse prognosis in TCGA and the METABRIC as well (Additional file 1: Fig. S8b to d).

In addition, we identified 107 genes that were likely positively selected (PLSA p value < 0.001, Additional file 4) by convergent evolution in 7 of the 20 patients (Fig. 3a), by performing PLSA on the MEDALTs derived from individual patients. Among these, 65 genes were amplified. By repeating the same PLSA on the MP trees, we identified 355 genes, 252 of which were amplified. The set of 65 genes identified from the MEDALTs had significantly lower gene knockout effect scores (thus more essential) than those of the set of 252 genes identified from the MP trees (one-side Wilcoxon rank-sum test, p value = 4.07 × 10−9), of known oncogenes (one-side Wilcoxon rank-sum test, p = 2.81 × 10−16) and sets of randomly selected genes (one-side Wilcoxon rank-sum test, p = 9.01 × 10−21), based on the CRISPR screens of the 29 breast cancer cell lines in the DepMap [43] (Fig. 3b). No significant scoring differences were found between the deleted genes identified from the MEDALTs and those identified from the MP trees, although both sets appeared more essential than the sets of known tumor suppressors and randomly selected genes.

Fig. 3
figure 3

Convergent evolution in TNBCs. a Genes associated with convergent evolution in the 20 TNBC patients. Labeled at the top are known cancer genes. AMP, copy number amplification; DEL, copy number deletions; MIX, genes with amplifications and deletions. b The gene knockoff effect scores of gene set identified from the MEDALTs and the MP trees by PLSA (p value < 0.001). Included as controls are 100 sets of 107 randomly selected genes, 234 oncogenes and 269 tumor suppressors (TS) in oncoKB and intOGen. c The MEDALT of patient KTN102. Orange boxes highlight lineages under potential convergent evolution. d Progression-free survival of breast cancer patients with and without FAAP24 loss in TCGA

Among the 107 genes identified by PLSA, 42% were known cancer genes, a fraction higher than what we obtained from the cohort-level single-lineage LSA (38%, Additional file 1: Fig. S8e). Loss of FAAP24 appeared in two distinct lineages in patient KTN102 and was associated with worse progression-free survival (PFS) in TCGA breast cancer data (Fig. 3c and d). Loss of BRCA1 was also found in two parallel lineages, which were depleted of cells from the post-treatment sample (Fig. 3c). That observation may be explained by the fact that BRCAness tumors often respond to neoadjuvant chemotherapy [49, 50].

Applications on single-cell RNA sequencing data

Our approaches are likely beneficial to characterizing SCCN data derived from single-cell RNA sequencing (scRNA-seq) experiments. To examine that possibility, we collected data obtained from paired primary and metastasis (or relapse) samples of a variety of cancer patients, including 6 head and neck squamous cell carcinoma (HNSCC) [20], 8 multiple myeloma (MM), 2 oral squamous cell carcinomas (OSCC) [51], and 4 ovarian cancer patients (OV) [52] (Additional file 1: Table S5).

We obtained SCCN profiles from the scRNA-seq data using the inferCNV program [53], which derives CNAs by exploring expression intensity of genes across position of tumor genome in comparison to a set of normal cells. We calculated average copy number levels in non-overlapping genomic 30-gene windows to infer MEDALT (see the “Methods” section). We then obtained a MEDALT for each patient, including cells in both the primary and the metastasis samples. For comparison, we also performed analysis for each patient using Monocle v3.0 [54], which was designed to reconstruct the transcriptomic (and phenotypic) trajectory of a developing cell population. Since the cells in the primary samples were most likely born before the cells in the metastasis (or relapse) samples, they should be arranged closer to the root of the lineage trees. Indeed, in the MEDALTs, the cells from the primary samples were placed significantly (one-side Wilcoxon rank-sum test, p = 0.0098) closer to the root than the cells from the metastasis (or relapse) samples (Fig. 4a). In contrast, the pseudotime estimated by Monocle did not significantly (one-side Wilcoxon rank-sum test, p = 0.51) delineate the two types of cells (Fig. 4b). Meanwhile, cells in the MEDALT lineages had more homogenous SCCN profiles than those in the Monocle clusters (Fig. 4c and Additional file 1: Fig. S9; see the “Methods” section). The result from this experiment indicated that our approaches are potentially more accurate in characterizing genome evolution from cancer scRNA profiles than approaches that are designed for transcriptomic trajectory reconstruction. This may not be entirely surprising as DNA copy number data have demonstrated useful for cancer cell chronology inference [12] while RNA data are known subject to complex transcriptional regulation.

Fig. 4
figure 4

Application on scRNA-seq data from cancer patients. a Average distance to root of the cells in the primary samples (X-axis) and those in the metastasis/relapse samples (Y-axis) estimated from the MEDALTs. b Average pseudotime of the cells in the primary samples (X-axis) and those in the paired metastasis/relapse samples (Y-axis) estimated by Monocle. c Pearson’s correlation coefficients between the SCCN profiles of the cells in the same lineages dissected from the MEDALTs (Y-axis) versus those in the same cell states clustered by Monocle (X-axis). Each dot in a, b, and c represents a cancer patient. All the p values were estimated by one-side Wilcoxon rank-sum test. d Average DepMap gene-knockoff effect scores of the gene set identified by MEDALT and those by Monocle from the 20 patients. Also included as controls are 100 sets of 75 random genes, 234 oncogenes, and 269 tumor suppressors (TS) in oncoKB and intOGen

We performed cohort-level LSA for gene set estimated from inferCNV on the MEDALTs and identified 75 fitness-associated genes (Additional file 5, p value < 0.001), which included 45 amplified and 30 deleted genes from the 20 patients. In contrast, Monocle identified 3412 differentially expressed genes between the cell clusters.

We found that the amplified genes identified by our approach are significantly more essential than those identified by Monocle (one-side Wilcoxon rank-sum test, p = 2.35 × 10−186; Fig. 4d), based on the CRISPR screens of 524 cancer cell lines in the DepMap [43].

Discussions

Advances in single-cell technologies present new challenges and opportunities for making biological discovery. Single-cell studies often involve large numbers of cells, which are powerful at characterizing cellular heterogeneity, but small numbers of biological samples, which are underpowered for discovering common disease genes. It has been shown by recent genome-wide association analysis that it is possible to enable new discovery by performing association analysis at cell-type resolutions [55]. For cancer and genetic diseases driven by somatic mutations, being able to obtain genetic footprint at various time and conditions can enable discovery of genes responsible for disease progression and resistance to therapy.

However, it remains unclear what analytical strategies should be deployed to achieve the benefits. Even more challenging it gets when CNAs are being considered, as CNAs affect large regions of the genome and are difficult to trace using phylogenetics methods.

In our study, we demonstrated that it is possible to achieve the benefit by reconstructing copy number evolution history as a lineage tree, i.e., MEDALT, and performing permutation-based statistical analysis, i.e., LSA, to identify fitness-associated CNAs and genes.

We have learned several important lessons in our study.

First, it is important to perform accurate lineage tracing. Although the single-copy gain and loss model that we implemented in deriving MEDALTs is limited in complexity, it already performed substantially better than conventional phylogenetics algorithms such as MP that assumes infinite sites and NJ that employs naïve distance metrics, as shown in our simulation and in real data analysis. It is conceivable that further development of methodology that incorporates more complex genome evolution mechanisms such as chromothripsis [56] can lead to better results.

An important goal was to represent convergent evolution that is likely prevalent in the lens of CNAs [10, 57]. Conventional phylogenetics algorithms strictly prohibit the expression of convergent evolution by disallowing an alteration to occur multiple times in a course of evolution [28]. Several new algorithms relaxed such limitation but were designed for analyzing point mutation data [58]. As shown in our analysis of the TNBC patients, genes identified based on convergent evolution analysis (i.e., PLSA) had an even higher fraction of known cancer genes than those identified based on cohort-level single-lineage LSA. Our result suggests that examining convergent evolution is likely a key component towards fully unleashing the power of single-cell studies.

Unlike canonical phylogenetic trees, MEDALTs are minimal spanning trees that do not contain unobserved internal ancestral nodes. Representing evolution using minimal spanning trees instead of phylogenetics trees was our deliberate choice, as it allowed us to develop polynomial-runtime solutions that are scalable to real datasets containing thousands of cells. It also allowed us to conveniently implement biologically meaningful MED and enforce directionality constraints. Phylogenetics algorithms are likely effective when the numbers of cells are small and that the alterations are simple to trace. None of these conditions apply to available SCCN datasets that have CNAs evolving non-linearly in hundreds of cells. Moreover, we have shown in our simulation that for the purpose of detecting fitness-association alterations, our method outperformed phylogenetics approaches in a wide range of sample sizes.

A particular challenge in developing and evaluating computational lineage tracing methods is the lack of exact ground truth. Although various experimental technologies have been developed [59, 60], we are not aware of any that can be applied to trace copy number evolution in patient samples. To circumvent this, we utilized in silico simulation that mimics several prevalent CNA mechanisms to evaluate the accuracies of the reconstructed lineages and fitness-associated alterations. We also utilized longitudinal datasets on which we knew the biological stages of the cells to evaluate the chronological accuracy of the inference results. Although these strategies are unlikely sufficient to validate all the edges and lengths in the trees, they are objective and sufficient to discriminate various approaches.

Second, it is important to control biases in statistical inference. It is challenging to detect fitness-associated genes, as CNAs often affect a large number of genes and that the sample sizes are often small. Passenger CNAs that occur naturally in non-functional regions such as those near fragile sites or repeats could easily cloud the discovery. In addition, lineage tracing algorithms are unlikely to be perfect and could introduce distinct biases. To address these challenges, we employed LSA, which randomly permutes SCCN profiles into different cells to reduce the biases introduced by background genomic variations and technical noises. And we reconstructed trees from permutated datasets to alleviate biases introduced by the lineage tracing algorithms. The evolutionarily meaningful MED metrics and constraints help our analyses to focus on biologically relevant hypotheses, given limited computational resources. These procedures appeared important to achieve the accuracy. Further exploration of different ways to permute the data and to estimate the background distribution will likely lead to better results.

We assessed the functional impact of the identified genes using cell-line CRISPR essentiality screen data. We confirmed that the set of fitness-associated, amplified genes discovered by our methods are significantly more essential than other control gene sets in cancer cell lines. We also nominated novel genes that appear to have prognostic values in TCGA and the METABRIC datasets. These assessment strategies likely have false positives and negatives. Further comprehensive, well-controlled and targeted experiments will likely be required to fully assess the functional impact and clinical values of these genes.

Lastly, it was exciting to observe benefits of our methods on both the scDNA-seq and the scRNA-seq data. Although RNA-derived copy number profiles may not be as accurate as those derived from DNAs, previous studies [61] suggested that they can reasonably distinguish tumor clones. Our study further revealed the value of scRNA-seq data in lineage tracing and supported the notion that genomic profiles, even approximations, are more accurate than transcriptomic profiles in determining biological timing of cells. Our results opened doors towards utilizing scRNA-seq as a platform to understand genetics underlying developmental processes and perform gene discovery.

Conclusions

In this study, we describe two innovative algorithms: MEDALT based on MED tracing tumor lineage evolution and LSA discovering lineage expansion associated genetic drivers. We examined the algorithms using synthetic datasets, longitudinal scDNA-seq data obtained from TNBC patients and scRNA-seq data of HNSCC, MM, OSCC, and OV patients. Compare to conventional algorithms, our approach effectively improves.

Methods

Inferring minimal event distance

We use a modified parsimony scoring method to score the distance between two copy number profiles, which can be considered as non-negative integer arrays. We assume a copy number alteration (CNA) (event) can affect adjacent genomic regions (one single entry or k adjacent entries in array) by increasing or decreasing their values by 1. We define the minimal event distance (MED) between two arrays a and b to be the minimal number of CNAs needed to transition from a to b (Additional file 1: Fig. S1a).

We propose a greedy algorithm (Additional file 1: Table S1) which guarantees to find an optimal solution within a runtime of O(m) (Additional file 6), where m is the size of the array [34]. We add an additional restriction that MED equals to infinity, if the copy number at any site is going from 0 to any other number. In addition, the amplification cannot span across the site with 0 copy number. This is different with Zeira et al., which utilized a zero-skipping rule [34].

Constructing Minimal Event Distance Aneuploidy Lineage Tree (MEDALT)

The optimal aneuploidy lineage tree is a rooted directed minimal spanning tree (RDMST) with the least number of CNAs. We use an implementation of Edmond’s algorithm to infer RDMST (Additional file 1: Table S2). Our algorithm runs in O(VE), where V is the node set and E is edge set. That is approximately as O(n3), where n is the size of the node set.

Lineage speciation analysis

We propose a statistic routine named lineage speciation analysis (LSA), which performs permutation tests on the topology of MEDALT or phylogenetics trees to identify CNAs that are non-randomly associated with cellular lineage expansion in a developmental process. In LSA, we start from the root node and iteratively remove edges to obtain all possible lineages (subsets of cells). For the i-th lineage, we calculate a cumulative fold level (CFL) for the j-th CNA event that sums together the copy number alteration level in constituent cells (Additional file 1: Fig. S1c).

$$ {CFL}_{ij}=\sum \limits_{k=1}^K $$
(1)

where CNijk is the copy number level in the k-th cell and K is the size of the lineage.

We treat the amplifications and deletions separately so that a region can be amplified in some samples but deleted in others. This is necessary because some oncogenes and tumor suppressors locate in close proximity and can get binned into the same regions.

We estimate the statistical significance of an observed CFL by comparing its value to a background distribution obtained through permutation (Additional file 1: Fig. S1d). In the default mode, SCCN data are randomly shuffled by chromosomes into different cells. They are not further shuffled by sites within each chromosome, because chromosomal context plays an important role in determining where and how a CNA occur.

In order to obtain an empirical background distribution, we permute SCCN data 1000 times and construct a lineage tree for each permutated SCCN dataset (Additional file 1: Fig. S1d). Similar to the process for the real tree, we dissect each permutated tree into a collection of lineages. For each lineage from the real tree, we select the lineages in the permutated trees of identical (or very similar) size. If there is no lineage which has the same number of cells in one permutated tree, we will select the lineage with the next closest size. Thus, for each real lineage, at least 1000 lineages from permutated trees are selected. We compute CFLs of each CNA event in these selected lineages using Eq. (1) and construct corresponding background distribution to calculate an empirical p-value (tail probability) of the observed value:

$$ p=\frac{\sum \limits_{r=1}^RI\left({S}_r\ge {S}_o\right)+1}{R+1} $$
(2)

where R is the number of background lineages from the permutation data, Sr, So are respectively the CFLs of the CNA event in the permutation and the real data.

To evaluate the performance of LSA for controlling biases in statistical inference, we estimated the significance using three additional ways:

  1. (1)

    Rather than reconstructing a tree from each permuted SCCN matrix, estimate CFLs of cells from real lineage using the by-chromosome-permuted SCCN matrix from the real three.

  2. (2)

    Same as (1) except using the SCCN matrix permuted by chromosomal bins within each cell (similar to GISTIC) instead of by chromosomes across different cells.

  3. (3)

    One-side Wilcoxon signed-rank test to estimate if the levels of CNA is significantly higher/lower in cells from a lineage than those from other lineages in the same tree.

For (1) and (2), it is similar with LSA that we construct background distribution of CFLs and estimate empirical p-value using Eq. (2).

Cohort-level LSA

In a cohort containing multiple individuals, we can estimate whether a recurrent CNA identified at individual level occurs non-randomly at the cohort (population) level. To do so, we construct meta-lineages by merging lineages dissected from different individuals and calculate a CFL for each meta-lineage through Eq. (1). We then estimate a statistical significance for each observed CFL through Eq. (2), based on a background distribution obtained from corresponding meta-lineages derived from individually permuted trees in the entire cohort (Additional file 1: Fig. S1f).

Identifying parallel evolution event

The lineage speciation analysis (LSA) can be used to identify potential presence of parallel (aka. convergent) evolution (PLSA), i.e., finding CNAs that occur independently in multiple parallel lineages during the evolution of a cell population (Fig. S1g). We can assess the statistical significance of such events using the same permutation framework. Instead of examining each lineage independently, we deploy an algorithm that exhaustively searches for parallel lineages that are formed by disjoint sets of cells with identical CNAs or genes.

We then estimate the probability of observing such multi-lineage CNAs over random chance through permutation (as described above, Additional file 1: Fig. S1g):

$$ p=\frac{\sum \limits_{\mathrm{r}=1}^RI\left({L}_r\ge {L}_o\right)+1}{R+1} $$
(3)

where Lr, Lo are respectively the number of lineages containing the CNA of interest in the real and the permuted trees and Lo ≥ 2. R is the number of permutations. In this analysis, only CNAs tested positive in the LSA are being further considered for the PLSA.

Simulating single-cell copy number evolution

Simulating cell birth-and-death process

In order to evaluate the accuracy of copy number lineage reconstruction, we implement a Markov process to simulate the cell growth under the influence of CNAs [39, 62]. The simulation process starts from an ancestor cancer cell, which divides and dies at rate b and d, respectively. All the descent cells have the same division and death rates as do their ancestors, unless they are mutated.

The cell growth dynamics follow the following differential equation:

$$ \frac{dn(t)}{dt}=b\bullet n(t) $$
(4)

where n(t) is the number of cells at time t. We assume that there are one root and 2 children after the first division: n(0) = 1, n(1) = 2. That leads to b = 0.69 as the initial value based on Eq. (4).

The distribution of the time intervals ∆t between any two jumps in a Markov process with continuous time is exponentially distributed with the mean E(∆t) = 1/(b + d) [63]. Here, we assumed E(∆t) = 1 and the death rate d = 1 − b. When a jump occurs, it results in a birth with a probability b/(b + d) or a death with a probability d/(b + d). This cell birth-and-death process can be depicted as a rooted directed tree in which nodes are cells.

We simulated 100 independent runs, each of which has a population size of 200 cells.

Simulating the occurrence of CNA events

CNAs accumulate among tumor cells at an appreciable rate [64]. The CNAs in a cell at time ti not only include the alterations it inherits from its parent, but also newly acquired ones from ti − 1 to ti (Additional file 1: Fig. S3a). We assume that the CNA rate per site/region varies in several levels μ ∈ {0.02, 0.05, 0.1, 0.15, 0.2} [32] and determine the number of CNAs (K) accumulating in ∆t based on a Poisson distribution (Additional file 1: Fig. S3a):

$$ K\sim Poisson\left(\lambda =\Delta t\ast \mu \ast G\right) $$
(5)

where G is the total number of sites/regions in the genome. In our simulation, we set G = 100.

Simulating genomic structural rearrangements

We assume that CNAs can be generated by various types of genomic structural rearrangements (GSR), such as terminal deletion (TER), interstitial deletion (DEL), unbalanced translocation (UT), tandem duplication (TD), inverted duplication (ID), and breakage fusion bridge (BFB) [30]. In addition, different GSRs could occur at differential rate in cancer [65, 66]. Thus, we determine the numbers of various GSRs based on a multinomial distribution [32].

$$ X=\left\{{x}_1,{x}_2,\dots, {x}_K\right\}\sim Multi\left({p}_{TER},{p}_{DEL},{p}_{UT},{p}_{TD},{p}_{ID},{p}_{BFB}\right) $$
(6)

where we empirically set pTER = pDEL = 0.1, pUT = 0.15, pTD = 0.5, pID = 0.05, pBFB = 0.1. We also required that \( K=\sum \limits_{k=1}^K{x}_k \) during the period of ∆t (Additional file 1: Fig. S3a).

Simulating the location of a CNA

CNAs affect contiguous sites/regions in a chromosome. They often exhibit two modes: (1) focal, affecting a relatively small (<MB) region [67], and (2) broad, encompassing large chromosomal regions (e.g., chromosomal arms) [68]. Broad CNAs often result from chromosomal mis-segregation during mitosis [64], which is a hallmark of cancer. Both focal and broad CNAs are important in oncogenesis. While broad CNAs often manifest through dosage effects [13], focal CNAs often target driver genes directly and result in protein structural changes [69].

We determined the size r of a CNA in X by sampling a zero-truncated Geometric distribution:

$$ g\left(r,p\right)=p\bullet {\left(1-p\right)}^{r-1} $$
(7)

where r is the number of genomic sites/regions that a CNA occupies and p the probability that a region is affected by the CNA (Additional file 1: Fig. S3a). We set p = 0.5 in our simulation.

We encode the simulated CNAs as sequences of non-negative integers in corresponding cells (Additional file 1: Fig. S3b). Our model allows single-copy gains and losses. A copy number gain increases the corresponding values by 1 and a copy number loss decreases the values by 1 (Additional file 1: Fig. S3b).

Simulating fitness-associated alterations

Some CNAs may themselves alter the fitness of a cell, or occur simultaneously with the driver mutations. We call them fitness-associated alterations (FAAs). We simulate the occurrence and the impact of FAAs in the evolution. At each generation, we determine if a FAA would occur through a Bernoulli distribution (p = 0.5). If a FAA occurs, we randomly select τ cells to carry the FAA, where τ follows a binomial distribution \( B\left(\zeta, p=\frac{1}{\zeta}\right) \) and ζ is the number of cells in the generation. The selected cells would increase their birth rates by s, which follows a uniform distribution U(0, 1).

In order to estimate the effects of noise, we added noise at different levels in the simulated copy number profile based on a Poisson(λ) model, where λ represents the mean number of randomly selected bins with increased or decreased CN values (by 1) in each cell. We set λ = 0, 2, 4, 6, 8, 10 with 0 being no noise, 10 corresponding to 10% of the genome.

Constructing phylogenetic trees

We construct phylogenetic trees using the R package phangorn [70], which implements widely used versions of the maximal parsimony, neighbor-joining and maximum likelihood approaches. To apply the maximal parsimony approach, the SCCN data are re-segmented by the collection of breakpoints detected in each cell, so that each column in the data matrix corresponds to a genomic interval that is uninterrupted by any GSR in any cell. The GSR breakpoints in individual cells are determined by the R package copynumber under default parameters. To apply the neighbor-joining approach, Hamming distances are calculated from each pair of the SCCN profiles. To apply the maximal likelihood approach, random trees are chosen as the initial solutions.

Estimating the accuracy of lineage partitioning

The cell birth-and-death process we simulate can be expressed as a rooted directed minimal spanning tree (RDMST). To compare RDMST with phylogenetic trees, we convert RDMSTs into dendrograms, which are fully comparable with the phylogenetic trees in that observed cells are represented as leaves in both types of representations [71]. From each dendrogram or phylogenetics tree, we calculate a metric, termed lineage partitioning accuracy (LPA), which measures how accurately cells are partitioned into lineages (subsets). Given a dendrogram, we performed lineage partitioning as follows:

We iteratively remove each branch in the dendrogram to obtain all the bi-partitions, i.e., the two disjoint subsets resulting from removing a branch. Each subset corresponds to a cellular lineage. All lineages can be described as a binary sequence l = {c1, c2, ⋯, cN}, ci = 1 if the i-th cell is in lineage l and ci = 0, otherwise.

In the simulation experiments, the lineages partitioned from the simulated cell growth trees are considered as the ground truth. The LPA of a given MEDALT or phylogenetics tree is calculated as the fraction of lineages that exist in the ground truth over the total number of predicted lineages.

Accuracy of FAA detection in simulation

We randomly spike in FAAs in the simulation experiments, which are used as the ground-truth to assess the accuracy of the MEDALT and the phylogenetic trees. For each CNA, we calculate its p value through LSA and identify the minimal p value over all the lineages containing the CNA. We use − log(minimal p) as the prediction score. We then characterize the accuracy of each approach on FAA detection using AUC values, which are calculated by tallying the positive and the negative hits at various prediction score cutoffs from 0 to the maximal values.

Identifying significant CNAs using GISTIC

We apply the GISTIC algorithm on the simulated and the real SCCN datasets to identify significant CNAs [37]. The following steps are taken:

  1. i)

    Calculate the occurrence frequency (f) and the amplitude (∆) of each alteration

  2. ii)

    Define a G-score as a function of f and ∆: G = f × log2(∆ + 2)

  3. iii)

    Assess the statistical significance of each alteration by comparing the observed G-score to a background distribution of G-scores obtained from permuted (by regions) copy number profiles

On the simulated datasets, we regard each cell as an individual sample and apply GISTIC at the cell level.

On the TNBC dataset, we average the SCCN profiles across the cells in each patient sample to create a pseudo-bulk copy number profile for each sample. We then run GISTIC on these pseudo-bulk profiles to identify significant CNAs, similarly to how GISTIC is applied in TCGA study.

Integer copy number profiles from single-cell DNA sequencing data

The SCCN profile is an integer-valued matrix. The SCCN profiles from single-cell DNA-sequencing data of triple-negative breast cancer are downloaded from the original paper [16, 18] which estimated using a variable binning method, as detailed in previous studies [18, 72]. Briefly, sequencing reads are counted in 11,927 genomic bins with variable start and stop coordinates, which are optimized to receive even read counts across the bins. The median genomic length spanned by the bins is 220 kbp. Cells with < 50 median reads per bin are excluded. Loess normalization is used to correct for GC bias [40]. Copy number profiles are segmented using circular binary segmentation (CBS) [73] followed by MergeLevels [74] to joint adjacent segments with non-significant differences in segment ratios (parameters alpha = 0.0001 and undo.prune = 0.05). Integer copy numbers are calculated by scaling segment ratios with average DNA ploidy determined by flow sorting indexes and rounding to closest integers [18].

Dissecting MEDALT into disjoint lineages

To characterize CNA rate variation and genetic organization of a cell population, we dissect it into disjoint lineages (cell subsets) based on the corresponding MEDALT. For each internal node v in MEDALT, the subtree rooted at v is denoted as Tv, which consists of all the descendants of v. The number of nodes in Tv is denoted as Sv, the size of the subtree. To ignore small lineages that cannot be confidently characterized, we set a minimal subtree size cutoff s (s = 5 in our analysis of the scDNA-seq and the scRNA-seq data) and define an internal node set IV = {v| Sv > s, v ∈ V}, where V represents the node set of the MEDALT. We arrange the node sets in IV in an increasing size order:

$$ IV=\left\{{v}_1,{v}_2,\dots, {v}_k|{S}_{v_1}<{S}_{v_2}<\dots <{S}_{v_k}\right\} $$
(8)

To obtain disjoint lineages, we remove the internal nodes that lead to redundant lineage assignments. For each vi ∈ IV, 1 ≤ i ≤ k, its parent node vj (j > i) should exist in IV. If a parent node vj has more than one child in IV, remove the parent node vj from IV; otherwise, remove the child node vi. We iterate through all the nodes in IV until no node can be removed. We then split the MEDALT into subtrees rooting at nodes remaining in IV. All the nodes that are not yet included are assigned into a control lineage.

Estimation of CNA rate and fraction of DDR loss

We estimate CNA rate in a lineage as the average number of CNAs, i.e., average MED, between the cells in the lineage. DNA damage repair (DDR) genes play key roles in maintaining genome stability. In our analysis, we download the list of DDR genes from Knijnenburg et al.’s study [41], based on which we estimate the proportion of DDR genes with copy number loss in each lineage.

Characterizing chromosomal level CNAs identified in TNBCs

We perform cohort-level LSA per genomic bin at a 220-kb resolution. We define the chromosomal arm is significant if more than half of bins in the arm are significantly associated with lineage expansion. Average p value of these significant bins corresponds to the significance level of the chromosomal arm. In order to benchmark the accuracy of chromosomal (arm) level CNA detection in the TNBC data, we search biomedical literature exhaustively and create a list of chromosome-arm-level CNAs that have reported relevance to TNBC biology or clinical utilities (Additional file 1: Table S4). We treat this list as the ground truth.

For each chromosomal level CNA in a lineage tree, we used the − log(p) estimated via the cohort LSA as its prediction score. We then estimate AUC values, respectively for the MEDALT, the MP, and GISTIC approaches.

Inferring copy number profile from single-cell RNA sequencing data

We use R package inferCNV (https://github.com/broadinstitute/infercnv) to identify somatic large scale chromosomal CNAs from single-cell RNA sequencing (scRNA-seq) data [53]. InferCNV detects CNAs by exploring expression intensity of genes across positions of tumor genome in comparison to a set of reference “normal” cell. The CNAs at gene level relative to reference cell are estimated under default parameters of inferCNV. According to inferred gene-level relative copy number profile, we calculate average relative CNA values in non-overlapping genomic bins, each consisting of 30 genes. Within each bin for each cell, we calculate an integer copy number by multiplying the relative CNA value by 2 (diploid) and then rounding the results off to closest integers.

Estimating genetic homogeneity

We compute a metric, called genetic homogeneity level (GHL) to compare the accuracies of MEDALTs with those of Monocle trajectories in tracing genetic evolution from scRNA-seq data. For each cell lineage (subset) partitioned from a MEDALT (see the “Dissecting MEDALT into disjoint lineages” subsection of the “Methods” section), we calculate pair-wise Pearson’s correlation coefficients between all the cells in the lineage, using gene-level copy number profiles inferred by inferCNV. We treat the mean correlation coefficient as the GHL of the lineage. Then average the GHLs across the lineages to obtain an overall GHL of the MEDALT.

Similarly, we calculate a GHL for a Monocle trajectory by averaging cluster-level GHLs estimated from cell clusters defined by the trajectory.