TNSim: A Tumor Sequencing Data Simulator for Incorporating Clonality Information
In recent years, the next generation sequencing enables us to obtain high resolution landscapes of the genetic changes at single-nucleotide level. More and more novel methods are proposed for efficient and effective analyses on cancer sequencing data. To facilitate such development, data simulator is a crucial tool, which not only tests and evaluates proposed approaches, but provides the feedbacks for further improvements as well. Several simulators are released to generate the next generation sequencing data. However, based on our best knowledge, none of them considers clonality information. It is suggested that clonal heterogeneity does widely exist in tumor samples. The patterns of somatic mutational events usually expose a wide spectrum of variant allelic frequencies, while some of them are only detectable in one or multiple clonal lineages. In this article, we introduce a Tumor-Normal sequencing Simulator, TNSim, to generate the next generation sequencing data by involving clonality information. The simulator is able to mimic a tumor sample and the paired normal sample, where the germline variants and somatic mutations can be settled respectively. Tumor purity is adjustable. Clonal architecture is preassigned as one or more clonal lineages, where each lineage consists of a set of somatic mutations whose variant allelic frequencies are similar. A group of experiments are conducted to evaluate its performance. The statistical features of the artificial sequencing reads are comparable to the real tumor sequencing data whose sample consists of multiple sub-clones. The source codes are available at http://github.com/lnmxgy/TNSim and for academic use only.
KeywordsCancer genomics Cancer sequencing data Data simulator Clonal structure
This work is supported by the National Science Foundation of China (Grant No: 31701150) and the Fundamental Research Funds for the Central Universities (CXTD2017003).
- 5.The Computational Pan-Genomics Consortium: Computational pan-genomics: status, promises and challenges. Briefings Bioinform. 19(1), 118–135 (2018)Google Scholar
- 8.Geng, Yu., Zhao, Z., Liu, R., Zheng, T., Xu, J., Huang, Y., Zhang, X., Xiao, X., Wang, J.: Accurately estimating tumor purity of samples with high degree of heterogeneity from cancer sequencing data. In: Huang, D.-S., Jo, K.-H., Figueroa-García, J.C. (eds.) ICIC 2017. LNCS, vol. 10362, pp. 273–285. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-63312-1_25CrossRefGoogle Scholar
- 11.McElroy, K., Luciani, F., Thomas, T.: GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genom. 13(74), 1–9 (2012)Google Scholar
- 12.Geng, Y., Zhao, Z., Xu, J., et al.: Identifying heterogeneity patterns of allelic imbalance on germline variants to infer clonal architecture. In: Huang, D., Jo, K., Figueroa-García, J. (eds.) ICIC 2017. LNCS, vol. 10362, pp. 286–297. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-63312-1_26CrossRefGoogle Scholar