Virtual methylome dissection facilitated by single-cell analyses
Numerous cell types can be identified within plant tissues and animal organs, and the epigenetic modifications underlying such enormous cellular heterogeneity are just beginning to be understood. It remains a challenge to infer cellular composition using DNA methylomes generated for mixed cell populations. Here, we propose a semi-reference-free procedure to perform virtual methylome dissection using the nonnegative matrix factorization (NMF) algorithm.
In the pipeline that we implemented to predict cell-subtype percentages, putative cell-type-specific methylated (pCSM) loci were first determined according to their DNA methylation patterns in bulk methylomes and clustered into groups based on their correlations in methylation profiles. A representative set of pCSM loci was then chosen to decompose target methylomes into multiple latent DNA methylation components (LMCs). To test the performance of this pipeline, we made use of single-cell brain methylomes to create synthetic methylomes of known cell composition. Compared with highly variable CpG sites, pCSM loci achieved a higher prediction accuracy in the virtual methylome dissection of synthetic methylomes. In addition, pCSM loci were shown to be good predictors of the cell type of the sorted brain cells. The software package developed in this study is available in the GitHub repository (https://github.com/Gavin-Yinld).
We anticipate that the pipeline implemented in this study will be an innovative and valuable tool for the decoding of cellular heterogeneity.
KeywordsDNA methylation Cellular heterogeneity Nonnegative matrix factorization Single-cell methylome
DNA methylation plays a key role in tissue development and cell specification. As the gold standard for methylation detection, bisulfite sequencing has been widely used to generate genome-wide methylation data and computational efforts have been made to meet the statistical challenges in mapping bisulfite-converted reads and determining differentially methylated sites [1, 2, 3, 4]. Methylation data analysis has been extended from simple comparisons of methylation levels to more sophisticated interpretations of methylation patterns embedded in sequencing reads, which are referred to as the combinatory methylation statuses of multiple neighboring CpG sites .
Through multiple bisulfite sequencing reads mapped to a given genome locus, methylation entropy can be calculated as a measurement of the randomness, specifically the variations, of DNA methylation patterns in a cell population . It was soon realized that such variations in methylation patterns could have resulted from methylation differences: (1) among different types of cells in a mixed cell population, (2) between the maternal and paternal alleles within a cell, or (3) between the CpG sites on the top and bottom DNA strands within a DNA molecule [7, 8, 9]. The genome-wide hairpin bisulfite sequencing technique was developed to determine strand-specific DNA methylation, i.e., methylation patterns resulting from (3). The methylation difference between two DNA strands is high in embryonic stem cell (ESC) but low in differentiated cells . For instance, in human brain, the chances of four neighboring CpG sites having an asymmetric DNA methylation pattern in a double-stranded DNA molecule are less than 0.02% . Allelic DNA methylation, i.e., methylation patterns resulting from (2), was found to be limited in a small set of CpG sites. In the mouse genome, approximately two thousand CpG sites were found to be associated with allele-specific DNA methylation . Thus, cellular heterogeneity could be a primary source of the variations in DNA methylation patterns. This often leads to bipolar methylation patterns, meaning that genome loci are covered both with completely methylated reads and completely unmethylated reads simultaneously in bulk methylomes. Such bipolar methylated loci can be detected using nonparametric Bayesian clustering followed by hypothesis testing and were found to be highly consistent with the differentially methylated regions identified among purified cell subsets . For this reason, these loci are called the putative cell-type-specific methylated (pCSM) loci. They were further demonstrated to exhibit methylation variation across single-cell methylomes .
An appropriate interpretation of methylome data derived from bulk tissues requires consideration of methylation variations contributed by diverse cellular compositions. With the existing reference methylomes for different types of cells, it is possible to estimate cell ratios in a heterogeneous population with known information about the cell types. For instance, cell mixture distributions within peripheral blood can be assessed using constrained projection, which adopts least-squares multivariate regression to estimate regression coefficients as the ratios for cell types . More recent studies suggest that non-constrained reference-based methods are robust across a range of different tissue types  and Bayesian semi-supervised methods may construct cell-type components in a way that each component corresponds to a single-cell type . For reference-based algorithms, prior knowledge of cell composition and cell-specific methylation markers is critical . To overcome these issues, principal component analysis (PCA) was adopted by ReFACTor for the correction of cell-type heterogeneity , and nonnegative matrix factorization (NMF) was adopted by MeDeCom to recover cell-type-specific latent methylation components . However, the performance of such reference-free cell-type deconvolution tools relies heavily on model assumptions . Recently, the development of single-cell DNA methylation sequencing techniques generated a growing number of methylomes at unprecedented resolution, providing new opportunities to explore cellular diversity within cell populations [21, 22, 23, 24, 25, 26, 27]; yet, no attempt has been taken to make use of single-cell methylomes for cell-type deconvolution analysis.
In this study, we propose a semi-reference-free, NMF-based pipeline to dissect cell-type compositions for methylomes generated from bulk tissues. This pipeline takes advantage of pCSM segments that exhibit bipolar methylation patterns in methylomes generated from bulk tissues or among single-cell methylomes. To overcome the shallow depth of whole-genome bisulfite sequencing, weighted gene co-expression network analysis (WGCNA) was modified to cluster pCSM loci. PCA was performed to select eigen-pCSM loci, which are representative loci for clusters of pCSM loci. To evaluate the performance of eigen-pCSM loci selected in cell-type deconvolution, over 3000 brain single-cell methylomes were mixed in random proportions in simulation studies to create synthetic methylomes. The pipeline implemented in this study provides an accurate estimation of cell-type composition on both synthetic methylomes and bulk methylomes from five neuronal cell populations.
Virtual methylome dissection based on eigen-pCSM loci
Mammalian brain consists of many functionally distinct cell subsets that can contribute to diverse DNA methylation patterns on loci with cell subset-specific methylation. In particular, diverse subpopulations of neurons and glial cells can often be found even within a given brain region . To demonstrate the effectiveness of our procedure, we performed two distinct analyses using synthetic methylomes derived from brain single cells and methylomes from brain-sorted cells.
pCSM loci predicted with brain single-cell methylomes
Our first case study took advantage of recent brain single-cell methylomes generated for 3377 neurons derived from mouse frontal cortex tissue  (Additional file 1: Table S1). Following our previous procedure for single-cell methylome analysis , we determined the pCSM loci from each single-cell methylome. Briefly, for each methylome, we scanned the sequence reads one by one to identify genomic segments with methylation data for four neighboring CpG sites. To facilitate pCSM identification from the 4,326,935 4-CG segments identified, we first selected 1,070,952 pCSM candidates that were completely methylated in at least one neuron but also completely unmethylated in another. We next applied the beta mixture model to the methylation patterns in single neurons for these candidates segments . 921,565 segments were determined to be pCSM segments with bipolar distributed methylation profiles, while the rest (149,387 segments) had heterogeneous methylation patterns among neurons.
To further explore the functional characteristics of pCSM segments, we merged the overlapped pCSM segments into 347,889 loci (Additional file 2: Table S2) and integrated them with brain histone modification maps. We observed that these pCSM loci were enriched at H3K27ac, H3K4me, and H3K4me3 peaks and CpG islands with 1.63-, 1.93-, 1.28-, and 1.52-fold increases, respectively (Fig. 2e). In addition, pCSM loci were depleted from repeat regions including SINE, LINE, and LTR. This result suggested that pCSM loci might play important regulatory roles in the brain. For the pCSM loci that overlapped with histone marks for enhancers or promoters, we identified their adjacent genes for functional enrichment analysis using the GREAT analysis tools . As shown in Additional file 3: Figure S1, genes associated with these pCSM loci are significantly enriched in the functional categories for brain development, such as “regulation of synaptic plasticity” and “metencephalon development.” Altogether, these results indicate that pCSM loci showing bipolar methylation among neurons may play important roles in the epigenetic regulation of brain development.
Synthetic methylome: eigen-pCSM loci determination and virtual methylome dissection by NMF
Brain methylome: virtual methylome dissection for neuronal cells
To examine whether the proposed virtual methylome dissection approach can be applied to the methylomes generated from tissue samples, we re-analyzed five brain methylomes derived from sorted nuclei including excitatory (EXC) neurons, parvalbumin (PV) expressing fast-spiking interneurons, vasoactive intestinal peptide (VIP) expressing interneurons , and mixed neurons from the cortex’s of 7-week (7wk NeuN+) and 12-month (12mo NeuN+) mice . These five methylomes were analyzed separately and together as a mixed pool (Additional file 3: Figure S3A). 19,091 to 212,218 pCSM segments were identified in the six methylomes, accordingly. Among the 212,218 pCSM segments identified in the mixed pool, 118,409 segments showed differential DNA methylation states across the five neuronal samples; the other 93,809 pCSM segments were found to be pCSM segments within the five methylomes (Additional file 3: Figure S3B). Since a significant number of pCSM segments can be identified from pooled samples to capture differences among sorted cells (Additional file 3: Figure S3B), it is a better strategy to pool methylomes from sorted cells for pCSM loci identification, particularly when methylomes have a low read depth.
Next, we asked whether the pCSM segments identified from the pooled methylome could reflect the cell-type-specific methylation pattern derived from single-cell methylomes. Interestingly, we found that the pCSM segments identified from the pooled methylome were significantly overlapped with those identified using single-cell methylomes (Additional file 3: Figure S3C). This indicates that the cell-type-specific methylated loci determined with single-cell methylomes could also be detected using a bulk methylome. In addition, pCSM loci identified from the pooled methylome (Additional file 4: Table S3) were enriched at enhancer histone markers and CpG islands, but were depleted from promoter, 5′UTR, and repeat elements (Additional file 3: Figure S3D).
In this study, we implemented an analysis pipeline to predict the composition of cell subtypes in bulk methylomes. To our knowledge, this is the first endeavor to systematically analyze the variation in DNA methylation patterns to infer pCSM loci as inputs for the NMF model. Application of synthetic methylomes that are simulated based on single-cell methylomes and methylomes derived from sorted cells demonstrated that our approach is efficient and has high prediction accuracy. Our procedure is semi-reference free. The clustering of pCSM loci to identify representative eigen-pCSM loci depends on the methylomes collected. With rapidly accumulating methylome data, such a method will gain power and can be widely used to explore cell heterogeneity during tissue development and disease progression.
Materials and methods
Analyses of single-nucleus methylcytosine sequencing (snmC-seq) datasets
Single-nucleus methylcytosine sequencing datasets of 3377 neurons from 8-week-old mouse cortex (GSE97179) were downloaded from the Gene Expression Omnibus (GEO). These datasets were analyzed following the processing steps provided in a previous study : (1) Sequencing adaptors were first removed using Cutadapt v2.1 , (2) trimmed reads were mapped to the mouse genome (GRCm38/mm10) in single-end mode using Bismark v0.16.3 , with the pbat option activated for mapping R1 reads , (3) duplicated reads were filtered using picard-tools v2.0.1, (4) non-clonal reads were further filtered by minimal mapping quality (MAPQ ≥ 30) using samtools view  with option −q30, and (5) methylation calling was performed by Bismark v0.16.3.
Identification of pCSM loci from snmC-seq datasets
pCSM loci were determined from single-cell methylomes with a similar procedure to what was provided in a previous study . Briefly, for each snmC-seq dataset, all segments with four neighboring CpG sites in any sequence read were extracted from autosomes, and the corresponding methylation patterns were recorded. The 4-CpG segments that overlapped with known imprinted regions  were excluded in subsequent steps. To ensure statistical power for the identification of pCSM loci, segments covered by at least ten single-cell methylomes were retained for further analysis. The remaining 4-CG segments covered by at least one completely methylated cell and one completely unmethylated cell in such genomic loci were identified as CSM loci candidates. From these candidates, a beta mixture model  was used to infer pCSM loci, by which cells that covered the same segment could be grouped into hypomethylated and hypermethylated cell subsets. The segments with methylation differences between hypomethylated and hypermethylated cell subsets over 30% and adjusted p values less than 0.05 were then identified as the pCSM loci.
Analyses of whole-genome bisulfite sequencing datasets
Sequencing adaptors and bases with low sequencing quality were first trimmed off using Trim Galore v0.4.4. The retained reads were then mapped to the mouse reference genome (GRCm38/mm10) using Bismark v0.16.3. Duplicated reads were removed using deduplicate_bismark. Lastly, methylation calling was performed by Bismark v0.16.3.
Identification of pCSM loci from WGBS datasets
pCSM loci were identified from WGBS datasets following a strategy described previously  with slight modifications. Genomic segments with four neighboring CpGs were determined within each sequence read. Such 4-CpG segments covered with at least ten reads were retained for further identification of bipolar methylated segments. A nonparametric Bayesian clustering algorithm  was performed to detect bipolar methylated segments that were covered by at least one completely methylated and one completely unmethylated read concurrently. Bipolar segments in chromosome X, Y, and known imprinted regions  were excluded from further analysis.
Genome annotation and gene ontology analysis
Genomic features were downloaded from the UCSC Genome database , including annotation for gene structure, CpG islands (CGI), and repeat elements in mm10. Promoters were defined as 2 kb regions upstream of transcription starting sites (TSS). CGI shores were defined as 2 kb outside of the CGI, and CGI shelves were defined as 2 kb outside of the CGI shores. The broad peaks of histone modifications H3K4me1, H3k4me3, and H3K27ac for 8-week mouse cortex were obtained from the ENCODE Project  (with accession GSM769022, GSM769026, and GSM1000100, respectively) and lifted from mm9 to mm10 using UCSC LiftOver tools. GO enrichment analysis for pCSM loci enriched in histone peaks was performed by the GREAT tool V3.0.0  using default settings.
Co-methylation, eigen-pCSM loci extraction, and NMF analyses for virtual methylome dissection
A two-step clustering approach was adopted for co-methylation analysis. First, k-means clustering analysis was performed to divide pCSM loci into hypo/mid/hypermethylation groups. For each k-means cluster, the R package WGCNA v1.61  was used to identify co-methylation modules of highly correlated pCSM loci. Briefly, for a given DNA methylation profile, a topological overlap measure (TOM) was used to cluster pCSM loci into network modules. The soft-thresholding power was determined with the scale-free topology. Network construction and module determination were performed using the “blockwiseModules” function in WGCNA, and the network type was set to “signed” during network construction to filter the negatively correlated pCSM loci within one module. PCA analysis was performed to select a subset of pCSM loci with the maximal loadings in PC1 as eigen-pCSM loci for the corresponding module.
The R package MeDeCom V0.2  was used to dissect the methylomes using NMF analysis. A matrix with eigen-pCSM loci in rows and samples in columns can be decomposed into the product of two matrices: one representing the profile of predicted cell types with eigen-pCSM loci in rows and cell types in columns and the other containing the proportion of predicted cell types in each sample with cell types in rows and samples in columns. Two parameters need to be artificially set in NMF analysis, i.e., the number of cell types k, and the regularizer shifts’ parameter λ, by which the estimated matrix of methylation patterns toward biologically plausible binary values close to zero (unmethylated) or one (methylated). k is dictated by prior knowledge on the input methylomes. In the case that no prior knowledge of cell composition is available for the input methylomes, both k and λ may be selected via cross-validation as suggested in the MeDeCom package.
Cell mixture methylome synthesis and virtual methylome dissection simulation
The authors thank Dr. Janet Webster for English language editing and Drs. Joseph R. Ecker, Ryan Lister, and Eran A. Mukamel for sharing brain methylome data and the laboratories contributing to ENCODE project.
HX and XL conceived and designed the study; LY and YL implemented procedures and conducted data analysis; XX, SW, and XL participated in data preparation, result organization, and discussion; LY, XW, and HX wrote the manuscript. All authors discussed the results and commented on the manuscript. All authors read and approved the final manuscript.
This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (XDB13000000 for X.L.), Fralin Life Sciences Institute at Virginia Tech faculty development fund (for H.X.) and VT’s Open Access Subvention Fund, the Key Research Program of the Chinese Academy of Sciences (KFZD-SW-220-1 for X.L.), and the CAS Light of West China Program (for X.L.).
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
- 16.Rahmani E, Schweiger R, Shenhav L, Wingert T, Hofer I, Gabel E, et al. BayesCCE: a Bayesian framework for estimating cell-type composition from DNA methylation without the need for methylation reference. Genome Biol. 2018;19:141. https://doi.org/10.1186/s13059-018-1513-2.CrossRefPubMedPubMedCentralGoogle Scholar
- 34.Lake BB, Codeluppi S, Yung YC, Gao D, Chun J, Kharchenko PV, et al. A comparative strategy for single-nucleus and single-cell transcriptomes confirms accuracy in predicted cell-type expression from nuclear RNA. Sci Rep. 2017;7:6031. https://doi.org/10.1038/s41598-017-04426-w.CrossRefPubMedPubMedCentralGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.