Correlation of an epigenetic mitotic clock with cancer risk
- 5k Downloads
Variation in cancer risk among somatic tissues has been attributed to variations in the underlying rate of stem cell division. For a given tissue type, variable cancer risk between individuals is thought to be influenced by extrinsic factors which modulate this rate of stem cell division. To date, no molecular mitotic clock has been developed to approximate the number of stem cell divisions in a tissue of an individual and which is correlated with cancer risk.
Here, we integrate mathematical modeling with prior biological knowledge to construct a DNA methylation-based age-correlative model which approximates a mitotic clock in both normal and cancer tissue. By focusing on promoter CpG sites that localize to Polycomb group target genes that are unmethylated in 11 different fetal tissue types, we show that increases in DNA methylation at these sites defines a tick rate which correlates with the estimated rate of stem cell division in normal tissues. Using matched DNA methylation and RNA-seq data, we further show that it correlates with an expression-based mitotic index in cancer tissue. We demonstrate that this mitotic-like clock is universally accelerated in cancer, including pre-cancerous lesions, and that it is also accelerated in normal epithelial cells exposed to a major carcinogen.
Unlike other epigenetic and mutational clocks or the telomere clock, the epigenetic clock proposed here provides a concrete example of a mitotic-like clock which is universally accelerated in cancer and precancerous lesions.
KeywordsDNA methylation Epigenetic clock Cancer Mitotic Stem cells Ageing
area under the curve
ductal carcinoma in situ
European Genome-phenome Archive
Gene Expression Omnibus
human embryonic stem cell
hematopoietic progenitor cell
invasive breast cancer
invasive lung cancer
lung carcinoma in situ
mesenchymal stem cell
Polycomb group target
The Cancer Genome Atlas
total number of stem cell divisions per stem cell.
Estimating the relative rate of stem cell divisions of a given tissue type between individuals may allow their stratification according to their prospective risk of cancer [1, 2]. It is therefore of interest to construct molecular mitotic-like clocks, which may provide an approximate estimate of the relative stem cell division rate of a tissue in an individual [3, 4, 5]. While telomere shortening represents a mitotic clock  and has been associated with increased cancer risk , these associations have, however, been largely inconsistent and only obtained in surrogate tissues such as blood . A recently identified mutational clock-like signature  may also approximate a mitotic clock but has not yet been applied to cancer risk prediction.
Errors in the maintenance of DNA methylation (DNAm) arising during cell division may accumulate in the stem cell population of a tissue in line with the stem cell division rate and chronological age and have been proposed as molecular marks for a mitotic clock [3, 4, 9]. In addition, an increased rate of mitosis in the stem cell pool, possibly associated with cancer risk factors such as inflammation or viral infection, has been suggested to fuel epigenetic cellular heterogeneity and to lead to an increased epigenetic clonal mosaicism which may predispose the tissue to future neoplastic transformation [10, 11, 12, 13, 14, 15]. Indeed, clonal genetic and copy number variation mosaicism has already been associated with the future risk of hematological cancers [16, 17, 18, 19], and DNAm variability in normal cervical cells has been shown to predict the prospective risk of cervical cancer . Given that many cancer risk factors have been associated with DNAm changes in normal cells [12, 15, 20, 21, 22], and preferentially at the same sites that undergo DNAm changes with age in healthy tissue [23, 24], we posited that a DNAm based mitotic-like clock could serve as a tool to predict cancer risk.
Here we report substantial progress towards the construction of such an epigenetic mitotic-like clock. Using a novel approach, based on an underlying mathematical model, we build a DNAm-based age-correlative model called “epiTOC” (Epigenetic Timer Of Cancer). A key feature underlying the construction of epiTOC is the focus on Polycomb group target (PCGT) promoter CpGs which are unmethylated in many different fetal tissue types, thus allowing us to define a proper ground state from which to then assess deviations in DNAm in aged tissue. By correlating the tick rate predictions of this model to the rate of stem cell divisions in normal tissue, as well as to an mRNA expression-based mitotic index in cancer tissue, we demonstrate that our model approximates a mitotic-like clock. Importantly, unlike Horvath’s epigenetic clock , the tick rate of epiTOC is universally accelerated in cancer, in preinvasive lesions, in normal epithelial cells at risk of neoplastic transformation, and in normal epithelial cells exposed to smoke carcinogens.
Construction of the epiTOC model
Using one of the largest Illumina 450 k DNAm datasets encompassing over 650 whole blood samples from healthy individuals spanning an age range of over 80 years  and correcting for changes in blood cell subtype composition (“Methods”; Additional file 1), we identified a subset of 385 PCGT promoter CpGs satisfying all required properties, including being unmethylated across 11 different fetal tissue types and exhibiting age-associated hypermethylation (false discovery rate <0.05) (“Methods”; Additional file 2). For each sample, epiTOC yields a score, denoted “pcgtAge”, as the average DNAm over these 385 CpG sites, representing the age-cumulative increase in DNAm at these sites due to putative cell-replication errors (Figure S1a in Additional file 3). Given that hypomethylation is also commonly observed in aging and cancer [29, 30], a separate age-correlative model based on promoter CpGs that are partially methylated in fetal tissue and which become hypomethylated with age was also derived (“Methods”; Figure S1b in Additional file 3). We validated both age-correlative models in an independent Illumina 450 k data set, encompassing over 300 whole blood samples from healthy individuals  (Figure S2 in Additional file 3). However, only epiTOC correlated with age in other normal tissue types (Figure S3 in Additional file 3), with the age-associated hypomethylation model showing inconsistent patterns (Figure S4 in Additional file 3).
epiTOC correlates with age in purified cell and stem cell populations
We also performed a gene set enrichment analysis on the 385 PCGT CpGs that make up epiTOC to see if there is any evidence for these CpGs mapping to immune/blood cell subtype markers. To identify relevant blood cell or immune cell type terms, we first conducted the gene set enrichment analysis on top ranked CpGs in the Hannum et al.  data without correction for cellular heterogeneity, which, as expected, revealed strong enrichment of promoter CpGs mapping to genes underexpressed in lymphocytes and genes overexpressed in myeloid cells (Fig. 2c), consistent with the known increased myeloid–lymphoid ratio with age . In contrast, these same biological terms were conspicuously absent and underenriched among the 385 PCGT epiTOC CpGs (Fig. 2c; Additional file 4).
Finally, we also assessed epiTOC in stem cell populations in order to support our underlying assumption that DNAm alterations at the epiTOC PCGT loci can accrue with age in a stem cell pool. We obtained Illumina Infinium 27 k DNA methylation data for a total of eight bone marrow-derived mesenchymal stem cell (MSC) populations of low passage number, representing a wide donor age range (20–80 years) , as well as for 12 CD34+ hematopoietic progenitor cell (HPC) populations derived from cord blood and adult peripheral blood . In both studies, and despite the small sample sizes, the pcgtAge score correlated positively with donor age (linear regression P = 0.015 for MSCs and Wilcoxon rank sum test P = 0.037 for HPCs; Fig. 2d).
epiTOC approximates a mitotic clock
To further demonstrate that pcgtAge provides a correlative measure of the cell division rate in a tissue, we posited that it would correlate with an mRNA expression-based mitotic index, which we constructed from the expression levels of genes that have been highly validated as being cell proliferation markers (“Methods”). This expression-based mitotic index was increased in all 15 TCGA cancer types compared to their corresponding normal tissue (Figures S6 and S7 in Additional file 3) and also correlated strongly with the TNSC estimates across normal tissues (P < 1e-21; Figure S8 in Additional file 3). Focusing on only the cancer samples from each of these cancer types, we obtained Pearson correlations between their pcgtAge and their expression-based mitotic index. Correlations were generally positive and much higher than those obtained using Horvath’s age-acceleration measure (Fig. 3f; paired Wilcoxon rank sum test P < 0.0001). For pcgtAge we observed a significant (P < 0.05) positive correlation in 7/15 cancer types, whilst for Horvath’s clock only one cancer type (KIRP) exhibited such an association (Fig. 3g). All these data support the view that epiTOC represents an approximate mitotic-like clock. In contrast, the model based on age-associated hypomethylation did not correlate well with the expression-based mitotic index in cancer tissue, although we did observe an excellent correlation with cellular turnover rates in normal tissue (Figure S9 in Additional file 3).
epiTOC predicts universal age acceleration in cancer and is further increased in cancer cell lines
Next, we asked if the pcgtAge score is also increased in cancer cell lines compared to cell lines of normal karyotype. Using Illumina 450 k DNAm data from 24 cancer and 29 normal cell lines, all profiled as part of ENCODE , we observed a significantly higher pcgtAge score in the cancer cell lines (Wilcoxon rank sum test P = 6e-14; Figure S13a in Additional file 3). Cancer cell lines generally also exhibited higher pcgtAge scores than those observed in cancer tissue. To confirm this, we compared the pcgtAge scores for 11 cancer types against the pcgtAge score of a corresponding representative cancer cell line (“Methods”). In all 11 cancer types, the cancer cell line exhibited a higher score than the average over corresponding cancer tissue samples (Wilcoxon paired rank sum test P = 0.005; Figure S13b in Additional file 3).
Increased epiTOC tick rate in pre-invasive cancer lesions
Most of these results were replicated in an Illumina 450 k dataset encompassing 14 normal breast tissues (from reduction surgery), 28 age-matched ductal carcinomas in situ (DCIS), and a further 13 age-matched DCIS samples from women who later developed an invasive breast cancer (IBC)  (Fig. 5c). Specifically, we observed an increased pcgtAge score in DCIS compared to normal breast (AUC = 1, P < 0.05; Fig. 5c). In contrast, neither Horvath’s measure of age acceleration nor the age-associated hypomethylation model could discriminate normal breast from DCIS samples (Fig. 5d; Figure S16 in Additional file 3). In another independent Illumina 450 k set which profiled 50 normal breast samples, 42 age-matched normal samples adjacent to a breast cancer and an additional 305 breast cancers , pcgtAge also showed a gradual age-independent increase between normal breast tissue and normal-adjacent breast tissue and breast cancer and was able to discriminate normal breast from normal-adjacent tissue (AUC = 0.66, 95 % CI 0.55–0.77; Fig. 5e). Again, this was not the case for Horvath’s age-acceleration measure nor for the analogous age-associated hypomethylation signature (Fig. 5f; Figure S17 in Additional file 3).
To translate the discrimination accuracies above into an estimate of the difference in incurred stem cell divisions, we integrated the estimate of the intrinsic rate of stem cell division in lung tissue [2, 38] with the epiTOC model, using the normal lung samples from TCGA to estimate the intercept and slope in our regression model (“Methods”). We obtained an estimate of 8.45 stem cell divisions per stem cell in the normal tissue compared to 13.33 divisions per stem cell in LCIS which did not progress to ILC and 22.02 divisions in the LCIS samples which progressed to ILC, representing an approximate threefold increase compared to normal samples. In the case of breast tissue, the underlying rate of stem cell division is unknown, but we could nevertheless estimate that DCIS samples had a 3.5-fold higher number of total stem cell divisions per stem cell compared to age-matched normal breast, a ratio similar to that of LCIS to normal lung tissue.
Increased epiTOC tickrate in normal buccal tissue of smokers
For a subset of 152 women for which there was matched blood–buccal tissue, we found that the pcgtAge score in the two tissues did not correlate with each other (Fig. 6c), consistent with the fact that all women were of the same age and that the effect of inflammation (and hence cancer risk) is tissue-specific. In line with this, the association of the pcgtAge score with SPY was stronger in buccal tissue than in blood (Fig. 6d). In contrast, for Horvath’s age-acceleration measure, which in this cohort of identically aged women equals the predicted DNAm age, we observed a weak yet significant correlation across the two different tissues (Fig. 6e). Correspondingly, no association between Horvath’s age-acceleration measure and SPY was observed, irrespective of tissue type (Fig. 6f).
Unlike telomere shortening [6, 7, 8], or the recently identified mutational clock-like signature , epiTOC provides a concrete example of a molecular mitotic-like clock which predicts universal acceleration in cancer. Although we acknowledge that current estimates of stem cell division rates in normal tissues are only very rough approximations, the observed correlation between epiTOC and the number of cell divisions per stem cell in over 288 normal samples was obtained by broadly categorizing tissues into groups of low, medium, and high cellular turnover, a classification which is likely to be robust (i.e., it is well accepted that colon has one of the highest cellular turnover rates of any tissue). Importantly, the pcgtAge score from epiTOC further correlated with an expression-based mitotic index in as many as seven different cancer types. In those cancer types where we did not see an association with the expression-based mitotic index, this could be due to small differences in proliferation rates between the actual tumors. Supporting this, the pcgtAge score was universally increased in cancer, as assessed in 15 tumor types encompassing over 5000 samples, in line with an increased mitotic rate being a universal cancer hallmark. The pcgtAge score was also accelerated in preinvasive cancer lesions, including LCIS and breast DCIS samples, allowing highly accurate discrimination of such lesions (AUC ~ 0.9–1). In addition, epiTOC was able to predict risk of an ILC, providing highly accurate identification (AUC ~ 0.8) of LCIS which progressed to an ILC (Figure S14a in Additional file 3). The estimated epiTOC tick rate in normal breast tissue adjacent to breast cancer was also accelerated, albeit at a much lower level, consistent with most of the cells in these samples being of normal cytology. Of note, we recently demonstrated the existence of DNAm field defects in the same normal-adjacent breast tissue samples . That the pcgtAge score could discriminate normal tissue containing such field defects from the normal tissue of age-matched cancer-free women (AUC of 0.66, 95 % CI 0.55–0.77) suggests that epiTOC may serve to assess the risk of neoplastic transformation of normal tissue. Large prospective studies will, however, be required to demonstrate utility of epiTOC in a clinical setting.
Of note, the pcgtAge score also correlated with SPY in normal buccal tissue, highlighting the potential of epiTOC to capture putative increased cellular proliferation of epithelial cells that have been exposed to a major carcinogen. Although we did not measure inflammatory or cell proliferation markers in our buccal samples, previous studies have noted increased inflammation and proliferative activity in the oral epithelium of smokers [44, 45]. Although it is also plausible that the DNAm changes seen in the buccal epithelium of smokers reflects active changes associated with smoking-induced DNA damage, there is another observation supporting the view that inflammation and increased proliferation underlie most of the observed changes: smoking-associated DNAm changes seen in the buccal epithelium are generally very similar to those seen in healthy aging [20, 46]—for instance, as seen in blood cells from pediatric populations [47, 48]—or similar to those DNAm changes seen in inflammatory diseases [12, 49]. Indeed, a recent review highlighted enrichment of PCGT CpGs as a common DNAm signature which is seen in healthy aging of normal cells (which are presumably free of DNA damage), as well as in normal cells of the same chronological age but which have been exposed to a variety of different cancer risk factors . Hence, the more likely explanation for the observed increased pcgtAge score in the buccal epithelial cells of smokers compared to non-smokers of the same age is a marginally increased proliferative activity, possibly associated with an inflammatory response to smoke toxins.
Assessment of epiTOC in cancer cell lines was less conclusive. Although cancer cell lines always exhibited a higher pcgtAge score than the average value of the corresponding cancer tissue, consistent with our epiTOC model and with reports that cancer cell lines exhibit higher levels of promoter methylation than cancer tissue , comparisons between cell lines and cancer tissue is difficult for a number of reasons. First, studies have shown that the artificial microenvironment and selection pressures of cell culture conditions can modify the genomic, DNAm, and expression patterns of the parental cells [35, 51, 52, 53]. Therefore, it is unclear whether the mechanism underlying DNAm changes upon cell division in tumors is also operative in cell lines. Second, a cell line derives from a parent cell which may grossly underrepresent the level of epigenetic heterogeneity in the stem cell pool of the primary cancer tissue, rendering comparisons between cell lines and independent cancer tissue samples problematic. Third, many cell lines would be needed to render meaningful comparisons with cancer tissue . Notwithstanding these limitations, our data are broadly supportive of epiTOC, assuming that, once cells are cultured in vitro, the pcgtAge score ceases to reflect the mitotic age of the cells while still reflecting the mitotic age of the parent tissue.
The data presented here further confirm that Horvath’s clock is not a mitotic clock and that it does not exhibit a consistent universal acceleration in cancer or precancerous lesions. Indeed, by its very nature, Horvath’s clock was designed to predict chronological age independently of tissue type and must, therefore, reflect a biological process unrelated to cell division (since cell division rate is highly variable between tissues). While it is not yet clear what this biological process may be, the data presented here indicate that it is not a process that is uniformly altered in cancer, nor a process which appears accelerated in response to a major cancer risk factor such as smoking. In comparing epiTOC to Horvath’s clock, it is also important to point out that epiTOC does not attempt to predict chronological age of samples in absolute terms. This is not a limitation but, in fact, a key advantage of our approach, since any optimization procedure aimed at predicting chronological age as accurately as possible would not allow us to reliably identify the CpGs which reflect an underlying mitotic process.
Importantly, our work has further exposed a subtle difference between promoter CpGs that undergo age-associated hypermethylation from those that undergo hypomethylation, with the analogous model based on age-hypomethylated sites not correlating with the mitotic index in cancer tissue and correspondingly not exhibiting a consistent acceleration in cancer. This is consistent with a recent report by Lin and Wagner . In fact, whereas the model based on hypomethylation also correlated with the number of stem cell divisions per stem cell in normal tissues, as well as with exposure to smoking in normal buccal tissue, it is intriguing that no consistent associations were found in cancer or preinvasive cancer lesions. This suggests that DNA hypermethylation errors associated with cell replication might be similar in cancer and normal stem cells but not so for the corresponding hypomethylation aberrations. It will be important for future studies to try to understand this deep and subtle asymmetry between age-associated hyper- and hypomethylation in relation to the changes seen in cancer.
Interestingly, the mRNA expression-based mitotic index also correlated with the estimated number of stem cell divisions, with the strength of correlation comparable to that obtained by epiTOC. Both mitotic indices were also comparable predictors of normal/cancer status (Figure S12 in Additional file 3). However, only the DNAm-based mitotic index correlates with chronological age. Indeed, it has been shown that mRNA expression signatures correlating with age are not strongly consistent or reproducible across independent studies and that the only consistent genes map to immune system and metabolic pathways and not to cell proliferation . Thus, epiTOC has a higher sensitivity to detect differences in cell division numbers between differently aged samples of the same tissue type. Moreover, epiTOC could discriminate normal breast tissue adjacent to breast cancer from the corresponding normal tissue of age-matched healthy women (Fig. 5e). Although matched mRNA data were not available to assess this for an expression-based mitotic index, it is unclear whether expression based assays can offer the stability and reliability of a DNA-based assay to detect such subtle differences .
Although epiTOC correlates with stem cell division rate estimates in normal tissues and correctly predicts an increased tick rate in precancerous and cancer lesions, it is important to point out several limitations underlying our model. First, epiTOC assumes that the underlying DNAm errors are occurring in the underlying stem cell pool of a tissue and ignores active DNAm changes which may occur post-mitotically in response to exposure to various risk factors. As such, the epiTOC model may only capture one part of a more general cancer risk-predictive DNAm signature. However, by computing the pcgtAge score in populations of mesenchymal stem cells and hematopoietic progenitor cells, we have seen that epiTOC does predict an increased mitotic age in older stem/progenitor cell populations, supporting the view that it may measure mitotic age of the stem cell pool in epithelial tissues. A second limitation of our model is that it largely ignores absolute stem cell numbers in the tissue of interest, which may vary substantially between individuals and even within an individual as a function of current exposure to environmental factors. The absolute number of stem cells in a tissue is expected to be an important determinant of epigenetic stem cell heterogeneity and clonal mosaicism and, therefore, of cancer risk, as proposed by a number of studies [10, 13, 58]. This is therefore an important caveat, since our model does not distinguish between a tissue with a lower number of stem cells carrying lots of DNAm changes from another tissue with a large number of stem cells, each carrying only a few yet unique DNAm alterations. Nevertheless, assuming similar stem cell numbers, our pcgtAge score would be expected to increase in line with the level of epigenetic heterogeneity in the stem cell pool and thus be informative of cancer risk. It is also worth stressing that the observed correlation of the pcgtAge score with the tissue’s cellular turnover rate does not depend on any estimates of the actual number of stem cells in the tissue of interest. Indeed, the pcgtAge score is an intensive variable, defined as an average over cells and a specific set of loci, and, as such, correlation can only be assessed meaningfully relative to another intensive variable. This other intensive variable is the cumulative number of cell divisions per stem cell in the tissue of interest and not the extensive variable defined as the total number of stem cell divisions in the tissue. For this reason, our model does not make, and cannot make, any predictions of cancer risk between different tissues.
In spite of these limitations, epiTOC may allow, in principle, prediction of stem cell division numbers if absolute stem cell numbers remain constant, if the intrinsic division rate in a tissue is known and if a large DNAm data set of that tissue from healthy individuals is available. In the case of lung tissue, for which estimates of the intrinsic rate of stem cell division are available [2, 38], and using normal-adjacent samples from the TCGA to train the model, we found that LCIS samples exhibited a total of about 13 to 22 divisions per stem cell. This is significantly larger than the approximate eight to nine divisions per stem cell for the normal lung samples, which in turn is only marginally higher than the life-time number of divisions per stem cell in lung tissue (approximately six divisions per stem cell) as estimated by Tomasetti and Vogelstein . Given that it has been estimated that lung tissue contains about one billion stem cells, and assuming that a 2 cm sized LCIS contains 0.5 % of these stem cells, this translates into approximately (22/8) × 0.005 × 109 ~ 14 million additional cell divisions in LCIS compared to a normal sample. We acknowledge that these estimates are only very rough approximations, yet they are also not necessary for the purpose of cancer risk prediction where only the tick rate relative to that of a normal reference tissue is required.
Finally, it is of interest to discuss the relation of epiTOC to a recently proposed “Big Bang” model of early cancer evolution as proposed by Sottoriva et al. . This Big Bang model describes the clonal expansion of a tumor and can accurately explain the observed patterns of spatial genetic heterogeneity within the tumor. In contrast, epiTOC provides an estimate of the relative number of stem cell divisions per stem cell in a tissue of interest, which may approximate the level of epigenetic stem cell heterogeneity within the normal tissue before the cancer clone emerges. Once the cancer clone has emerged, the increased cellular proliferation of this clone would be expected to generate a “large wave” of further DNAm changes that may affect most of the genome. Thus, when applying epiTOC to cancer tissue, the pcgtAge score is expected to correlate strongly with the level of proliferation of the tumor, and we have indeed demonstrated this in as many as 15 TCGA cancer types. Moreover, we recently demonstrated that the pattern of aberrant DNAm of one tumor sample can explain, on average, about 60 % (R2 ~ 0.6) of the aberrant DNAm variation of another tumor, even if from a different cancer type , suggesting that a common process underlies most of the DNAm variation in the cancer genome. This common universal process is likely to be an increased cell division rate. The epigenetic epiTOC and genetic Big Bang model are therefore mainly aimed at describing different phases of the carcinogenic process, with epiTOC providing an approximate measure of cell division numbers per stem cell before the first cancer clone emerges and with the Big Bang model providing a description of subsequent cancer evolution. Both models together are highly consistent with an overarching “phase transition” model of oncogenesis, in which epigenetic clonal mosaicism is maximal before cancer emerges .
The epigenetic mitotic clock-like signature presented here exhibits a consistent universal pattern of acceleration in cancer, in precancerous epithelial lesions, and in normal epithelial cells exposed to a major carcinogen. We propose that DNAm-based models such as epiTOC may constitute informative biomarkers of cancer risk if evaluated in the cell of origin or in a relevant surrogate tissue.
Construction of an epigenetic mitotic clock (epiTOC): a mathematical model
where A(s) denotes the chronological age of the individual s, IR(t) denotes the intrinsic rate of stem cell divisions per stem cell in tissue type t, E(s) is a complex non-linear (generally unknown) positively valued function representing the exposure of individual s to a cancer risk factor, and where ER(t) denotes an extrinsic rate of stem cell division associated with exposure to the cancer risk factor and which we assume may depend on tissue type t. Note that we assume that the intrinsic rate IR only depends on tissue type and that it is, therefore, independent of chronological age and individual s, so we are ignoring genetic factors which may influence the rate of stem cell division.
where we have absorbed the term IR into a new slope coefficient. Thus, to identify CpGs whose DNAm levels correlate with TNSC, it is justified to correlate DNAm to chronological age, as long as we use one tissue type and focus on healthy individuals. However, since many biological processes may be associated with distinct age-associated DNAm changes, it is necessary to make a selection of CpGs which are more likely to capture the DNAm aberrations caused by cell division.
Construction of an epigenetic mitotic-like clock: CpG selection
Motivated by the above mathematical model, we posited that we could identify relevant CpGs as follows: (i) identify CpG sites undergoing age-associated DNAm changes in a large DNAm dataset of healthy individuals, encompassing one tissue type only and correcting for potential changes in cell type composition; (ii) identify a subset of these that map to PCGT promoters, i.e., marked by the PRC2 complex, and which are constitutively unmethylated in a ground state of age zero (e.g., fetal tissue).
To justify (i), we reasoned that using multiple tissues, which would naturally differ in their mitotic tick rates, would only hamper or confound derivation of a mitotic clock (see above mathematical model). Correction for underlying changes in cell type composition is, however, an important potential confounder if we are to use only one tissue type. For these reasons, we used the dataset of 656 whole blood samples from Hannum et al. , representing one of the largest cohorts of healthy individuals which have been profiled with Illumina 450 k DNAm beadarrays, and a tissue type (blood) for which accurate correction for changes in blood cell type composition is possible . We justify (ii) on grounds that a recent study has shown that DNAm changes occurring during hematopoietic ontogeny involve preferentially DNAm increases at PCGT promoters, i.e., sites marked by the PRC2 complex . Thus, we reasoned that focusing on a subset of such promoter CpGs which are also constitutively unmethylated in a large set of fetal tissues  would provide us with the right markers to measure the rate of cell division.
In detail, using Hannum’s whole blood samples, we ran linear regressions of chronological age versus DNA methylation beta profiles adjusted for plate, sex, and estimates of blood cell subtypes. Estimates of blood cell subtypes were obtained using quadratic programming  and a novel blood cell subtype DNAm reference matrix (Additional file 1) constructed by integrating the Illumina 450 k DNAm data from Reinius et al.  with blood cell subtype-specific DNase hypersensitive site data from the NIH Epigenomics Roadmap (Teschendorff A et al: A comparison of reference-based algorithms for correcting cell-type heterogeneity in Epigenome-Wide Association Studies, submitted). Age-associated CpGs were selected at a false discovery rate threshold of <0.05. Subsequently, these age-CpGs were filtered for those mapping unambiguously to within 200 bp of a transcription start site (TSS200 probes). We note that this restriction to TSS200 probes was done in order to minimize differences in the ground state (i.e., at age zero) methylation levels between probes, which facilitates the later construction of the pcgtAge score as an average of the probes. With this restriction to TSS200 probes as well as the inherent restricted coverage of the 450 k beadarrays, we nevertheless still covered 72 % of all PCGT promoters as defined in Lee et al. . The age-associated TSS200 CpGs were then divided into age-hypermethylated CpGs and age-hypomethylated ones. Age-hypermethylated CpGs were filtered further, selecting only those with absent or low (beta <0.2) methylation across 52 fetal tissue samples encompassing 11 tissue types (cord blood (GSE72867), stomach, heart, tongue, kidney, liver, brain, thymus, spleen, lung, adrenal gland ). These unmethylated promoter CpGs were divided further into those marked by PRC2 in human embryonic stem cells (hESCs) and those that are not, according to the annotation provided . This resulted in an age-hypermethylated set of 385 CpGs, which we denote “pcgtAge”. In the case of the age-hypomethylated promoter CpGs, we selected those with a methylation level of at least 0.3 across all 52 fetal tissue samples in order to guarantee that the observed hypomethylation at these sites is genuine and of potential biological significance. This resulted in a second set of 656 CpGs, denoted “hypoAge”.
and so, if pcgtAge(s 1 ,x) > pcgtAge(s 2 ,x), then the cancer risk (CR) of sample s 1 is also greater than that of s 2 : CR(s 1 ,x) > CR(s 2, x). Hence, a higher pcgtAge score should be indicative of a higher mitotic tick rate and indicate a higher cancer risk, which we can formally test if data for pre-cancerous lesions are available. Similarly, a lower hypoAge score would correspond to a higher cancer risk.
From Hannum et al.  whole blood data set, we estimated for the pcgtAge model that α’ = 0.052 and γ’ = 0.000345. From the 81 normal breast tissue samples from TCGA, we estimated α’ = 0.053 and γ’ = 0.000165, and from the combined 73 normal lung samples from the TCGA we estimated α’ = 0.021 and γ’ = 0.000588.
does not depend on IR(t).
Validation of the age-correlative models in blood and other normal tissue types
EpiTOC and the analogous model based on age-hypomethylated sites were tested in the large independent whole blood Illumina 450 k dataset of Liu et al.  using only healthy controls (over 300 samples). We also tested these two models in other normal tissue types. For this purpose, we used normal-adjacent tissue from TCGA, focusing on tissues for which there were enough normal samples and for which there had been corresponding fetal tissue used in the derivation and selection of the CpGs making up these models. This included 38 normal colon (normal-adjacent tissue to colon adenocarcinoma (COAD)), 160 normal kidney (adjacent to kidney renal cell carcinoma (KIRC)), 47 normal liver (adjacent to liver hepatocellular carcinoma (LIHC)), and 73 normal lung samples (adjacent to lung squamous cell carcinoma (LSCC)/ lung adenocarcinoma (LUAD)).
Validation of the epigenetic mitotic clock in normal tissue
In order to demonstrate that our age-correlative models define approximate mitotic clocks, i.e., that the relation TNSC(s,t) ~ pcgtAge(s,t) holds, we estimated the age-correlative scores in the normal tissue samples of TCGA , for which estimates of the intrinsic stem cell division rates (IR) per stem cell and per year are available from . Specifically, this included colon (IR = 73 divisions per stem cell per year), rectum (IR = 73), esophagus (IR = 17.4), head and neck (IR = 21.5), liver (IR = 0.9125), lung (IR = 0.07), pancreas (IR = 1) and thyroid (IR = 0.087), encompassing a total of 288 normal tissue samples. Cumulative total number of cell divisions per stem cell (i.e., TNSC) was estimated for each of these 288 normal “healthy” samples as the product of chronological age (tissue-independent) and the corresponding rate IR (tissue-dependent), i.e., as TNSC(s,t) = A(s)IR(t). The sample-specific scores pcgtAge, nonpcgtAge, and hypoAge were then correlated to these sample-specific cumulative cellular turnover rates using a linear regression framework adjusted for chronological age (in order to avoid the expected trivial correlation by age).
Validation of the epigenetic mitotic clock in cancer tissue: construction of an mRNA expression based-mitotic index
Because cell division rates are altered in cancer, we validated the mitotic nature of the age-correlative scores pcgtAge and hypoAge in cancer samples by comparison of these scores to an mRNA expression-based mitotic index. This mitotic index was constructed by first taking the interConstruction of the epiTOC model of genes in the cell proliferation cluster of Ben-Porath et al.  and those of the proliferation signature of Rhodes et al. . This resulted in nine genes (CDKN3, ILF2, KDELR2, RFC4, TOP2A, MCM3, KPNA2, CKS2, and CDC2). The mitotic index was then defined as the average mRNA expression over these nine genes. This mRNA expression-based mitotic index was validated in 15 cancer types of TCGA by demonstrating that it is significantly increased in each cancer type compared to its corresponding normal tissue type. We verified that it was a more reliable mitotic index than PCNA expression (not shown).
Cancer and pre-cancerous Illumina 450 k datasets
We downloaded and processed level 3 Illumina 450 k and RNA-SeqV2 data from TCGA , as described by us previously . In total, we considered 15 cancer types: BLCA (bladder adenocarcinoma, nN = 19, nC = 204), BRCA (breast adenocarcinoma, nN = 81, nC = 652), COAD (colon adenocarcinoma, nN = 38, nC = 272), ESCA (esophageal carcinoma, nN = 15, nC = 126), HNSC (head and neck squamous cell carcinoma, nN = 45, nC = 402), KIRC (kidney renal cell carcinoma, nN = 160, nC = 299), KIRP (kidney renal papillary carcinoma, nN = 45, nC = 196), LIHC (liver hepatocellular carcinoma, nN = 47, nC = 176), LSCC (lung squamous cell carcinoma, nN = 41, nC = 275), LUAD (lung adenomacarcinoma, nN = 32, nC = 399), PAAD (pancreatic adenoma carcinoma, nN = 10, nC = 146), PRAD (prostate ademoma carcinoma, nN = 48, nC = 278), READ (rectal adenoma carcinoma, nN = 7, nC = 95), THCA (thyroid carcinoma, nN = 53, nC = 489), and UCEC (uterine cervix endometrial carcinoma, nN = 34, nC = 374).
We used Illumina 450 k DNAm data from the three previous publications and which had profiled precursor cancer lesions or normal-adjacent tissue in addition to normal samples. Briefly, these datasets were: (i) a dataset of normal lung and lung carcinoma in situ (LCIS) samples, with a subset of these progressing to an invasive lung cancer, previously described in ; (ii) a dataset of normal breast tissue and ductal carcinoma in situ (DCIS) samples, with a subset of these progressing to an invasive breast cancer, previously described in ; (iii) a dataset of 50 normal breast tissue samples, 42 matched normal-adjacent breast tumor pairs, and an additional 263 breast cancers, previously described in .
Age-correlative scores pcgtAge and hypoAge were estimated as average DNAm levels over the corresponding CpG sites in all of these samples.
Buccal tissue Illumina 450 k set
Illumina 450 k DNAm profiles were generated for buccal samples from 790 women, all aged 53 at the time of sampling, as described by us previously . For a subset of 152 women, there were matched buccal–blood samples. We used the normalized data as used in our previous publication.
Non-TCGA cancer tissue and ENCODE cell line DNAm datasets
Illumina 450 k DNAm data for 32 glioblastoma multiformes (GBM) were downloaded from the Gene Expression Omnibus (GEO; accession number GSE30338)  and normalized with BMIQ. Illumina 450 k data for 215 ovarian cancers was processed and normalized with BMIQ as described by us previously (GEO: GSE74845) . Illumina 27 k DNAm data for a total of 49 cervical cancer epithelial samples were processed and normalized as described by us previously (GEO: GSE30759) . Cell line Illumina 450 k DNAm data for 62 cell lines was obtained from ENCODE via GEO (GSE40699). These data were subsequently normalized with BMIQ.
Purified T-cell, B-cell and monocyte Illumina 450 k sets
Illumina 450 k profiling was performed on 100 purified CD19+ B-cell samples, 98 CD4+ T-cell samples, and 104 CD14+/CD16− monocytes from a total of 52 monozygotic twins discordant for type 1 diabetes, as described by us previously (Paul D et al: Increased DNA methylation variability in type 1 diabetes across three immune effector cell types, submitted) Here we only used the samples from the healthy controls, amounting to 50 B-cell, 49 T-cell, and 52 monocyte samples. Across all cell types, the mean cell purity was 90 %. The Illumina 450 k data were processed as described (Paul D et al: Increased DNA methylation variability in type 1 diabetes across three immune effector cell types, submitted) and are available from the European Genome-phenome Archive (EGA; https://www.ebi.ac.uk/ega/) under accession number EGAS00001001598.
In addition, we used Illumina 450 k data of an independent set of purified CD4+ T-cell (n = 214) and monocyte (n = 1202) samples, as generated by the MESA study . These data were downloaded from the GEO (GSE56046 and GSE56581). Intra-array normalization was performed with BMIQ.
Stem cell Illumina DNAm sets
We downloaded normalized Illumina 27 k data for two sets of stem cell-like cell populations. One set had profiled eight mesenchymal stem cell (MSC) samples (all of the same low passage number of 2) collected from the bone marrow of eight donors of widely different ages . Data were obtained from the GEO (GSE17448). Another set consisted of 12 CD34+ hematopoetic progenitor cell (HPC) samples collected from either cord blood (n = 7) or mobilized peripheral blood from adults (n = 5, age range 28 to 41 years) . Data were obtained from EBI’s ArrayExpress repository (E-MTAB-487).
The authors wish to thank the Chinese Academy of Sciences, Shanghai Institute for Biological Sciences and the Max-Planck Society for financial support. This work was also supported by NSFC (National Science Foundation of China) grants, grant numbers 31571359 and 31401120, and by a Royal Society Newton Advanced Fellowship (NAF project number 522438, NAF award number 164914). The results shown here are in part based upon data generated by TCGA Research Network (http://cancergenome.nih.gov/), to which we are very grateful. This project has also received funding from the European Union’s FP7 programme under grant agreement number 316758 (Project EpiTrain), the European Union’s Horizon 2020 research and innovation programme under grant agreement number 634570, and The Eve Appeal and the Medical Research Council (MC_UU_12019/1).
Availability of data and materials
Data: The main Illumina DNA methylation datasets used here are freely available from public repositories, including GEO (http://www.ncbi.nlm.nih.gov/geo/), ArrayExpress (http://www.ebi.ac.uk/arrayexpress/) and EGA (https://www.ebi.ac.uk/ega/home) (see also “Methods” for relevant references). Details: whole blood discovery (450 k, n = 656, GEO: GSE40279); whole blood validation (450 k, n = 335, GEO: GSE42861); glioblastoma multiforme (450 k, n = 32, GEO: GSE30338); ovarian cancer (450 k, n = 215, GEO:GSE74845); cervical cancer (27 k, n = 49, GEO: GSE30759). Cell lines (450 k, n = 62, GEO:GSE40699); purified blood cells (450 k, 50 purified CD19+ B-cell samples, 49 CD4+ T-cell samples, and 52 CD14+/CD16− monocytes from a total of 52 healthy monozygotic twins, EGA: EGAS00001001598); MESA (450 k, 214 purified CD4+ T-cell and 1202 monocyte samples, GEO: GSE56046 and GSE56581). Mesenchymal stem cells (27 k, n = 8, GEO:GSE17448). Hematopoietic progenitor cells (27 k, n = 12, ArrayExpress: E-MTAB-487).
Statistical analyses were done by AET. YZ contributed analyses on TCGA data. The study was conceived and designed by AET. The manuscript was written by AET. SB and MW contributed useful feedback. AW, DK, SCZ, VR, and DL contributed data. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Ethics approval and consent to participate
This study only analyses existing publicly available data.
- 31.Liu Y, Aryee MJ, Padyukov L, Fallin MD, Hesselberg E, Runarsson A, Reinius L, Acevedo N, Taub M, Ronninger M, et al. Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nat Biotechnol. 2013;31:142–7.CrossRefPubMedPubMedCentralGoogle Scholar
- 47.Martino D, Loke YJ, Gordon L, Ollikainen M, Cruickshank MN, Saffery R, Craig JM. Longitudinal, genome-scale analysis of DNA methylation in twins from birth to 18 months of age reveals rapid epigenetic change in early life and pair-specific effects of discordance. Genome Biol. 2013;14:R42.CrossRefPubMedPubMedCentralGoogle Scholar
- 68.Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM. Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc Natl Acad Sci U S A. 2004;101:9309–14.CrossRefPubMedPubMedCentralGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.