Background

Smoking is well-known for its adverse health effects [1]; however, between 10 and 35 % of people still smoke daily worldwide [2]. Despite established evidence of the causal relationships between smoking and elevated risk of diseases including cancers [3] and pulmonary [4] and cardiovascular diseases [5], the underlying mechanisms are not completely understood. One proposed mechanism is through DNA methylation.

DNA methylation, a type of epigenetic modification, plays a key role in regulating gene expression [6]. Unlike DNA sequence, methylation has cell-type and tissue-specific characteristics. DNA methylation can be impacted by age [7], gender [8], and exposures such as obesity [9] and smoking [10].

At least 16 epigenome-wide association studies (EWASs) of the association between smoking and blood DNA methylation in adults have been published [1126]. Only one study was conducted in an East Asian population [26]; most have been conducted in populations of European ancestry with others in African American, Arab, and South Asian populations. There is no study in Koreans. There are few data where reported smoking has been biochemically validated [11, 21, 25] or where methylation has been evaluated in relation to quantitative biomarkers of smoking [21, 27], pack-years, or duration of smoking cessation [2225]. Only one EWAS correlated differential methylation in blood with gene expression in lung tissue, and only one locus was examined in 10 individuals [19].

The published EWASs of smoking have identified individual differentially methylated probes (DMPs) rather than differentially methylated regions (DMRs). Identification of DMRs associated with an exposure can provide stronger evidence for causality than single DMPs [28]. In addition, DMR analysis is statistically more powerful for detection of association with disease traits or exposures [29].

To identify both DMPs and DMRs in relation to smoking, we conducted an EWAS in 100 adults from a Korean chronic obstructive pulmonary disease (COPD) cohort using the Infinium HumanMethylation450 BeadChip (450k). For the DMPs of genome-wide significance, we investigated their relationship with smoking intensity (urine cotinine) and cumulative smoking (pack-years) in current smokers and duration of smoking cessation in former smokers. As a replication look-up, we also evaluated association between methylation and smoking at previously published probes in our data. For the loci to which significant DMPs or DMRs mapped, we examined differential transcriptome profiles in relation to pack-years in lung tissue from a separate population—188 smokers from the Asan Biobank [30].

Methods

Study participants and exposure to cigarette smoking: the Korean COPD cohort

We aimed to compare methylation in current and former smokers separately to never smokers. For this purpose, we measured DNA methylation in 100 of 190 participants in a Korean COPD cohort [31]. Of the 100 participants, 60 had COPD and 40 were without COPD. The breakdown by smoking was 39 never, 30 former, and 31 current smokers. Subjects were recruited from a rural area in Korea. Having available clinical information, computed tomography (CT) data, survey questionnaire, and blood/urine samples were used for sample selections of methylation profiling. Additional approximate frequency matching on age and smoking status was applied. Details of the COPD cohort have been published [31]. All study participants completed a questionnaire and provided both blood and urine samples. Urine samples were collected at the time of participants’ baseline visits. Fresh morning urine samples were obtained from subjects at the time similar with blood sampling. Urine samples had been frozen at −70 °C. Height (cm) and weight (kg) were measured twice for each participant using a body composition analyzer IOI 353 (Aarna Systems., Udaipur, India); the average value of two measurements was used for further analyses. Body mass index (BMI, kg/m2) was calculated by dividing the weight (kg) by the square of the height (m2).

Self-reported smoking status—current, former, and never smoking—was obtained from the questionnaire, and the current status of non-smoking versus smoking was confirmed by urine cotinine levels (nmol/L) measured by immunoassay (Immulite 2000 Xpi; Siemens, NY, USA). One self-reported never smoker was re-assigned to current smoker based on a urine cotinine level of 16,909 nmol/L, higher than our cut-point for current smoking status of 283 nmol/L [32]. Smokers provided the duration (years) and amount (cigarette packs) of cigarette smoking. Pack-years were calculated by multiplying the number of smoked cigarette packs per day by the number of years smoked. Duration of smoking cessation (years) was reported by former smokers.

Genomic DNA preparation and DNA methylation profiling

We used blood DNA samples from participants’ baseline visits for methylation profiling. The DNA quality was checked with a spectrophotometer (NanoDrop® ND-1000 UV-vis), and genomic DNA was diluted to 50 ng/μl using Quant-iT PicoGreen (Invitrogen, Carlsbad, CA, USA). Bisulfite-conversion using EZ DNA methylation kit (Zymo Research, Irvine, CA, USA) was carried out according to the manufacturer’s protocols.

The Infinium HumanMethylation450 BeadChip (Illumina, Inc., San Diego, CA, USA) was used for our genome-wide methylation profiling. The methylation value (β)—a ratio between methylated probe intensity and total probe intensity—is interpreted as the proportion of methylation and ranges between 0 (unmethylated) and 1 (methylated). The signal extraction and normalization using Beta MIxture Quantile dilation (BMIQ) [33] were conducted in ChAMP [34]. The ComBat [35] method was applied to adjust for batch effects. Cell-type composition was estimated by Houseman’s algorithm [36] in minfi [37]. Cytosine-phosphate-guanine (CpG) probe filtering criteria [38] were applied to eliminate sources of possible false positive results, excluding probes that had a detection p value above 0.01 in any sample; had a bead-count less than 3 in 5 % or more of samples; were non-CpG probes; or were non-specific probes [39]. To minimize the effects of extreme outliers at each probe on association results, methylation values outside three times the interquartile range (IQR) from the first and third quartiles were removed from the analyses. Of all beta values across all participants, 75,549 (0.19 %) were removed. Probes mapping to the X or Y chromosomes were removed [40]. Therefore, a total of 402,508 CpG probes were used in our EWAS.

Statistical approach

We used methylation β values because they are more easily interpretable as methylation changes than M values [41]—the log2 ratio of methylated probe intensity and unmethylated probe intensity. To identify smoking-associated DMPs, we tested methylation levels (response) for association with smoking exposure status (predictor) using robust linear regression. We adjusted for COPD status because of the selection subjects and for age, sex, BMI, and estimated cell-type composition. Never smokers served as the reference group. The regression analysis and empirical Bayes approach was done using Linear Models for Microarray data (limma) [42]. For genome-wide significance, we set the threshold of false discovery rate (FDR) [43] adjusted p < 0.05. All results in this study are methylation differences in current smokers compared to never smokers unless otherwise noted.

In addition to association analyses at individual probes, we applied two different methods—DMRcate [44] and comb-p [45]—to detect regional methylation alterations. These methods can identify significant DMRs even when there is a lack of genome-wide significance at individual probe level. A DMR does not need to contain a DMP of genome-wide significance. DMRs were calculated based not on raw methylation data but the association results.

The DMR methods work in slightly different ways. DMRcate identifies DMRs using tunable kernel smoothing of association signals across the human genome. We used the “dmrcate” function in the DMRcate R package with an input file containing regression coefficients, standard deviations, and unadjusted p values for each probe from our EWAS of current smoking. In detail, DMRcate re-calculates p values at individual CpGs after modeling the Gaussian smoothing using Satterthwaite [46] method within a predefined bandwidth (the length of a distance), corrects p for multiple-testing, and combines information from nearby significant CpGs within the bandwidth. In contrast, comb-p identifies regional enrichments of low p values from unevenly spaced p values. It utilizes only unadjusted p values and chromosomal locations at each probe. It performs the Stouffer-Liptak-Kechris (slk) correction to adjust for adjacent p values after calculating auto-correlation, identifies regions of enrichment, generates Stouffer-Liptak region-corrected p values for each region, and performs Sidak [47] multiple-testing correction.

We defined significant DMRs (1) containing at least two probes, (2) combining information from probes residing within 1000 basepairs (bp), and (3) having multiple-testing corrected p < 0.01 (FDR for DMRcate and Sidak p for comb-p). These two values—the minimum number of CpGs in a region and the minimum length of a distance—were the defaults in DMRcate [48], so we used the same values for comb-p to compare results from two approaches. One DMR study using comb-p set the minimum number of probes to 2 and reported DMRs (Sidak p < 0.05) [49]. We used a more strict cutoff for multiple-testing correction (adjusted p < 0.01) for statistical significance because these methods have been updated and there is no consensus of the threshold. Relevant parameters for DMR calling can be found in Additional file 1: Table S2. We considered that the same region was identified as differentially methylated by the two methods if the start (bp) or end (bp) site was the same or a region identified by one of the two method resided inside a region identified by the other.

We evaluated whether the genome-wide significant (FDR < 0.05) differential methylation patterns seen in current smokers relative to never smokers were also seen in former compared to never smokers. Therefore, in the former smokers, we adjusted for 108 tests to determine look-up level replication (FDR < 0.05). In addition, we examined the dose-response relationships between methylation levels and quantitative indexes of smoking exposure: urine cotinine levels (nmol/L), pack-years in current smokers, and time since smoking cessation (years) in former smokers by using the Spearman correlation. For the dose-response analyses, we used nominal statistical significance (unadjusted p < 0.05) to report our findings.

We also examined the association with current smoking for the 192 CpGs reported more than once in the 16 published studies based on either Illumina Infinium HumanMethylation27 BeadChip or 450k array. Of these 192, 178 CpGs were checked for association after probe filtering in our data. The cutoff for statistical significance was set to FDR adjusted p < 0.05 after correcting for 178 tests.

All statistical analyses were performed in R (version 3.0.2) [50] except for comb-p [45]. The gene annotation for each probe was based on the manufacturer’s annotation file [51].

We used coMET [52] to visualize regional methylation patterns in the top four DMRs (adjusted p < 1.0E−10 at both analyses). In addition to gene names and regulatory elements of the region from ENSEMBLE, Digital DNaseI Hypersensitivity Clusters from ENCODE (DNase Cluster) and chromatin state segmentation by HMM from ENCODE/Broad (Broad ChromHMM) were added (Additional file 2: Figure S2).

Enrichment and functional network analysis

We performed an enrichment analysis to examine whether the significant DMPs (FDR < 0.05) were over- or under-represented, compared to all probes from the 450k array, in several biological features from the Illumina annotation file. The hypergeometric test (two-sided doubling mid-p) was used for the evaluation of enrichments or depletions.

For biological insights into differential methylation changes in relation to current smoking, we implemented a functional network analysis. Genes annotated from selected DMPs (FDR < 0.10) were included in the analysis. We used a core analysis of Ingenuity Pathway Analysis (Ingenuity Systems, Inc., Redwood City, CA, USA).

Transcriptome analysis: Asan Biobank

Transcriptome profiles from the lung tissues of 188 male smokers from the Asan Biobank were used in this analysis. Details of transcriptome profiles using RNA-seq (HiSeq 2000 system, Illumina, Inc., San Diego, CA, USA) have been published [30]. Data was available at NCBI Gene Expression Omnibus (GEO) (accession number of GSE57148). To exclude potential impact of extreme values, we filtered gene expression values outside of three times the IQR from the first and third quartiles of each gene transcript. Of all gene expression values across all participants, 35,607 (1.1 %) were removed. We calculated pack-years from duration (years) and amount (cigarette packs) of cigarette smoking.

To identify differentially expressed genes in relation to smoking intensity (pack-years), we applied a robust linear regression model and empirical Bayes approach by using limma [42]. For robust linear regression, gene expression levels were the response and pack-years the predictor. We presented nominally significant results to provide a clue to understand relationships between methylation in blood and gene expression in lung tissue.

Results

The descriptive characteristics of the study populations are shown in Table 1. The study participants were aged 53 to 84 years. There were 39 never, 30 former, and 31 current smokers. Among the never smokers, 6 were male and 33 were female. The former smokers were all male. There was one female current smoker. Individuals diagnosed with COPD were represented in each smoking group as follows: 19 in never, 20 in former, and 21 in current smoking group. The average BMI was 23.2 kg/m2 for never smokers, 23.5 kg/m2 for former smokers, and 22 kg/m2 for current smokers. The duration of smoking cessation in former smokers ranged 7 to 40 years. There were no significant differences in age, BMI, and proportion of COPD cases across smoking groups in our EWAS data.

Table 1 Descriptive characteristics of the study population

We identified 108 significant DMPs in current smokers compared to never smokers (FDR < 0.05) (Table 2, Additional file 3: Table S1, and Additional file 4: Table S3). Of these, nine were significant after Bonferroni correction (unadjusted p < 1.2E−07 correcting for 402,508 tests). Of the FDR-significant DMPs, 93 of these were novel and 15 were previously reported in EWASs of smoking. Decreased methylation in current smokers was observed at 85 % of the significant DMPs. The methylation differences between current and never smokers at significant CpGs ranged from −20.3 to 15.6 %. Among the top five probes, the most highly statistically significant was a CpG well-known for its association with smoking: cg05575921 (FDR = 2.6E−07) in aryl-hydrocarbon receptor repressor (AHRR). Among the remaining four probes in the top five, three were novel—cg10664184 (FDR = 1.80E−05) in DDA1; cg20723792 (FDR = 6.40E−05) in FAM53B; and cg24780263 (FDR = 0.001) in ALDOA—except for cg05951221 (FDR = 8.50E−04) located 12,850 base pair (bp) apart from ALPPL2. At five loci, more than one DMP at genome-wide significance was identified: AHRR (3 probes), 2q37.1 near ALPPL2 (2 probes), MYO1G (2 probes), NKX2-3 (2 probes), and FAM82A2 (2 probes). The genomic inflation factor (lambda) was 1.25. Manhattan plot and QQ plot are provided (Additional file 5: Figure S1).

Table 2 Top 30 CpGs differentially methylated in blood DNA in relation to current smoking compared to never smoking (FDR < 0.05, ordered by chromosomal location)

For our 108 significant DMPs, we found enrichment of probes mapping to CpG island shores (35 versus 23 % overall from the array, p = 0.002) and enhancer (29 versus 21 % overall from the array, p = 0.04). No significant over- or under-representation of probes in promoter-associated regions (19 versus 19 % overall, p > 0.05) or DNase hypersensitivity sites (18 versus 12 % overall, p > 0.05) were detected.

From the two different DMR analyses, we discovered 249 significant (FDR < 0.01) DMRs from DMRcate, 102 significant (Sidak p < 0.01) DMRs from comb-p, and 87 significant based on both approaches (Table 3). Of these 87 significant using both methods, 66 regions were novel, meaning never reported in previous EWASs of smoking in adults, including 7 that contained one of our genome-wide significant individual DMPs. Among those 87 DMRs, the most significant one (chromosome:start position-end position) from DMRcate was chr5:373378–374425 (FDR = 4.6E−17) in AHRR and this region contains five probes—cg05575921, cg22103736, cg08714121, cg04141806, and cg22356527—including our top-ranked DMP. AHRR differential methylation was also observed from comb-p with two probes—cg05575921 and cg22103736—in slightly shorter length (chr5:373378–373887; Sidak p = 4.8E−05) than that from DMRcate. The most significant DMR overall from comb-p was chr6:149805995–149806732 (Sidak p = 1.9E−14) in ZC3H12D and the exact same region, meaning the same start, end, and number of probes, was also observed from DMRcate (FDR = 2.3E−15) (Table 3). This region did not contain a genome-wide significant DMP. Among novel DMRs, the top two regions from both analyses were chr4:81117647–81119473 (FDR = 6.7E−13 from DMRcate; Sidak p = 2.9E−13 from comb-p) at PRDM8 including 11 probes and chr4:103940711–103941300 (FDR = 6.8E−14 from DMRcate; Sidak p = 2.7E−10 from comb-p) at SLC9B1 including 11 probes. Details of the top five DMRs from each software are in Additional file 6: Table S4. Those regions contain either one or two highly significant CpGs or tightly spaced CpGs of nominal statistical significance. The average (standard deviation, SD) of distances of nearby CpGs in those regions was 147 (153) bp for DMRcate and 158 (169) bp for comb-p.

Table 3 Differentially methylated regions in blood DNA in relation to current smoking compared to never smoking (multiple-testing corrected p < 0.01 at DMRcate and comb-p, ordered by chromosomal location)

Among the 108 significant DMPs from the comparison of current to never smokers, 104 were also significant in the former to never smoker comparison (FDR <0.05, look-up level replication) and had effects in the same direction (Additional file 7: Table S5). The attenuation in effect size in former compared with current smokers ranged from −12.3 to 4.3 %. The top-ranked DMP in former smokers compared to never smokers was cg20723792 (FDR = 1.3E−2) in FAM53B at which no relationship with smoking exposures in terms of DNA methylation has been previously reported.

We examined dose-response relationships between methylation levels and quantitative measures of smoking exposure (urine cotinine levels and pack-years in current smokers and duration of smoking cessation in former smokers) for the 108 significant DMPs identified in our EWAS of current smoking (Table 4). There was no significant finding after FDR multiple-testing correction. Urine cotinine levels were positively correlated at nominal levels of significance (uncorrected p < 0.05) with methylation levels at a probe in MTNR1A and negatively correlated with methylation levels at five probes from five different loci: GNG12; GPR15; AHRR; FAM82A2; and F2RL3. Pack-years in current smokers showed positive correlation at five loci and negative correlation with methylation levels at one locus. Duration of smoking cessation in former smokers was positively correlated at nominal significance (p < 0.05) with methylation levels at seven loci and negatively correlated with methylation at one locus.

Table 4 CpGs differentially methylated in relation to smoking status also related to quantitative measures of smoking (p correlation <0.05, ordered by chromosomal location)

Our analysis of differential gene expression in lung tissue was conducted in 188 male smokers from a separate study, the Asan Biobank. The average age was 64.2 (SD = 8.7) years and average pack-years was 42.0 (SD = 20.6) (Table 1). Of the 174 genes to which the 108 DMPs or 87 DMRs that were significantly differentially methylated were annotated, we had gene transcript profiles for 143. Of these, 20 genes, annotated from 17 DMPs or eight DMRs, showed nominally significant differential gene expression profiles (p < 0.05) in relation to pack-years (Table 5). Fourteen of the 20 genes were novel loci for effects of smoking on methylation and six—GPR15, AHRR, ELMO1, SNED1, LPP, and GNA12—were previously reported in EWASs of smoking. No significant results were observed after FDR multiple-testing correction.

Table 5 Differential methylation in relation to current smoking for genes with transcripts differently expressed (p < 0.05) in relation to smoking pack-years (ordered by chromosomal location)

In current smokers compared to never smokers, there were lower methylation levels at 17 DMPs (Table 5). Of those, four CpGs were located in enhancer regions and their corresponding lung tissue gene expression values were positively associated with pack-years in smokers, regardless of whether or not they were located in a CpG island. Four of the 17 were at DNase I hypersensitivity sites (DHS). Three of these were outside of CpG islands and showed a positive association with pack-years in smokers. The remaining site, located on a shelf region of a CpG island, was negatively associated. At four promoter-associated CpGs, we did not find any relationships between methylation levels and gene expression values.

Our functional network mapping involving 221 genes annotated from probes in our EWAS (FDR < 0.10) identified four overrepresented pathways (Additional file 8: Table S6). Top three networks were “gene expression, cellular movement, and embryonic movement,” “cancer, cellular development, organismal injury, and abnormalities,” and “hematological, metabolic, and cardiovascular disease.”

From a replication look-up of 178 CpGs, selected based on significant findings in at least two published EWASs of smoking, we confirmed differential methylation at 70 CpGs (Table 6). Of these, all CpGs showed same direction of association compared to that in previous reports. Among these 178 probes from previous EWASs, 83 (47 %) showed nominal (p < 0.05) association in our analysis of current smokers which is much higher expected by chance (Kolmogorov p < 2.2E−16). There were also significant differential methylation changes in former smokers at 24 CpGs in 17 loci (Table 6).

Table 6 Look-up in the Korean COPD cohort of CpGs reported at least two epigenome-wide association studies (70 CpGs at FDRg < 0.05, ordered by chromosomal location)

Discussion

This is the second EWAS for smoking exposure in an East Asian population and the first which links differential methylation changes in blood to large-scale differential transcriptome profiles in lung tissue at multiple loci. We discovered novel smoking-associated DMRs as well as DMPs and confirmed previous findings mostly from non-Asian populations. We identified nominally significant correlations in DNA methylation in relation to quantitative measures of smoking: urine cotinine levels, pack-years, and duration of smoking cessation. Differentially expressed genes in relation to smoking intensity in lung tissue support the potential utility of our findings as blood DNA methylation biomarkers for smoking exposure.

We discovered 108 significant DMPs and 87 significant DMRs in relation to current smoking. Fourteen loci were significant from both approaches; nine of which were novel: CALML4, CCND1, FOXK2, LINC01019, NKX2-3, NT5C1A, PRDM8, SPAG17, and SYNGR1. It has been reported that genetic variants in CCND1 and smoking exposure are associated with gastric carcinogenesis [53], nasopharyngeal carcinoma [54], and lung cancer [55] and useful for lung cancer prediction [56]. PRDM8 encodes a protein which belongs to a conserved family of histone methyltransferases regulating transcription negatively.

Of the 87 significant DMRs, in 32 all CpGs were of nominal (p < 0.05) statistical significance. On average, 78 % of CpGs in each identified DMR were nominally significant. Although a DMR does not need to include a genome-wide significant DMP in the region, 14 DMRs contained FDR-significant DMPs. In our analysis of differentially methylated regions, the most highly significant DMRs consist of either one or two highly significant DMPs or closely spaced neighboring CpGs of only nominal statistical significance in the region (Additional file 6: Table S4). Although it has been reported that two methods that we used to identify DMRs can correct for irregular spacing of probes across the genome [44, 45], we cannot conclude whether these are reflecting true differential methylation or false discovery driven by array-design.

Our EWAS identified 104 DMPs from the analysis of current smokers that were also seen in former smokers compared to never smokers; 93 of which were novel. The methylation differences in current and former smokers compared to never smokers were only slightly attenuated. The persistence of blood DNA methylation changes in former smokers, even after 7 to 40 years of smoking cessation, is notable. Our analysis of duration of smoking cessation in former smokers showed positive correlations at seven loci—IFI16, CLASP1, KTELC1, SPEF2, ACOT13, BSPRY, and FAM82A2—which has not been previously reported in EWASs. We also found a negative correlation at cg25799109 in ARHGEF3, a known smoking-associated CpG [12].

Although there are biomarkers of current smoking, including nicotine and its metabolite cotinine levels in urine, blood, or saliva, biomarkers reflecting past smoking have been lacking. Interestingly, we found that most of the signals for current smoking remained for past smoking. Recent studies suggest that methylation signals are promising biomarkers for both current and lifetime smoking [57] that are related to mortality [58]. Significant methylation alterations in former smokers compared to never smokers from our study can contribute to development of biomarkers for past smoking.

For urinary cotinine, we confirmed previous findings of differential methylation at GNG12, GPR15, F2RL3 [27], and AHRR [21, 27] at nominal statistical significance (p < 0.05) and negative directions of association were also consistent. We also identified novel positive and negative correlations with methylation levels at MTNR1A and FAM82A2, respectively. Gene-environment interactions of variants in MTNR1A and smoking have been reported in relation to oral cancer [59]. In studies without cotinine measured, differential methylation at loci correlated with cotinine could serve as objective biomarkers to confirm the self-reported current level of smoking. For pack-years, we found correlations with DNA methylation at NT5C1A, ZBTB9, HPX, CCND1, and RNF160 which were have not been reported in previous EWASs. Although cg19134728 in JAKMIP3 was previously shown to be differentially methylated in smokers compared to non-smokers [15], its relationship with pack-years in current smokers was never studied.

To gain some biological insight into the differential methylation from our EWAS, we linked our genome-wide significant results to large-scale transcriptome profiles in lung tissues. We discovered differential gene expressions in relation to pack-years at 20 genes which were mapped from 17 DMPs and 8 DMRs. Our findings include six genes—GPR15, AHRR, LPP, GNA12, CYB561, and SNED1—known for their association with smoking in previous EWASs, but none of these has been identified in transcriptome analyses of pack-years in lung tissue. Only one previous EWAS included smoking-associated differential gene expression at AHRR; that study included lung tissue samples from five smokers and five non-smokers [19].

Our finding of enrichment of significant DMPs in CpG island shore (regions within 2000 bp within a CpG island) is consistent with previous findings of variable DNA methylation in the regions [60], suggesting methylation in shore regions is more susceptible to environmental factors including smoking.

Our replication look-up confirmed 70 DMPs in the same direction of methylation changes from previous EWASs at strict look-up level significance. Of these, 51 were replicated in one EWAS [26] from a Chinese population. Nineteen were never replicated in an East Asian population. We could not replicate the novel findings identified from the EWAS in Chinese [26].

We had only one female current smoker and six male never smokers. Because of this imbalance, our adjustment for gender may not eliminate potential bias in the smoking results. We identified one EWAS of gender using Illumina’s 450k array [13] in blood DNA (n = 123). In their supplementary table, they presented 274 gender-associated CpGs of genome-wide significance (p < 1.07E−07) located in autosomes. None of our 108 smoking DMPs (FDR < 0.05) were among those suggesting that our top findings do not reflect the gender imbalance.

In our EWAS, we used COPD status as a covariate. The disease status could be a confounding factor. For 108 FDR-significant DMPs related to current smoking, we checked the association between COPD status and DNA methylation under two statistical models. Model 1 included covariates of age, sex, height, and estimated cell-type compositions; model 2 contained additional covariates of smoking status and pack-years. None of our DMPs were statistically significantly associated with COPD under either model (FDR ≤0.05 after correcting for 108 tests). Sixteen CpGs were nominally related to COPD at uncorrected p < 0.05 (Additional file 9: Table S9).

There are limitations and strengths in this study. First, these data were cross-sectional which limits causal inference regarding resolution of effects with cessation of smoking. Second, we do not have a replication dataset from an independent Korean, or similar, population. Therefore, there is a chance of false positives among our novel findings. Third, the study population was drawn from a COPD cohort. Although we adjusted for the disease status in the regression models, the possibility of some type of selection bias could be raised. Fourth, we used blood DNA methylation to examine effects of smoking. The use of blood DNA methylation changes can be limited due to cell- and tissue-specific characteristics of methylation. However, our findings of differential methylation were adjusted for estimated cell-type proportions. We also confirmed differential transcriptome patterns in relation to pack-years in lung tissue at multiple loci.

Our study also has strengths. This is one of the few studies in Asian populations and the first in Koreans. We verified self-reported non-smoking status with urine cotinine values. Underreporting of smoking status in surveys occurs [61] and the nondifferential misclassification could distort association results. We also implemented two DMR approaches to provide significant DMRs in our EWAS. The methodologies for the discovery of DMRs have been developed and revised over several years, and it has been reported that the performance of DMRcate and comb-p were superior to those of others [44]. We were also able to examine whether genes with differential methylation in relation to smoking also showed differential transcription in relation to smoking in lung tissue, an important target for smoking related pathology.

Conclusions

Our study in Koreans, we discovered novel smoking-associated DNA methylation changes in blood and also confirmed many previous findings mostly identified in Caucasians. Observed correlations between methylation levels and quantitative measures of smoking exposures support the utility of blood DNA methylation biomarkers for smoking intensity and history. Our evaluation of differential gene expression profiles of corresponding genes in lung tissues supports the potential functional importance of our methylation findings.