SWAN: Subset-quantile Within Array Normalization for Illumina Infinium HumanMethylation450 BeadChips
- 25k Downloads
DNA methylation is the most widely studied epigenetic mark and is known to be essential to normal development and frequently disrupted in disease. The Illumina HumanMethylation450 BeadChip assays the methylation status of CpGs at 485,577 sites across the genome. Here we present Subset-quantile Within Array Normalization (SWAN), a new method that substantially improves the results from this platform by reducing technical variation within and between arrays. SWAN is available in the minfi Bioconductor package.
KeywordsProbe Type Renal Clear Cell Carcinoma Array Normalization Normal Human Kidney Probe Body
differentially methylated probe
receiver operating characteristic
reduced representation bisulfite sequencing
Subset-quantile Within Array Normalization
The Cancer Genome Atlas.
DNA methylation, which is the addition of a methyl group to the cytosine of a CpG dinucleotide, is one of the most widely studied epigenetic modifications in human development and disease. Changes in DNA methylation are vital for normal development and differentiation , whilst aberrant methylation is involved in diseases such as diabetes, schizophrenia, multiple sclerosis and cancer [2, 3, 4]. As interest in epigenetics, and particularly DNA methylation, has increased, analysis methods have had to evolve in scale and resolution. Currently, several microarray and next-generation sequencing technologies are available that allow the interrogation of DNA methylation on a genome-wide scale [5, 6, 7, 8, 9, 10, 11, 12, 13, 14]. Each of these approaches has inherent strengths and weaknesses, which have been compared and discussed in several recent reviews [15, 16, 17, 18]. As sequencing-based DNA methylation assays become more affordable, it is anticipated that they will be more widely used in this arena; at present, however, they are still too costly for most studies, particularly those that involve large numbers of samples. Consequently, methylation arrays are a popular alternative for high-throughput DNA methylation analyses.
The Infinium I design, which was previously employed on the 27k arrays, uses fluorescence from two different probes, unmethylated (converted) and methylated (unconverted), to assess the level of methylation of a target CpG. If a target CpG was methylated in the sample, the DNA fragment will remain unconverted after bisulfite treatment and will therefore bind to the complementary 'methylated' probe, which terminates at the 3' end with a cytosine. If the target CpG was unmethylated, however, binding will occur to the complementary 'unmethylated' probe, which terminates at the 3' end with a thymine. Binding at either probe is followed by single base extension that results in the addition of a fluorescently labeled nucleotide (Figure 1a). It is assumed that the methylation status of CpGs underlying the 50 bp probe body is correlated to that of the target CpG such that CpGs in the probe body of an unmethylated (converted) probe are also converted, while CpGs in the body of a methylated (unconverted) probe will remain unconverted. By contrast, the Infinium II design uses only a single probe per target CpG, which incorporates a 'degenerate' R-base at any underlying CpG sites in the probe body. The 3' end of each Infinium II probe is complementary to the base directly upstream of the 'C' of the target CpG. Methylation state is detected by single base extension at the position of the 'C' of the target CpG, which always results in the addition of a labeled 'G' or 'A' nucleotide, complementary to either the 'methylated' C or 'unmethylated' T, respectively (Figure 1b).
The Infinium II design is the preferred probe design for the 450k chip. Bibikova et al.  demonstrated that the Infinium II probes could have up to three CpG sites underlying their 50 bp probe body without affecting data quality. However, hybridization kinetics and specificity were often compromised in regions of higher CpG density and therefore Infinium I probes are still used to expand the number of CpG sites that can be assayed. Consequently, many of the Infinium I probes contain three or more underlying CpGs, whilst most Infinium II probes have less than three underlying CpGs (Figure 1c).
Technical differences between the Infinium I and Infinium II probe types have been observed. Bibikova et al.  noted a difference in the β value distributions they produced, where β is defined as the proportion of the total signal coming from the methylated channel. Specifically, they noticed a compression in the β value distribution of Infinium II probes compared to Infinium I. Similarly, Dedeurwaerder et al.  reported that the β values obtained from the Infinium II probes displayed a narrower range than those obtained from Infinium I probes and suggested that Infinium II probes are less sensitive for the detection of extreme methylation values due to the two-color detection method used. They suggested a simple scaling of β values for Infinium II probes and reported improved results in terms of validation against bisulfite pyrosequencing data, but also noted potential difficulties in applying this procedure to cancer samples.
Here we present a novel method to normalize between Infinium probe types on the 450k platform. This method derives from normalization methods that have been hugely successful for microarray expression platforms [22, 23, 24]. Specifically, we introduce a Subset-quantile Within Array Normalization (SWAN) method that allows the Infinium I and II probes within a single array to be normalized together. We show that this method substantially reduces the differences in β value distribution observed between Infinium I and II probes. We also demonstrate that this method improves correlation between technical replicates, whilst increasing the number of significantly differentially methylated probes that are detected. SWAN is written in the R programming language and is available in the minfi package  from Bioconductor.
Subset-quantile Within Array Normalization
Because the two probe types interrogate different subsets of the genome, established methods for normalization, such as quantile normalization, cannot be applied naively between probe types. Standard quantile normalization makes the distribution of probe intensities for each array in a set of arrays identical. More recently, a subset quantile normalization approach was introduced that uses large sets of control probes on the arrays for normalization and assumes that only the distributions of these control probes remain constant . However, there are no large sets of controls that have probes corresponding to both the Infinium I and Infinium II designs on the 450k platform.
The SWAN method has two parts. First, an average quantile distribution is determined using a subset of probes defined to be biologically similar based on CpG content. This is achieved by randomly selecting N Infinium I and II probes that have one, two and three underlying CpGs, where N is the minimum number of probes in the six sets of Infinium I and II probes with one, two and three probe body CpGs. If no probes have been filtered out - for example, sex chromosome probes, and so on - N = 11,303. This results in a pool of 3N Infinium I and 3N Infinium II probes. Due to the vast differences in their distributions (Figure 2), the subsequent processing is performed independently on both the methylated and unmethylated channels. The subset for each probe type, from each channel (methylated or unmethylated), is sorted by increasing intensity. The value of each of the 3N pairs of observations is then assigned to be the mean intensity of the two probe types for that row or 'quantile'. This is the standard quantile procedure. The second step is to then adjust the intensities of the remaining probes, of which there are many more Infinium II than I, by interpolation onto the distribution of the subset probes. This is done for each probe type separately using linear interpolation between the subset probes to define the new intensities. Consequently, while the distribution of the subset is identical, the intensity distribution of Infinium I probes is still vastly different from the distribution of Infinium II probes (Figure S2 in Additional file 1).
SWAN makes Infinium I and II β value distributions more similar
We applied the SWAN method to the fully methylated (FM), fully unmethylated (FU) and hemi-methylated (HM) sample analyzed by Bibikova et al. . The raw data were imported from IDAT files using the minfi package . SWAN was applied to the raw intensity data and β values were calculated using the methylated and unmethylated intensity values for both the raw and SWAN normalized data. Figure 4a shows the raw and SWAN normalized β value distributions for Infinium I and II probes for all three methylation standards. It can be seen that after SWAN the average β value distributions from the two probe types are more consistent, particularly for the FM and FU samples. Furthermore, the absolute difference in the medians of the Infinium I and II β value distributions is reduced after using SWAN for all three standards (Figure 4b; difference in medians: FM-Raw: 0.032, FM-SWAN: 0.012; HM-Raw: 0.057, HM-SWAN: 0.041; FU-Raw: 0.084, FU-SWAN: 0.017). Figure 4c shows an example of how the differences in the two probe types can result in an aberrant overall β value distribution for a normal human DNA sample. Using SWAN, however, corrects the overall distribution (Figure 4c). The SWAN procedure reduces the absolute difference between the peak positions of Infinium I and Infinium II probes at the unmethylated (ΔPU) and methylated (ΔPM) extremes of the distribution (see Materials and methods). For the data shown in Figure 4c, ΔPU is reduced from 0.067 to 0.046, whilst ΔPM remains relatively unchanged from 0.035 to 0.038, resulting in an improved overall β value distribution. Although the changes to the overall β value distribution appear dramatic for some samples, not all samples have large differences in probe type distributions. Usually the changes to the β values of individual CpGs after SWAN are less than ±0.1 (Figure S3 in Additional file 1).
Next we compared the results of methylation analysis for an MCF7 sample that was assayed on the 450k array and the older 27k array. The 27k array only includes probes of the Infinium I design. Of the CpGs interrogated on the 27k array, 25,978 are included on the 450k array but many (91%) of them are now assayed using the Infinium II design, while the remaining sites are still assayed using the Infinium I design. We found that using SWAN made the 450k Infinium I and II β value distributions more similar to those of 27k by reducing the absolute difference in the locations of the peaks at the extremes of the distribution (Figure 6). ΔPU is reduced for 27k compared to 450k Infinium I from 0.021 to 0.018 and Infinium II from 0.0174 to 0.0167, whilst ΔPM is also reduced for 27k compared to 450k Infinium I from 0.085 to 0.067 and Infinium II 0.04 to 0.013.
SWAN reduces technical variability
SWAN leads to better detection of differential methylation
We have shown that SWAN reduces technical variation between arrays. In order to show that the biological differences of interest have been maintained, we performed a differential methylation analysis between two groups, as reducing technical variation whilst maintaining biological differences should increase the power to detect truly differentially methylated CpGs. We performed a differential methylation analysis comparing three normal human kidney samples to three normal human rectum samples. To evaluate the impact of SWAN on differential methylation analysis, unrelated kidney and rectal mucosa samples analyzed by reduced-representation bisulfite sequencing (RRBS) were used to define a set of 'truly' differentially methylated loci (see Materials and methods).
We performed a second differential methylation analysis comparing three male arrays with two female arrays where each array is a pool of two individuals. The data were processed using three different methods: no normalization (raw), Illumina's control normalization as implemented in minfi and SWAN. Probes on the Y chromosome and probes with a detection P-value >0.01 in one or more samples were excluded, leaving a total of 482,704 probes for differential methylation analysis. Differential methylation analysis was performed using the testing available in minfi on each of the three different versions of the data. Using SWAN consistently resulted in a higher number of significantly differentially methylated probes (DMPs) across a range of q-value thresholds (Figure S6a in Additional file 1). Furthermore, using SWAN facilitated the detection of more unique DMPs (170) than using the other methods (118) (Figures S6b and S7 in Additional file 1). We also found that, as expected, a larger proportion (77.6%) of the unique DMPs detected when using SWAN was from the × chromosome, when compared with the combined set of unique DMPs detected using the other methods (63.6%).
The HumanMethylation450 BeadChip includes a combination of two different probe designs for assaying the methylation status of 485,577 CpG sites across the human genome. This unique design clearly produces technical differences between probe types within a single array. Here we present a new within array normalization method that substantially reduces the technical variability between the probe types whilst maintaining the important biological differences.
The SWAN method makes the assumption that the number of CpGs within the 50 bp probe sequence reflects the underlying biology of the region being interrogated. Hence, the overall distribution of intensities of probes with the same number of CpGs in the probe body should be the same. The method then uses a subset quantile normalization approach to adjust the intensities of the probes on the arrays.
SWAN clearly improves the results obtained from the 450k array. We show that technical variability is reduced, whilst increasing the ability to detect differential methylation between samples. We also report better correlation between the 450k arrays and the 27k arrays, which will be important for studies that aim to combine data from both platforms.
Although further investigations into other aspects of the analysis of these arrays, such as color normalization, between array normalization and statistical testing procedures for differential methylation may prove beneficial, we feel that SWAN is an essential step in the analysis of the Illumina Infinium HumanMethylation450 BeadChip. SWAN is available in the Bioconductor package minfi .
Materials and methods
The HumanMethylation450 data for the unmethylated, methylated and hemi-methylated reference standards, as well as the NA17105 and NA17018 DNA samples, and the MCF7 and A431 cancer cell lines were obtained from the Illumina website in the raw IDAT format. The HumanMethylation27 MCF7 data were kindly provided by Dr Marina Bibikova (Illumina).
The normal human kidney and rectum methylation data were sourced from The Cancer Genome Atlas (TCGA) Data Portal . Specifically, the normal kidney samples (TCGA-B0-5121-11, TCGA-BP-4177-11 and TCGA-B0-5092-11) were part of the kidney renal clear cell carcinoma cohort, whilst the normal rectum samples (TCGA-AG-3731-11, TCGA-AG-3725-11 and TCGA-B0-5121-11) were from the rectal adenocarcinoma cohort. All the data were in the raw IDAT format.
The RRBS data were obtained from the Epigenomics Roadmap at NCBI . The normal human kidney (NA000003582.1) and normal human primary rectal mucosal tissue (NA000003579.1) samples were both obtained in WIG format, which is a series of base pair positions with corresponding β values for each chromosome.
The data for the male versus female differential methylation comparison comprise a subset of data generated for an unrelated study by Martino et al. . Briefly, the five HumanMethylation450 arrays used in this study were hybridized with bisulfite converted DNA pooled from three samples from two male individuals and two samples from two female individuals extracted from mononuclear cells collected at birth. These data were also in the raw IDAT format.
If the normalized intensity of any probe is less than or equal to zero, its intensity is set to the median intensity of the negative control probes.
In order to assess the performance of the method, we calculated the difference in the peak positions of the Infinium I and Infinium II probes . We define PU to be the position of the maximum of the β distribution with β < 0.5 (unmethylated peak) and PM to be the position of the maximum with β > 0.5 (methylated peak). We define the absolute difference in peak positions between Infinium I and Infinium II probes as |ΔP = PI - PII| for both the methylated and unmethylated peaks.
Selecting RRBS validation data
To identify CpG loci that were interrogated in both the RRBS and HumanMethylation450 data, we firstly identified a set of CpGs that were assayed in both the kidney and rectum RRBS samples. The resulting list was then intersected with the probe locations of the HumanMethylation450 data. This produced a subset of 18,678 CpG loci.
Differential methylation analysis: tissue comparison
IDAT files were loaded into the R (2.14) environment using the Bioconductor (2.9) minfi package (1.0.0) . The detection P-values for all probes were then calculated for the data using functionality provided in minfi. Probes on the × and Y chromosomes were removed at this stage. Two versions of the data were used in subsequent analyses: the raw data and SWAN data. Probes with a detection P-value >0.01 in one or more samples were then excluded. The differential methylation analysis was performed for both datasets on the subset of 18,678 probes that overlapped with the RRBS data using the 'dmpFinder' minfi function. The 'dmpFinder' function uses an F-test to identify positions that are differentially methylated between two groups. The tests are performed on M-values (log2(Methylated/Unmethylated)) as recommended in Du et al. . Variance shrinkage was used due to the small sample size. In 'dmpFinder', the sample variances are squeezed by computing empirical Bayes posterior means using the limma package . Example R code for performing a differential methylation analysis using minfi can be found in Additional file 2.
True positives were defined to be CpGs that had an absolute difference in β value >0.25 between the kidney and rectum RRBS samples. Additionally, for the ROC analysis, which was performed using the ROCR package , true negatives were defined as those CpGs found to have an absolute difference in β value <0.05 between the RRBS samples.
We thank Kasper Hansen, Martin Aryee and Rafael Irizarry for making their minfi code available and including our methods. We thank Mark Robinson for helpful discussion and critical reading of the manuscript. We also acknowledge Terry Speed, Nadia Davidson, the Cancer and Disease Epigenetics (Saffery) Lab and Early Life Epigenetics (Craig) Lab at the Murdoch Childrens Research Institute (MCRI) for helpful discussion. We thank David Martino and Marina Bibikova for providing access to their published data. We would also like to acknowledge the TCGA Research Network for making their vast resource of genomic data available. This work was supported by the Victorian Government's Operational Infrastructure Support Program to MCRI.
- 1.Rakyan VK, Down TA, Thorne NP, Flicek P, Kulesha E, Gräf S, Tomazou EM, Bäckdahl L, Johnson N, Herberth M, Howe KL, Jackson DK, Miretti MM, Fiegler H, Marioni JC, Birney E, Hubbard TJP, Carter NP, Tavaré S, Beck S: An integrated resource for genome-wide identification and analysis of human tissue-specific differentially methylated regions (tDMRs). Genome Res. 2008, 18: 1518-1529. 10.1101/gr.077479.108.PubMedPubMedCentralCrossRefGoogle Scholar
- 4.Irizarry RA, Ladd-Acosta C, Wen B, Wu Z, Montano C, Onyango P, Cui H, Gabo K, Rongione M, Webster M, Ji H, Potash JB, Sabunciyan S, Feinberg AP: The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores. Nat Genet. 2009, 41: 178-186. 10.1038/ng.298.PubMedPubMedCentralCrossRefGoogle Scholar
- 5.Schumacher A, Kapranov P, Kaminsky Z, Flanagan J, Assadzadeh A, Yau P, Virtanen C, Winegarden N, Cheng J, Gingeras T, Petronis A: Microarray-based DNA methylation profiling: technology and applications. Nucleic Acids Res. 2006, 34: 528-542. 10.1093/nar/gkj461.PubMedPubMedCentralCrossRefGoogle Scholar
- 7.Ordway JM, Bedell JA, Citek RW, Nunberg A, Garrido A, Kendall R, Stevens JR, Cao D, Doerge RW, Korshunova Y, Holemon H, McPherson JD, Lakey N, Leon J, Martienssen RA, Jeddeloh JA: Comprehensive DNA methylation profiling in a human cancer genome identifies novel epigenetic targets. Carcinogenesis. 2006, 27: 2409-2423. 10.1093/carcin/bgl161.PubMedCrossRefGoogle Scholar
- 9.Rauch T, Li H, Wu X, Pfeifer GP: MIRA-assisted microarray analysis, a new technology for the determination of DNA methylation patterns, identifies frequent methylation of homeodomain-containing genes in lung cancer cells. Cancer Res. 2006, 66: 7939-7947. 10.1158/0008-5472.CAN-06-1888.PubMedCrossRefGoogle Scholar
- 13.Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, Nery JR, Lee L, Ye Z, Ngo QM, Edsall L, Antosiewicz-Bourget J, Stewart R, Ruotti V, Millar Ha, Thomson Ja, Ren B, Ecker JR: Human DNA methylomes at base resolution show widespread epigenomic differences. Nature. 2009, 462: 315-322. 10.1038/nature08514.PubMedPubMedCentralCrossRefGoogle Scholar
- 25.minfi: Analyze Illumina's 450k methylation arrays. [http://www.bioconductor.org/packages/release/bioc/html/minfi.html]
- 32.The Cancer Genome Atlas Data Portal. [https://tcga-data.nci.nih.gov/tcga/]
- 33.NCBI Epigenomics. [http://www.ncbi.nlm.nih.gov/epigenomics/]
- 34.Martino D, Maksimovic J, Joo JH, Prescott SL, Saffery R: Genome-scale profiling reveals a subset of genes regulated by DNA methylation that program somatic T-cell phenotypes in humans. Genes Immun. 2012, doi: 10.1038/gene.2012.7Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.