Estimates of array and pool-construction variance for planning efficient DNA-pooling genome wide association studies
- 3.5k Downloads
Until recently, genome-wide association studies (GWAS) have been restricted to research groups with the budget necessary to genotype hundreds, if not thousands, of samples. Replacing individual genotyping with genotyping of DNA pools in Phase I of a GWAS has proven successful, and dramatically altered the financial feasibility of this approach. When conducting a pool-based GWAS, how well SNP allele frequency is estimated from a DNA pool will influence a study's power to detect associations. Here we address how to control the variance in allele frequency estimation when DNAs are pooled, and how to plan and conduct the most efficient well-powered pool-based GWAS.
By examining the variation in allele frequency estimation on SNP arrays between and within DNA pools we determine how array variance [var(earray)] and pool-construction variance [var(econstruction)] contribute to the total variance of allele frequency estimation. This information is useful in deciding whether replicate arrays or replicate pools are most useful in reducing variance. Our analysis is based on 27 DNA pools ranging in size from 74 to 446 individual samples, genotyped on a collective total of 128 Illumina beadarrays: 24 1M-Single, 32 1M-Duo, and 72 660-Quad.
For all three Illumina SNP array types our estimates of var(earray) were similar, between 3-4 × 10-4 for normalized data. Var(econstruction) accounted for between 20-40% of pooling variance across 27 pools in normalized data.
We conclude that relative to var(earray), var(econstruction) is of less importance in reducing the variance in allele frequency estimation from DNA pools; however, our data suggests that on average it may be more important than previously thought. We have prepared a simple online tool, PoolingPlanner (available at http://www.kchew.ca/PoolingPlanner/), which calculates the effective sample size (ESS) of a DNA pool given a range of replicate array values. ESS can be used in a power calculator to perform pool-adjusted calculations. This allows one to quickly calculate the loss of power associated with a pooling experiment to make an informed decision on whether a pool-based GWAS is worth pursuing.
KeywordsAllele Frequency Individual Genotyping Effective Sample Size Array Variance Allele Frequency Difference
Genome-wide association studies (GWAS) have been used to examine over 200 diseases and traits, and identified over 4000 single nucleotide polymorphisms (SNPs) associated with these traits, as listed in the Catalog of Published Genome-Wide Association Studies . In many cases, GWAS have revealed previously unsuspected molecular mechanisms of disease, highlighting the value of this hypothesis-free approach [reviewed in [2, 3]]. Unfortunately, GWAS are very costly due to the price of genotyping thousands of individual DNA samples on high-density SNP arrays. Consequently, GWAS have only been feasible for research groups with the necessary budget, studying well-funded diseases or traits. A simple strategy to drastically reduce cost is to replace individual genotyping in Phase I of a GWAS with genotyping of DNA pools. DNA pools yield estimated allele frequencies rather than observed genotypes; hence, this step has been called allelotyping . Several studies have provided proof of principle for the pooling strategy, using it to re-discover known disease-variant associations of moderate to large effect size for a fraction of the cost of conventional GWAS [5, 4]. To date, more than twenty pooled-based GWAS have been published, many reporting genome-wide significant associations for diseases and traits such as follicular lymphoma, otosclerosis, multiple sclerosis, Alzheimer's disease, melanoma, psoriasis, and skin colour [6, 7, 8, 9, 10, 11, 12]. Depending on the number of samples being pooled, the cost reduction in Phase I can easily reach 100 fold. Consider, if a SNP array costs $250 and there are 2000 cases and 2000 controls to genotype, a million dollars is required for Phase I individual genotyping alone. Conversely, the pool-based experiment using 12 replicate arrays on two pools (case and control) would be $6000, or 0.6% of the cost. Simply put, a pooling GWAS is feasible for most grant budgets, while an individual genotyping GWAS is not. The criticism of pool-based GWAS is that they have reduced power relative to conventional GWAS because of errors introduced by estimating allele frequency from DNA pools rather than individual genotyping data. While it is true that pool-based GWAS forfeit some power, these losses can be estimated, are often less than expected, and may not change the associations discovered. Although array costs will continue to drop and conventional GWAS will become more feasible, the potential savings associated with the pooling approach will scale in proportion, leaving more funds for subsequent replication, fine-mapping, and sequencing of associated genomic regions. For diseases or traits with unknown biology or genetic involvement, a pooling GWAS represents an economical way to test for associations with moderate odds ratios. In addition, work using DNA extracted from pooled whole blood suggests that a large time-savings (50-100 fold) may also be possible, presenting the possibility of an incredibly fast (<1 month) and economical experiment . For a comprehensive introduction and review of DNA pooling readers are directed to Sham et al. 2002 and Pearson et al. 2007 [13, 4], and for a set of best practices for any GWAS to Pearson & Manolio, 2008 .
We know that in the process of estimating allele frequencies from DNA pools we introduce error, and these must be taken into consideration to plan an adequately powered experiment or to appropriately calculate association statistics [15, 16]. With respect to doing this, the most important consideration is the pooling variance ; the variance in the errors arising from estimating allele frequency from a DNA pool. Pooling variance is the sum of many sources of variation, including in particular, array variance and pool construction variance. Array variance can be attributed to those errors arising from estimating allele frequency from a DNA pool on an SNP array [17, 18]. Pool construction variance can be attributed to those errors arising from the physical creation of a DNA pool. As pooling variance increases, the ability of a pool-based GWAS to detect odds ratios similar to those detectable by conventional GWAS decreases. In this report we assume pooling variance is the sum of array variance and pool-construction variance and attempt to determine which makes the greater contribution to the pooling variance. This is relevant to determining how best to design a pool-based GWAS and how to allocate resources, for example, replicate arrays can be used to reduce array variance and/or pools can be constructed in replicate to control pool construction variance.
Here we partition and estimate variance components using the approach described by MacGregor , which examines variation in allele frequency measurements between and within DNA pools. Briefly, within-pool variation is that observed between two arrays used to allelotype the same DNA pool (i.e. replicate arrays), and is an estimate of array variance. Between-pool variation is that observed between two arrays used to allelotype two different DNA pools, and is an estimate of pooling variance. Estimates of array variance and pooling variance are used to calculate pool construction variance by subtraction . Using this approach in an analysis of two DNA pools allelotyped on twelve Affymetrix Genechip HindIII arrays (6 arrays per pool) MacGregor  found that approximately 87.5% of pooling variation could be attributed to the arrays, leaving 12.5% to pool-construction . It was noted, however, that more data sets would be necessary to determine the variability in these estimates. Here we inspect 27 DNA pools allelotyped on a total of 128 Illumina arrays, including the Human1M Single (1M-Single), Human1M Duo (1M-Duo), and HumanHap660 Quad (660-Quad) arrays, allowing us to better address the question of what values array variance and pool-construction variance are likely to take. In addition, we perform our analysis on normalized array data and raw array data to examine how normalization affects pooling variance estimates.
In the first part of this study we establish values for array variance and pool-construction variance. In the second part, we use these estimates to calculate the effective sample size (ESS) of a DNA pool (where ESS is the equivalent number of samples that would need to be individually genotyped to give a similar result) . We also present a simple online tool, PoolingPlanner, which uses our empirical variance estimates as default values to calculate the effective sample size (ESS) of a DNA pool given a range of replicate array values (available at http://www.kchew.ca/PoolingPlanner/). PoolingPlanner also accepts user-supplied values for variance estimates. ESS can then be used in one of the available power calculators, such as CaTS , or Quanto , to perform pool-adjusted power calculations . PoolingPlanner is intended to help researchers quickly calculate the loss of power associated with a particular pooling experiment, which is a first step in making on informed decision on whether a pool-based GWAS is worth pursuing.
Our analysis is based on 27 DNA pools ranging in size from 74 to 446 individual samples. These were allelotyped on a collective total of 128 Illumina beadarrays: 24 1M-Single, 32 1M-Duo, and 72 660-Quad. Our dataset comprises four batches of genotyping (details given in Additional File 1, Table S1), which correspond to four ongoing pool-based GWAS that have not yet been published. Each of these studies was approved by the joint Clinical Research Ethics Board of the British Columbia Cancer Agency and the University of British Columbia. All subjects gave written informed consent.
Genomic DNA was extracted from peripheral venous blood collected between 2001 and 2008 by different laboratories using different methods. DNA samples were diluted to 50-100 ng/uL and then quantified in duplicate by fluorometry using PicoGreen™(Molecular Probes, Eugene, OR, US). Pools were constructed by combining 200 ng of each sample DNA by manual pipetting. Pools were assayed (allelotyped) at the Centre for Applied Genomics at Sick Children's Hospital in Toronto."
where G i and R i are the green and red fluorescence intensity for the ith bead assaying a given SNP. The two colours correspond to the two alleles of the SNP, and n is the number of beads assaying a given SNP, typically 16-18. Illumina beadarrays are manufactured such that there are multiple strips on each array , and our preliminary analysis revealed that unique groups of SNPs are consistently on only a subset of strips. From our previous experience, and that of others , it was known that the average relative intensity of the red and green channels could differ dramatically between strips and between arrays. To prevent these manufacturing and/or assaying properties from biasing allele frequency estimation, a simple normalization was performed. Each array was normalized on a strip-by-strip basis by adjusting the red channel intensity to give a mean strip-wide allele frequency estimate of 0.5 . To examine the effect of this normalization on the variance terms estimated, the analyses presented in this paper are performed on both normalized and raw Illumina array data.
Var(e array ) is assumed constant for all SNPs. If more than two replicate arrays are used to allelotype a given DNA pool, multiple array comparisons are possible, and the best estimate of var(e array ) is the average of all possible pairings .
where "pooling-1" is used to indicate that this estimate of pooling variance is based on the comparison of arrays that allelotype two replicate DNA pools. As before, if more than two replicate arrays are used to allelotype a given DNA pool, multiple array comparisons are possible, and the best estimate of var(e pooling- 1) is the average of all possible pairings .
The random binomial sampling variance terms accounts for the additional component of variation arising from the comparison of non-identical pools. It is assumed that the two DNA pools are constructed from samples drawn from the same population, and although in fact it is often a case and control being compared (where we specifically look for differences in allele frequency), for most SNPs on an array this is a valid assumption .
A number of assumptions are made in this analysis. We assume that the array variance is comparable across the DNA pools in an experiment, and that the average array variance is the best estimate. For arrays with larger than average array variance, perhaps caused greater variation in PCR amplification steps and/or measurement of allele frequency (detection of red and green fluorescence), array variance will be underestimated; arrays with smaller than average array variance will be overestimated. It is known that SNPs with smaller minor allele frequencies are estimates with a greater margin of error, i.e. var(earray) is not constant for all SNPs. For SNP with a small minor allele frequency, average array variance will underestimate the array variance. We also assume that the pooling variance is constant across all SNPs, and that unequal amplification and/or hybridization of alleles (A or B) will have a negligible effect on results. Because our analysis is based upon contrasting array data from two DNA pools, the effects of unequal hybridization should largely cancel out [15, 18].
In one, the RSS of a DNA pool is expressed as the ratio of effective sample size to the actual sample size (N). In two, it is expressed as the fraction of the total variance, (Vs + var(epooling)), explained by the binomial sampling variance, Vs. Vs is calculated as p(1-p)/2N, where p is the average minor allele frequency on the array, and N is number of individuals contributing to the DNA pool. If DNA pools have been constructed in replicate we let var(epooling)= var(epooling-1), otherwise we let var(epooling)= var(epooling-2). The two equations for RSS can then be equated and solved for N*. It is worth noting that because our calculation of RSS relies on our empirical estimates of var(epooling) (Equation 2), estimates which are based on contrasting allele frequencies in two DNA pools, the effects of unequal hybridization, which would typically thwart a direct comparison of a pooling-based and conventional genotyping experiment, cancels out (15, 18).
Replicate arrays can be used to reduce var(epooling) by a factor of 1/k, where k is the number of replicate arrays . In making var(epooling) smaller the RSS and N* become larger. Effective sample size can then be used with one of the available power calculators, for example CaTS  or Quanto  to perform pool-adjusted power calculations . PoolingPlanner is intended to help first time users plan a DNA pooling experiment, and our empirical estimates of array variance and pool construction variance are supplied as the default setting for the program for this reason. Users with their own estimates of variances can provide these to the program as well. PoolingPlanner is available at http://www.kchew.ca/PoolingPlanner/).
In our analyses we encountered beads with negative intensity values in the red, green, or both channels. The number of negative beads varied by strip and typically affected 1-10% beads, a pattern consistently seen across all arrays. This can occur due to local background intensity removal at the point of image processing . These beads were removed from our variance calculations. Furthermore, beads with zero in both the red and green channels were considered failed beads and also dropped from our analysis. There were typically fewer than 100 of these per strip. Finally, SNPs having fewer than four bead observations were excluded. The rationale for this was that SNPs having fewer than four beads observation would have poorly estimated allele frequency.
Array Variance or var(earray): Type A comparisons
Estimates of array variance, var(earray), for three Illumina arrays types for normalized and raw data.
Normalized data Var(e array ) (Range)
3.8 × 10-4
(2.2 × 10-4 - 6.6 × 10-4)
3.2 × 10-4
(1.6 × 10-4 - 6.3 × 10-4)
3.3 × 10-4
(2.5 × 10-4 - 4.9 × 10-4)
Raw data Var(e array ) (Range)
2.9 × 10-3
(3.0 × 10-4 - 9.2 × 10-3)
9.0 × 10-4
(1.7 × 10-4 - 4.3 × 10-3)
2.7 × 10-3
(2.0 × 10-3 - 3.0 × 10-3)
Number of pools
Number of comparisons, var(e array ) (1)
Number of arrays
72 (6 or 12/pool)
Pooling Variance or var(epooling): Type B and C comparisons
Using raw data, estimates of var(epooling-1) were approximately 8-fold higher than the normalized data. Estimates of var(econstruction-1) tended to be higher as well, averaging ~20% of the pooling variance. Var(epooling-2) estimates followed the same pattern, larger estimates of pooling variance and pool-construction variance (data not shown).
Impact of replicate arrays on effective sample size (N*) and minimum detectable odds ratio (MDOR) in pooling-GWAS.
Arrays per pool
MDOR at 80% (p = 0.29)
MDOR at 80% (p = 0.10)
In the first part of this study we set out to establish a range of experimentally observed values for array variance on Illumina's SNP-genotyping beadarrays. At the same time, we wanted to establish a range of values for pool construction variance. In the second part, we used these estimates to calculate the effective sample size of a DNA pool given a range of replicate array values, and provide an online tool to allow readers to do the same.
At the time of our analysis we were aware of only one report that estimated array variance (var(earray)= 1.1 × 10-4 ) for an Illumina HumanHap300 beadarray . Illumina has since released higher density arrays (>1 million SNPs per array), and we wanted to determine if increased SNP density negatively impacted array variance. Overall, we found this was not the case. All of the Illumina array types examined here (660-Quad, 1M-Single, 1M-Duo) had very similar var(earray) estimates, centering around 3 × 10-4 for our normalized data, which is largely in keeping with the HumanHap300 result . We expect this result would extend to the HumanOmni1-Quad array, although it was not analyzed it here. We found that the normalization procedure we used reduced the array variance between 2-8-fold, and a newly reported normalization algorithm suggests that array variance can be reduced even further . Reduced array variance should mean more precise estimates of allele frequency, which should further minimize the loss of power associated with using the DNA pooling strategy.
The Illumina arrays analyzed here yielded var(earray) estimates ~10-fold smaller than those of the Affymetrix HindIII 50K arrays (var(earray)= 1.26 × 10-3) analyzed by MacGregor . A similar result was noted when Affymetrix arrays were compared to Illumina HumanHap300 arrays . In part, this may be explained by differences in the manufacturing of the arrays. MacGregor et al.  report that pooling errors appear to be highly related to number of probes used to estimate SNP allele frequency. While 10 probe pairs are assigned to each SNP on the Affymetrix HindIII 50K arrays , on average 16-18 beads are used on the Illumina arrays. Further, on Illumina arrays beads are randomly dispersed on a slide , while on Affymetrix arrays probes are fixed in a given location, making the latter more susceptible to location-specific technical errors. As the array variance gets smaller (i.e. when using Illumina arrays), we expect the pool-construction variance to account for a greater proportion of the pooling variance.
Our estimates of var(econstruction) spanned 27 DNA pools, ranging in size from 74 to 446 individual samples, allowing us to sample a range of possible pool construction variances. First, in contrast to a previous report , we did not observe a relationship between pool size and pool-construction variance. We did, however, observe batch effects. For the 1M-Duo arrays, which were processed in two batches on different dates, we observed very different estimates of pooling variance and pool-construction variance (see Figure 6). Most of our estimates of pool-construction variance were based on values from Type C comparisons, and for these var(econstruction) usually fell between 20 and 40% of the pooling variance. When calculations were based on the comparison of replicate DNA pools (Type B comparisons, 1M-Single arrays only) our estimates were smaller, on average 7.5% of the pooling variance. There are several possible reasons for this. The adjustment for binomial sampling variance may not fully account for the variance arising from sampling, leaving variance that is then attributed to pool-construction in the Type C comparisons. As well, some estimates of pool-construction variance were negative, and these were set to zero, which would lead to overestimation of pool-construction variance. We conclude that relative to var(earray), var(econstruction.) is of less importance; however, our results suggest pool construction may account for more of the pooling variance than previously estimated . MacGregor  attributed 12.5% of the pooling variance to pool-construction when using Affymetrix HindIII 50K arrays. On average we attribute 30% of pooling variance to pool construction when using Illumina arrays. This difference is what might be expected given the smaller var(earray) for Illumina arrays. Further reductions in array variance, for example, through improved normalization of array data, have the potential to further shift the proportion of an experiment's pooling variance that is attributed to pool-construction errors.
With respect to the design of pool-based experiments when using Illumina arrays, our partitioning of the pooling variance still suggests  that constructing fewer (large) pools while using more replicate arrays (i.e. target array variance), is the most effective way to reduce pooling variance and conduct the most efficient pool-based GWAS. Further, for an equivalent pool-based experiment using Affymetrix arrays in place of Illumina arrays, more array replicates will be needed (~10-fold more). As the proportion of array variance to pool construction variance approaches 50:50, strategies to reduce pool construction variance become more important.
For one of our experiments, 1M-Duo Batch 2, we observed unusually high estimates of pool-construction variance and low estimates of array variance (see Figure 6). In this experiment, pool replicates were allelotyped on the same physical array (which holds two samples). Subsequently, we noticed that the array variance for replicates on the same chip were much smaller than the variance for replicates on different chips. Overall, this led to the array variance being underestimated relative to the pooling variance, leaving more variance to be accounted for by pool construction. In addition, the between-chip variance for these arrays was much higher than observed in the 1M-Duo Batch 1 dataset, which lead to large estimates of pooling and pool-construction variance overall. Ultimately, this was traced back to unusually high red channel intensity on some arrays, despite normalization, which biased allele frequency estimates array-wide. Clearly this will influence any downstream association analysis, so in this case, our analysis of variance served to flag a serious problem in the array data. It also highlighted the need to randomize DNA pool replicates among arrays that carry more than one sample, and to randomize by location on the array, particularly in the case of the 660-Quad and HumanOmni1-Quad arrays, which carry four samples.
The differences between 1M-Duo Batch 1 and 2 data were significant for normalized data, but not raw data. On one hand, it may be that greater noise associated with the raw data prevented differences in array variance and pool construction variance from being significant. On the other, it is possible that the normalization procedure itself exacerbated technical artifacts only present on some arrays, leading to the observed differences in normalized data. This can occur if technical artefacts violate the assumptions of the normalization .
We have provided empirical estimates of var(earray) and var(econstruction) for a range of DNA pool sizes. We have also presented PoolingPlanner, a simple program to help translate these variances into their effect on sample size, information that can then be use in a power calculator to conduct pool-adjust calculations. PoolingPlanner may be helpful in quickly assessing theoretical best and worst-case scenarios for a DNA pooling GWAS. With this information the user can then make a more informed decision about how to carry out their pooling experiment to optimally balance cost with loss of power.
Acknowledgements and Funding
We thank Dr. John Spinelli, Senior Biostatistician, for very useful discussion and critical advice during the preparation of this manuscript.
This work was supported in part by OvCaRe, through the BC Cancer Foundation [NSA10112 to A.B-W.]; and Canadian Institutes for Health Research [BMA-63184, IG1-93476 to A.B-W]. A.B-W. is a Senior Scholar of the Michael Smith Foundation for Health Research [CI-SSH-00947(06-1)]. M.E. was supported by studentships from Natural Sciences and Engineering Research Council of Canada and the University of British Columbia [17G44444].
- 4.Pearson JV, Huentelman MJ, Halperin RF, Tembe WD, Melquist S, Homer N, Brun M, Szelinger S, Coon KD, Zismann VL, Webster JA, Beach T, Sando SB, Aasly JO, Heun R, Jessen F, Kolsch H, Tsolaki M, Daniilidou M, Reiman EM, Papassotiropoulos A, Hutton ML, Stephan DA, Craig DW: Identification of the genetic basis for complex disorders by use of pooling-based genomewide single-nucleotide-polymorphism association studies. Am J Hum Genet. 2007, 80 (1): 126-139.CrossRefPubMedGoogle Scholar
- 6.Skibola CF, Bracci PM, Halperin E, Conde L, Craig DW, Agana L, Iyadurai K, Becker N, Brooks-Wilson A, Curry JD, Spinelli JJ, Holly EA, Riby J, Zhang L, Nieters A, Smith MT, Brown KM: Genetic variants at 6p21.33 are associated with susceptibility to follicular lymphoma. Nat Genet. 2009, 41 (8): 873-875.CrossRefPubMedPubMedCentralGoogle Scholar
- 7.Schrauwen I, Ealy M, Huentelman MJ, Thys M, Homer N, Vanderstraeten K, Fransen E, Corneveaux JJ, Craig DW, Claustres M, Cremers CW, Dhooge I, Van de Heyning P, Vincent R, Offeciers E, Smith RJ, Van Camp G: A genome-wide analysis identifies genetic variants in the RELN gene associated with otosclerosis. Am J Hum Genet. 2009, 84 (3): 328-338.CrossRefPubMedPubMedCentralGoogle Scholar
- 8.Comabella M, Craig DW, Camina-Tato M, Morcillo C, Lopez C, Navarro A, Rio J, BiomarkerMS Study Group, Montalban X, Martin R: Identification of a novel risk locus for multiple sclerosis at 13q31.3 by a pooled genome-wide scan of 500,000 single nucleotide polymorphisms. PLoS One. 2008, 3 (10): e3490.CrossRefPubMedPubMedCentralGoogle Scholar
- 9.Abraham R, Moskvina V, Sims R, Hollingworth P, Morgan A, Georgieva L, Dowzell K, Cichon S, Hillmer AM, O'Donovan MC, Williams J, Owen MJ, Kirov G: A genome-wide association study for late-onset Alzheimer's disease using DNA pooling. BMC Med Genomics. 2008, 1: 44.CrossRefPubMedPubMedCentralGoogle Scholar
- 10.Brown KM, Macgregor S, Montgomery GW, Craig DW, Zhao ZZ, Iyadurai K, Henders AK, Homer N, Campbell MJ, Stark M, Thomas S, Schmid H, Holland EA, Gillanders EM, Duffy DL, Maskiell JA, Jetann J, Ferguson M, Stephan DA, Cust AE, Whiteman D, Green A, Olsson H, Puig S, Ghiorzo P, Hansson J, Demenais F, Goldstein AM, Gruis NA, Elder DE, Bishop JN, Kefford RF, Giles GG, Armstrong BK, Aitken JF, Hopper JL, Martin NG, Trent JM, Mann GJ, Hayward NK: Common sequence variants on 20q11.22 confer melanoma susceptibility. Nat Genet. 2008, 40 (7): 838-840.CrossRefPubMedPubMedCentralGoogle Scholar
- 11.Capon F, Bijlmakers MJ, Wolf N, Quaranta M, Huffmeier U, Allen M, Timms K, Abkevich V, Gutin A, Smith R, Warren RB, Young HS, Worthington J, Burden AD, Griffiths CE, Hayday A, Nestle FO, Reis A, Lanchbury J, Barker JN, Trembath RC: Identification of ZNF313/RNF114 as a novel psoriasis susceptibility gene. Hum Mol Genet. 2008, 17 (13): 1938-1945.CrossRefPubMedPubMedCentralGoogle Scholar
- 21.Gene × Environment, Gene × Gene Interaction Home page. [http://hydra.usc.edu/gxe/]
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1755-8794/4/81/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.