Interpretation of custom designed Illumina genotype cluster plots for targeted association studies and next-generation sequence validation
- 6.1k Downloads
High-throughput custom designed genotyping arrays are a valuable resource for biologically focused research studies and increasingly for validation of variation predicted by next-generation sequencing (NGS) technologies. We investigate the Illumina GoldenGate chemistry using custom designed VeraCode and sentrix array matrix (SAM) assays for each of these applications, respectively. We highlight applications for interpretation of Illumina generated genotype cluster plots to maximise data inclusion and reduce genotyping errors.
We illustrate the dramatic effect of outliers in genotype calling and data interpretation, as well as suggest simple means to avoid genotyping errors. Furthermore we present this platform as a successful method for two-cluster rare or non-autosomal variant calling. The success of high-throughput technologies to accurately call rare variants will become an essential feature for future association studies. Finally, we highlight additional advantages of the Illumina GoldenGate chemistry in generating unusually segregated cluster plots that identify potential NGS generated sequencing error resulting from minimal coverage.
We demonstrate the importance of visually inspecting genotype cluster plots generated by the Illumina software and issue warnings regarding commonly accepted quality control parameters. In addition to suggesting applications to minimise data exclusion, we propose that the Illumina cluster plots may be helpful in identifying potential in-put sequence errors, particularly important for studies to validate NGS generated variation.
KeywordsMinimal Coverage Single Nucleotide Polymorphism Array Tasmanian Devil Allele Specific Oligo Single Nucleotide Polymorphism Assay
Commercially available genome-wide single nucleotide polymorphism (SNP) arrays and high-throughput "custom designed" genotyping of targeted variants are desirable for biologically focused research. The generation of next-generation sequencing (NGS) de novo and resequencing data is contributing to the increased desire for custom, high-throughput arrays for variant validation and determination of allele frequencies [1, 2]. For custom designed genotyping, assay reliability and productivity is considered more crucial than perhaps for genome-wide association studies (GWAS). Variants are chosen to answer a specific question, and therefore variant selection is more thoughtful and less redundant.
The Illumina platform (Illumina Inc., San Diego, CA, USA) has proven reliable and efficient for a number of high-throughput genotyping applications using DNA extracted from several sources, [3, 4, 5, 6, 7, 8, 9]. VeraCode and BeadArray technologies are used with the GoldenGate assay (Illumina) developed for simultaneous determination of between 96 and 384 (VeraCode) and 96 and 1,536 (BeadArray) variants. GoldenGate chemistry employs the use of allele specific oligo (ASO) hybridisation coupled with fluorescent labelled universal amplification primers for genotype differentiation. In addition to genotype information, automated data analysis provides measures of SNP and sample quality control (QC). Previous studies have adopted guidelines suggested by Illumina for determining assay and sample reliability based on a quality score (GenCall score) calculated by the degree of separation between homozygote and heterozygote clusters [7, 10, 11]. A value between 0 and 1 is assigned to each call with a score of ≥0.3 and ≥0.25 generally used as the cut-off for the overall SNP and each individual genotype, respectively. In addition to this automated QC, investigators have applied statistical methods such as Hardy-Weinberg equilibrium (HWE) and tests for Mendelian consistency to filter reliable assays [8, 10, 12, 13, 14].
In this study, we assess genotype information generated on custom-designed Illumina 96-SNP VeraCode and BeadArray Sentrix Array Matrix (SAM) assays, for optimal generation of accurate genotyping and NGS generated variant validation. Application of strict analysis parameters may lead to valuable genetic information being lost. Alternatively, limited QC may result in the addition of incorrect genotype calls confounding data analysis. The SNP assays described in this study have the potential of being disregarded or falsely included during analysis based on the current QC selection criteria. We discuss the importance of visually inspecting each cluster plot, particularly for custom designed genotype arrays, and suggest strategies for interpreting data in plots that would normally be discarded during the QC process.
Sample and Genotyping Data Sets
Two sample data sets were used to assess and interpret Illumina-generated genotype cluster plots. The first was part of a gene/pathway targeted human prostate cancer association study. In brief, DNA was extracted from dried blood spots (Guthrie card) and genotyped using a custom 96-plex VeraCode SNP array for 768 samples (unpublished data). In the second study, DNA was extracted from ear clippings of Tasmanian devils (Sarcophilus harrisii). DNA was genotyped as part of a validation of sequence variation determined from minimal coverage (0.3× versus 0.5×) of NGS de novo sequencing data of two animals using the Roche/454 Titanium chemistry (unpublished data). Genotype analysis was performed using a custom 96-plex SAM array for 96 samples. For both studies, the GoldenGate genotyping procedure was performed as outlined by Illumina .
Data analysis and genotype confirmation
Genotype calls were generated automatically using the GenCall software version 3.1.3. Due to the potential of intra-plate inconsistencies (e.g. variation in fluorescent intensities), the eight VeraCode runs were assessed individually. The genotype cluster plots generated by each VeraCode and SAM assays were visually inspected for quality of calls. Plots that appeared to be "unusually" clustered (ie not following the typical "spread" predicted in regards to software generated HWE or distance between clusters (θ)) were further investigated by selecting samples to assess by direct Sanger sequencing for genotype confirmation. Samples were sequenced using Big Dye Terminator v3.1 (AB, Foster City, CA, USA) chemistry according to manufacturer's guidelines and sequenced on an AB 3730 genetic analyzer (AB).
Effect of outliers on cluster formation (VeraCode)
Two-cluster autosomal and non-autosomal SNP calling (VeraCode)
Undefined cluster distribution in a verified sequence (VeraCode)
Undefined cluster distribution in a de novo NGS generated SNP array (SAM)
Large-scale custom-designed genotyping studies have been made feasible by the development of high-throughput technologies and improvements of automated genotype calling software. In this study we demonstrate how simple applications for interpreting automatically generated Illumina cluster plots can be used to avoid spurious genetic associations as well as optimising data inclusion and interpretation, particularly for custom designed target arrays.
Variation in quality and quantity of DNA input, is largely unavoidable in large-scale studies. In order to limit the effect potential outlier DNA samples may have on genotype calls, it is important to visually inspect the genotype plots and exclude outliers as demonstrated in this study (Figure 1). In addition to illustrating the importance of visual analysis, these examples (rs8081356 and rs10096900) emphasise the need to adopt QC criteria that extends beyond the GenCall score calculated by the Illumina analysis software. For both variants, the GenCall score decreased when the correct genotypes were applied (Figure 1B and 1D), and is therefore not an optimal measure of assay reliability in such scenarios.
Investigating the disease-causing potential of rare variants (that are often missed on larger scale GWAS ), can prove problematic for genotyping arrays. In this study we found the Illumina VeraCode and SAM technologies to be highly reliable for rare variant and single allele (male sex chromosome) two-cluster calling. However, we also highlight in this study, potential QC problems that may arise for two-cluster genotyping. We observed that HWE was successful as a measure of assay reliability when interpreting autosomal rare variant miss-calling, but was not suitable for single allele X-linked variants. For this application, visual inspection of cluster plots in combination with prior knowledge of the variant statistics (e.g. location and population-specific allele frequency) would provide an adequate means of determining assay success.
A growing number of laboratories are employing the use of NGS technology to sequence as yet unclassified genomes, as well as resequencing of human and human disease-associated genomes. From the predicted sequence variants generated by these projects, custom-designed arrays are being employed for validation and frequency determination. Although the optimal coverage required to distinguish true sequence variation is highly debatable, and dependent on the type of NGS platform used, high-throughput genotyping platforms allow for rapid, cost-effective validation at even minimal coverage. The SAM array discussed in this study was developed from long-read NGS information generated from minimal coverage of two samples of an as yet unsequenced species (unpublished data). In both the VeraCode assay, custom designed to genotype 96 confirmed human variants, and the SAM array, designed to verify newly identified variants from de novo generated NGS data, we observed several examples of individual clusters being partially segregated within the one genotype group (Figure 3 and 4). Although this cannot be attributed to miss-calling on the VeraCode assays (Figure 3), the unusually grouped variants described on the SAM array (Figure 4) were due to incorrect oligo design as a result of miss-interpretation of de novo generated NGS data. For variants TD102 and TD108, NGS results suggested the presence of an additional nucleotide immediately adjacent to the variant site, which was the same as the alternative allele of the putative variant. Direct Sanger sequencing confirmed that this additional nucleotide was not present and hence we observed varying binding capabilities of only one ASO, resulting in the fluorescent intensity of samples containing a mutant allele being greater than those without. In addition to confirming the presence of a variant at these locations, the unusually clustered assays on the SAM array, prompted closer inspection (visual and Sanger sequencing) which lead to the identification of NGS-generated sequencing error. We therefore suggest that visual inspection and closer investigation of unusually clustered scatter plots, may provide information that exceeds the initial goal of SNP validation and is a complimentary tool for NGS data validation.
We demonstrate in this study, applications to optimise and improve the efficiency of data analysis generated using the Illumina GoldenGate chemistry using logical interpretation of both rare and common genotyping data. We also present this platform as a successful tool for NGS variant validation, which is applicable to even limited sequence coverage.
Funding sources include Cancer Institute New South Wales (CINSW, VMH is a Fellow and EAT is a PhD Scholar), Australian Rotary Health (EAT is a PhD Scholar) and Allco Foundation donation, Sydney, Australia (to VMH). Additional infrastructure was provide by the Children's Cancer Institute Australia for Medical Research (CCIA) and University of New South Wales, Sydney, Australia. We acknowledge Prof. Graham Giles and Dr. Gianluca Severi (The Cancer Council of Victoria, Melbourne, Australia) for co-ordinations of sample collections for the prostate cancer study, and Dr. Menna Jones (University of Tasmanian, Hobart, Australia) for co-ordination of Tasmanian devil sample collection. We thank Ms Diana Lau (CCIA) for technical assistance.
- 1.Pavy N, Pelgas B, Beauseigle S, Blais S, Gagnon F, Gosselin I, Lamothe M, Isabel N, Bousquet J: Enhancing genetic mapping of complex genomes through the design of highly-multiplexed SNP arrays: application to the large and unsequenced genomes of white spruce and black spruce. BMC Genomics. 2008, 9: 21-10.1186/1471-2164-9-21.PubMedCentralPubMedCrossRefGoogle Scholar
- 2.Van Tassell CP, Smith TP, Matukumalli LK, Taylor JF, Schnabel RD, Lawley CT, Haudenschild CD, Moore SS, Warren WC, Sonstegard TS: SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nat Methods. 2008, 5 (3): 247-252. 10.1038/nmeth.1185.PubMedCrossRefGoogle Scholar
- 3.Agalliu I, Schweitzer PA, Leanza SM, Burk RD, Rohan TE: Illumina DNA test panel-based genotyping of whole genome amplified-DNA extracted from hair samples: performance and agreement with genotyping results from genomic DNA from buccal cells. Clin Chem Lab Med. 2009, 47 (5): 516-522. 10.1515/CCLM.2009.106.PubMedCrossRefGoogle Scholar
- 4.Paynter RA, Skibola DR, Skibola CF, Buffler PA, Wiemels JL, Smith MT: Accuracy of multiplexed Illumina platform-based single-nucleotide polymorphism genotyping compared between genomic and whole genome amplified DNA collected from multiple sources. Cancer Epidemiol Biomarkers Prev. 2006, 15 (12): 2533-2536. 10.1158/1055-9965.EPI-06-0219.PubMedCrossRefGoogle Scholar
- 8.Cunningham JM, Sellers TA, Schildkraut JM, Fredericksen ZS, Vierkant RA, Kelemen LE, Gadre M, Phelan CM, Huang Y, Meyer JG: Performance of amplified DNA in an Illumina GoldenGate BeadArray assay. Cancer Epidemiol Biomarkers Prev. 2008, 17 (7): 1781-1789. 10.1158/1055-9965.EPI-07-2849.PubMedCentralPubMedCrossRefGoogle Scholar
- 15.Illumina, Inc. [http://www.illumina.com/]
- 16.National centre for biotechnology information (NCBI): dbSNP. [http://www.ncbi.nlm.nih.gov/SNP]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.