Background

Bipolar disorder (BD), also known as manic-depressive illness, is a chronic and devastating psychiatric condition, affecting 0.5-1.6% of the general population across their lifetime [1]. The frequency of hospitalization, psychological impairment, family devastation and suicidal behaviour make BD a major public health concern [2, 3], with an estimated total annual societal cost of at least 45 billion dollars in North America [4]. It is characterized by the recurrence of manic and hypomanic episodes. Although the majority of BD sufferers experience a significant reduction in symptoms between episodes of illness, approximately 60% develop chronic interpersonal and occupational impairment, with the result of untreated illness usually generating major disability [1]. Comorbidity with other psychiatric illness such as alcohol or substance abuse may also exacerbate the long-term course of BD [1].

Family, twin and adoption studies provide strong evidence of the genetic predisposition to BD [5, 6], with heritability estimates typically in the region of 80%. The recurrence risk in siblings of a BD proband is ~8% (corresponding to a sibling relative risk compared with the general population of λS ~ 8), and for monozygotic (identical) co-twins the risk is ~ 60%.

Identification of susceptibility genes for BD is the first step on a path toward improved understanding of the pathogenesis of mood disorders,with much to offer including (a) more effective and better targeted treatments, (b) earlier recognition of individuals at risk, and (c) improved understanding of environmental factors [79]. Linkage study findings support the view that no variation within a single gene can explain the majority of cases of BD, and demonstrates features that are typical in studies of complex genetic disorders, such as: (1) No finding replicates in all data sets, (2) Modest levels of statistical significance and estimated effect sizes, and (3) Chromosomal regions of linkage are typically broad (often > 20–30 cM).

Recent advances in density and speed of high-throughput single nucleotide polymorphism (SNP) genotyping along with a reduction in costs has provided researchers with an excellent opportunity to dissect the genetics of BD, under the hypothesis that common variants contribute to the disease. There has since been a wave of large genome-wide association studies (GWAS) of BD that have used high-density SNP microarrays to look for common shared genotypes and haplotypes. Although such studies seem to suffer from many of the same problems as the family-based whole genome scans performed using microsatellite markers, i.e. failure to replicate across different sample sets and a realization that much larger sample sizes are necessary, the use of standard SNPs and genotyping methodologies has allowed the pooling of data from large patient cohorts, and this has led to some exciting findings of possible susceptibility loci and genes such as DGKH, CACNA1C and ANK3[10] (reviewed in Lee et al., 2012).

In the current study, we utilized an alternative and, we believe, more efficient strategy, that used genotype data from more than 950 BD cases and 950 psychiatrically screened controls collected from two sites with identical ascertainment criteria and assessment methods - one in Canada, the other in the UK. As a source of independent validation, we have also analysed genome-wide genotype data from a Canadian cohort of small families.

Methods

Subject recruitment

For the total 1922 samples, there are 871 samples from Toronto constituting 431 cases (160 males and 271 females) and 440 controls (176 males and 264 females); there are 1051 samples from UK including 538 cases (180 males and 358 females) and 513 controls (192 males and 321 females). A breakdown of mean and median age at interview, age of onset (AOO), diagnostic subtype (BD I versus BD II), presence of psychotic symptoms, suicide attempt and family history of psychiatric disorders has been provided previously for both the Toronto and UK cohorts [11]. The 229 Toronto parent-offspring trio families, including 215 families with BD proband and both parents, and 14 families with BD proband and just 1 parent. Demographic information and ascertainment criteria for the family cohort have been reported previously [12].

From the CAMH, Toronto site BD individuals and unrelated healthy controls matched for age, gender and ethnicity were recruited. Inclusion criteria for patients: a) diagnosed with DSMIV/ICD 10 BD I or II; b) 18 years old or over; c) Caucasian, of Northern and Western European origin, and three out of four grandparents also N.W. European Caucasian. Exclusion criteria include: a) Use of intravenous drugs; b) Evidence of mental retardation; c) Related to an individual already in the study; d) Manias that only ever occurred in relation to or as a result of alcohol or substance abuse or dependence and/or medical illness; e) Have any manias as a result of a non-psychotropic substance. In this study, the SCAN interview (Schedule for Clinical Assessments in Neuropsychiatry) was used. SCAN was developed in the framework of the World Health Organisation (WHO) and the National Institutes of Health (NIH) Joint Project on Diagnosis and Classification of Mental Disorders Alcohol and Drug Related Problems [13]. Details on SCAN are available at http://apps.who.int/iris/handle/10665/40356.

Using both SCAN and case note review, each case was assigned DSM-IV and ICD 10 diagnoses by two independent team members with extensive diagnostic experience according to lifetime consensus best-estimate diagnosis [14]. Lifetime occurrence of psychiatric symptoms was also recorded using the OPCRIT checklist modified for use with mood disorders [15].

Similar methods and criteria were also used to collect a sample of 538 BD cases and 513 controls in London at the Institute of Psychiatry [16] (as described in Gaysina et al., 2010).

Our third sample is an independent BD cohort)t of 229 parent/affected offspring trio families also collected in the Toronto area. Methods included recruitment from hospital clinics and through advertising, SCID-I interviews, and best estimate consensus diagnosis [17, 18].

Both studies were approved by local Research Ethics Committees (the CAMH Research Ethics Board (REB) in Toronto, and the College Research Ethics Committee (CREC) at King’s College London, and informed written consent was obtained from all participants.

Genotyping

Genome-wide genotyping was performed for the Toronto and London case/control cohorts using the IlluminaSentrix Human Hap550 BeadChip (Illumina Inc., San Diego, CA, USA). Data was extracted by the Illumina® Beadstudio software from data files created by the IlluminaBeadArray reader. Mainly, these were genotyped by Illumina Inc. (San Diego, CA, USA) however 280 samples (97 cases and 183 controls) from the Toronto cohort were genotyped at the Genome Quebec facility. For the Toronto parent-offspring trio family cohort, Affymetrix 5.0 arrays (Affymetrix, Santa Clara, CA, USA) were used, and genotyped by the London Regional Genomics Centre (London, Ontario).

Sample and SNP quality control

After genotyping, the discovery cohort samples were subject to a battery of a quality control (QC) tests. Reported and genetic gender were examined using X-chromosome linked SNPs. Relatedness between samples, sample contaminations, mis-identification and duplications were tested using genome-wide identity-by-descent (IBD) estimation; inconsistent samples were dropped from the analysis. Separate QC was applied on the validation cohort including the 229 Toronto parent-offspring trio families.

SNPs were subject to QC before analysis. Samples and markers with call rates below 95% were excluded from analysis. We removed SNPs with minor allele frequencies below 1%. To minimize genotyping errors we excluded SNPs with p-value <10-5 for HWE of control samples. PLINK software was used for quality control steps described above [19].

Population stratification

Principal component analysis was conducted with EIGENSTRAT [20] on the discovery cohort with SNPs selected after QC filtering. To ensure the most homogeneous groups for association analysis, we excluded subjects with outliers defined by EIGENSTRAT [20]. Principal components (PCs) were selected based on analysis of the scree plot. For the genetic association analysis, the selected principal components were adjusted in the logistic regression model to correct for population stratification. We did not apply principle component analysis on the validation cohort since the family-based association tests are robust against population substructure [21].

Genotype imputation

Since the discovery and validation datasets were genotyped on different GWAS SNP platforms, and the validation dataset has a smaller number of SNPs, genotypes of the SNPs that are not in the validation cohort were imputed. Beagle V3.3.1 [22] was used for the trio family imputation using a trio reference panel (HapMap3 Phasing Data: ftp://ftp.ncbi.nlm.nih.gov/hapmap//phasing/2009-02_phaseIII/HapMap3_r2/CEU/TRIOS/[23]) as this has better accuracy than imputation using a phased reference panel [22]. Individual genotypes with probability less than 0.90 were not included. Hidden Markov models (HMMs) were used for the imputation [22].

Association analysis

The association analyses were first applied on the discovery cohort with only autosomal markers tested for association. Although we used an additive genetic model for primary analyses, we also explored dominant and recessive genetic models for sensitivity analysis. Logistic regression models were applied based on a genetic additive model. Odds ratios (OR) and 95% confidence intervals (CI) were estimated for the cases compared to the control group. The association, adjusting for principal components from the EIGENSTRAT analysis, was tested using multivariate logistic regression (SAS v9.2, Cary, NC). Association analysis on the validation cohort of trio families was performed using the transmission disequilibrium test (TDT) [24]. Power calculations for association analyses were performed using QUANTO [25, 26].

Exploratory analysis on combined discovery and validation datasets

Exploratory analysis was applied on the combined dataset of unrelated case control individuals and the trio family data. A hybrid method was applied to combine the distinct estimates from separate case control samples and parent-offspring trio families [27]. The estimates obtained from separate analyses are combined into an overall risk estimate and provided with the corresponding p-value. As an exploratory analysis, the combined analysis was applied on the SNPs that were nominally significant (p < 0.01) in both the discovery and validation analysis.

Overlap of data with other published GWAS studies

Analysis of microarray genotype data for a subset of the cases/control cohort (483 cases, 462 controls) from the Institute of Psychiatry, London has been included in several published GWAS meta-analysis reports, including the Wellcome Trust Case Control Consortium (WTCCC) 2007, Scott et al., 2009, and Sklar et al., 2011 [2830]. Data from a subset of the Canadian cohort (334 cases and 257 controls) were also included in the meta-analyses published by Scott et al., 2009 and Sklar et al., 2011, and in a locus-specific replication analysis locus by McMahon et al., 2010 (3p21.1) [31] and Francks et al., 2010 (19q13) [32]. Thus, the current study includes 55 cases and 49 controls from the London cohort and 97 cases and 183 controls from the Toronto cohort that do not overlap with these studies. GWAS for the small families does not overlap with any published genome-wide study, and pathway analysis on these cohorts has not been published previously.

Pathway analysis

Our pathway analysis on the discovery cohort data followed that described by Beyene et al. [33]. For each SNP passing QC, we performed univariate SNP association analysis using logistic regression in PLINK. We selected SNPs that have nominal association (p < 0.01) from the discovery data association analysis. This includes 5111 SNPs.We obtained the nearest gene for each of the selected SNP from the Illumina SNP annotation file (HumanHap550Yv3_Gene_Annotation, available from icom.illumina.com) based on physical distance. The SNPs were mapped to 2155 genes. For each of the mapped genes, we obtained an aggregate summary measure based on individual values for SNPs assigned to this gene. Here we used the maximum of absolute summary measure over all SNPs mapped to the gene [33].

The aggregated summary measure was used to evaluate the significance of predefined pathways using Ingenuity Pathway Analysis software (IPA, version 11904312). Briefly, for a given pathway, statistical significance of the pathway enrichment is calculated using a Fisher's exact test based on the number of genes annotated, number of genes represented in the input dataset, and the total number of genes being assessed in the experiment. A pathway was deemed significant if the adjusted p-value of enrichment was ≤ 0.05 (adjusted for multiple comparisons using a Benjamini-Hochberg correction [34]).

Results and discussion

To test the common variant hypothesis more comprehensively, we performed an unbiased genome-wide association study of common variation using the discovery dataset of 1922 case–control samples. Findings were validated using the independent family-based cohort. Quality-control (QC) procedures were applied to the 510,740 single nucleotide polymorphisms (SNPs) in the discovery dataset and 440,794 SNPs in the validation dataset.

Population stratification

After applying QC filters, 502,877 common autosomal SNPs remained in the discovery dataset and 346,565 common autosomal SNPs remained in the validation dataset. To account for possible population stratification, principal component analysis was undertaken with EIGENSTRAT [20]. Five subjects were identified as population outliers and excluded from the analysis. Three principal components were selected based on scree plot. Additional analyses for population stratification were undertaken with each of the genetic markers adjusting for the three principle components. The final datasets included 912 cases and 903 controls in the discovery dataset and 224 families (636 individuals) in the validation dataset. The average genotyping rate in the remaining individuals was 99.7%. The logistic regression model was used for association analyses in the discovery cohort. In the discovery dataset, none of the p-values met the stringent and perhaps overly conservative Bonferroni correction for genome-wide significance (Figure 1A). The distribution of p-values examined in the discovery dataset demonstrated a close match to that expected for a null distribution except at the extreme tail of low p-values (Figure 2).

Figure 1
figure 1

A Manhattan plot is shown for A. the combined IoP/CAMH case/control cohort, and B. the CAMH small family sample, with –log10(P-value) plotted by genomic location for chromosomes 1–22. SNPs from each chromosome are represented by different colors and ordered by physical positions.

Figure 2
figure 2

Quantile-Quantile (Q-Q) plot of p-values for the case control dataset. Note: The Q-Q plot measures deviation from the expected P-values. The diagonal (red) line represents the expected (null) distribution. The slight deviation of the observed values from expected values at the tail of the distribution is consistent with modest genetic effects.

Discovery dataset analysis

We computed the power of the 1815 samples in the discovery dataset. Given a prevalence of BD of 0.01, a SNP in LD (D' = 1) with a risk allele frequency 0.3, we have 76% power to detect significant association at p = 5.0E-7 under an additive model with strong effect size of OR 1.5. To detect an association with the same assumptions and at p = 5.0E-8 significance level, the statistical power is reduced to 0.61. With a moderate effect size of OR 1.3, the power to detect genome wide significant association (p < 5.0E-8) is very low. Despite no genome-wide significant association (p < 5.0E-8), 68 SNPs in our discovery dataset showed suggestive association with BD risk (p < 0.0001), many of which are replicating other GWAS findings for BD (Table 1 shows a subset of these SNPs with previous GWAS evidence for BD; the full set is given in Additional file 1). The most significant SNP was rs11787406, which is located just downstream of the gene PRSS5 on 8p23.1 (p = 2.35E-6). Also, among the top 68 SNPs we see 6 SNPs within the gene SYNE1 on 6q25, with lowest p = 3.02E-6 (plus a further 8 SNPs among the top 1000 SNPs; Additional file 1: Table S1; Figure 3). SNPs in this gene showed moderate association in the WTCCC study of ~2000 bipolar cases and ~3000 controls [28], with a genotypic p-value of 1.92E-05 for SNP rs2763025. Similarly, in a GWA meta-analysis [35] 14 SNPs within SYNE1 were identified with p value <9.0E-6. As the WTCCC study and Liu et al. meta-analyses included the case/control cohort from London, this cannot be presented as a completely independent observation, however the SYNE1 SNP rs17082664 also showed suggestive association in a combined analysis of WTCCC plus STEP-UCL and ED-DUB-STEP2 datasets (p = 3.6E-6), with much of the signal coming from the STEP-UCL study, which on its own gives p = 3E-4 [36]. A single SNP in ZNF659 on 3p24.3 was among the top 68 (with a further 4 SNPs among the top 1000; Additional file 1: Table S1). Sklar et al. [37] (no overlap with current datasets) also reported nominal association for this gene, with p = 3.25E-4 at rs259521. We also report 3 SNPs among the top 68 situated within the ZNF274 gene (rs4444432: p = 4.85E-6) on 19q13. However, this is someway distal to the nominal association for schizophrenia and psychosis reported by Francks et al. [32]. We also see suggestive association at rs4689410, within the PPP2R2C gene on 4p16.1 (p = 5.96E-6), which was previously reported to be associated to bipolar disorder [38, 39].

Table 1 SNPs showing suggestive association (p < 0.0001) to BD in our combined (CAMH and IoP) GWAS
Figure 3
figure 3

Plots for association for combined IoP/CAMH cohorts across the SYNE1 , and CSMD1 loci. Probability of significance of association for SNPs passing quality control is shown as –log10 of the P-value on the left hand Y-axis. Recombination rate as estimated from HapMap (http://hapmap.ncbi.nlm.nih.gov/) is plotted in light blue. Chromosomal position is plotted according to NCBI build 36/Hg18. The SNP with the strongest evidence for association at each locus is shown as a blue diamond.Correlation of linkage disequilibrium between SNPs and the blue diamond SNP, r2, is colour-coded, red indicating stronger LD.

In addition, a number of genes with no previous association to BD have multiple SNPs with suggestive association (Additional file 1: Table S4), including the brain-expressed genes, ADCY2, NCALD, WDR60, SCN7A and SPAG16.

Validation dataset analysis

We computed the power of the TDT in 224 trio families in our validation sample. Given a prevalence of BD of 0.01, a SNP in LD (D' = 1) with a risk allele frequency 0.3, we have 82% power to detect association at the p = 0.05 significance level under an additive model with a strong effect of OR 1.5. To detect an association with a moderate effect size of OR 1.3, the statistical power reduce to 0.58. 132 SNPs in the validation dataset showed suggestive association (p < 0.0001), with the lowest p-value at rs16873052 on 6p24.1, uncorrected p = 3.19×10-7(Figure 1B). Other SNPs showing suggestive association included SNPs within known candidate genes for BD, such as PDE4B (p = 7.45×10-5). PDE4B encodes a phosphodiesterase that binds directly with DISC1, and is critical for cyclic adenosine monophosphate signalling, which is linked to learning, memory, and mood [40], and shows association with SCZ [4144], and to some degree with BD [44]. SNPs were identified with suggestive association at a number of other genes with plausible biological arguments for involvement in, and/or previous associations to BD, such as NRG3 (p = 5.53×10-6), GAD2 (p = 2.21×10-5), GRIK2 (p = 4.18×10-5), GABRG3 (p = 3.83×10-5), and the synapse-associated protein 102 gene, DLG3 (p = 5.31×10-5). In addition, a SNP in ATP2A2, the Darier disease gene (MIM#124200) also showed suggestive association (p = 2.67×10-5). Co-morbidity between Darier disease and BD has been known for some time [45], and linkage for BD to this locus has been reported in numerous studies [4649]. Table 2 shows a subset of these 132 SNPs with previous evidence for BD or other neuropsychiatric disorders (the full set is given in Additional file 1).

Table 2 Selected SNPs (based on location within gene encoding protein of known function or disease association) showing suggestive association with TdT analysis (p < 0.0001) to BD in our validation dataset (small family cohort)

Exploratory analysis on combined discovery and validation datasets

Thirteen SNPs were nominal significant in the combined case/control discovery cohort and the trio validation cohort datasets with joint analysis p < 0.01 (Table 3). A sign test for the same direction of effect between discovery cohort and trio validation cohort was significant (p < 0.001). Several SNPs in this list overlap with candidate genes of interest in non-overlapping studies. These include SNP rs1154037 in CSMD1, for which the intronic SNP rs4875310 was suggestive significant in the Sklar et al. 2008 study (p = 3.74×10-5), as well as other SNPs at CSMD1 in the Baum et al., 2008 study [50] (rs779105, NIMH cohort p = 0.0341, German cohort p = 0.0047; rs7812884, NIMH cohort p = 0.0012, German cohort p = 0.0103).

Table 3 Most significant overlap between discovery set (case/controls) and validation set (families)

Pathway analysis

Using the set of 2155 genes identified by our association analysis, pathway analyses were performed with IPA. From the pathway analysis of the 2155 genes (1956 of them were mapped to the IPA database) with nominal associations, 30 pathways were significantly enriched for at a Benjamini-Hochberg corrected p-value of 0.05 (see Table 4). Specific pathways involved in bipolar disorder (such as Neuropathic Pain Signaling, CREB Signaling in Neurons, etc.) were amongst the ones identified as most significant (Table 4). Consequently, these results suggest that the genes identified by our association analysis have a high degree of biological relevance.

Table 4 Canonical pathways enriched using 1956 genes from discovery set with adjusted p-value <0.05

Conclusions

Our GWA study presented here represents a multi-staged analysis, combining case/control genome-wide genotype data from two “sister” studies with parallel recruitment strategies and identical genotyping platforms as a discovery set, and using a family-based cohort consisting mainly of trios as a validation set. We reported our results by using suggestive significance (p < 0.0001) and nominal significance (p < 0.01). This is based on the concern that the SNPs across the genome are not independent, so a simple Bonferroni adjustment may be too conservative. Although relatively few results were suggestive significant in both discovery and validation sets, several of the overlapping SNPs are in genes of much interest for neuropsychiatric disease. One SNP in particular (rs1154037) is located within the third intron of the CUB and Sushi multiple domains 1 gene (CSMD1), which has been implicated by at least two further (non-overlapping) BD GWA studies [37, 50]. CSMD1, which has also been associated with schizophrenia [51, 52], is a complement control-related gene, and supports the theory of diminished activity of immunity-related pathways in the brain as a disease mechanism for psychiatric disorders including BD [53]. CSMD1 protein can inhibit the deposition of complement component C3 in vitro[54], and thus impaired function may lead to impaired regulation of the classical complement cascade. Alternatively, it is also known that proteins involved in regulating complement control can also regulate synaptic function [55, 56]. Also of note, the neuropepetide Y gene, NYP, also identified by a nearby SNP in the joint discovery and validation analysis, was previously shown to be significantly down-regulated in the dorsolateral prefrontal cortex of psychosis patients [57], and in prefrontal cortex of BD subjects [58].

Analysis of associations from our discovery set shows strong support for the SYNE1 locus (Table 1; Figure 3), albeit not at genome-wide significance levels. SYNE1 has been implicated in a number of independent studies. Mutations of this gene are known as a cause for autosomal recessive spinocerebellar ataxia 8 (MIM 610743) and Emery-Dreifuss muscular dystrophy 4 (MIM 612998). SYNE1 encodes a nesprin-1 component of a complex that links the cytoskeleton and nucleoskeleton (reviewed in [59]). However, several brain specific isoforms of rat Syne1 have been shown to localize to the postsynaptic side of synapses of glutamatergic neurons, and may be part of a mechanism of endocytosis of synaptic proteins, including glutamate receptors [60].

Our comparison with data from an independent BD GWAS from University College London showed joint suggestive significant loci at CDH13, PPP2R2C and IGFBP7 (McQuillin and Gurling, personal communication). Comparison with other published GWA studies for bipolar disorder, excluding those with partial overlap of subjects, appears to corroborate several loci and candidate genes, including CNTNAP5[50, 61], ZNF804A[62], ZNF659[37], SORCS2[50, 63, 64], and ZNF536[61]. A full list is provided in the Additional file 1: Table S2). CNTNAP5, encoding another neurexin-like protein, has also been linked with autism [65]. Interestingly, CDH13 was also suggestive significant in our validation set (rs7186123; p = 7.74E-5), however the odds ratio suggests this allele as protective, whereas for the suggestive significant SNPs at CDH13 in the discovery set the alleles appear mostly to be risk alleles (Additional file 1: Tables S1, S2 and S3).

Of the three zinc finger genes listed as loci showing suggestive significant association in our combined case/control study and in other bipolar GWA studies, little is known about the function, except for ZNF536 on chromosome 19p13.3, which is highly expressed in the developing brain, and in cerebral cortex, hippocampus and hypothalamus and is believed to be a negative regulator of neuronal differentiation [66].

Suggestive association was seen at SNP rs4689410 within the gene PPP2R2C in our study (p = 5.96E-6). This gene has been previously reported to be associated with BD [38, 39], and has also shown modest association in the UCL study, for SNP rs13122929 (p = 9.95E-4; McQuillin and Gurling, personal communication). Disruption of this gene may also be a cause of autosomal dominant intellectual disability (ID) [67]. This is one among a number of genes for which disruption may cause ID and for which common alleles may also be associated with risk for BD or SCZ (e.g. ANK3, TCF4 and NRXN1).

Interestingly, a number of well established GWAS candidate genes are not represented among our top 1000 p-values, including CACNA1C, ANK3 and DGKH[50, 64], or ODZ4 [30]. This could reflect differences in the population in terms of heterogeneity of phenotype or ethnicity, or an issue of insufficient power to detect an effect, or effects due to differences in method of ascertainment. Conversely, a number of genes in our validation set show multiple SNPs with suggestive association that have not been reported elsewhere (Additional file 1: Table S4), including the brain-expressed genes, ADCY2, NCALD, WDR60, SCN7A, SPAG16. It will be of much interest to see whether support for these genes, for which no phenotype has previously been reported (Online Mendelian Inheritance in Man) [68], increases in BD meta-studies, once the sample size exceeds the tens of thousands.

In summary, the findings here support several key genetic associations to genes for BD, such as CSMD1, SYNE1.