Deletions of chromosomal regulatory boundaries are associated with congenital disease
Recent data from genome-wide chromosome conformation capture analysis indicate that the human genome is divided into conserved megabase-sized self-interacting regions called topological domains. These topological domains form the regulatory backbone of the genome and are separated by regulatory boundary elements or barriers. Copy-number variations can potentially alter the topological domain architecture by deleting or duplicating the barriers and thereby allowing enhancers from neighboring domains to ectopically activate genes causing misexpression and disease, a mutational mechanism that has recently been termed enhancer adoption.
We use the Human Phenotype Ontology database to relate the phenotypes of 922 deletion cases recorded in the DECIPHER database to monogenic diseases associated with genes in or adjacent to the deletions. We identify combinations of tissue-specific enhancers and genes adjacent to the deletion and associated with phenotypes in the corresponding tissue, whereby the phenotype matched that observed in the deletion. We compare this computationally with a gene-dosage pathomechanism that attempts to explain the deletion phenotype based on haploinsufficiency of genes located within the deletions. Up to 11.8% of the deletions could be best explained by enhancer adoption or a combination of enhancer adoption and gene-dosage effects.
Our results suggest that enhancer adoption caused by deletions of regulatory boundaries may contribute to a substantial minority of copy-number variation phenotypes and should thus be taken into account in their medical interpretation.
KeywordsClubfoot Williams Syndrome Human Phenotype Ontology Topological Domain Supravalvular Aortic Stenosis
DNase I hypersensitive site
Human Phenotype Ontology
- NIH REMC
National Institutes of Health’s Roadmap Epigenomics Mapping Consortium
topological domain boundary
Wellcome Trust Case Control Consortium 2
Genomic deletions and duplications result in the loss or gain of specific genomic segments and thus are referred to as copy-number variants (CNVs). The phenotypes of CNV disorders are often complex, commonly involving intellectual disability and multiple congenital anomalies . The phenotypic abnormalities seen in some diseases associated with CNVs are thought to be related to altered gene dosage effects of one or more genes located within the CNV. For instance, Williams syndrome (WS) is a multisystem disorder that results from heterozygous deletion of 1.5 to 1.8 Mb on chromosome 7q11.23, which contains approximately 28 genes . Some of the phenotypic abnormalities of WS have been attributed to hemizygosity of individual genes located within the deleted region. Thus, hemizygosity for the ELN gene is thought to cause the supravalvular aortic stenosis , LIMK1 hemizygosity is implicated in the impaired visuospatial constructive cognition  and GTF2I hemizygosity is thought to contribute to the mental retardation in WS patients .
Alteration of gene dosage by deletion or duplication or by disruption of genes located at the boundaries of CNVs thus represents a plausible pathomechanism for many phenotypic abnormalities seen in CNV disorders. However, structural variations such as CNVs, inversions or translocations can also change the regulatory context of genes, thereby disturbing the delicate balance between enhancers, silencers and insulators by interfering with the complex chromosomal looping and interaction mechanisms of promoters and one or more cis-regulatory elements. These changes in the regulatory environment of genes can result in misexpression and subsequent deregulation of signaling -.
Long-range looping interactions over tens or even hundreds of kilobases together with three-dimensional nuclear organization, involving the positioning of genes, regulatory sequences and DNA binding proteins, help determine which genes are transcribed at any given time ,. Hi-C is a method that probes the three-dimensional architecture of whole genomes by coupling proximity-based ligation with massively parallel, next-generation sequencing . Recently, Hi-C was used to identify megabase-sized local chromatin interaction regions termed `topological domains’; the domains represent highly self-interacting regions bounded by narrow segments where the chromatin interactions appear to end abruptly . Topological domains were suggested to represent chromosomal units that serve to spatially accommodate enhancer-promoter interactions and control gene expression levels across cell populations . The boundary regions between the domains are associated with CCCTC-binding factor (CTCF) binding sites, cohesin binding sites and active transcription of housekeeping genes . Recent knock-down experiments suggest that CTCF and cohesin contribute differentially to chromatin organization and gene regulation, but surprisingly depletion of both was not accompanied by disruption of topological domain organization . Therefore, it remains unclear whether the observed topological domains are the cause of genomic interaction or a consequence , but the boundaries between the domains might function as regulatory barriers by inhibiting the interaction of enhancers/silencers in one domain with genes in the adjacent domain . Recent studies in Drosophila suggest that insulator proteins are frequently found at topological domain boundaries (TDBs) . It was also shown that insulators can organize and support very long-range functional interactions between regulatory elements at distances of up to several megabases ,. Since insulator proteins mediate not only enhancer blocking but also contribute to the organization of chromosome architecture and the integrity of regulatory elements, they have been dubbed architectural proteins . The role of these architectural proteins in TDBs in vertebrates is currently being investigated.
We recently identified the etiology of Liebenberg syndrome, an autosomal-dominant upper-limb malformation, as a homeotic limb transformation in which the arms acquire morphological characteristics of a leg. We characterized deletions in the vicinity of PITX1 in patients with Liebenberg syndrome. PITX1 is a homeobox gene that plays a role in specifying the identity or structure of the lower limb. The structural changes are likely to remove a barrier element that separates the PITX1 regulatory domain from neighboring regulators. In Liebenberg syndrome, a highly conserved non-coding enhancer element, hs1473, which is normally separated from PITX1 by a TDB, was relocated into the vicinity of PITX1. Element hs1473 was shown to have forelimb-specific activity in mouse embryos, and transgenic hs1473-Pitx1 mice showed features characteristic of Pitx1 misexpression at embryonic day 15.5, as well as phenotypic features of forelimb-to-hindlimb transformation . These observations suggested that the pathomechanism of Liebenberg syndrome can best be explained by a topological domain boundary disruption (TDBD) between an enhancer with activity in the forelimb and a gene that is phenotypically related to the clinical manifestations observed in individuals with Liebenberg syndrome . We will refer to this phenomenon as `enhancer adoption’.
This observation motivated us to ask whether computational evidence can be obtained for additional CNVs with an analogous pathomechanism by searching for a bioinformatic signature suggestive of enhancer adoption. Here, we perform a systematic computational analysis of phenotypes of patients in the DECIPHER database . Our results suggest that a substantial proportion of CNVs are associated with phenotypes that can be partially or completely explained by disruption of genomic barrier effects associated with ectopic activation of phenotypically relevant genes.
Results and discussion
In this work, we present a computational analysis of the hypothesis that the disruption of TDB regions may contribute to or even be the major factor of the phenotypes observed in a subset of CNV disorders. We developed an analysis strategy that relates the phenotypic features of the CNV disorders to the locations of genes and TDBs within and near to the CNV as well as the phenotypic features of monogenic disorders affecting these genes.
Our approach involves comparing the phenotypic features associated with the CNVs with the phenotypic features associated with Mendelian diseases of single genes located within or adjacent to the CNVs. To do so, we perform semantic similarity analysis using the Human Phenotype Ontology (HPO) as described in detail in the Materials and methods. We define a gene as being phenotypically relevant if mutations in that gene lead to a Mendelian disease with phenotypic abnormalities that are similar to those of the CNV disorder (such as the genes ELN, LIMK1 and GTF2I in WS, as described above). We analyzed 2,300 deletions in DECIPHER for which phenotype data were available, and found that the degree of similarity between CNV phenotypes and phenotypes associated with single genes located within the CNVs was significantly higher than for random deletions (19.6±28.8 compared to 14.2±25.6; P=8.54×10-67, Wilcoxon test). This result suggests that our computational approach of `explaining’ the phenotypic features of CNVs is applicable to the analysis of deletions in the DECIPHER database.
Distribution of topological domain boundaries in pathogenic and neutral deletions
CNV data from DECIPHER and control CNVs taken from the WTCCC2 study
Deletions with phenotype data
Deletions with unique target phenotype
We were therefore motivated to investigate how common TDBD is among pathological deletions associated with congenital anomalies. However, given that the mean size of the deletions in DECIPHER is 3.68 Mb, with over three TDBs being removed on average, the mere fact that a pathological deletion disrupts a TDB is not surprising. We therefore reasoned that it is necessary to take tissue specificity of enhancers as well as the phenotypic abnormalities associated with genes within and adjacent to deletions into account to assess the potential association of TDBD with deletion phenotypes.
A computational phenotypic signature of topological domain boundary disruption
We reasoned that if TDBD is responsible for the pathogenesis of a sizable number of CNVs, then we should be able to detect a corresponding bioinformatic signature significantly more often than would be expected by random chance. To test this hypothesis, we developed a strategy for predicting computationally which CNVs are most likely to be partially or completely related to TDBD by comparing the phenotypes of the CNVs with the phenotypes of single-gene diseases of genes located within or adjacent to the CNVs and comparing their distribution with that of predicted tissue-specific enhancers (Additional file 1: Figure S2).
DNase-sequencing (DNase-seq) experiments from the National Institutes of Health’s Roadmap Epigenomics Mapping Consortium (NIH REMC) offer a unique resource for identifying enhancers. DNase I hypersensitivity, as measured by DNase-seq, has been used previously to characterize human cell lines, revealing cell-type-specific promoters and enhancers -. The human genome is thought to harbor at least 400,000 enhancers , many of which exhibit tissue or developmental-stage specificity .
Tissue-specific enhancers and corresponding HPO terms for ten tissue types
HPO term name
Fetal adrenal gland
Abnormality of the adrenal glands
Abnormality of the forebrain
Abnormality of the heart
Abnormality of the intestine
Abnormality of the kidney
Abnormality of the lung
Abnormality of the musculature
Abnormality of the stomach
Abnormality of the thymus
White blood cells
Abnormality of leukocytes
In our analysis, we assigned deletions to the category TDBD if they completely overlapped a TDB and a tissue-specific enhancer and a phenotypically relevant gene were identified surrounding the deletion with the enhancer and the gene being on different sides of the deletion. A deletion was assigned to the category GDE if it contained one or more genes that were phenotypically relevant to the CNV, that is, for which the phenogram score (see Materials and methods) was above zero, with the additional condition that no computational evidence for TDBD was present. Finally, a deletion was assigned to the TDBD only category if the phenotypic similarity score of genes adjacent to the deletion was higher than for genes within the deletion. Note that a gene or enhancer was considered to be adjacent to the deletion if it was located between the deletion breakpoint and the distal end of the affected topological domain (Figure 1).
For comparison, we then performed an analysis of randomized data, whereby the deletion was assigned randomly to a different phenotypic category from Table 2. For instance, a deletion originally assigned to Abnormality of the forebrain might be assigned to Abnormality of the kidney. We then tried to identify the best `explanation’ for the random phenotype as GDE or TDBD as described above. Since the phenotypic spectrum of CNVs is complex and often multiple organs are affected, it is not surprising that some matches are found, but we reasoned that if the signal we observed for TDBD events in the real data was genuine, a lower proportion of random deletions would be placed into this category. In fact, there were significantly fewer deletions assigned to the category TDBD (P=8 × 10-4; Figure 4B). As an additional background model, we permuted the phenotype annotations of all human genes and found similar enrichment of TDBD deletions in the real data compared to randomized background (P=0.003; Figure 4B). The larger a deletion is, the more likely it is to contain haplosensitive genes whose deletion will cause a phenotype, whereas the chance that a deletion primarily acts by the TDBD mechanism should only depend on the enhancers and genes located adjacent to the deletion, and thus should not be dependent on the size of the deletion. Therefore, we investigated the relation between the number of topological domain boundaries affected by a CNV and the frequency of TDBD effect mechanisms. These data show that small deletions that overlap only one boundary show rates of 10% TDBD and thereby higher frequencies than larger deletions that overlap two or more domain boundaries. In all subsets of deletions that overlap up to three TDBs, the frequency of TDBD events was significantly higher in the DECIPHER CNV cases than in the randomized data with permuted CNV phenotypes (one TDBD: P=0.01; two TDBDs: P=0.0014; three TDBDs: P=0.0036; Additional file 1: Figure S3).
An alternative hypothesis to our concept of TDBD is simply that enhancer adoption occurs solely because a deletion brings a tissue-specific enhancer into the vicinity of a tissue-specific gene, regardless of chromosomal domains. The question boils down to whether TDBs tend to separate tissue-specific enhancers whose effect on phenotypically relevant genes would otherwise have a damaging effect. It would be difficult to provide a conclusive computational answer to this question for any specific CNV without extensive experimental validation. However, we did address the question by analyzing the 253 DECIPHER deletions that do not overlap any TDB. To do so, we searched in windows of 400 kb for the matching enhancer and gene signature on both sides of these deletions. We chose a distance of 400 kb because it corresponds to the median observed distance of 389.9 kb between CNV breakpoints and the next closest TDB (or in some cases the end of the chromosome or a region of unorganized chromatin at the border of a domain). Only 1.63% of the 922 DECIPHER deletions fulfilled our enhancer adoption criteria without overlapping a boundary element (Figure 4, right panel). This proportion is not more than expected from randomized data with permuted CNV phenotypes (1.44%, P=0.33) or permuted gene phenotypes (2.02%, P=0.86), which therefore suggests that the disruption of chromatin architecture by TDBD is a major factor in the enhancer adoption mechanism.
As a control for the specificity of enhancers, we repeated the TDBD analysis with the ubiquitous DHS and observed lower rates of TDBD events compared to the analysis with tissue-specific enhancers (3.69% for ubiquitous vs 4.45% for tissue-specific enhancers; Additional file 1: Figure S4A). Furthermore, the phenotypic similarity of genes adjacent to the deletion to the phenotypes of the patient was significantly higher for TDBD with tissue-specific enhancers compared to the ubiquitous enhancers (P=0.013; Additional file 1: Figure S4B).
Model organism data increases the number of interpretable copy-number variants
We recently presented an ontology-based approach to measure similarities between human disease manifestations and the mutational phenotypes in model organisms to identify candidate genes located within CNVs that best explain the individual phenotypic features of the CNV . Since there are considerably more mouse and zebrafish mutants with monogenic defects than the number of currently characterized Mendelian diseases of humans , we asked whether cross-species analysis would increase the percentage of CNVs that could be classified with our algorithm. As in our analysis of purely human disease data, we compared the similarity of the 2,300 DECIPHER deletion phenotypes to the phenotypes of the single-gene disorders of the genes located within the CNVs. However, here, we used the cross-species ontology Uberpheno  to exploit mouse and zebrafish annotations for these genes. The phenotypic similarity for the DECIPHER deletions was significantly higher than for randomized deletions (62.2 ± 81.8 compared to 45.6 ± 66.8; P=2.36 × 10-58). Using the cross-species data, we again analyzed the 922 DECIPHER deletions that had been assigned to a target phenotype corresponding to a tissue-specific enhancer. Compared with the purely human data, about 10% more cases could be classified for a total of 92% of all CNVs for which our phenotypic analysis allowed assignment to one of the categories TDBD and GDE. Compared to the rate of 4.45% TDBD events predicted with human data, 5.75% of deletions were characterized as purely TDBD using the model organism data. This was significantly more than for randomized data with permuted CNV phenotypes (P=0.011) and permuted gene phenotypes (P<0.001; Additional file 1: Figure S5).
DECIPHER deletions with predicted TDBD pathomechanism
A deletion of about 3.9 Mb on chromosome 10 leads to haploinsufficiency of a number of genes with no known phenotypic relevance to the CNV phenotype of multiple renal cysts (HP:0005562). The deletion also removes a total of five TDBs that lie between a predicted kidney-specific enhancer at chr10:118,480,800 to 118,481,000 and the gene FGFR2. Many fibroblast growth factors (FGF) and all of their receptors (FGFR) are expressed in the developing kidney, and overexpression of basic fibroblast growth factor in developing rodent kidneys can induce the formation of renal cysts in vivo. In humans, activating and loss-of-function mutations in FGFRs cause syndromes that are sometimes associated with urogenital anomalies , including lacrimo-auriculo-dento-digital syndrome and Antley-Bixler syndrome, both of which can be caused by FGFR2 mutations and in some cases are associated with severe congenital renal anomalies ,. Therefore, we hypothesize that disruption of the TDBs in the deletion in DECIPHER case 262197 results in overexpression of FGFR2 in the developing kidney with resultant formation of renal cysts (Figure 5B).
Two additional cases (not shown in Figure 5) showed deletions in the vicinity of the DUX4 gene. Facioscapulohumeral muscular dystrophy is an autosomal dominant disease associated with reduction in the copy number of the D4Z4 repeat at chromosome 4q35. The reduction in D4Z4 copy number leads to reduced polycomb silencing and production of a chromatin-associated non-coding RNA that coordinates derepression of 4q35 genes including the transcription factor DUX4. The resulting misexpression of DUX4 in skeletal muscle may be associated with apoptosis of muscle cells ,. A similar D4Z4 repeat array, which contains a paralog of DUX4 at chr10:135,480,558 to 135,485,241, has been identified on chromosome 10q26, but contractions at the 10q26 locus are not pathogenic. DECIPHER case 249776 represents a deletion of chr10:130,955,710 to 135,397,841. The deletion removes two TDBs thereby bringing 107 muscle-specific enhancers into the vicinity of the chromosome 10 DUX4 paralog. Similarly, DECIPHER case 4069 represents a deletion at chr10:129,690,073 to 135,422,505, which removes three TDBs and brings 33 muscle-specific enhancers into the vicinity of the chromosome 10 DUX4 paralog. Both DECIPHER cases are associated with a number of features including muscular hypotonia, which was the feature leading to the characterization of the deletion as TDBD. Therefore, one possibility for the pathogenesis of this feature might be an inappropriate activation of the chromosome 10 DUX4 gene by adoption of the muscle-specific enhancers.
Additional file 1: Table S2 provides an overview of the 41 DECIPHER CNVs classified as purely TDBD by our algorithm.
In this work, we have provided suggestive computational evidence that a TDBD pathomechanism may be involved in a substantial minority of deletions recorded in the DECIPHER database. For the great majority of deletions and other CNVs identified to date, medical interpretation (`explanation’ of the phenotypic features found in an individual with the CNV) has been based on a guilt-by-association approach, in which one compares the CNV phenotypic features with those associated with monogenic diseases of the genes located within the CNV. Thus, the explanation of the phenotypic feature supravalvular aortic stenosis in WS is thought to be haploinsufficiency of the elastin gene, because individuals with loss-of-function mutations in this gene have the identical phenotypic abnormality. Comprehensive experimental investigation of the pathomechanism of a CNV disease such as WS might involve the generation of mouse models in which the orthologous chromosomal regions have been removed but each of the genes in turn is `rescued’ by addition of a corresponding transgene construct. Since strategies such as this are currently unthinkable for investigating the pathogenesis of human CNV diseases, numerous computational approaches have been applied to investigate the pathogenesis of CNVs ,-. In the current work, we have shown that a computational approach to analyze deletions in light of adjacent tissue-specific enhancers and genes identifies up to around 10% of deletions in DECIPHER as having a potential contribution of the TDBD pathomechanism. While our approach does not provide proof of this pathomechanism, previous guilt-by-association approaches did not do so either. Our results do suggest that TDBD should be taken into account in the interpretation of deletions, and that corresponding experimental analysis of deletions may be fruitful for future research.
A limitation of our study is the fact that the size of deletions in DECIPHER (mean 3.68 Mb) is much greater than the mean distance between adjacent TDBs. In contrast, the deletions we identified in two individuals with Liebenberg syndrome were only 134 kb and 107 kb in size . The larger deletions that are common in DECIPHER are more likely to have a complex mode of pathogenesis resulting from haploinsufficiency of one or even multiple genes located within the deletion and in some cases at least from the enhancer adoption mechanism . However, we speculate that there may be a bias to submit cases with large CNVs to databases such as DECIPHER, because previous paradigms of CNV interpretation focused on a potential phenotypic relevance of genes located within the CNV itself, not on adjacent genes . Therefore, it may be fruitful for future research to search specifically for smaller deletions that conform to the enhancer adoption pathomechanism described here.
We did not analyze duplications in our study. The location of duplicated copy can be adjacent to the original (tandem) or somewhere else in the genome, and a tandem duplication can be in the original orientation or inverted. Array CGH, which was used to generate the data investigated in our study, is not able to distinguish between these possibilities, each of which would be predicted to have a different effect on gene regulation by disruption of TDBs. However, a duplication could in principle bring elements that are normally separated by one or more TDBs into the vicinity of one another and thereby cause disease.
Our results have important implications for the medical and scientific interpretation of CNVs, and suggest that the pathomechanism of a sizable minority - up to even 11.8% - of CNVs may be related to the disruption of TDBs with misregulation of phenotypically relevant genes due to enhancer adoption. Currently, medical interpretation of rare CNVs often involves comparison of the phenotype seen in the patient with the CNV with that of monogenic diseases associated with genes located within the CNV. Our results suggest that it is also important to examine the topological domain structure in the region of the CNVs for the presence of tissue-specific enhancers and phenotypically relevant genes that lie adjacent to the CNV itself. It will also be important to develop experimental strategies for investigating these cases based on chromosomal conformation capture or similar approaches. Finally, the analysis described in this paper was made possible because of data shared by many in the community within the framework of the DECIPHER database, demonstrating the value of sharing genotype and phenotype information with appropriate data access conditions. Phenotypic data will continue to be key to understanding the medical relevance of genomic variation.
Materials and methods
Clinical and molecular copy-number variant data
The DECIPHER database is an online repository of rare genomic CNVs and associated phenotypic data . For each of the 7,535 cases in DECIPHER, we considered only the single largest CNV, of which 4,055 were deletions. Of these, 2,300 were annotated with phenotypic data and were used for our analysis. We additionally compiled a set of CNVs from 5,919 individuals participating in WTCCC2 as common controls as previously described . After mapping the genomic coordinates to the hg19 reference genome using the UCSC liftover tool , we again took only the largest CNV per case into account and analyzed only deletions. Our underlying assumption with this data is that CNVs observed among adults recruited as controls for genome-wide association studies are unlikely to be causative of congenital anomalies.
Tissue-specific enhancer prediction
DNase-seq is a high-throughput experimental technology, which has been shown to be effective in identifying open chromatin regions that correspond to active gene regulatory elements. Nucleosome-depleted regions representing open chromatin are distinguished from DNA regions that are tightly wrapped in nucleosomes or in higher-order structures by the ability of DNase I to digest the sequences. DNase-seq identifies such DHSs by capturing DNase-digested fragments and sequencing them by next-generation sequencing . Transcription factor binding is highly cell-type specific, and the investigation of differential DNase I hypersensitivity provides a general approach for predicting cell-type specific binding profiles . In this work, we have developed a computational methodology to predict tissue-specific enhancers on the basis of differential DNase I hypersensitivity profiles from ten human tissues (Table 2). Accessible chromatin regions are preferentially cleaved by endonucleases, such as DNase I, and are therefore referred to as hypersensitive, and can be measured using DNase-seq by digesting chromatin with the endonuclease DNase I followed by next-generation sequencing. DNase-seq thus generates a genome-wide map of DHSs that reflects the degree to which sequence regions were accessible .
DNase-seq reads from the NIH REMC  were counted in 200-bp windows covering the human genome. Windows that overlapped repetitive elements in RepeatMasker with scores higher than 1,000 were eliminated leaving 9.7 million windows. Genomic range manipulation and counting were performed using BEDTools . The logs of the counts plus a pseudocount of one were normalized for sequencing depth by multiplying each sample by the average read count over all samples divided by the sample's average read count. For each sample, we counted the number of DNase-seq reads falling into non-overlapping 200-bp windows along the human genome excluding strong repeat sequences. After accounting for different sequencing depth in the samples, we generated an average profile for each tissue as well as for all tissues combined (ubiquitous DHS) (Figure 3). Using correlation to measure distance between DNase profiles, we were able to group samples by cell type with hierarchical clustering (Figure 3B). The differences for each 200-bp window and each tissue from the average profile were calculated and weighted by the pooled within-tissue standard deviation. This derived quantity corresponds to a t-statistic and measures the specificity of a DHS for the corresponding tissue. We then ranked all the 200-bp windows for each tissue such that top-ranked sites corresponded to the largest positive t-statistics.
We have shown that our quantitative measure of tissue specificity allows us to define a reproducible set of ranked DHSs. Next, we tested whether the location and the chromatin environment of the identified CTS-DHSs support our claim that the identified CTS-DHSs are indeed specific for a tissue or cell type. The top CTS-DHSs are located primarily in intronic and intergenic regions. This is in stark contrast to the top ubiquitous DHSs, of which 72% overlap promoter regions (Figure 3). These findings suggest that the CTS-DHSs are mainly enhancers, which may regulate nearby genes - a conclusion that has also been drawn in earlier studies about cell lines -.
We used the profiles of the normalized log counts from each DNase-seq sample to find regions of similarity and difference across the tissue types. We created an average profile of DNase accessibility for each tissue type as well as across all tissue types (ubiquitous DNase hypersensitive sites). We then predicted tissue specificity based on a calculation of the within-tissue-type variance of DNase accessibility.
where s0 is the mean of s over all windows to prevent division by small within-cell-type variance estimates . The ranking of these t-statistics over all windows was used to quantify the cell-type specificity. Statistical analysis was carried out using the R statistics environment, using the sparse matrix package Matrix.
To estimate the number of reproducible top-ranked DHSs, all DNase-seq samples were split into two equally stratified groups. Then, within-cell-type standard deviations and CTS-DHSs were calculated separately for each group. For the top n sites, the reproducible ratio (the proportion of top CTS-DHSs that are shared between the two groups) was calculated. Looking at reproducible ranks (correspondence at the top plots) helps to determine at what cutoff the ranks transition from consistent ones into lower ranks dominated by noise ,. Maxima were defined using interpolation of reproducible ratio curves (Additional file 1: Figure S1 and Additional file 1: Table S1).
Topological domains and boundaries
Topological domain data from genome-wide higher-order chromatin interaction data in human embryonic stem cells  were downloaded  and mapped to hg19 coordinates using the UCSC liftover tool . TDBs are defined as regions with size up to 400,000 bp (400 kb) between topological domain regions.
Analyzing phenotypic similarity: human phenotype ontology and the Uberpheno ontology
where p t is the observed frequency of patients annotated to term t among all annotated patients in DECIPHER, . Note that the annotation propagation rule applies here , i.e., if a patient is annotated to a term t then the patient is also annotated to all of the more general terms.
For some of the analyses described in this work, we assigned patients to one of ten phenotypic categories corresponding to the ten tissue-specific enhancers. This strategy was based on observations in families with PITX1 mutations and for Liebenberg syndrome. The transcription factor Pitx1 is expressed predominantly in the developing hindlimb and is only minimally expressed in the forelimb , suggesting that Pitx1 is an important regulator of hindlimb identity. Both a missense mutation in the highly conserved homeodomain of PITX1 as well as a 241-kb chromosome 5q31 microdeletion have been shown to result in clubfoot in humans ,, allowing PITX1 to be assigned to the top-level category of genes with phenotypic relevance for the skeleton. In our previous work, we showed that heterotopic activation of Pitx1 by tissue-specific skeletal (forelimb) enhancers leads to Liebenberg syndrome ,. We note that the phenotypic features of these diseases are distinct (clubfoot with PITX1 mutations and an upper-limb malformation in Liebenberg syndrome), but that they both affect the skeletal system. Therefore, we reasoned that if heterotopic activation of a gene by a tissue-specific enhancer is responsible for a CNV phenotype, then we should expect a phenotypic abnormality in the same organ system rather than necessarily an exact phenotypic match.
We only included cases in the further analysis if they had at least one term in S ij and for which there was a unique maximum for one of the ten T i . Then 922 of the 2,300 deletion cases could be assigned to one of the ten phenotype categories in Table 2 in this fashion. The remaining cases could not be classified because they did not share phenotype terms with any of the target terms (n=1,377). One case was excluded from further analysis because maximal values were obtained for more than one target term by Equation 4.
Quantification of phenotypic similarities
The genomic coordinates of human genes in hg19 were retrieved from the UCSC known-genes table and mapped to Entrez Gene IDs. For the resulting 23,459 genes, only the longest transcript was considered. The similarity between the set of phenotype terms annot j used to annotate a patient j and the set of terms associated with genes in the genomic region within or adjacent to a deletion is calculated as described previously  with some modifications. For each gene g in a region G CNV within or adjacent of a deletion, a phenomatch score S g is defined based on the information content of the term. For these calculations, the frequencies p t were calculated based on HPO project annotations for human diseases . For cross-species analyses, the frequencies p t were calculated based on annotations to term t amongst all annotated genes in humans, mice and zebrafish in the cross-species phenotype ontology Uberpheno .
We note that in our previous work , we used a scoring scheme designed to identify all genes within the CNVs that were good candidates for contributing to the phenotypic spectrum of the CNV. This was possible because of our detailed manual biocuration of the 27 CNV syndromes. For the current project, we chose a scoring system that would look for a single gene within or adjacent to the CNV with the maximal phenotypic similarity, since the depth of annotations in DECIPHER is much less.
To test whether the phenogram score in Equation 7 captures clinical similarities between deletions and the genes located within them (as with the ELN gene and WS as explained in the introduction), we placed each of the 2,300 DECIPHER deletions with at least one HPO term 100 times randomly on the genome and compared the distribution of phenogram scores of genes within the random deletions against those of the DECIPHER deletions with a Wilcoxon/Mann-Whitney test.
For a given patient assigned to the phenotype target term T, we define a deletion as TBDB, if it completely overlaps a TDB, has a T-specific enhancer in one region adjacent to the CNV and has a gene associated with T in the adjacent region located on the other side of the CNV. Adjacent regions span the genomic sequence from each end of the deletion up to the end of the current domain (Figure 1B).
To assess the statistical significance of TBDB events in DECIPHER, we simulated a background distribution by permuting the phenotype annotations in the following way. We assigned to each DECIPHER patient i the phenotype annotation of a randomly chosen DECIPHER patient j that is not in the same target term group as the original patient i. We repeated this procedure for all 922 deletion patients 10,000 times and computed the empirical P value as the fraction of randomizations for which a higher or equal rate of TBDB events as in the original annotation assignment is observed. As a further control, we permute the phenotype annotations not of the CNV patients but of the genes. To do so, we shuffled all 23,459 human gene IDs randomly and replaced each gene in the HPO annotation files with a random other gene. This approach to permutation holds the number of disease genes and depth of annotation constant. We computed an empirical P value as the proportion of 1,000 permutations in which a higher or equal rate of TBDB events was observed compared with the non-permuted gene phenotypes.
Data and code deposition
Python scripts that implement the algorithms described in this manuscript have been deposited in GitHub . This repository also contains files with data on the tissue-specific enhancers used in this analysis. Additionally, source code for performing simple statistics on sparse data sets without losing sparsity that was used for the analysis of tissue-specific enhancers has been deposited as SparseData in GitHub . The phenotypic data on patients with CNVs were obtained from the DECIPHER consortium . The DECIPHER website offers information on how researchers can obtain access to this data . It is also possible to visualize individual deletions in the UCSC Genome Browser . For example, the deletion chr19:30682288 to 36367331 (which is the second entry in Additional file 1: Table S2) can be visualized by selecting the human genome assembly of February 2009 (GRCh37/hg19) in the UCSC Browser, entering the search term `chr19: 30682288-36367331’, and then setting the DECIPHER track in the section Phenotype and Literature to fullfull, and clicking the refresh button. The individual in question has the id DECIPHER:3776, and by letting the mouse hover over the red bar next to the number 3776 in the browser, the corresponding phenotype terms will be shown. It is not currently possible to download all DECIPHER data from the UCSC Genome Browser.
We used the latest version of the DECIPHER data from 5 April 2013 with 7,535 patients. The patient IDs are represented by increasing numbers and the last patient we analyzed has the ID 273601. The HPO and Uberpheno data are publicly available from the HPO download page . For the analysis described here, we used OBO files and annotation tables from HPO build #856#856 (9 December 2013) and build #132#132 (9 December 2013) of the cross-species ontology Uberpheno.
JI, SK and SB carried out the computational analysis of CNVs, phenotypes and TDBDs. MIL and HC performed the computational analysis of tissue-specific enhancers. NH and MEH performed the computational analysis of WTCCC2 CNVs. MH, NLW, SK, CJM, SEL and PNS designed the methodology for cross-specific phenotype analysis and participated in the evaluation of the phenotype analysis results in this study. CEO contributed the medical analysis of the results. SM, MS and PNR conceived the study. JI, MS and PNR drafted the manuscript. All authors read and approved the final manuscript.
This work was supported by grants from the Bundesministerium für Bildung und Forschung (project number 0313911), by the European Community’s Seventh Framework Programme (grant agreement 602300; SYBIL), the National Institutes of Health (NIH Office of the Director Grant #5R24OD011883), and by a grant from the Max Planck Foundation to SM. MS was supported by a fellowship of the Berlin-Brandenburg School for Regenerative Therapies, Berlin, Germany.
- 1.Vulto-van Silfhout AT, Hehir-Kwa JY, van Bon BWM, Schuurs-Hoeijmakers JHM, Meader S, Hellebrekers CJM, Thoonen IJM, de Brouwer APM, Brunner HG, Webber C, Pfundt R, de Leeuw N, de Vries BBA: Clinical significance of de novo and inherited copy number variation. Hum Mutat. 2013, 34: 1679-1687. 10.1002/humu.22442.PubMedCrossRefGoogle Scholar
- 4.Frangiskakis JM, Ewart AK, Morris CA, Mervis CB, Bertrand J, Robinson BF, Klein BP, Ensing GJ, Everett LA, Green ED, Pröschel C, Gutowski NJ, Noble M, Atkinson DL, Odelberg SJ, Keating MT: LIM-kinase1 hemizygosity implicated in impaired visuospatial constructive cognition. Cell. 1996, 86: 59-69. 10.1016/S0092-8674(00)80077-X.PubMedCrossRefGoogle Scholar
- 5.Morris CA, Mervis CB, Hobart HH, Gregg RG, Bertrand J, Ensing GJ, Sommer A, Moore CA, Hopkin RJ, Spallone PA, Keating MT, Osborne L, Kimberley KW, Stock AD: GTF2I hemizygosity implicated in mental retardation in Williams syndrome: genotype-phenotype analysis of five families with deletions in the Williams syndrome region. Am J Med Genet A. 2003, 123A: 45-59. 10.1002/ajmg.a.20496.PubMedCrossRefGoogle Scholar
- 7.Ott CE, Hein H, Lohan S, Hoogeboom J, Foulds N, Grünhagen J, Stricker S, Villavicencio-Lorini P, Klopocki E, Mundlos S: Microduplications upstream of MSX2 are associated with a phenocopy of cleidocranial dysplasia. J Med Genet. 2012, 49: 437-441. 10.1136/jmedgenet-2012-100825.PubMedCrossRefGoogle Scholar
- 8.Verdin H, D’haene B, Beysen D, Novikova Y, Menten B, Sante T, Lapunzina P, Nevado J, Carvalho CMB, Lupski JR, De Baere E: Microhomology-mediated mechanisms underlie non-recurrent disease-causing microdeletions of the FOXL2 gene or its regulatory domain. PLoS Genet. 2013, 9: e1003358-10.1371/journal.pgen.1003358.PubMedPubMedCentralCrossRefGoogle Scholar
- 11.Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, Sandstrom R, Bernstein B, Bender MA, Groudine M, Gnirke A, Stamatoyannopoulos J, Mirny LA, Lander ES, Dekker J: Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009, 326: 289-293. 10.1126/science.1181369.PubMedPubMedCentralCrossRefGoogle Scholar
- 14.Zuin J, Dixon JR, van der Reijden MIJA, Ye Z, Kolovos P, Brouwer RWW, van de Corput MPC, van de Werken HJG, Knoch TA, van IJcken WFJ, Grosveld FG, Ren B, Wendt KS: Cohesin and CTCF differentially affect chromatin architecture and gene expression in human cells. Proc Natl Acad Sci USA. 2014, 111: 996-1001. 10.1073/pnas.1317788111.PubMedPubMedCentralCrossRefGoogle Scholar
- 18.Li HB, Müller M, Bahechar IA, Kyrchanova O, Ohno K, Georgiev P, Pirrotta V: Insulators, not Polycomb response elements, are required for long-range interactions between Polycomb targets in Drosophila melanogaster. Mol Cell Biol. 2011, 31: 616-625. 10.1128/MCB.00849-10.PubMedPubMedCentralCrossRefGoogle Scholar
- 20.Spielmann M, Brancati F, Krawitz PM, Robinson PN, Ibrahim DM, Franke M, Hecht J, Lohan S, Dathe K, Nardone AM, Ferrari P, Landi A, Wittler L, Timmermann B, Chan D, Mennen U, Klopocki E, Mundlos S: Homeotic arm-to-leg transformation associated with genomic rearrangements at the PITX1 locus. Am J Hum Genet. 2012, 91: 629-635. 10.1016/j.ajhg.2012.08.014.PubMedPubMedCentralCrossRefGoogle Scholar
- 22.Swaminathan GJ, Bragin E, Chatzimichali EA, Corpas M, Bevan AP, Wright CF, Carter NP, Hurles ME, Firth HV: DECIPHER: web-based, community resource for clinical interpretation of rare variants in developmental disorders. Hum Mol Genet. 2012, 21: R37-R44. 10.1093/hmg/dds362.PubMedPubMedCentralCrossRefGoogle Scholar
- 23.Xi H, Shulha HP, Lin JM, Vales TR, Fu Y, Bodine DM, McKay RDG, Chenoweth JG, Tesar PJ, Furey TS, Ren B, Weng Z, Crawford GE: Identification and characterization of cell type-specific and ubiquitous chromatin regulatory structures in the human genome. PLoS Genet. 2007, 3: e136-10.1371/journal.pgen.0030136.PubMedPubMedCentralCrossRefGoogle Scholar
- 24.Song L, Zhang Z, Grasfeder LL, Boyle AP, Giresi PG, Lee BK, Sheffield NC, Gräf S, Huss M, Keefe D, Liu Z, London D, McDaniell RM, Shibata Y, Showers KA, Simon JM, Vales T, Wang T, Winter D, Zhang Z, Clarke ND, Birney E, Iyer VR, Crawford GE, Lieb JD, Furey TS: Open chromatin defined by DNaseI and FAIRE identifies regulatory elements that shape cell-type identity. Genome Res. 2011, 21: 1757-1767. 10.1101/gr.121541.111.PubMedPubMedCentralCrossRefGoogle Scholar
- 25.Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, Zhang X, Wang L, Issner R, Coyne M, Ku M, Durham T, Kellis M, Bernstein BE: Mapping and analysis of chromatin state dynamics in nine human cell types. Nature. 2011, 473: 43-49. 10.1038/nature09906.PubMedPubMedCentralCrossRefGoogle Scholar
- 28.Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H, Reynolds AP, Sandstrom R, Qu H, Brody J, Shafer A, Neri F, Lee K, Kutyavin T, Stehling-Sun S, Johnson AK, Canfield TK, Giste E, Diegel M, Bates D, Hansen RS, Neph S, Sabo PJ, Heimfeld S, Raubitschek A, Ziegler S, Cotsapas C, Sotoodehnia N, Glass I, Sunyaev SR, et al: Systematic localization of common disease-associated variation in regulatory DNA. Science. 2012, 337: 1190-1195. 10.1126/science.1222794.PubMedPubMedCentralCrossRefGoogle Scholar
- 29.Doelken SC, Köhler S, Mungall CJ, Gkoutos GV, Ruef BJ, Smith C, Smedley D, Bauer S, Klopocki E, Schofield PN, Westerfield M, Robinson PN, Lewis SE: Phenotypic overlap in the contribution of individual genes to CNV pathogenicity revealed by cross-species computational analysis of single-gene mutations in humans, mice and zebrafish. Disease Models Mech. 2013, 372: 358-372. 10.1242/dmm.010322.CrossRefGoogle Scholar
- 31.Köhler S, Doelken SC, Ruef BJ, Bauer S, Washington N, Westerfield M, Gkoutos G, Schofield P, Smedley D, Lewis SE, Robinson PN, Mungall CJ: Construction and accessibility of a cross-species phenotype ontology along with gene annotations for biomedical research. F1000Res. 2013, 2: 30-PubMedPubMedCentralGoogle Scholar
- 32.Ariani F, Hayek G, Rondinella D, Artuso R, Mencarelli MA, Spanhol-Rosseto A, Pollazzon M, Buoni S, Spiga O, Ricciardi S, Meloni I, Longo I, Mari F, Broccoli V, Zappella M, Renieri A: FOXG1 is responsible for the congenital variant of Rett syndrome. Am J Hum Genet. 2008, 83: 89-93. 10.1016/j.ajhg.2008.05.015.PubMedPubMedCentralCrossRefGoogle Scholar
- 33.Kortüm F, Das S, Flindt M, Morris-Rosendahl DJ, Stefanova I, Goldstein A, Horn D, Klopocki E, Kluger G, Martin P, Rauch A, Roumer A, Saitta S, Walsh LE, Wieczorek D, Uyanik G, Kutsche K, Dobyns WB: The core FOXG1 syndrome phenotype consists of postnatal microcephaly, severe mental retardation, absent language, dyskinesia, and corpus callosum hypogenesis. J Med Genet. 2011, 48: 396-406. 10.1136/jmg.2010.087528.PubMedCrossRefGoogle Scholar
- 34.Ellaway CJ, Ho G, Bettella E, Knapman A, Collins F, Hackett A, McKenzie F, Darmanian A, Peters GB, Fagan K, Christodoulou J: 14q12 microdeletions excluding FOXG1 give rise to a congenital variant Rett syndrome-like phenotype. Eur J Hum Genet EJHG. 2013, 21: 522-527. 10.1038/ejhg.2012.208.PubMedCrossRefGoogle Scholar
- 35.Allou L, Lambert L, Amsallem D, Bieth E, Edery P, Destrée A, Rivier F, Amor D, Thompson E, Nicholl J, Harbord M, Nemos C, Saunier A, Moustaïne A, Vigouroux A, Jonveaux P, Philippe C: 14q12 and severe Rett-like phenotypes: new clinical insights and physical mapping of FOXG1-regulatory elements. Eur J Hum Genet EJHG. 2012, 20: 1216-1223. 10.1038/ejhg.2012.127.PubMedCrossRefGoogle Scholar
- 43.Snider L, Geng LN, Lemmers RJLF, Kyba M, Ware CB, Nelson AM, Tawil R, Filippova GN, van der Maarel SM, Tapscott SJ, Miller DG: Facioscapulohumeral dystrophy: incomplete suppression of a retrotransposed gene. PLoS Genet. 2010, 6: e1001181-10.1371/journal.pgen.1001181.PubMedPubMedCentralCrossRefGoogle Scholar
- 50.Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, Diekhans M, Furey TS, Harte RA, Hsu F, Hillman-Jackson J, Kuhn RM, Pedersen JS, Pohl A, Raney BJ, Rosenbloom KR, Siepel A, Smith KE, Sugnet CW, Sultan-Qurraie A, Thomas DJ, Trumbower H, Weber RJ, Weirauch M, Zweig AS, Haussler D, Kent WJ: The UCSC genome browser database: update 2006. Nucleic Acids Res. 2006, 34: D590-D598. 10.1093/nar/gkj144.PubMedPubMedCentralCrossRefGoogle Scholar
- 54.Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, Milosavljevic A, Meissner A, Kellis M, Marra MA, Beaudet AL, Ecker JR, Farnham PJ, Hirst M, Lander ES, Mikkelsen TS, Thomson JA: The NIH Roadmap Epigenomics Mapping Consortium. Nat Biotechnol. 2010, 28: 1045-1048. 10.1038/nbt1010-1045.PubMedPubMedCentralCrossRefGoogle Scholar
- 57.Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, Frank BC, Gabrielson E, Garcia JGN, Geoghegan J, Germino G, Griffin C, Hilmer SC, Hoffman E, Jedlicka AE, Kawasaki E, Martínez-Murillo F, Morsberger L, Lee H, Petersen D, Quackenbush J, Scott A, Wilson M, Yang Y, Ye SQ, Yu W: Multiple-laboratory comparison of microarray platforms. Nat Methods. 2005, 2: 345-350. 10.1038/nmeth756.PubMedCrossRefGoogle Scholar
- 59.San Diego Supercomputer Center. , [http://chromosome.sdsc.edu/mouse/hi-c/download.html]
- 60.Resnik P: Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J Artif Intell Res. 1999, 11: 95-130.Google Scholar
- 61.Robinson PN, Bauer S: Introduction to Bio-Ontologies, Boca Raton: CRC Press; 2011.Google Scholar
- 65.Köhler S, Doelken SC, Mungall CJ, Bauer S, Firth HV, Bailleul-Forestier I, Black GCM, Brown DL, Brudno M, Campbell J, Fitzpatrick DR, Eppig JT, Jackson AP, Freson K, Girdea M, Helbig I, Hurst JA, Jähn J, Jackson LG, Kelly AM, Ledbetter DH, Mansour S, Martin CL, Moss C, Mumford A, Ouwehand WH, Park SM, Riggs ER, Scott RH, Sisodiya S, et al: The human phenotype ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res. 2014, 42: D966-D974. 10.1093/nar/gkt1026.PubMedPubMedCentralCrossRefGoogle Scholar
- 66.topdombar: Source code repository for analysis of phenotypes, microdeletions, and topological chromosome domain boundaries. , [https://github.com/charite/topodombar]
- 67.SparseData. , [https://github.com/mikelove/SparseData]
- 68.DECIPHER (DatabasE of genomic variants and phenotype in humans using ensembl resources). , [http://decipher.sanger.ac.uk/]
- 69.Karolchik D, Barber GP, Casper J, Clawson H, Cline MS, Diekhans M, Dreszer TR, Fujita PA, Guruvadoo L, Haeussler M, Harte RA, Heitner S, Hinrichs AS, Learned K, Lee BT, Li CH, Raney BJ, Rhead B, Rosenbloom KR, Sloan CA, Speir ML, Zweig AS, Haussler D, Kuhn RM, Kent WJ: The UCSC Genome Browser database: 2014 update. Nucleic Acids Res. 2014, 42: D764-D770. 10.1093/nar/gkt1168.PubMedPubMedCentralCrossRefGoogle Scholar
- 70.Human Phenotype Ontology Downloads. , [http://human-phenotype-ontology.org/contao/index.php/downloads.html]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.