The features of Drosophila core promoters revealed by statistical analysis
- 5k Downloads
Experimental investigation of transcription is still a very labor- and time-consuming process. Only a few transcription initiation scenarios have been studied in detail. The mechanism of interaction between basal machinery and promoter, in particular core promoter elements, is not known for the majority of identified promoters. In this study, we reveal various transcription initiation mechanisms by statistical analysis of 3393 nonredundant Drosophila promoters.
Using Drosophila-specific position-weight matrices, we identified promoters containing TATA box, Initiator, Downstream Promoter Element (DPE), and Motif Ten Element (MTE), as well as core elements discovered in Human (TFIIB Recognition Element (BRE) and Downstream Core Element (DCE)). Promoters utilizing known synergetic combinations of two core elements (TATA_Inr, Inr_MTE, Inr_DPE, and DPE_MTE) were identified. We also establish the existence of promoters with potentially novel synergetic combinations: TATA_DPE and TATA_MTE. Our analysis revealed several motifs with the features of promoter elements, including possible novel core promoter element(s). Comparison of Human and Drosophila showed consistent percentages of promoters with TATA, Inr, DPE, and synergetic combinations thereof, as well as most of the same functional and mutual positions of the core elements. No statistical evidence of MTE utilization in Human was found. Distinct nucleosome positioning in particular promoter classes was revealed.
We present lists of promoters that potentially utilize the aforementioned elements/combinations. The number of these promoters is two orders of magnitude larger than the number of promoters in which transcription initiation was experimentally studied. The sequences are ready to be experimentally tested or used for further statistical analysis. The developed approach may be utilized for other species.
KeywordsPositional Distribution Position Weight Matrix Human Promoter Core Promoter Element Downstream Promoter Element
Research over the past thirty years has revealed the diversity of transcription initiation scenarios in eukaryotes. Only some of the scenarios have been studied in detail and more are likely to be discovered. So far, six core promoter elements have been experimentally identified in eukaryotes. These elements are TATA box, Initiator (Inr), Downstream Promoter Element (DPE), TFIIB recognition element (BRE), Downstream Core Element (DCE), and Motif Ten Element (MTE) [1, 2, 3].
The basal transcriptional machinery includes Pol II and general transcription factors (TF): TFIIA, B, D, E, F, and H [4, 5, 6, 7]. TFIID plays the central role in transcription initiation [8, 9], acting in cooperation with core promoter elements and/or specific TFs [6, 7, 10]. TFIID consists of the TATA Binding Protein (TBP) and TBP-associated factors (TAFs) . The universal feature of transcription is binding of TBP to DNA at a specific distance from transcription start cite (TSS) regardless of the presence/absence of the TATA box. In the absence of the TATA box (TATA-less promoters), TAFs bind to DNA and/or to other TFs in order to involve TBP in pre-initiation complex [9, 12, 13, 14]. From this perspective it is easy to comprehend why TATA box dominates as a core promoter element having the ability to govern transcription initiation alone (at least in vitro). The rest of the core elements usually work in cooperation with others. Indeed, strong synergism between DPE and Inr, MTE and Inr, DCE and Inr, MTE and DPE, BRE and TATA, and Inr and TATA has been experimentally established [9, 14, 15, 16, 17, 18, 19]. It is peculiar, that in spite of the considerable improvement of our knowledge of the transcriptional regulation processes due to emergence of new experimental techniques and computational approaches, the scenarios of the interaction between basal transcription machinery and the core promoter are not known for the majority of identified promoters .
The statistics of the core elements still remain obscure even for the most studied eukaryotes like Drosophila. So far, two Drosophila promoter databases have been analyzed. Kutach and Kadonaga  created a small Drosophila Core Promoter Database containing 205 sequences with an experimentally defined position of TSS "carefully extracted" from the literature. They visually identified the presence of TATA box, Inr, and DPE in those sequences and found that respectively 42.4%, 67.3% and 40.0% of the promoters contain TATA, Inr, and DPE at their functional positions. The larger database (1941 promoters) was constructed by Ohler et al. . In total, 28.3% and 62.8% of promoters from this database have TATA and Inr elements, respectively . These percentages have been found using motif consensuses for respective elements with one mismatch allowed.
The experimental investigation of the core promoter elements is still very labor- and time-consuming. Even for the well-studied elements, such as TATA box, Inr and DPE, only a few promoters have been experimentally examined. Therefore, the statistical analysis of large promoter databases is useful to complement experimental study by identifying new overrepresented motifs , revealing potential synergetic combinations , and classifying promoters.
The hypothesis behind our research is that in the course of evolution the motifs necessary for promoter regulation have been preserved in the promoter region, thus their occurrence frequencies there are far from random. We will examine the following particular questions:
1) How many known Drosophila promoters follow known scenarios of the interaction of the basal machinery and DNA? In particular, the transcription of how many promoters is guided by the TATA box and/or by any of the known synergetic combinations?
2) What are the typical distances between the core elements and TSS and between elements in synergetic combinations?
3) May statistical analysis suggest new synergetic combinations?
4) Are BRE and DCE (elements discovered in human promoters) statistically significant in Drosophila promoters?
5) What typical motifs in the core promoter sequences remain unknown?
6) How do Drosophila and human promoters differ statistically?
For statistical analysis we used an "Orthomine Database" of Drosophila melanogaster promoters  composed by P. Cherbas and S. Middha (pers. comm. prior to publication, see Data and Methods for description.)
Four core promoter elements (TATA box, Inr, DPE, MTE) have been experimentally identified in Drosophila promoters [1, 2]. First, we considered statistical parameters of each of those elements: positional distribution, functional window, and percentage of promoters containing a particular element. We also examined the DCE and BRE elements in Drosophila promoters, although the biological function of those elements has only been observed in human promoters [3, 14, 17, 19]. Second, we analyzed the parameters of synergetic and/or cooperative combinations of each pair of elements: typical distances between the elements and percentage of promoters containing a combination. Finally, we revealed typical motifs in different subsets of Drosophila promoters by the MEME program  and examined their positional distributions in promoter area.
The parameters of core promoter elements. List of the core promoter elements (col. 1); motif consensus in a NC-IUB nomenclature  (col. 2); the length of motif (at left) and the distance between center and 5' end (at right) (col. 3); applied windows for the center of motifs (col. 4); the maximal number of allowed mismatches (n-1) in order for motif consensus still to remain functional (col. 5); cutoff value for PWM (col. 6); the absolute number (col. 7) and percentage (col. 8) of promoters with respective core element; statistical significance (SS) of the occurrence frequency of an element in the respective window (col. 9). All respective P-values are less than 0.0001, which is considered to be extremely statistically significant. The P-values were obtained using P-Value Calculator  from respective Chi (χ) values used for SS calculation  for a system with 1 degree of freedom (DF = 1).
-33 - -23
-1 - +9
+27 - +36
+17 - +26
Using this new PWM for the TATA box (built specifically for Drosophila) we are able to find the number and percentage of TATA+ promoters as well as statistical significance (formula I from Data and Methods) of the TATA over-representation in the functional window (see Table 1, first line, columns 7–9). One can see that the percentage of TATA-containing promoters is much less than previous estimates; compare with 42.4%  and 28.3% . However, this percentage is comparable with estimation for the human promoters . Note, that if we apply our PWM to Drosophila Core Promoter Database at the region from -45 to -15 bp (as in ) we find that 40.0% of promoters have the TATA box, which is close to their estimate (42.4%). So the difference between percentages (42.4%, 28.3%, and 16.2%) can be explained by the differences between databases and applied intervals. The positional distribution of the TATA box obtained by PWM is shown at Supplemental Figure S2d (see Additional file 1). The set of promoter sequences potentially utilizing TATA box element is presented in Supplemental Sequences S1 (see Additional file 2).
The analogous analysis with Inr consensus allows building PWM for Inr (see pictogram at Table 2 and also Additional file 1, Supplemental Figures S3a and S3b and Table S2) as well as finding respective statistical parameters (Table 1, second line).
The percentage of promoters with Initiator (66.5%) is comparable with  (67.3%) and  (62.8%) estimates. Analysis of the Inr positional distribution for the considered database (see Additional file 1, Supplemental Figure S3c) shows significant over-representation for the Inr motif in the area (-1 to +9 bp). Although that differs from the canonical Inr positioning at +1 bp, we consider that window as functional for Inr. The difference may in part be a mere consequence of imprecise TSS mapping for some of the promoters, but may also have other, less trivial reasons (see below). Note, that this window is asymmetric relative to TSS, with Inr often shifted upstream from TSS. The promoter sequences with Inr may be found in Supplemental Sequences S2 (see Additional file 2).
The DPE element was discovered and studied mainly in Drosophila [9, 21]. The positional distribution of DPE (see Additional file 1, Supplemental Figure S4a) exhibits over-representation in the area from +27 to +33 bp with maximum at position +28 bp, which is the experimentally defined functional position for DPE. Note that sites resembling DPE are under-represented almost in the entire promoter area except of the functional window and around TSS. The latter is just an artifact since DPE and Inr motifs partially coincide (compare 'RGWY' in DPE and 'AKTY' in Inr). Since DPE works in cooperation with Inr at a strict distance, the functional window for DPE should have at least the same size as a functional window for Inr. That is why we consider the interval from +27 to +36 as a functional window for DPE despite over-representation of the DPE sites in narrower interval (27–33).
The selection of DPE motif consensus is not straightforward. The initial study based on three Drosophila and one human promoters  revealed sequence motif G(A/T)CG as a new core promoter element. Later on, the functional significance and universality of this motif were confirmed on 19 Drosophila promoters . The experimentation in vitro with randomized sequences showed that variety of sequences could function as DPE . Thus, the consensuses RGWYVT or/and RGWYV were suggested, although there is no evidence that all possible sequences from these motifs are indeed functional in real promoters in vivo. To choose the sufficient consensus we first applied the most trusted motif G(A/T)CG to the promoter database and extracted all promoters containing this motif in the window from +27 to +33 bp. Then we found the positional distribution of sites with consensus RGWYVT in the remaining (DPE-less) subset of promoters. The positional distribution showed over-representation of motif RGWYVT in the same window suggesting functional significance of this consensus. Then we applied consensus RGWYV to the subset of DPE-less (in this case RGWYVT-less) promoters and found that even this loosest motif is still over-represented in the functional window. Thus, statistics suggest that the consensus RGWYV is viable for DPE, so this information was used for further analyses and the PWM building. Supplemental Table S3 (see Additional file 1) and the pictogram at Table 2 present the frequency table calculated based on the DPE sites at positions from 27 to 29. The positional distribution of DPE obtained by PWM is in Supplemental Figure S4b (see Additional file 1).
The statistical parameters of DPE calculated based on the PWM are presented at Table 1, third line. One can see that the percentage of potential DPE promoters is even larger than percentage of the TATA box promoters. The set of promoter sequences most likely utilizing DPE element is presented in Supplemental Sequences S3 (see Additional file 2).
The Motif Ten Element, "CSARCSSAACGS", initially was discovered by statistical analysis of Drosophila promoter database . Then the functional significance of MTE as a new core promoter element has been experimentally established . It was shown that the first five nucleotides are important for transcriptional activity, while the seven remaining nucleotides are "sufficient to confer MTE activity to heterologous core promoters" . MTE (at position +18) works in cooperation with Inr and also with DPE. Since the synergetic position for the DPE is +28 the last two nucleotides are overlapped with DPE. Because it is not clear what the functional MTE motif consensus is, we considered three consensuses: first 5, first 10, and 12 bp long. All of them are essentially over-represented in the functional window. For further statistical analysis we used only the 10 bp long consensus (see Table 1, fourth line). Note that in contrast to the DPE, MTE is over-represented practically in whole promoter area (see Additional file 1, Supplemental Figures S5a and S5b). The PWM was obtained based on the frequency table (Table 2 and Additional file 1, Supplemental Table S4) built by sites extracted from positions +18 - +23 by consensus allowing up to two mismatches. The promoter sequences with MTE at its functional position are presented in Supplemental Sequences S4 (see Additional file 2).
Although it was shown that MTE is also functional (in vitro) in human promoters , the preliminary statistical analysis of two human promoter databases (Eukaryotic Promoter Database  and Database of Transcriptional Start Sites ) using any of three considered above consensuses did not show overrepresentation of MTE at expected functional positions in human promoters.
BRE and DCE
We found that these two elements discovered in human promoters are statistically overrepresented in Drosophila promoters too. For the details of their statistical analysis as well as a list of potential promoters utilizing them as core promoter elements see Additional file 1.
Potential synergetic combinations
The core promoter elements usually work in cooperation with each other. Supposedly, a sizable amount of promoters utilize a similar scenario, i.e. use the same combination of core promoter elements for promoter recognition by the basal machinery. If this is true, statistical analysis of the promoter database should be able to verify the known synergetic combinations as well as to reveal new combinations. It is also important to find the exact distances between the elements as well as to classify known promoters by the combinations they utilize.
The statistical parameters of combinations of core elements. Combination name (col. 1); position of the center of the first element of the combination in bp (col. 2); distance between the centers of the elements in bp (the suggested synergetic distances marked by bold font (col. 3); the percentage (%) (col. 4); the absolute number (N) (col. 5); statistical significance of over-representation of promoters having this combination at respective positions with distance as in col. 3 (col. 6); and respective P-values (col. 7). The P-values were calculated as for the Table 1. The P-values < 0.001 are commonly considered to be extremely statistically significant, and those <0.01 – as very statistically significant.
-1 - +9
-1 - +9
17 – 26
-33 - -23
-33 - -23
-33 - -23
The combination TATA and Inr also can work synergistically . Since the maximum of occurrence frequency for the TATA and Inr elements are placed at position -29 and +1, respectively, the expected synergetic distance between them is 29 bp. Surprisingly, the SS of over-representation of TATA_Inr combination at distance 29 bp is negative, although SS at distances from 30 to 34 are positive with a strong maximum at 31 and 32 bp suggesting synergy at those distances. The promoters with TATA_Inr combination are listed in Supplemental Sequences S10 (see Additional file 2).
The statistical analysis of other possible combinations of core promoter elements suggests cooperation between TATA and DPE at distances 58–60 bp, and TATA and MTE at distances 47–49 bp (see Table 3). The respective subsets of promoters can be found in Supplemental Sequences S11 and S12 (see Additional file 2).
The pictograms and consensuses of overrepresented motifs. The numeral in parentheses in the first column is the numeral of overrepresented motif from the article .
Motif 1 is the most over-represented motif. We scanned the entire promoter database and Inr-less subset of promoters by the Motif 1 consensus with two mismatches (see Table 4, line 1). The resulting positional distributions are presented respectively in Supplemental Figures S7a and S7b (see Additional file 1). One can see an essential over-representation of Motif 1 at positive strand in the area from -50 to +30. Indeed SS(-50<l<30) = 40.9 and SS(-50<l<30) = 36.9 for the whole promoter database and Inr-less subset of promoters, respectively. The positional distribution of Motif 1 with one mismatch exhibits the same behavior (not shown). Note the large maximum at position -5 (from the 5'-end of the motif consensus), which is the position +1 for the first 'A' in the consensus. Surprisingly, this maximum is even larger in the Inr-less set of promoters, which poses a question if Motif 1 is able to work as a core promoter element instead of Inr. It is interesting that the occurrence frequency of Motif 1 at the proximal distance from TSS is essentially larger at positive strand than at negative strand (see Additional file 1, Supplemental Figure S7c), which also indirectly suggests that Motif 1 is able to interact with the basal machinery.
Motif 2 is essentially over-represented at positive strand in the area from -70 up to +10 bp (see Additional file 1, Supplemental Figure S8a); the occurrence frequency in the area from -40 to +10 is much larger at positive strand than at negative strand (see Additional file 1, Supplemental Figure S8b).
Motif 3 has a huge over-representation in the wide area from -130 to +20 at both strands; the occurrence frequency is up to eight-fold higher than expected by chance (formula I from Data and Methods) (see Additional file 1, Supplemental Figures S9a and S9b). Motif 4 is largely overrepresented practically in all promoter area, especially from -150 to +50 bp, at both strands (see Additional file 1, Supplemental Figures S10a and S10b). Usually, transcription factor binding sites that regulate transcription by interacting with the basal machinery exhibit such behavior.
We also examined via the program MEME the TATA-less subset of promoters in the area from -40 to -10 bp as well as DPE-less and MTE-less subset in the area from +10 to +40. In the TATA-less subset of promoters MEME found motif 5 that resembles the motif 6 from the article  (Table 4, line 5). The positional distribution of the motif 5 in the TATA-less promoters (positive strand) is presented at Supplemental Figure S11a (see Additional file 1). One can see the large over-representation in upstream area up to -120 bp. Similar to the motifs 1 and 2, the occurrence frequency of motif 5 at positive strand is visibly larger than at negative strand at the upstream area up to -90 bp (see Additional file 1, Supplemental Figure S11b). In DPE-less and MTE-less subset of promoters we found two new motifs (Table 4, lines 6 and 7). These motifs are over-represented in the entire promoter area at both strands (see Additional file 1, Supplemental Figures S12 and S13), which is not typical for the core promoter elements.
Relation to chromatin structure
Involvement of nucleosomes in the promoter activity (e. g. [27, 28, 29, 30, 31, 32, 33]) and regulation [34, 35, 36, 37, 38, 39, 40, 41, 42, 43] suggests that the nucleosomes would occupy certain positions in the vicinity of promoters, to provide specific spatial environment for the recognition of the promoters, and for interactions with various transcription factors. In our earlier work  we addressed this issue by computational mapping the nucleosomes in the vicinity of the TSS of human genes. For this, the nucleosomal DNA AA/TT periodical pattern was used, derived from a collection of experimentally mapped nucleosomes . Two preferred positions for the nucleosome centers relative the TSS have been detected: 43 ± 3 base pairs upstream from the TSS, and 18 ± 9 downstream. These two positions may correspond to two different types of the chromatin local architecture around the promoters – two types of promoters . Alternatively, the preferred positions could reflect two states (dormant and active?) of the promoters of one dominant type. In this study we mapped computationally the nucleosomes around the Drosophila promoters of various regulatory types, to compare the data with those for human promoters.
In the Supplemental Figure S14 (see Additional file 1) the combined (superimposed) map of the nucleosomes near the TSS is shown. It displays two maxima. The more prominent maximum corresponds to the nucleosomes centered at around -43 bp from the TSS. This is, apparently, the same preferred position as observed in human promoters. Such remarkable commonality suggests that, indeed, eukaryotic promoters are involved in a very special 3D organization, being spatially linked with the "promoter nucleosomes". The transcription start sites are located within the nucleosomes, 43 base pairs from the dyad axis of the nucleosome, and oriented outwards from the histone surface. This follows from the almost exact divisibility of the distance by the nucleosome DNA structural period: 4 × 10.4 = 41.6 base pairs.
This major preferred position for the "promoter nucleosomes" is characteristic of all types of Drosophila promoters (TATA+, TATA-, DPE+, DPE-, MTE+, MTE-, Inr+), except for Inr- promoters (see Additional file 1, Supplemental Figure S15). This may mean that the Inr-less promoters are not involved in any specific 3D chromatin structure, being, e.g., permanently exposed for a non-specific, non-regulated initiation.
Second, minor preferred position for the nucleosomes in the vicinity of TSS is around +11 bp. It does not have a counterpart in human promoters, as well as the position +18 of human promoters has no counterpart in Drosophila. Only future detailed 3D study of the promoter structure in its chromatin environment may reveal what the preferred positions +11, and +18 correspond to. They may reflect details of remodeling, somewhat different in human and Drosophila.
Interestingly, the TATA promoters (see Additional file 1, Supplemental Figure S15a) demonstrate a rather elaborate pattern of several preferred positions, in addition to the standard -43 peak. This may reflect, again, a TATA-specific subtype of local promoter architectures, or perhaps, a special path of remodeling of the TATA+ promoters.
TATA, MTE and DCE contain AA and TT dinucleotides, only one per motif. This can have only a small modulatory effect on the nucleosome positioning, since typical nucleosomes require 3–4 AA and/or TT dinucleotides distributed in accordance with the nucleosome sequence pattern .
Positional distributions of each of the four core promoter elements (TATA, Inr, DPE, and MTE) exhibit essential overrepresentation at their functional positions (see Table 1 and Additional file 1, Supplemental Figures S2-S5) strongly suggesting that sizable amount of promoters utilize them for interaction with the basal machinery.
Surprisingly, a small number of promoters (~16%) comparing with known statistics for Drosophila [21, 22] include TATA box, although this percentage is consistent with the percentage of TATA promoters in human genome [20, 47].
Every fifth promoter has DPE (22%) and a majority of promoters (66%) have an Inr element, which is also consistent with the percentage of the respective elements in human promoters . There are a considerable amount of promoters (~10%) with MTE. As we already mentioned, the MTE is not over-represented at expected functional positions in human promoters. It seems to be odd since the rest of the known core elements are functional (or at least over-represented) in both human and Drosophila promoters; moreover it was specifically shown that MTE is functional (in vitro) in one human promoter . This contradiction can be explained if we notice that only the first 5 nucleotides from the MTE consensus are really necessary for the MTE recognition by pre-initiation complex (PIC) , and this short version of MTE partially includes the sub-element S3 from the DCE (compare CSARC and AGC). It suggests that human and Drosophila consensuses of MTE are different and also that S3 could be part of MTE.
Motif consensus for a particular element is derived from the sites experimentally found to be functional. Usually the number of experimental sites is limited, making it difficult to build a reliable PWM. It is expected that the majority of putative sites found in the functional window of aligned promoter sequences are functional which allows using these sites for building more realistic motif consensus and/or PWM. Using an earlier developed technique , we obtained PWMs for those four elements specifically for Drosophila (see pictograms at Table 2 and Additional file 1, Supplemental Tables S1-S4) using sites extracted from the promoter database.
Promoter elements BRE and DCE discovered in human promoters most likely have functional meaning in some Drosophila promoters too. Indeed, the number of promoters having combination BRE_TATA at distance 9 bp (in this case 3'-end of BRE and 5'-end of TATA box are connected just like in human promoters ) is visibly over-represented compared with the expected number. The sub-elements of DCE also show statistically significant features. Thus, the over-representation of combination Inr and sub-element one (S1) of DCE at distances +6 and +7 is large. The combination of Inr and S1 at those distances are found to be functional in several human promoters . The sub-element two (S2) shows significant over-representation at certain distances from Inr. The sub-element three (S3) is also overrepresented at expected positions from +19 to +31 from TSS.
Typically, transcription initiation is regulated by a combination of the core promoter elements. The synergism between the elements usually requires exact spacing [1, 2]. Statistical analysis of the promoter database allows an identification of synergetic/cooperative distances. Thus, our analysis confirms experimentally defined distances between Inr and DPE – 27 bp; Inr and MTE – 17 bp; MTE and DPE – 10 bp (see Table 3). Surprisingly, the synergetic distances between the TATA and Inr are 31 and 32 bp, not 29 bp as expected based on the position of maximums of the TATA box (-29 bp) and Inr (+1) of respective positional distributions in the promoter area. This finding suggests that in the presence of functional TATA box the TSS position does not necessarily coincide with the center of the Inr element but may be shifted on 2–3 bp in 5' direction. It could be one of the reasons why positional distribution of Inr is asymmetric relative to TSS. The result of analysis also suggests the cooperation between TATA and DPE at distances 58–60 bp as well as the possibility of TATA and MTE cooperation at distances 47–49 bp. The Inr_MTE combination is also over-represented at a distance of 16 bp (not only 17 bp), although experiments showed synergism only at 17 bp . Overall, the proposed technique is sensitive to the spacing between core elements and can be recommended for examination of other elements, as well as for analysis of promoter databases for other species.
Our estimates show that only 24% of promoters utilized known and proposed synergetic combinations while 25% of promoters contain none of the known four core elements. That encourages the search of new elements. The analysis of positional distribution of over-represented motifs revealed by the program MEME leads to several suggestions.
1. Motif 1 (Table 4, first line) could be a core promoter element, since a) the occurrence frequency of this motif obtained on 3393 aligned promoter sequences (on positive strand) has a strong maximum at TSS area (namely, at position +1 for the first 'A' from the 5'-end); b) this maximum is even larger on Inr-less set of promoters, excluding possible interference of Inr element; c) there is no such maximum at negative strand.
2. Motifs 2 and 5 are highly over-represented in the proximal promoter area, namely in the area where pre-initiation complex interacts with DNA. In addition, the occurrence frequency at the DNA positive strand in the over-represented area is essentially larger than at the negative strand. As follows from the previous analysis, the typical features of core promoter elements are a) a narrow functional window and b) distribution on the positive strand is visibly different from those on the negative strand. (Note that TFBS for the majority of specific TFs are placed on both strands). While the motif 1 has both features of the core elements (a and b), the motifs 2 and 5 have only one (b). At the same time the distributions of the motifs 2 and 5 still have a relatively narrow region of overrepresentation covering the basal machinery area. One may speculate that these motifs still could be a target for PIC, or e.g. a target for repressors preventing PIC-DNA interaction.
3. Motifs 3 and 4 are also highly over-represented in the proximal promoter area on both strands. They most likely are transcription factor binding sites for some (not general) TFs.
Statistical analysis of the Drosophila promoter database revealed the major features of Drosophila promoters. We summarize here the main results.
1. The sets of promoter sequences utilizing the TATA box, and/or Initiator, and/or DPE, and/or MTE elements for DNA-PIC interaction are presented. The positions of the elements are marked to simplify experimental verification. The position weight matrices for these four elements as well as their optimal cutoff values are obtained.
2. There is statistical evidence that BRE and DCE, the core promoter elements shown to be functional in human promoters, are most likely functional in some Drosophila promoters too.
3. The sets of promoter sequences presumably utilizing synergetic combinations of two core elements, TATA and Inr, Inr and DPE, Inr and MTE, and DPE and MTE, are represented. There are also the sets of promoters with suggested synergetic combinations (not shown experimentally but statistically significant): TATA and DPE, TATA and MTE, and TATA and BRE.
4. The synergetic distances between the elements are established. In addition to known from the experiment synergetic distances such as between Inr and DPE (27 bp), Inr and MTE (17 bp), MTE and DPE (10 bp) we found synergetic distances between TATA and Inr (30–34 bp), Inr and MTE (16 bp), TATA and DPE (58–60 bp), and TATA and MTE (47–49 bp).
5. Over-represented motif 1 (Table 4, line 1) can be a new core promoter element.
6. Motifs 2 and 5 (Table 4, lines 2 and 5) could be elements for DNA-PIC interaction or binding sites for silencers or repressors.
7. Motif 3 and 4 (Table 4, lines 3 and 4) are most likely transcription factor binding sites.
8. Some of statistical features are similar between Drosophila and Human promoters. Thus, the percentages of promoters containing core promoter elements such as TATA, Inr, and DPE as well as their synergetic combinations are comparable. The functional positions of the core promoter elements as well as the distances between elements in synergetic combinations are the same for Drosophila and Human promoters. Exception is the distances between TATA box and others elements (Inr and DPE), which are longer (approximately on two bp) in Drosophila promoters than in Human.
9. The relationship of the local chromatin architecture (nucleosome positioning) with certain types of core promoter was elucidated. In particular, TATA+ and Inr- promoters show two distinct types of the chromatin organization.
A total of 3393 non-redundant Drosophila melanogaster promoter sequences from the "Orthomine Database" (P. Cherbas and S. Middha, pers. comm.) were used for statistical analyses. The database was constructed as the nonredundant union of 3 published Drosophila promoter sequence databases [21, 22, 48]. In the case of Kutach and Kadonaga's database  some experimentally-determined TSSs had been rejected in favor of positions suggested by sequence analysis; in those cases the "Orthomine database" employed the original (experimental) TSS. In those few cases where the TSS position could not be unambiguously derived from the published papers, the sequence was omitted. For each sequence the unambiguous genomic sequence was retrieved (Drosophila genome annotation v4.1); those sequences that could not be unambiguously assigned to a single genomic location were omitted. In each case the genomic sequence from -250 to +100 (TSS = +1) was recovered. The final database includes 3393 sequences (1908 from Ohler et al. , 157 from Kutach and Kadonaga , 1328 from the EPD). When the entire set is compared to the current Drosophila annotation the modal deviation between the database TSS and the annotated TSS is equal to 0.
We exploited the idea that motifs necessary for transcription regulation are overrepresented in a particular area of promoter region. So the statistical analysis of averaged positional distribution of the element's occurrence frequency (OF i = n i /N s , where n i is the number of promoters containing a considered element centered at position i in N s aligned promoter sequences) is the main method of our investigation. We use the term 'functional window' to designate the positions of the center of the site relative to TSS (the distances between 5'-end and the center of motifs were defined as in Table 1, column 3), where the occurrence frequency of the considered element is much larger than expected. Thus, we suppose that sites appearing in that window are likely to have a functional (biological) meaning. To formalize 'over-representation' we consider parameter of statistical significance derived from Chi-test :
where N real is the total number of sites found by position weight matrix (PWM) or motif consensus in the considered window and N random is the total number of sites found in the randomly generated control sequences with the same percentage of nucleotides as in the promoter sequences at the same positions. To find the distribution of the element's occurrence frequency we scan each promoter sequence at each position by respective PWM or motif consensus. We examine the presence of the core promoter elements and relations between the elements in different subsets of Drosophila promoters. To implement this strategy we divided datasets of promoters to subsets.
To generate the random sequences we first calculated the percentage of nucleotides at each position averaged over all 3393 aligned promoter sequences. Then we generated 100,000 sequences with length equal to promoter length. The probability of finding each nucleotide at each position is proportional to the calculated above percentage. Note that we do not use a conventional model of randomly shuffled sequences as the control. The main reason for this is the essential in-homogeneity of the nucleotide positional distributions in the promoter area (see Additional file 1, Supplemental Figures S1a and S1b). As a result of such distributions, the SS values built using shuffled sequences are strongly biased. For example, let's consider a hypothetical motif (with no biological sense) with dominant composition of A and T nucleotides. With shuffled random sequences, such motif will show overrepresentation (large positive SS) at positions from -250 to -150 and from +50 to +100 and under-representation at positions from -25 to -5 (large negative SS). The same motif will not show significant SS values at any positions if our random sequences will be in use. Thus the control sequence set designed here allows eliminating the biases related to strong positional in-homogeneity of promoter area.
The following procedure was applied to obtain PWM for each core promoter element (this is a simplified and modified version of PWM building algorithm we developed earlier ). First, the approximate position of a functional window for a particular element was defined by examining the occurrence frequency distribution. Second, we analyzed how many mismatches in an "ideal" consensus (consensus defined by the experiments) are allowed. For this we divided the database to two subsets: one with promoters containing sites at any position in the functional window and matching exactly the motif consensus, and another with promoters without such sites. Then we applied motif consensus to the latter subset allowing one mismatch. If the number of sites in the functional window is still essentially overrepresented, we repeat all previous steps allowing two mismatches. We reiterate this cycle up to n times, where n is the number of mismatches in consensus for which distribution of occurrence frequency (obtained on the datasets of promoters with no sites matching the motif consensus with n-1 mismatches) has no over-representation (SS = 5 was taken as cutoff value). We assume that sites found inside a functional window by the consensus with n-1 mismatches are most likely functional sites. Note that functional windows of all n steps do not necessarily coincide. We used these sites from the functional window of step n-1 to construct PWM. There are several different approaches to define PWM . We used the form derived from Staden  and Bucher . The next step is to define the cutoff value. We realize that PWM should be "stronger" than consensus with n mismatches and "weaker" than consensus with n-1 mismatches. Our goal is to find such optimal cutoff value C op that PWM with C op find all functional (over-represented) sites. To implement it, we apply PWM with arbitrary C = C1 (we could start with small values, a priori less than C op ) to promoter database and divide it to two subsets: with sites in the functional window and without such sites. Then we apply the motif consensus with n mismatches to the latter subset of promoters . Thus, we find the number of promoters that do not contain sites defined by PWM with C = C1, yet contain sites defined by consensus with n mismatches. We should compare this number with – the number of sites from the randomly generated sequences with the same percentage of nucleotides as in the aforementioned subset of promoter sequences at the same positions. If <, the cutoff value is too small (C1<C op ). We should repeat the procedure every time increasing cutoff value. The value C m is the optimal cutoff value if in the subset of promoters .
To define potential synergetic distances between two core promoter elements we examine the statistical significance (SS l ) of over-representation of promoters containing a combination:
where and are the real and expected numbers of pairs of considered elements placed at their functional positions at distance l from each other. The expected number is the estimated number of pairs if the presence of one element is independent of the presence of the other. This number may be calculated by formula:
where w1 and w2 are the positions of 5'- and 3'-ends of the functional window of element one; and are the probabilities to find element one at position i and element two at position i+l, respectively. These probabilities are the respective occurrence frequencies and calculated based on all promoters from Orthomine Database.
As we see at the Results section some of the combinations exhibit over-representation at several distances. To calculate the over-representation of promoters containing both elements at distances from l to l+Δl we should modify the formula for the expected number:
The authors are thankful to Peter Cherbas and Sumit Middha (Dept. of Biology and Center for Genomics and Bioinformatics, Indiana University, Bloomington) for useful discussions and providing the database and its description prior to publication, to Ken Petri for assistance in the software design, to Thaddeus Tarpey for statistical consultation, and to Kristin Sanderson and Judith O'Donnell for proofreading.
- 12.Zenzie-Gregory B, Khachi A, Garraway IP, Smale ST: Mechanism of initiator-mediated transcription: evidence for a functional interaction between the TATA-binding protein and DNA in the absence of a specific recognition sequence. Mol Cell Biol. 1993, 13: 3841-3849.PubMedPubMedCentralCrossRefGoogle Scholar
- 23.Orthomine: A Dataset of Drosophila Core Promoters. [http://bio.informatics.indiana.edu/capstone/may05/talk_smiddha.pdf]
- 24.MEME home page. [http://meme.sdsc.edu/meme/intro.html]
- 25.Eukaryotic Promoter Database. [http://www.epd.isb-sib.ch/]
- 26.Database of Transcriptional Start Sites. [http://dbtss.hgc.jp/index.html]
- 50.The software package, Promoter Classifier. [http://bmi.osu.edu/~ilya/promoter_classifier/]
- 51.Connor-Linton J: Chi square tutorial. [http://www.georgetown.edu/faculty/ballc/webtools/web_chi_tut.html]
- 56.Nomenclature for incompletely specified bases in nucleic acid sequences. [http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html]
- 57.P-Value Calculator. [http://www.graphpad.com/quickcalcs/PValue1.cfm]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.