# A motif-independent metric for DNA sequence specificity

- 3.4k Downloads
- 7 Citations

**Part of the following topical collections:**

## Abstract

### Background

Genome-wide mapping of protein-DNA interactions has been widely used to investigate biological functions of the genome. An important question is to what extent such interactions are regulated at the DNA sequence level. However, current investigation is hampered by the lack of computational methods for systematic evaluating sequence specificity.

### Results

We present a simple, unbiased quantitative measure for DNA sequence specificity called the Motif Independent Measure (MIM). By analyzing both simulated and real experimental data, we found that the MIM measure can be used to detect sequence specificity independent of presence of transcription factor (TF) binding motifs. We also found that the level of specificity associated with H3K4me1 target sequences is highly cell-type specific and highest in embryonic stem (ES) cells. We predicted H3K4me1 target sequences by using the N- score model and found that the prediction accuracy is indeed high in ES cells.The software to compute the MIM is freely available at: https://github.com/lucapinello/mim.

### Conclusions

Our method provides a unified framework for quantifying DNA sequence specificity and serves as a guide for development of sequence-based prediction models.

### Keywords

Embryonic Stem Cell Nucleosome Position JASPAR Database Transcription Factor Motif Motif Presence## Background

Of the entire 3GB human genome, only about 2% codes for proteins. The identification of biological functions of the entire genome remains a major challenge [1, 2]. One powerful venue to gain functional insights is to identify the proteins that bind to each genomic region. Recent development of chromatin immunoprecipitation followed by microarray or sequencing (ChIP- chip or ChIPseq) technologies has made it feasible to map genome-wide protein-DNA interaction profiles [3, 4, 5]. The data generated by these experiments have not only greatly facilitated the genome-wide characterization of regulatory elements such as enhancers [6, 7] but also been integrated with other data sources to build gene regulatory networks [8, 9, 10, 11].

An important question is to what extent a specific protein-DNA interaction is mediated at the level of genomic sequences. While it is well known that specific sequence motifs are crucial for transcription factors (TF) mediated *cis*-regulation, there are many other proteins, such as chromatin modifiers, whose target sequences cannot simply be characterized by a handful of distinct motifs [12]. Such sequences are often regarded as nonspecific and not studied further. However, recent studies in nucleosome positioning have provided new insights by going beyond this motif-centric view [13]. Here various sequence features have been associated with nucleosome positioning, including poly dA:dT track [14, 15], abundance of G/C content [16, 17], and certain periodic patterns [18, 19]. Such patterns cannot be captured by traditional motif analysis methods. Similar results have been obtained by analyzing histone modification [20, 21] and DNA methylation data [22, 23].

Despite the success of these recent sequence-based prediction models, it remains difficult to determine which sequences lack intrinsic specificity because a poor prediction outcome might imply than more sophisticated models. A guide is needed for developing sequence-based prediction models. To this end, here we present a simple approach to quantify sequence specificity based on the frequency distribution of *k*-mers. We will also systematically investigate the relative merit of various distance or similarity functions for capturing specific sequence information. While *k*-mers have been extensively to detect splice sites [24], to study functional genomic regions[25], to identify protein coding genes[26] and used in motif analysis (reviewed by [27]), to our knowledge, they have not been used to quantify sequence specificity.

We evaluated the performance of our approach by analyzing one simulated datasets and two real experimental datasets, corresponding to a TF (STAT1) and a histone modification (H3K4me1) respectively. Our results have provided new insights into the role of DNA sequences in modulating protein-DNA interactions regardless of motif presence.

## Results

### A simple measure of sequence specificity

While specific sequence information has been identified in the absence of distinct motifs, to our knowledge, it is always associated with enrichment of certain *k*-mers (where *k* is a small number, such as 4). Its main difference with motifs is that, when k is small, a single *k*-mer may occur many times in the genome and therefore would not be useful for any practical purpose. On the other hand, we reasoned that more specific information can be obtained by combinations of multiple *k*-mers. Therefore, it seems appropriate to quantify sequence specificity by aggregating enrichment information for all *k*-mers. For the rest of the paper, we fix *k* = 4, although the method presented below is equally applicable to any choice of k. Treating complementary sequences as identical, there are 136 non-redundant 4-mers. By counting the frequency of each 4-mer, each input sequence is then mapped to a 136 dimensional numerical vector containing the frequency of each *k*-mer. The distributions corresponding to sequences containing specific information should be distinct from those for random sequences, which are generated to match the number and length of the input sequences. We use the symmetric Kullback-Leibler (KL) divergence [28] for comparing frequency distributions and average over the entire set of input sequences. We term the resulting value as the *M* otif *I* ndependent *M* etric (MIM). To evaluate statistical significance, we estimate the null distribution by computing MIM values for sets of random sequences. The detailed procedure is described in the Methods section.

### Model Validation

#### Simulated data

As an initial evaluation, we synthetically generated 8 sequence sets each containing 2000 sequences, mimicking TF ChIPseq experiments for which the corresponding TF recognizes a single motif: **TTGACA**. The difference between these sequence sets is the motif strength, which is parameterized by a real number *ε* (see Methods). In particular, a perfect motif corresponds to *ε* = 0, whereas a random sequence corresponds to *ε* = 0.25. In a typical ChIPseq experiment, only a subset of target sequences contains the motif. To simulate this fact, we randomly selected 1000 sequences from each set and inserted the motif at a randomly selected location. As control, we also synthesized 1000 sets of 2000 random sequences each.

*ε*up to 0.1 (Figure 1a and 1b). The information content for the corresponding motif is 5.35 bit, which is still lower than 98% of the motifs in the JASPAR core database [29]. In the following we will show that our method indeed performs well for real data. We ranked each

*k*-mer according to its relative contribution to the MIM. The most informative

*k*-mers are shown in Table 1. The methodology used to select such motif is outlined in the methods section. We noticed that the top

*k*-mers are substrings of the inserted motif (highlighted in bold in Table 1), suggesting that these

*k*-mers may be used as a seed for motif detection, in a similar way as the dictionary approach [30]. In additional to the KL divergence considered here, there are a number of other metrics to compare frequency distributions. We selected a few commonly used metrics and repeated the above analysis (Methods). We found that the results are quite similar (Table 2).

Top 20 *k*-mers ranked by different distances on Cell1 of Synthetic dataset

Cell1 | ||
---|---|---|

| | |

| | |

| | |

| | |

acag | acag | acag |

caag | caag | caag |

attg | caac | attg |

acat | acac | acat |

caac | attg | caac |

acac | acat | acac |

aatg | aatg | aatg |

acaa | acaa | acaa |

ttaa | cgga | ttaa |

aaat | caaa | aaat |

cgga | gacc | cgga |

caaa | agat | caaa |

aatt | aaat | aatt |

cata | taaa | cata |

gacc | ccgc | gacc |

agat | cgcc | agat |

gtaa | aagg | gtaa |

agat | cgcc | agat |

Distances values on Synthetic dataset

Cell | KL | p-value | Bhattacharyya | p-value | Hellinger | p-value |
---|---|---|---|---|---|---|

| 4.12E-03 | <0.001 | 2.18E-03 | <0.001 | 2.67E-02 | <0.001 |

| 3.64E-03 | <0.001 | 1.93E-03 | <0.001 | 2.51E-02 | <0.001 |

| 4.06E-03 | <0.001 | 2.16E-03 | <0.001 | 2.65E-02 | <0.001 |

| 3.23E-03 | <0.001 | 1.70E-03 | <0.001 | 2.36E-02 | <0.001 |

| 2.06E-03 | <0.001 | 1.07E-03 | <0.001 | 1.89E-02 | <0.001 |

| 9.59E-04 | <0.001 | 5.03E-04 | <0.001 | 1.29E-02 | <0.001 |

| 7.80E-04 | 0.0262 | 4.09E-04 | 0.0497 | 1.16E-02 | 0.0207 |

| 6.27E-04 | 0.2367 | 3.75E-04 | 0.1670 | 1.04E-02 | 0.2467 |

#### Real ChIPseq data

To validate our method using real experimental data, we analyzed a publicly available ChIPseq dataset for STAT1 [31], a member of the signal transducer and activator of transcription (STAT) family TFs, in the HeLa S3 cell line. The dataset contains 39,000 target sequences, 35% of which contains the consensus motif **TTCCNGGAA** (JASPAR database [29]). As control, we sampled random sequences from genomic background matching the number and length of the target sequences.

*k*-mers, several are substrings of the "classic" STAT1 motif (highlighted in bold in Table 3), suggesting it may provide useful information for identifying discriminative sequence signatures without the knowledge of TF motifs. Furthermore, the results are not sensitive to the specific choice of distances as in the simulated data experiment (Table 4).

Top 20 *k*-mers ranked by different distances on motif sequences on STAT1 dataset

STAT1 Motif | ||
---|---|---|

| | |

aata | atat | aata |

ttaa | tata | ttaa |

aaat | aata | aaat |

aaaa | ttaa | aaaa |

| atta | |

atat | aaat | atat |

atac | taaa | atac |

tcaa | atac | tcaa |

aatt | aatt | aatt |

acat | ataa | acat |

taca | taca | taca |

aggg | cata | aggg |

cgga | aaaa | cgga |

atta | attg | atta |

attg | acat | attg |

taga | tcaa | taga |

caaa | agcg | caaa |

acta | gata | acta |

ccag | taga | ccag |

agca | cgga | agca |

Distances values on STAT1 dataset

Peaks | KL | p-value | Bhattacharyya | p-value | Hellinger | p-value |
---|---|---|---|---|---|---|

| 3.26E-02 | <0.001 | 2.19E-03 | <0.001 | 7.51E-02 | <0.001 |

| 3.36E-02 | <0.001 | 2.72E-03 | <0.001 | 7.61E-02 | <0.001 |

| 3.76E-02 | <0.001 | 3.65E-03 | <0.001 | 8.05E-02 | <0.001 |

### Detecting sequence specificity in absence of a dominant motif

#### STAT1

As mentioned above, while the presence of STAT1 motif can explain the sequence specificity for 35% of the target sequences, it is unclear how TF is recruited to the other 65% of the targets. In order to evaluate the role of DNA sequence specificity for these motif-absent targets, we compared the MIM values between the motif-present and motif-absent subsets of targets. Surprisingly, we found that the MIM value for motif-absent targets is almost indistinguishable from motif-present targets (see Figure 2a and 2b). This high level of specificity cannot be simply explained by promoter-related biases, because only 11% of target sequences are located in promoters.

To gain mechanistic insights, we searched for enrichment of other TF motifs in the JASPAR database [29], using the FIMO software [32]. We found two motifs that are significantly enriched (threshold p-value < 10^{-6}): SP1 and ESR1, both have previously been shown to interact with STAT1 [33, 34]. Therefore, STAT1 might be recruited to the motif-absent targets through interaction with these other TFs. We further compared the associated gene ontology terms between the motif-present and motif-absent sets to see if there are any functional differences. We found that these two sets share many similar biological functions, such as hydrolase and ATPase activities (*p* < 10^{-17}). On the other hand, while the motif-present targets are highly enriched for the voltage-gated calcium channel complex (*p* < 10^{-12}), the motif-absent targets are highly enriched for cytoplasmic components instead (*p* < 10^{-12}).

#### H3K4me1

Top 20 *k*-mers ranking by different distances on H1 cell line on H3k4me1 dataset

H1 cell line | ||
---|---|---|

| | |

tcga | tcga | tcga |

cgaa | tcca | cgaa |

attc | attc | attc |

tcca | atgg | tcca |

atcg | cgaa | atcg |

ggaa | ggaa | ggaa |

atgg | aatg | atgg |

aatg | atcg | aatg |

aacg | tata | aacg |

ctta | ttaa | ctta |

gcta | ctta | gcta |

ttaa | aacg | ttaa |

ctaa | gcta | ctaa |

agct | taaa | agct |

ggta | ggta | ggta |

taaa | ataa | cgga |

cgga | cgga | taaa |

acgg | atta | acgg |

ataa | acgg | ataa |

atta | aaaa | atta |

Distances values on H3k4me1 dataset

Cell | KL | p-value | Bhattacharyya | p-value | Hellinger | p-value |
---|---|---|---|---|---|---|

| 4.43E-01 | <0.001 | 1.28E-02 | <0.001 | 2.71E-01 | <0.001 |

| 1.97E-01 | <0.001 | 8.47E-03 | <0.001 | 1.82E-01 | <0.001 |

| 1.10E-01 | <0.001 | 4.10E-03 | <0.001 | 1.37E-01 | <0.001 |

| 0.083176 | <0.001 | 0.003584 | <0.001 | 0.119491 | <0.001 |

| 0.064867 | <0.001 | 0.006796 | <0.001 | 0.105424 | <0.001 |

| 0.026875 | <0.001 | 0.002996 | <0.001 | 0.067992 | <0.001 |

| 0.014557 | <0.001 | 0.002912 | <0.001 | 0.050102 | <0.001 |

Since the H3K4me1 marks cell-type specific enhancers, one possible explanation for the high sequence specificity in ES cells is that the targets might be associated with a few ES-specific TFs. To test this possibility, we searched for enrichment of TF motifs in the JASPAR database using FIMO. Surprisingly, we were unable to find any significantly-enriched motif, suggesting that the specificity is contributed to a different mechanism.

## Discussion

Recently it has been shown that a large number of proteins may weakly bind to DNA [37]. It remains unclear to what extent such events are mediated by specific sequence information. This question cannot be answered by using traditional motif analysis, since the target sequences do not contain distinct motifs. As an alternative approach, we define a simple measure, called MIM, to quantify sequence specificity by aggregating information from all *k*-mers. Our approach does not make any assumptions regarding motif presence, providing a more versatile tool for sequence analysis. We validated this method by analyzing both simulated and experimental data and found that it is indeed effective for detecting sequence specificity in both cases.

We also showed that the MIM measure can provide new biological insights. Specifically, we found that the motif-absent targets of a TF may also contain specific sequence information due to interaction with other TFs. We also found that the sequence specificity for H3K4me1 targets is higher in ES cells than in differentiated cell-types, suggesting a unique role of DNA sequence in the recruitment of H3K4me1 in ES cells. Interestingly, this high specificity cannot be explained by enrichment of known TF motifs, suggesting a yet uncharacterized recruitment mechanism in ES cells. The MIM algorithm is implemented in Python and can be freely accessed at : https://github.com/lucapinello/mim.

## Conclusion

The role of DNA sequence in gene regulation remains incompletely understood. Our MIM method has extended previous work by further accounting for sequence specificity due to accumulation of weak sequence features. The information can be used as a guide to systematically investigate the regulatory mechanisms for a wide variety of biological processes.

## Methods

### Synthetic data generation

**TTGACA**. In order to simulate the variation of motif sites among different target sequences, we modeled the position weight matrix (PWM) as illustrated in Table 7, where ε measures the mutation rate of the motif and can change between 0 (perfect motif) and 0.25 (totally random). We sampled ε at 8 different values: 0, 0.001, 0.005, 0.01, 0.05, 0.1, 0.1667, and 0.25. For each choice of ε, we generated 2000 sequences of 500 bp each. The sequences were initially generated by randomly sampling from the background distribution with the probabilities of A,C,G,T equal to 0.15, 0.35, 0.35, 0.15, respectively. In addition, we randomly selected a subset of 1000 sequences and inserted the motif at a random location.

PWM for synthetic motif generation

1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|

| ε | ε | ε | 1-3ε | ε | 1-3ε |

| ε | ε | ε | ε | 1-3ε | ε |

| ε | ε | 1-3ε | ε | ε | ε |

| 1-3ε | 1-3ε | ε | ε | ε | ε |

### ChIPseq data source

Genome-wide STAT1 peak locations in HeLa S3 cell lines were obtained from the http://archive.gersteinlab.org/proj/PeakSeq/Scoring_ChIPSeq/Results/STAT1[31]. ChIPseq data for H3K4me1 in seven human cell lines were obtained from literature: CD4+ T cell [4], CD36+ and CD133+ T cells [35], H1, Huvec, K562, and Nhek [1]. The raw data were processed by cisGenome to identify peak locations [36]. The DNA sequences at the peak locations were analyzed subsequently.

### Motif analysis

Motif analysis was done by using several tools in the MEME suite (http://meme.nbcr.net/meme/) as follows. Scanning DNA sequences for matches of a known motif was done by using the FIMO [32]. Motif comparison was done by using TOMTOM.

### Functional annotation

Functional annotation was done by using the GOrilla software [38] (http://cbl-gorilla.cs.technion.ac.il/).

### Details of the MIM measure

*k*-mer treating complementary

*k*-mers as the same. There are

*m*= 136 non-redundant

*k*-mers for k = 4. MIM is essentially a metric between two distributions of

*k*-mer frequencies. Specifically, let

**P = (**

*P*

_{ ij }) be the

*k*-mer frequency distributions corresponding to a set of

*n*target sequences

**S = (**

*S*

_{ i }), where

*S*

_{ i }represents a sequence in the set

**S**. We generate a set of

*n*random sequences

**R =**

**(***R*

_{ i }) matching the sequence lengths (analogously

*R*

_{ i }represents a sequence in the set

**R**). Let

**Q**= (

*Q*

_{ ij }) be the

*k*-mer frequency distributions corresponding to

**R**. Finally let ${\mathbf{P}}_{\mathit{j}}=\frac{{\Sigma}_{i}{P}_{ij}}{{\Sigma}_{ij}{P}_{ij}}$ and ${\mathbf{Q}}_{\mathit{j}}=\frac{{\Sigma}_{i}{Q}_{ij}}{{\Sigma}_{ij}{Q}_{ij}}$ (

**P**

_{ j }in particular represents the probability of the

*j*-th

*k*-mer in

**S**, analogously,

**Q**

_{ j }represents the probability of the

*j*-th

*k*-mer in

**R**) then the difference between

**P**and

**Q**is quantified by the symmetrical Kullback-Leibler (KL) divergence [28], as follows:

**S**is defined as the expected value

*d*

_{ kl }(

**S**,

**R**), which is estimated by averaging over 1000 sets of random sequences. The MIM value, using the symmetrical KL divergence, can be interpreted as the number of the expected number of extra bits required to code samples from

**S**when using a code based on the background distribution. Note that there exist several alternatives to measure the similarity of two probability distributions [39]. To evaluate whether the results are sensitive to the specific choice of distances, we also computed MIM values based on two other well-known distances between probability distributions:

- 1)The Hellinger distance [39]${d}_{hl}\left(\mathbf{S},\mathbf{R}\right)=\sqrt{\frac{1}{2}\sum _{j=1}^{m}{\left(\sqrt{{\mathbf{P}}_{j}-}\sqrt{{\mathbf{Q}}_{j}}\right)}^{2}}$

*d*

_{ kl }are 1)

*d*

_{ hl }naturally satisfies the triangle inequality; and 2) the range of

*d*

_{ hl }is the interval [0,1].

- 2)The Bhattacharyya distance[40]${d}_{bh}\left(\mathbf{S},\mathbf{R}\right)=\sum _{j=1}^{m}\frac{{\left({\mu}_{{\mathbf{P}}_{j}}-{\mu}_{{\mathbf{Q}}_{j}}\right)}^{2}}{4\left({\sigma}_{{\mathbf{P}}_{j}}+{\sigma}_{{\mathbf{Q}}_{j}}\right)}+\frac{1}{2}log\left(\frac{{\sigma}_{{\mathbf{P}}_{j}}+{\sigma}_{{\mathbf{Q}}_{j}}}{2\sqrt{{\sigma}_{{\mathbf{P}}_{j}}+{\sigma}_{{\mathbf{Q}}_{j}}}}\right)$

which has been widely used for pattern recognition in computer science [41];

where ${\mu}_{{\mathbf{P}}_{j}}=\frac{1}{n}\sum _{i=1}^{n}{P}_{ij}$ and ${\sigma}_{{\mathbf{P}}_{j}}=\sqrt{\frac{1}{n-1}\sum _{i=1}^{n}{\left({P}_{ij}-{\mu}_{{\mathbf{P}}_{j}}\right)}^{2}}$ are the mean and standard deviation, respectively, of **P**_{ j }(${\mu}_{{\mathbf{Q}}_{j}}$ and ${\sigma}_{{\mathbf{Q}}_{j}}$ are defined similarly for **Q**_{ j }).

In order to estimate the null distribution, we generated 1000 sets of random sequences and then calculated MIM values for each random sequence set. The probability density function (pdf) was estimated by using a kernel method [42]. This pdf was used to infer not only the mean and standard deviation of the null distribution but also the statistical significance for any MIM value. Recognizing the limited resolution of the estimated pdf, we did not distinguish p-values that are smaller than 0.001.

### N-score model

The N-score model was described previously [19, 21]. In brief, the model integrates three types of sequence features, including sequence periodicities [19], word counts [16], and structural parameters [43], a total of 2920 candidate features. Model selection was done by stepwise logistic regression. The final model was used for target prediction.

### Most informative k-mers selection

Giving **P**_{ j }and **Q**_{ j }associated to **S** and **R** respectively, it is possible to calculate their Kullback-Leibler (KL) divergence for each *j*, where *j* indicates the *j*-th *k*-mer component. This results in a list of 136 distance values, whose ranking can be used as a guide to identify the most informative *k*-mers.

## Notes

### Acknowledgements

We thank Zhen Shao for help with H3K4me1 data collection and initial processing. GY's research was supported by the NIH grant HG005085 and a Career Incubator Award from the Harvard School of Public Health.

## Supplementary material

### References

- 1.Birney E, et al.: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007, 447(7146):799–816. 10.1038/nature05874CrossRefPubMedGoogle Scholar
- 2.TCGA: Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008, 455(7216):1061–8. 10.1038/nature07385CrossRefGoogle Scholar
- 3.Bussemaker HJ, Li H, Siggia ED: Regulatory element detection using a probabilistic segmentation model. Proc Int Conf Intell Syst Mol Biol 2000, 8: 67–74.PubMedGoogle Scholar
- 4.Barski A, et al.: High-resolution profiling of histone methylations in the human genome. Cell 2007, 129(4):823–37. 10.1016/j.cell.2007.05.009CrossRefPubMedGoogle Scholar
- 5.Mikkelsen TS, et al.: Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 2007, 448(7153):553–60. 10.1038/nature06008PubMedCentralCrossRefPubMedGoogle Scholar
- 6.Heintzman ND, et al.: Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature 2009, 459(7243):108–12. 10.1038/nature07829PubMedCentralCrossRefPubMedGoogle Scholar
- 7.Crawford GE, et al.: Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). Genome Res 2006, 16(1):123–31.PubMedCentralCrossRefPubMedGoogle Scholar
- 8.Yeang CH, Ideker T, Jaakkola T: Physical network models. J Comput Biol 2004, 11(2–3):243–62. 10.1089/1066527041410382CrossRefPubMedGoogle Scholar
- 9.Harbison CT, et al.: Transcriptional regulatory code of a eukaryotic genome. Nature 2004, 431(7004):99–104. 10.1038/nature02800PubMedCentralCrossRefPubMedGoogle Scholar
- 10.Zhou Q, et al.: A gene regulatory network in mouse embryonic stem cells. Proc Natl Acad Sci USA 2007, 104(42):16438–43. 10.1073/pnas.0701014104PubMedCentralCrossRefPubMedGoogle Scholar
- 11.Chang LW, et al.: Computational identification of the normal and perturbed genetic networks involved in myeloid differentiation and acute promyelocytic leukemia. Genome Biol 2008, 9(2):R38. 10.1186/gb-2008-9-2-r38PubMedCentralCrossRefPubMedGoogle Scholar
- 12.Kouzarides T: Chromatin modifications and their function. Cell 2007, 128(4):693–705. 10.1016/j.cell.2007.02.005CrossRefPubMedGoogle Scholar
- 13.Jiang C, Pugh BF: Nucleosome positioning and gene regulation: advances through genomics. Nat Rev Genet 2009, 10(3):161–72.CrossRefPubMedGoogle Scholar
- 14.Sekinger EA, Moqtaderi Z, Struhl K: Intrinsic histone-DNA interactions and low nucleosome density are important for preferential accessibility of promoter regions in yeast. Mol Cell 2005, 18(6):735–48. 10.1016/j.molcel.2005.05.003CrossRefPubMedGoogle Scholar
- 15.Yuan GC, et al.: Genome-scale identification of nucleosome positions in S. cerevisiae. Science 2005, 309(5734):626–30. 10.1126/science.1112178CrossRefPubMedGoogle Scholar
- 16.Peckham HE, et al.: Nucleosome positioning signals in genomic DNA. Genome Res 2007, 17(8):1170–7. 10.1101/gr.6101007PubMedCentralCrossRefPubMedGoogle Scholar
- 17.Tillo D, Hughes TR: G+C content dominates intrinsic nucleosome occupancy. BMC Bioinformatics 2009, 10: 442. 10.1186/1471-2105-10-442PubMedCentralCrossRefPubMedGoogle Scholar
- 18.Field Y, et al.: Distinct modes of regulation by chromatin encoded through nucleosome positioning signals. PLoS Comput Biol 2008, 4(11):e1000216. 10.1371/journal.pcbi.1000216PubMedCentralCrossRefPubMedGoogle Scholar
- 19.Yuan GC, Liu JS: Genomic sequence is highly predictive of local nucleosome depletion. PLoS Comput Biol 2008, 4(1):e13. 10.1371/journal.pcbi.0040013PubMedCentralCrossRefPubMedGoogle Scholar
- 20.Ku M, et al.: Genomewide analysis of PRC1 and PRC2 occupancy identifies two classes of bivalent domains. PLoS Genet 2008, 4(10):e1000242. 10.1371/journal.pgen.1000242PubMedCentralCrossRefPubMedGoogle Scholar
- 21.Yuan GC: Targeted recruitment of histone modifications in humans predicted by genomic sequences. J Comput Biol 2009, 16(2):341–55. 10.1089/cmb.2008.18TTCrossRefPubMedGoogle Scholar
- 22.Bock C, et al.: CpG island methylation in human lymphocytes is highly correlated with DNA sequence, repeats, and predicted DNA structure. PLoS Genet 2006, 2(3):e26. 10.1371/journal.pgen.0020026PubMedCentralCrossRefPubMedGoogle Scholar
- 23.Das R, et al.: Computational prediction of methylation status in human genomic sequences. Proc Natl Acad Sci USA 2006, 103(28):10713–6. 10.1073/pnas.0602949103PubMedCentralCrossRefPubMedGoogle Scholar
- 24.Salzberg SL: A method for identifying splice sites and translational start sites in eukaryotic mRNA. Comput Appl Biosci 1997, 13(4):365–76.PubMedGoogle Scholar
- 25.DeCaprio D, et al.: Conrad: gene prediction using conditional random fields. Genome Res 2007, 17(9):1389–98. 10.1101/gr.6558107PubMedCentralCrossRefPubMedGoogle Scholar
- 26.Narlikar L, et al.: Genome-wide discovery of human heart enhancers. Genome Res 2010, 20(3):381–92. 10.1101/gr.098657.109PubMedCentralCrossRefPubMedGoogle Scholar
- 27.Ji H, Wong WH: Computational biology: toward deciphering gene regulatory information in mammalian genomes. Biometrics 2006, 62(3):645–63. 10.1111/j.1541-0420.2006.00625.xCrossRefPubMedGoogle Scholar
- 28.Kullback S, Leibler RA: On Information and Sufficiency. The Annals of Mathematical Statistics 1951, 22(1):79–86. 10.1214/aoms/1177729694CrossRefGoogle Scholar
- 29.Sandelin A, et al.: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res 2004, (32 Database):D91–4.PubMedCentralCrossRefPubMedGoogle Scholar
- 30.Bussemaker HJ, Li H, Siggia ED: Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. Proc Natl Acad Sci USA 2000, 97(18):10096–100.PubMedCentralCrossRefPubMedGoogle Scholar
- 31.Rozowsky J, et al.: PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol 2009, 27(1):66–75. 10.1038/nbt.1518PubMedCentralCrossRefPubMedGoogle Scholar
- 32.Grant CE, Bailey TL, Noble WS: FIMO: scanning for occurrences of a given motif. Bioinformatics 2011, 27(7):1017–8. 10.1093/bioinformatics/btr064PubMedCentralCrossRefPubMedGoogle Scholar
- 33.Look DC, et al.: Stat1 depends on transcriptional synergy with Sp1. J Biol Chem 1995, 270(51):30264–7. 10.1074/jbc.270.51.30264CrossRefPubMedGoogle Scholar
- 34.Panchanathan R, et al.: Mutually positive regulatory feedback loop between interferons and estrogen receptor-alpha in mice: implications for sex bias in autoimmunity. PLoS One 2010, 5(5):e10868. 10.1371/journal.pone.0010868PubMedCentralCrossRefPubMedGoogle Scholar
- 35.Cui K, et al.: Chromatin signatures in multipotent human hematopoietic stem cells indicate the fate of bivalent genes during differentiation. Cell Stem Cell 2009, 4(1):80–93. 10.1016/j.stem.2008.11.011PubMedCentralCrossRefPubMedGoogle Scholar
- 36.Ji H, et al.: An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat Biotechnol 2008, 26(11):1293–300. 10.1038/nbt.1505PubMedCentralCrossRefPubMedGoogle Scholar
- 37.Hu S, et al.: Profiling the human protein-DNA interactome reveals ERK2 as a transcriptional repressor of interferon signaling. Cell 2009, 139(3):610–22. 10.1016/j.cell.2009.08.037PubMedCentralCrossRefPubMedGoogle Scholar
- 38.Eden E, et al.: GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics 2009, 10: 48. 10.1186/1471-2105-10-48PubMedCentralCrossRefPubMedGoogle Scholar
- 39.Deza E, Deza MM: Dictionary of distances. Elsevier; 2006.Google Scholar
- 40.Theodoridis S, Koutroumbas K: Pattern Recognition. Fourth edition. Academic Press; 2009.Google Scholar
- 41.Kailath T: The Divergence and Bhattacharyya Distance Measures in Signal Selection. Communications, IEEE Transactions on [legacy, pre - 1988] 1967, 15(1):52–60.CrossRefGoogle Scholar
- 42.Bowman AW, Azzalini A: Applied Smoothing Techniques for Data Analysis. Oxford Univeristy Press; 1997.Google Scholar
- 43.Lee W, et al.: A high-resolution atlas of nucleosome occupancy in yeast. Nat Genet 2007, 39(10):1235–44. 10.1038/ng2117CrossRefPubMedGoogle Scholar

## Copyright information

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.