New Methods to Infer DNA Function from Sequence Information
- 771 Downloads
We present a new computational approach to infer DNA function from eukaryotic DNA sequence information. It is based on the fact that exons, regulatory regions, and non-coding non-regulatory DNA exhibit different statistical patterns. We suggest capturing and measuring these patterns by the following suite of statistical tools: (1) the ‘fluffy-tail’ test, a bootstrap procedure to recognize statistically significant abundant similar words in regulatory DNA; (2) an algorithm to assess the density of patches of low entropy as a new measure of homogeneity. This measure can be used to distinguish coding from non-coding and regulatory regions; (3) an adaptive window technique applied to rescaled range analysis and entropy measurements. This is an optimization technique to segment DNA into homogeneous parts (that are therefore likely to be coding), of which the outcomes are independent of the size of the sliding window and hence avoids averaging. The application of our methods to several annotated data sets from six eukaryotic species enables a clear separation of coding, regulatory, and non-coding non-regulatory DNA. We propose that established computational methods complemented by our new statistical tests and augmented with the novel optimization technique for sliding windows create a powerful tool for the characterization and annotation of DNA sequences. The software is available from the authors on request.
Key wordsregulatory regions coding DNA heterogeneity statistical methods information entropy long-range correlations motif abundance
Unable to display preview. Download preview PDF.