In this paper, we develop a new approach for analyzing DNA sequences in order to detect regions with similar nucleotide composition. Our algorithm, which we call composition alignment or, more whimsically, scrambled alignment, employs the mechanisms of string matching and string comparison yet avoids the overdependence of those methods on position-by-position matching. In composition alignment, we extend the matching concept to composition matching. Two strings have a composition match if their lengths are equal and they have the same nucleotide content.
We define the composition alignment problem and give a dynamic programming solution. We explore several composition match weighting functions and show that composition alignment with one class of these can be computed in O(nm) time, the same as for standard alignment. We discuss statistical properties of composition alignment scores and demonstrate the ability of the algorithm to detect regions of similar composition in eukaryotic promoter sequences in the absence of detectable similarity through standard alignment.
KeywordsSequence Length Alignment Score Alignment Parameter Logarithmic Region Match Length
Unable to display preview. Download preview PDF.
- 2.Amir, A., Cole, R., Hariharan, R., Lewenstein, M., Porat, E.: Overlap matching. In: Proc. 12th ACM-SIAM Sym. on Discrete Algorithms, pp. 279–288 (2001)Google Scholar
- 12.Felsenfeld, G., McGhee, J.: Methylation and gene activity (1982)Google Scholar
- 28.Wagner, R.A.: On the complexity of the extended string-to-string correction problem. In: Proceedings 7th ACM STOC, pp. 218–223 (1975)Google Scholar