In this paper, we develop a new approach for analyzing DNA sequences in order to detect regions with similar nucleotide composition. Our algorithm, which we call composition alignment or, more whimsically, scrambled alignment, employs the mechanisms of string matching and string comparison yet avoids the overdependence of those methods on position-by-position matching. In composition alignment, we extend the matching concept to composition matching. Two strings have a composition match if their lengths are equal and they have the same nucleotide content.

We define the composition alignment problem and give a dynamic programming solution. We explore several composition match weighting functions and show that composition alignment with one class of these can be computed in O(nm) time, the same as for standard alignment. We discuss statistical properties of composition alignment scores and demonstrate the ability of the algorithm to detect regions of similar composition in eukaryotic promoter sequences in the absence of detectable similarity through standard alignment.


Sequence Length Alignment Score Alignment Parameter Logarithmic Region Match Length 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Amir, A., Aumann, Y., Landau, G., Lewenstein, M., Lewenstein, N.: Pattern matching with swaps. J. Algorithms 37, 247–266 (2000)zbMATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    Amir, A., Cole, R., Hariharan, R., Lewenstein, M., Porat, E.: Overlap matching. In: Proc. 12th ACM-SIAM Sym. on Discrete Algorithms, pp. 279–288 (2001)Google Scholar
  3. 3.
    Amir, A., Lewenstein, M., Porat, E.: Approximate swapped matching. Information Processing Letters 83, 33–39 (2002)zbMATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    Arratia, R., Waterman, M.: A phase transition for the score in matching random sequences allowing deletions. Ann. Appl. Prob. 4, 200–225 (1994)zbMATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    Benham, C.J.: Duplex destabilization in superhelical DNA is predicted to occur at specific transcriptional regulatory regions. J. Mol. Biol. 255, 425–434 (1996)CrossRefGoogle Scholar
  6. 6.
    Benham, C.J.: The topologically driven strand separation transition in DNAmethods of analysis and biological significance. DIMACS Series in Discrete Mathematics and Theoretical Computer Science 47, 173–198 (1999)MathSciNetGoogle Scholar
  7. 7.
    Bernardi, G.: The isochore organization of the human genome. Annu. Rev. Genet. 23, 637–661 (1989)CrossRefGoogle Scholar
  8. 8.
    Bernardi, G.: The human genome: Organization and evolutionary history. Annu. Rev. Genet. 29, 445–476 (1995)CrossRefMathSciNetGoogle Scholar
  9. 9.
    Bucher, P.: Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J. Mol. Biol. 212, 563–578 (1990)CrossRefGoogle Scholar
  10. 10.
    Cormen, T., Leiserson, C., Rivest, R.: Introduction to Algorithms. MIT Press, Cambridge (1990)zbMATHGoogle Scholar
  11. 11.
    Doerfler, W.: DNA methylation and gene activity. Ann. Rev. Biochem. 52, 93–124 (1983)CrossRefGoogle Scholar
  12. 12.
    Felsenfeld, G., McGhee, J.: Methylation and gene activity (1982)Google Scholar
  13. 13.
    Garden, M.G., Frommer, M.: CpG islands in vertebrate genomes. J.Mol. Biol. 196, 261–282 (1987)CrossRefGoogle Scholar
  14. 14.
    Goodsell, D.S., Dickerson, R.E.: Bending and curvature calculations in B-DNA. Nucleic Acids Research 22, 5497–5503 (1994)CrossRefGoogle Scholar
  15. 15.
    Gotoh, O.: An improved algorithm for matching biological sequences. J. Mol. Biol. 162, 705–708 (1982)CrossRefGoogle Scholar
  16. 16.
    Heinemeyer, T., Chen, X., Karas, H., Kel, A., Kel, O., Liebich, I., Meinhardt, T., Reuter, I., Schacherer, F., Wingender, E.: Expanding the TRANSFAC database towards an expert system of regulatory molecular mechanisms. Nucleic Acids Res. 27, 318–322 (1999)CrossRefGoogle Scholar
  17. 17.
    Karlin, S., Altschul, S.: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264–2268 (1990)zbMATHCrossRefGoogle Scholar
  18. 18.
    Koo, H.-S., Wu, H.-M., Crothers, D.M.: DNA bending at adenine - thymine tracts. Nature 320, 501–506 (1986)CrossRefGoogle Scholar
  19. 19.
    Lewis, M., Chang, G., Horton, N.C., Kercher, M.A., Pace, H.C., Schumacher, M.A., Brennan, R.G., Lu, P.: Crystal structure of the lactose operon repressor and its complexes with DNA and inducer. Science 271, 1247–1254 (1996)CrossRefGoogle Scholar
  20. 20.
    Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theor. 37, 145–151 (1991)zbMATHCrossRefGoogle Scholar
  21. 21.
    Lowrance, R., Wagner, R.A.: An extension of the string-to-string correction problem. JACM 22, 177–183 (1975)zbMATHCrossRefMathSciNetGoogle Scholar
  22. 22.
    Needleman, S., Wunch, C.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970)CrossRefGoogle Scholar
  23. 23.
    Périer, R., Praz, V., Junier, T., Bonnard, C., Bucher, P.: The Eukaryotic Promoter Database (EPD). Nucleic Acids Research 28, 302–303 (2000)CrossRefGoogle Scholar
  24. 24.
    Schultz, S.C., Shields, G.C., Steitz, T.A.: Crystal structure of a CAP-DNA complex: The DNA is bent by 90 degrees. Science 253, 1001–1007 (1991)CrossRefGoogle Scholar
  25. 25.
    Smit, A.: The origin of interspersed repeats in the human genome. Curr. Opin. Genet. Dev. 6, 743–748 (1996)CrossRefGoogle Scholar
  26. 26.
    Smith, T., Waterman, M.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)CrossRefGoogle Scholar
  27. 27.
    Vingron, M., Waterman, M.: Sequence alignment and penalty choice: review of concepts, case studies and implications. J. Mol. Biol. 235, 1–12 (1994)CrossRefGoogle Scholar
  28. 28.
    Wagner, R.A.: On the complexity of the extended string-to-string correction problem. In: Proceedings 7th ACM STOC, pp. 218–223 (1975)Google Scholar
  29. 29.
    Waterman, M., Gordon, L., Arratia, R.: Phase transitions in sequence matches and nucleic acid structure. Proc. Natl. Acad. Sci. USA 84, 1239–1243 (1987)CrossRefMathSciNetGoogle Scholar
  30. 30.
    Yeramian, E.: Genes and the physics of the DNA double-helix. Gene 255, 139–150 (2000)CrossRefGoogle Scholar
  31. 31.
    Yeraminan, E., Bonnefoy, S., Langsley, G.: Physics-based gene identification: proof of concept for Plasmodium falciparum. Bioinformatics 18, 190–193 (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Gary Benson
    • 1
  1. 1.Department of Biomathematical SciencesThe Mount Sinai School of MedicineNew YorkUSA

Personalised recommendations