Gapped Extension for Local Multiple Alignment of Interspersed DNA Repeats

  • Todd J. Treangen
  • Aaron E. Darling
  • Mark A. Ragan
  • Xavier Messeguer
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4983)


The identification of homologous DNA is a fundamental building block of comparative genomic and molecular evolution studies. To date, pairwise local sequence alignment methods have been the prevailing technique to identify homologous nucleotides. However, existing methods that identify and align all homologous nucleotides in one or more genomes have suffered poor scalability and limited accuracy. We propose a novel method that couples a gapped extension heuristic with a previously described efficient filtration method for local multiple alignment. During gapped extension, we use the MUSCLE implementation of progressive multiple alignment with iterative refinement. The resulting gapped extensions potentially contain alignments of unrelated sequence. We detect and remove such undesirable alignments using a hidden Markov model to predict the posterior probability of homology. The HMM emission frequencies for nucleotide substitutions can be derived from any strand/species-symmetric nucleotide substitution matrix, and we have developed a method to adapt an arbitrary substitution matrix (i.e. HOXD) to organisms with different G+C content. We evaluate the performance of our method and previous approaches on a hybrid dataset of real genomic DNA with simulated interspersed repeats. Our method outperforms existing methods in terms of sensitivity, positive predictive value, and localizing boundaries of homology. The described methods have been implemented in the free, open-source procrastAligner software, available from: procrastination


Positive Predictive Value Hide Markov Model Input Sequence Pairwise Alignment Gapped Extension 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Kumar, S., Filipski, A.: Multiple sequence alignment: In pursuit of homologous DNA positions. Genome Res. 17, 127–135 (2007)CrossRefGoogle Scholar
  2. 2.
    Schwartz, S., Kent, J.W., Smit, A., Zhang, Z., Baertsch, R., Hardison, R.C., Haussler, D., Miller, W.: Human-mouse alignments with blastz. Genome Res. 13, 103–107 (2003)CrossRefGoogle Scholar
  3. 3.
    Pearson, W.R.: Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol 183, 63–98 (1990)CrossRefGoogle Scholar
  4. 4.
    Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)CrossRefGoogle Scholar
  5. 5.
    Blanchette, M., Kent, W., Riemer, C., Elnitski, L., Smit, A.F., Roskin, K.M., Baertsch, R., Rosenbloom, K., Clawson, H., Green, E., Haussler, D., Miller, W.: Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14, 708–715 (2004)CrossRefGoogle Scholar
  6. 6.
    Raphael, B., Zhi, D., Tang, H., Pevzner, P.: A novel method for multiple alignment of sequences with repeated and shuffled elements. Genome Res. 14(11), 2336–2346 (2004)CrossRefGoogle Scholar
  7. 7.
    Morgenstern, B., French, K., Dress, A., Werner, T.: DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics 14, 290–294 (1998)CrossRefGoogle Scholar
  8. 8.
    Zhang, Y., Waterman, M.S.: An Eulerian path approach to local multiple alignment for DNA sequences. PNAS 102, 1285–1290 (2005)zbMATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    Brudno, M., Do, D.C.B., Cooper, G.M., Kim, M.F., Davydov, E., Program, N.C.S., Green, E.D., Sidow, A., Batzoglou, S.: LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic dna. Genome Res. 13, 721–731 (2003)CrossRefGoogle Scholar
  10. 10.
    Szklarczyk, R., Heringa, J.: Aubergene–a sensitive genome alignment tool. Bioinformatics 22, 1431–1436 (2006)CrossRefGoogle Scholar
  11. 11.
    Wang, L., Jiang, T.: On the complexity of multiple sequence alignment. J. Comput. Biol. 1, 337–348 (1994)Google Scholar
  12. 12.
    Thompson, J.D., Higgins, D.G., Gibson, T.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680 (1994)CrossRefGoogle Scholar
  13. 13.
    Notredame, C., Higgins, D.G., Heringa, J.: T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205–217 (2000)CrossRefGoogle Scholar
  14. 14.
    Edgar, R.: MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32 (2004)Google Scholar
  15. 15.
    Do, C.B., Mahabhashyam, M.S., Brudno, M., Batzoglou, S.: ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res. 15, 330–340 (2005)CrossRefGoogle Scholar
  16. 16.
    Darling, A.E., Treangen, T.J., Zhang, L., Kuiken, C., Messeguer, X., Perna, N.T.: Procrastination leads to efficient filtration for local multiple alignment. Algorithms in Bioinformatics 4175, 126–137 (2006)CrossRefMathSciNetGoogle Scholar
  17. 17.
    Choi, P.K., Zeng, F., Zhang, L.: Good spaced seeds for homology search. Bioinformatics 20, 1053–1059 (2004)CrossRefGoogle Scholar
  18. 18.
    Szklarczyk, R., Heringa, J.: Tracking repeats using significance and transitivity. Bioinformatics 20 (suppl. 1), 1311–1317 (2004)Google Scholar
  19. 19.
    Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997)CrossRefGoogle Scholar
  20. 20.
    Kent, W.J.: BLAT–the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002)CrossRefMathSciNetGoogle Scholar
  21. 21.
    Chiaromonte, F., Yap, V.B., Miller, W.: Scoring pairwise genomic sequence alignments. In: Pac Symp. Biocomput., pp. 115–126 (2002)Google Scholar
  22. 22.
    Yi-Kuo, Y., Altschul, F.: The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics 21, 902–911 (2005)Google Scholar
  23. 23.
    Lunter, G.: HMMoC a compiler for hidden Markov models. Bioinformatics 23, 2485–2487 (2007)CrossRefGoogle Scholar
  24. 24.
    Rocha, E.P., Blanchard, A.: Genomic repeats, genome plasticity and the dynamics of Mycoplasma evolution. Nucleic Acids Res. 30, 2031–2042 (2002)CrossRefGoogle Scholar
  25. 25.
    Thompson, J.D., Plewniak, F., Poch, O.: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 27, 2682–2690 (1999)CrossRefGoogle Scholar
  26. 26.
    Achaz, G., Boyer, F., Rocha, E.P.C., Viari, A., Coissac, E.: Repseek, a tool to retrieve approximate repeats from large dna sequences. Bioinformatics (2006)Google Scholar
  27. 27.
    Prakash, A., Tompa, M.: Statistics of local multiple alignments. Bioinformatics 21(suppl. 1) (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Todd J. Treangen
    • 1
  • Aaron E. Darling
    • 2
  • Mark A. Ragan
    • 2
  • Xavier Messeguer
    • 1
  1. 1.Dept. of Computer SciencePolytechnic University of CataloniaBarcelonaSpain
  2. 2.ARC Centre of Excellence in Bioinformatics, and Institute for Molecular BioscienceThe University of QueenslandBrisbaneAustralia

Personalised recommendations