Pair HMM Based Gap Statistics for Re-evaluation of Indels in Alignments with Affine Gap Penalties

  • Alexander Schönhuth
  • Raheleh Salari
  • S. Cenk Sahinalp
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6293)


Although computationally aligning sequence is a crucial step in the vast majority of comparative genomics studies our understanding of alignment biases still needs to be improved. To infer true structural or homologous regions computational alignments need further evaluation. It has been shown that the accuracy of aligned positions can drop substantially in particular around gaps. Here we focus on re-evaluation of score-based alignments with affine gap penalty costs. We exploit their relationships with pair hidden Markov models and develop efficient algorithms by which to identify gaps which are significant in terms of length and multiplicity. We evaluate our statistics with respect to the well-established structural alignments from SABmark and find that indel reliability substantially increases with their significance in particular in worst-case twilight zone alignments. This points out that our statistics can reliably complement other methods which mostly focus on the reliability of match positions.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Altschul, S.F., Gish, W.: Local alignment statistics. Methods in Enzymology 266, 460–480 (1996)CrossRefPubMedGoogle Scholar
  2. 2.
    Bassino, F., Clement, J., Fayolle, J., Nicodeme, P.: Constructions for Clumps Statistics. In: MathInfo 2008 (2008),
  3. 3.
    Bradley, R.K., Roberts, A., Smoot, M., Juvekar, S., Do, J., Dewey, C., Holmes, I., Pachter, L.: Fast statistical alignment. PLoS Computational Biology 5(5), e1000392 (2009)Google Scholar
  4. 4.
    Cartwright, R.A.: Logarithmic gap costs decrease alignment accuracy. BMC Bioinformatics 7, 527 (2006)CrossRefPubMedPubMedCentralGoogle Scholar
  5. 5.
    Chang, M.S.S., Benner, S.A.: Empirical analysis of protein insertions and deletions determining parameters for the correct placement of gaps in protein sequence alignments. Journal of Molecular Biology 341, 617–631 (2004)CrossRefPubMedGoogle Scholar
  6. 6.
    Cline, M., Hughey, R., Karplus, K.: Predicting reliable regions in protein sequence alignments. Bioinformatics 18 (2), 306–314 (2002)CrossRefPubMedGoogle Scholar
  7. 7.
    Dembo, A., Karlin, S.: Strong limit theorem of empirical functions for large exceedances of partial sums of i.i.d. variables. Annals of Probability 19, 1737–1755 (1991)CrossRefGoogle Scholar
  8. 8.
    Dewey, C.N., Huggins, P.M., Woods, K., Sturmfels, B., Pachter, L.: Parametric alignment of Drosophila genomes. PLoS Computational Biology 2, e73 (2006)Google Scholar
  9. 9.
    Do, C.B., Mahabhashyam, M.S., Brudno, M., Batzoglou, S.: ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Research 15, 330–340 (2005)CrossRefPubMedPubMedCentralGoogle Scholar
  10. 10.
    Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological sequence analysis. Cambridge University Press, Cambridge (1998)CrossRefGoogle Scholar
  11. 11.
    Fu, J.C., Koutras, M.V.: Distribution theory of runs: a Markov chain approach. Journal of the American Statistical Association 89(427), 1050–1058 (1994)CrossRefGoogle Scholar
  12. 12.
    Gotoh, O.: An improved algorithm for matching biological sequences. Journal of Molecular Biology 162, 705–708 (1982)CrossRefPubMedGoogle Scholar
  13. 13.
    Karlin, S., Altschul, S.F.: Methods for assessing the statistic significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences of the USA 87, 2264–2268 (1990)CrossRefPubMedPubMedCentralGoogle Scholar
  14. 14.
    Kumar, S., Filipski, A.: Multiple sequence alignment: In pursuit of homologous DNA positions. Genome Research 17, 127–135 (2007)CrossRefPubMedGoogle Scholar
  15. 15.
    Loeytynoja, A., Goldman, N.: An algorithm for progressive multiple alignment of sequences with insertions. Proceedings of the National Academy of Sciences of the USA 102 (30), 10557–10562 (2005)CrossRefGoogle Scholar
  16. 16.
    Loeytynoja, A., Goldman, N.: Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science 320, 1632–1635 (2008)CrossRefGoogle Scholar
  17. 17.
    Lunter, G., Rocco, A., Mimouni, N., Heger, A., Caldeira, A., Hein, J.: Uncertainty in homology inferences: Assessing and improving genomic sequence alignment. Genome Research 18 (2007), doi:10.1101/gr.6725608Google Scholar
  18. 18.
    Mevissen, H., Vingron, M.: Quantifying the local reliability of a sequence alignment. Stochastic Models of Sequence Evolution including Insertion-Deletion Events. Protein Engineering 9(2), 127–132 (1996)CrossRefPubMedGoogle Scholar
  19. 19.
    Miklos, I., Novak, A., Satija, R., Lyngso, R., Hein, J.: Stochastic Models of Sequence Evolution including Insertion-Deletion Events. In: Statistical Methods in Medical Research 2009 (2008), doi:10.1177/096228020809950Google Scholar
  20. 20.
    Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 443–453 (1970)CrossRefPubMedGoogle Scholar
  21. 21.
    Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85, 2444–2448 (1988)CrossRefPubMedPubMedCentralGoogle Scholar
  22. 22.
    Peköz, E.A., Ross, S.M.: A simple derivation of exact reliability formulas for linear and circular consecutive-k-of-n F systems. Journal of Applied Probability 32, 554–557 (1995)CrossRefGoogle Scholar
  23. 23.
    Polyanovsky, V.O., Roytberg, M.A., Tumanyan, V.G.: A new approach to assessing the validity of indels in algorithmic pair alignments. Biophysics 53(4), 253–255 (2008)CrossRefGoogle Scholar
  24. 24.
    Qian, B., Goldstein, R.A.: Distribution of indel lengths. Proteins: Structure, Function and Bioinformatics 45, 102–104 (2001)CrossRefGoogle Scholar
  25. 25.
    Schönhuth, A., Salari, R., Hormozdiari, F., Cherkasov, A., Sahinalp, S.C.: Towards improved assessment of functional similarity in large-scale screens: an indel study. Journal of Computational Biology 17(1), 1–20 (2010)CrossRefPubMedGoogle Scholar
  26. 26.
    Schönhuth, A., Salari, R., Sahinalp, S.C.: Pair HMM based gap statistics for re-evaluation of indels in alignments with affine gap penalties—Extended Version (2010),
  27. 27.
    Van Walle, I., Lasters, I., Wyns, L.: SABmark - a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics 21, 1267–1268 (2005)CrossRefPubMedGoogle Scholar
  28. 28.
    Schlosshauer, M., Ohlsson, M.: A novel approach to local reliability of sequence alignments. Bioinformatics 18 (6), 847–854 (2002)CrossRefPubMedGoogle Scholar
  29. 29.
    Smith, T.M., Waterman, M.: Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981)CrossRefPubMedGoogle Scholar
  30. 30.
    Tress, M.L., Jones, D., Valencia, A.: Predicting reliable regions in protein alignments from sequence profiles. Journal of Molecular Biology 330 (4), 705–718 (2003)CrossRefPubMedGoogle Scholar
  31. 31.
    Waterman, M.S., Eggert, M.: A new algorithm for best subsequences alignment with application to tRNA-rRNA comparisons. J. MoL. BioL. 197, 723–728 (1987)CrossRefPubMedGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Alexander Schönhuth
    • 1
  • Raheleh Salari
    • 2
  • S. Cenk Sahinalp
    • 2
  1. 1.Department of MathematicsUniversity of California at BerkeleyUSA
  2. 2.School of Computing ScienceSimon Fraser UniversityBurnaby

Personalised recommendations