Skip to main content

A Probabilistic Model for Sequence Alignment with Context-Sensitive Indels

  • Conference paper
Research in Computational Molecular Biology (RECOMB 2011)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 6577))

Abstract

Probabilistic approaches for sequence alignment are usually based on pair Hidden Markov Models (HMMs) or Stochastic Context Free Grammars (SCFGs). Recent studies have shown a significant correlation between the content of short indels and their flanking regions, which by definition cannot be modelled by the above two approaches. In this work, we present a context-sensitive indel model based on a pair Tree-Adjoining Grammar (TAG), along with accompanying algorithms for efficient alignment and parameter estimation. The increased precision and statistical power of this model is shown on simulated and real genomic data. As the cost of sequencing plummets, the usefulness of comparative analysis is becoming limited by alignment accuracy rather than data availability. Our results will therefore have an impact on any type of downstream comparative genomics analyses that rely on alignments. Fine-grained studies of small functional regions or disease markers, for example, could be significantly improved by our method. The implementation is available at http://www.mcb.mcgill.ca/~blanchem/software.html

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ball, E., Stenson, P., Abeysinghe, S., Krawczak, M., Cooper, D., Chuzhanova, N.: Microdeletions and microinsertions causing human genetic disease: common mechanisms of mutagenesis and the role of local DNA sequence complexity. Human Mutation 26(3), 205–213 (2005)

    Article  Google Scholar 

  2. Benson, G.: Sequence alignment with tandem duplication. Journal of Computational Biology 4(3), 351–367 (1997)

    Article  Google Scholar 

  3. Bérard, S., Rivals, E.: Comparison of minisatellites. Journal of Computational biology 10(3-4), 357–372 (2003)

    Article  Google Scholar 

  4. Bishop, M., Thompson, E.: Maximum likelihood alignment of DNA sequences* 1. Journal of Molecular Biology 190(2), 159–165 (1986)

    Article  Google Scholar 

  5. Chen, F., Chen, C., Li, W., Chuang, T.: Human-specific insertions and deletions inferred from mammalian genome sequences. Genome Research 17(1), 16 (2007)

    Article  Google Scholar 

  6. Chuzhanova, N., Anassis, E., Ball, E., Krawczak, M., Cooper, D.: Meta-analysis of indels causing human genetic disease: mechanisms of mutagenesis and the role of local DNA sequence complexity. Human Mutation 21(1), 28–44 (2003)

    Article  Google Scholar 

  7. Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological sequence analysis south asia edition: probabilistic models of proteins and nucleic acids (2003)

    Google Scholar 

  8. Felenstein, J.: Inferring phylogenies. Sinauer Associates Sunderland, Mass (2003)

    Google Scholar 

  9. Felsenstein, J.: Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution 17(6), 368–376 (1981)

    Article  Google Scholar 

  10. Hein, J., Jensen, J., Pedersen, C.: Recursions for statistical multiple alignment. Proceedings of the National Academy of Sciences of the United States of America 100(25), 14960 (2003)

    Article  Google Scholar 

  11. Hein, J., Wiuf, C., Knudsen, B., Moller, M., Wibling, G.: Statistical alignment: computational properties, homology testing and goodness-of-fit. Journal of Molecular Biology 302(1), 265–280 (2000)

    Article  Google Scholar 

  12. Holmes, I.: A probabilistic model for the evolution of RNA structure. BMC Bioinformatics 5(1), 166 (2004)

    Article  Google Scholar 

  13. Holmes, I., Bruno, W.: Evolutionary HMMs: a Bayesian approach to multiple alignment. Bioinformatics 17(9), 803 (2001)

    Article  Google Scholar 

  14. Joshi, A., Levy, L., Takahashi, M.: Tree adjunct grammars. Journal of Computer and System Sciences 10(1), 136–163 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  15. Joshi, A., Schabes, Y.: Tree-adjoining grammars. Handbook of Formal Languages, Beyond Words 3, 69–123 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  16. Jukes, T., Cantor, C.: Evolution of protein molecules. Mammalian Protein Metabolism 3, 21–132 (1969)

    Article  Google Scholar 

  17. Levinson, G., Gutman, G.: Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Molecular Biology and Evolution 4(3), 203 (1987)

    Google Scholar 

  18. Lunter, G.: Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes. Bioinformatics 23(13), i289 (2007)

    Article  Google Scholar 

  19. Lunter, G., Drummond, A., Miklós, I., Hein, J.: Statistical alignment: Recent progress, new applications, and challenges. Statistical Methods in Molecular Evolution, 375–405 (2005)

    Google Scholar 

  20. Matsui, H., Sato, K., Sakakibara, Y.: Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures. Bioinformatics 21(11), 2611–2617 (2005)

    Article  Google Scholar 

  21. Messer, P., Arndt, P.: The majority of recent short DNA insertions in the human genome are tandem duplications. Molecular Biology and Evolution 24(5), 1190 (2007)

    Article  Google Scholar 

  22. Miklós, I., Lunter, G., Holmes, I.: A” long indel” model for evolutionary sequence alignment. Molecular Biology and Evolution 21(3), 529 (2004)

    Article  Google Scholar 

  23. Miklos, I., Novak, A., Satija, R., Lingso, R., Hein, J.: Stochastic models of sequence evolution including insertion-deletion events. Statistical Methods in Medical Research 18(5), 448–453 (2009)

    Article  MathSciNet  Google Scholar 

  24. Rhead, B., Karolchik, D., Kuhn, R., Hinrichs, A., Zweig, A., Fujita, P., Diekhans, M., Smith, K., Rosenbloom, K., Raney, B., et al.: The UCSC genome browser database: update 2010. Nucleic Acids Research (2009)

    Google Scholar 

  25. Rivas, E., Eddy, S.: Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics 2(1), 8 (2001)

    Article  Google Scholar 

  26. Sammeth, M., Stoye, J.: Comparing tandem repeats with duplications and excisions of variable degree. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 395–407 (2006)

    Google Scholar 

  27. Schabes, Y.: Stochastic lexicalized tree-adjoining grammars. In: Proceedings of the 14th Conference on Computational Linguistics, vol. 2, pp. 425–432. Association for Computational Linguistics (1992)

    Google Scholar 

  28. Schwartz, S., Kent, W., Smit, A., Zhang, Z., Baertsch, R., Hardison, R., Haussler, D., Miller, W.: Human–mouse alignments with BLASTZ. Genome Research 13(1), 103 (2003)

    Article  Google Scholar 

  29. Siepel, A., Haussler, D.: Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Molecular Biology and Evolution 21(3), 468 (2004)

    Article  Google Scholar 

  30. Sinha, S., Siggia, E.: Sequence turnover and tandem repeats in cis-regulatory modules in Drosophila. Molecular Biology and Evolution 22(4), 874 (2005)

    Article  Google Scholar 

  31. Tanay, A., Siggia, E.: Sequence context affects the rate of short insertions and deletions in flies and primates. Genome Biology 9(2), R37 (2008)

    Article  Google Scholar 

  32. The Chimpanzee Genome Sequencing and Analysis Consortium: Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437(7055), 69 (2005)

    Google Scholar 

  33. Thorne, J., Kishino, H., Felsenstein, J.: Inching toward reality: an improved likelihood model of sequence evolution. Journal of Molecular Evolution 34(1), 3–16 (1992)

    Article  Google Scholar 

  34. Uemura, Y., Hasegawa, A., Kobayashi, S., Yokomori, T.: Tree adjoining grammars for RNA structure prediction* 1. Theoretical Computer Science 210(2), 277–303 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  35. Vijay-Shankar, K., Joshi, A.: Some computational properties of tree adjoining grammars. In: Proceedings of the Workshop on Strategic Computing Natural Language, p. 223. Association for Computational Linguistics (1986)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hickey, G., Blanchette, M. (2011). A Probabilistic Model for Sequence Alignment with Context-Sensitive Indels. In: Bafna, V., Sahinalp, S.C. (eds) Research in Computational Molecular Biology. RECOMB 2011. Lecture Notes in Computer Science(), vol 6577. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20036-6_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20036-6_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20035-9

  • Online ISBN: 978-3-642-20036-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics