Skip to main content

Circular Sequence Comparison with q-grams

  • Conference paper
  • First Online:
Algorithms in Bioinformatics (WABI 2015)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9289))

Included in the following conference series:

Abstract

Sequence comparison is a fundamental step in many important tasks in bioinformatics. Traditional algorithms for measuring approximation in sequence comparison are based on the notions of distance or similarity, and are generally computed through sequence alignment techniques. As circular genome structure is a common phenomenon in nature, a caveat of specialized alignment techniques for circular sequence comparison is that they are computationally expensive, requiring from super-quadratic to cubic time in the length of the sequences. In this paper, we introduce a new distance measure based on q-grams, and show how it can be computed efficiently for circular sequence comparison. Experimental results, using real and synthetic data, demonstrate orders-of-magnitude superiority of our approach in terms of efficiency, while maintaining an accuracy very competitive to the state of the art.

R. Mercaş—Supported by the P.R.I.M.E. programme of DAAD co-funded by BMBF and EU’s 7th Framework Programme (grant 605728).

S.P. Pissis—Supported by a Research Grant (#RG130720) awarded by the Royal Society.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Barton, C., Iliopoulos, C.S., Kundu, R., Pissis, S.P., Retha, A., Vayani, F.: Accurate and efficient methods to improve multiple circular sequence alignment. In: Bampis, E. (ed.) SEA 2015. LNCS, vol. 9125, pp. 247–258. Springer, Heidelberg (2015)

    Chapter  Google Scholar 

  2. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Rapp, B.A., Wheeler, D.L.: GenBank. Nucleic Acids Res. 28(1), 15–18 (2000)

    Article  Google Scholar 

  3. Bray, N., Pachter, L.: MAVID: constrained ancestral alignment of multiple sequences. Genome Res. 14(4), 693–699 (2004)

    Article  Google Scholar 

  4. Brodie, R., Smith, A.J., Roper, R.L., Tcherepanov, V., Upton, C.: Base-By-Base: single nucleotide-level analysis of whole viral genome alignments. BMC Bioinform. 5(1), 96 (2004)

    Article  Google Scholar 

  5. Bunke, H., Buhler, U.: Applications of approximate string matching to 2D shape recognition. Pattern Recogn. 26(12), 1797–1812 (1993)

    Article  Google Scholar 

  6. Burcsi, P., Cicalese, F., Fici, G., Lipták, Z.: Algorithms for jumbled pattern matching in strings. Int. J. Found Comput. Sci. 23(2), 357–374 (2012)

    Article  MATH  Google Scholar 

  7. Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H.P., Rivals, E., Vingron, M.: \(q\)-gram based database searching using a suffix array (QUASAR). In: 3rd RECOMB, pp. 77–83 (1999)

    Google Scholar 

  8. Chao, K.M., Zhang, J., Ostell, J., Miller, W.: A tool for aligning very similar DNA sequences. CABIOS 13(1), 75–80 (1997)

    Google Scholar 

  9. Cohen, S., Houben, A., Segal, D.: Extrachromosomal circular DNA derived from tandemly repeated genomic sequences in plants. Plant J. 53(6), 1027–1034 (2008)

    Article  Google Scholar 

  10. Craik, D.J., Allewell, N.M.: Thematic minireview series on circular proteins. J. Biol. Chem. 287(32), 26999–27000 (2012)

    Article  Google Scholar 

  11. Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on Strings. Cambridge University Press, New York (2007)

    Book  MATH  Google Scholar 

  12. del Castillo, C.S., Hikima, J.I., Jang, H.B., Nho, S.W., Jung, T.S., Wongtavatchai, J., Kondo, H., Hirono, I., Takeyama, H., Aoki, T.: Comparative sequence analysis of a multidrug-resistant plasmid from Aeromonas hydrophila. Antimicrob. Agents Chemother. 57(1), 120–129 (2013)

    Article  Google Scholar 

  13. Ehlers, T., Manea, F., Mercaş, R., Nowotka, D.: k-Abelian pattern matching. In: Shur, A.M., Volkov, M.V. (eds.) DLT 2014. LNCS, vol. 8633, pp. 178–190. Springer, Heidelberg (2014)

    Google Scholar 

  14. Fernandes, F., Pereira, L., Freitas, A.T.: CSA: an efficient algorithm to improve circular DNA multiple alignment. BMC Bioinform. 10(1), 1–13 (2009)

    Article  Google Scholar 

  15. Fischer, J.: Inducing the LCP-array. In: Dehne, F., Iacono, J., Sack, J.-R. (eds.) WADS 2011. LNCS, vol. 6844, pp. 374–385. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  16. Fletcher, W., Yang, Z.: INDELible: a flexible simulator of biological sequence evolution. Mol. Biol. Evol. 26(8), 1879–1888 (2009)

    Article  Google Scholar 

  17. Goios, A., Pereira, L., Bogue, M., Macaulay, V., Amorim, A.: mtDNA phylogeny and evolution of laboratory mouse strains. Genome Res. 17(3), 293–298 (2007)

    Article  Google Scholar 

  18. Gotoh, O.: An improved algorithm for matching biological sequences. J. Mol. Biol. 162(3), 705–708 (1982)

    Article  Google Scholar 

  19. Helinski, D.R., Clewell, D.B.: Circular DNA. Annu. Rev. Biochem. 40(1), 899–942 (1971)

    Article  Google Scholar 

  20. Lee, T., Na, J.C., Park, H., Park, K., Sim, J.S.: Finding consensus and optimal alignment of circular strings. Theor. Comput. Sci. 468, 92–101 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  21. Maes, M.: On a cyclic string-to-string correction problem. IPL 35(2), 73–78 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  22. Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  23. Marzal, A., Barrachina, S.: Speeding up the computation of the edit distance for cyclic strings. In: 15th ICPR, vol. 2, pp. 891–894 (2000)

    Google Scholar 

  24. Mosig, A., Hofacker, I.L., Stadler, P.F.: Comparative analysis of cyclic sequences: viroids and other small circular RNAs. In: GCB. LNI, vol. 83, pp. 93–102. GI (2006)

    Google Scholar 

  25. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)

    Article  Google Scholar 

  26. Peterlongo, P., Sacomoto, G.T., do Lago, A.P., Pisanti, N., Sagot, M.F.: Lossless filter for multiple repeats with bounded edit distance. Algorithm Mol. Biol. 4(3), 1–20 (2009)

    Google Scholar 

  27. Peterlongo, P., Pisanti, N., Boyer, F., do Lago, A.P., Sagot, M.F.: Lossless filter for multiple repetitions with Hamming distance. JDA 6(3), 497–509 (2008)

    MathSciNet  MATH  Google Scholar 

  28. Pisanti, N., Giraud, M., Peterlongo, P.: Filters and seeds approaches for fast homology searches in large datasets. In: Elloumi, M., Zomaya, A.Y. (eds.) Algorithms in computational molecular biology, chap. 15, pp. 299–320. John Wiley & sons (2010)

    Google Scholar 

  29. Ponting, C.P., Russell, R.B.: Swaposins: circular permutations within genes encoding saposin homologues. Trends Biochem. Sci. 20(5), 179–180 (1995)

    Article  Google Scholar 

  30. Rasmussen, K., Stoye, J., Myers, E.: Efficient \(q\)-gram filters for finding all epsilon-matches over a given length. J. Comput. Biol. 13(2), 296–308 (2006)

    Article  MathSciNet  Google Scholar 

  31. Rice, P., Longden, I., Bleasby, A.: EMBOSS: the european molecular biology open software suite. Trends Genet. 16(6), 276–277 (2000)

    Article  Google Scholar 

  32. Ukkonen, E.: Approximate string-matching with \(q\)-grams and maximal matches. Theor. Comput. Sci. 92(1), 191–211 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  33. Wang, Z., Wu, M.: Phylogenomic reconstruction indicates mitochondrial ancestor was an energy parasite. PLoS ONE 10(9), e110685 (2014)

    Article  Google Scholar 

  34. Weiner, J., Bornberg-Bauer, E.: Evolution of circular permutations in multidomain proteins. Mol. Biol. Evol. 23(4), 734–743 (2006)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Solon P. Pissis .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Grossi, R. et al. (2015). Circular Sequence Comparison with q-grams. In: Pop, M., Touzet, H. (eds) Algorithms in Bioinformatics. WABI 2015. Lecture Notes in Computer Science(), vol 9289. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48221-6_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-48221-6_15

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-48220-9

  • Online ISBN: 978-3-662-48221-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics