Abstract
Sequence comparison is a fundamental step in many important tasks in bioinformatics. Traditional algorithms for measuring approximation in sequence comparison are based on the notions of distance or similarity, and are generally computed through sequence alignment techniques. As circular genome structure is a common phenomenon in nature, a caveat of specialized alignment techniques for circular sequence comparison is that they are computationally expensive, requiring from super-quadratic to cubic time in the length of the sequences. In this paper, we introduce a new distance measure based on q-grams, and show how it can be computed efficiently for circular sequence comparison. Experimental results, using real and synthetic data, demonstrate orders-of-magnitude superiority of our approach in terms of efficiency, while maintaining an accuracy very competitive to the state of the art.
R. Mercaş—Supported by the P.R.I.M.E. programme of DAAD co-funded by BMBF and EU’s 7th Framework Programme (grant 605728).
S.P. Pissis—Supported by a Research Grant (#RG130720) awarded by the Royal Society.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Barton, C., Iliopoulos, C.S., Kundu, R., Pissis, S.P., Retha, A., Vayani, F.: Accurate and efficient methods to improve multiple circular sequence alignment. In: Bampis, E. (ed.) SEA 2015. LNCS, vol. 9125, pp. 247–258. Springer, Heidelberg (2015)
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Rapp, B.A., Wheeler, D.L.: GenBank. Nucleic Acids Res. 28(1), 15–18 (2000)
Bray, N., Pachter, L.: MAVID: constrained ancestral alignment of multiple sequences. Genome Res. 14(4), 693–699 (2004)
Brodie, R., Smith, A.J., Roper, R.L., Tcherepanov, V., Upton, C.: Base-By-Base: single nucleotide-level analysis of whole viral genome alignments. BMC Bioinform. 5(1), 96 (2004)
Bunke, H., Buhler, U.: Applications of approximate string matching to 2D shape recognition. Pattern Recogn. 26(12), 1797–1812 (1993)
Burcsi, P., Cicalese, F., Fici, G., Lipták, Z.: Algorithms for jumbled pattern matching in strings. Int. J. Found Comput. Sci. 23(2), 357–374 (2012)
Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H.P., Rivals, E., Vingron, M.: \(q\)-gram based database searching using a suffix array (QUASAR). In: 3rd RECOMB, pp. 77–83 (1999)
Chao, K.M., Zhang, J., Ostell, J., Miller, W.: A tool for aligning very similar DNA sequences. CABIOS 13(1), 75–80 (1997)
Cohen, S., Houben, A., Segal, D.: Extrachromosomal circular DNA derived from tandemly repeated genomic sequences in plants. Plant J. 53(6), 1027–1034 (2008)
Craik, D.J., Allewell, N.M.: Thematic minireview series on circular proteins. J. Biol. Chem. 287(32), 26999–27000 (2012)
Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on Strings. Cambridge University Press, New York (2007)
del Castillo, C.S., Hikima, J.I., Jang, H.B., Nho, S.W., Jung, T.S., Wongtavatchai, J., Kondo, H., Hirono, I., Takeyama, H., Aoki, T.: Comparative sequence analysis of a multidrug-resistant plasmid from Aeromonas hydrophila. Antimicrob. Agents Chemother. 57(1), 120–129 (2013)
Ehlers, T., Manea, F., Mercaş, R., Nowotka, D.: k-Abelian pattern matching. In: Shur, A.M., Volkov, M.V. (eds.) DLT 2014. LNCS, vol. 8633, pp. 178–190. Springer, Heidelberg (2014)
Fernandes, F., Pereira, L., Freitas, A.T.: CSA: an efficient algorithm to improve circular DNA multiple alignment. BMC Bioinform. 10(1), 1–13 (2009)
Fischer, J.: Inducing the LCP-array. In: Dehne, F., Iacono, J., Sack, J.-R. (eds.) WADS 2011. LNCS, vol. 6844, pp. 374–385. Springer, Heidelberg (2011)
Fletcher, W., Yang, Z.: INDELible: a flexible simulator of biological sequence evolution. Mol. Biol. Evol. 26(8), 1879–1888 (2009)
Goios, A., Pereira, L., Bogue, M., Macaulay, V., Amorim, A.: mtDNA phylogeny and evolution of laboratory mouse strains. Genome Res. 17(3), 293–298 (2007)
Gotoh, O.: An improved algorithm for matching biological sequences. J. Mol. Biol. 162(3), 705–708 (1982)
Helinski, D.R., Clewell, D.B.: Circular DNA. Annu. Rev. Biochem. 40(1), 899–942 (1971)
Lee, T., Na, J.C., Park, H., Park, K., Sim, J.S.: Finding consensus and optimal alignment of circular strings. Theor. Comput. Sci. 468, 92–101 (2013)
Maes, M.: On a cyclic string-to-string correction problem. IPL 35(2), 73–78 (1990)
Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Marzal, A., Barrachina, S.: Speeding up the computation of the edit distance for cyclic strings. In: 15th ICPR, vol. 2, pp. 891–894 (2000)
Mosig, A., Hofacker, I.L., Stadler, P.F.: Comparative analysis of cyclic sequences: viroids and other small circular RNAs. In: GCB. LNI, vol. 83, pp. 93–102. GI (2006)
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)
Peterlongo, P., Sacomoto, G.T., do Lago, A.P., Pisanti, N., Sagot, M.F.: Lossless filter for multiple repeats with bounded edit distance. Algorithm Mol. Biol. 4(3), 1–20 (2009)
Peterlongo, P., Pisanti, N., Boyer, F., do Lago, A.P., Sagot, M.F.: Lossless filter for multiple repetitions with Hamming distance. JDA 6(3), 497–509 (2008)
Pisanti, N., Giraud, M., Peterlongo, P.: Filters and seeds approaches for fast homology searches in large datasets. In: Elloumi, M., Zomaya, A.Y. (eds.) Algorithms in computational molecular biology, chap. 15, pp. 299–320. John Wiley & sons (2010)
Ponting, C.P., Russell, R.B.: Swaposins: circular permutations within genes encoding saposin homologues. Trends Biochem. Sci. 20(5), 179–180 (1995)
Rasmussen, K., Stoye, J., Myers, E.: Efficient \(q\)-gram filters for finding all epsilon-matches over a given length. J. Comput. Biol. 13(2), 296–308 (2006)
Rice, P., Longden, I., Bleasby, A.: EMBOSS: the european molecular biology open software suite. Trends Genet. 16(6), 276–277 (2000)
Ukkonen, E.: Approximate string-matching with \(q\)-grams and maximal matches. Theor. Comput. Sci. 92(1), 191–211 (1992)
Wang, Z., Wu, M.: Phylogenomic reconstruction indicates mitochondrial ancestor was an energy parasite. PLoS ONE 10(9), e110685 (2014)
Weiner, J., Bornberg-Bauer, E.: Evolution of circular permutations in multidomain proteins. Mol. Biol. Evol. 23(4), 734–743 (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Grossi, R. et al. (2015). Circular Sequence Comparison with q-grams. In: Pop, M., Touzet, H. (eds) Algorithms in Bioinformatics. WABI 2015. Lecture Notes in Computer Science(), vol 9289. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48221-6_15
Download citation
DOI: https://doi.org/10.1007/978-3-662-48221-6_15
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-48220-9
Online ISBN: 978-3-662-48221-6
eBook Packages: Computer ScienceComputer Science (R0)