Skip to main content

Indexing DNA Sequences Using q-Grams

  • Conference paper
Database Systems for Advanced Applications (DASFAA 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3453))

Included in the following conference series:

Abstract

We have observed in recent years a growing interest in similarity search on large collections of biological sequences. Contributing to the interest, this paper presents a method for indexing the DNA sequences efficiently based on q-grams to facilitate similarity search in a DNA database and sidestep the need for linear scan of the entire database. Two level index – hash table and c-trees – are proposed based on the q-grams of DNA sequences. The proposed data structures allow the quick detection of sequences within a certain distance to the query sequence. Experimental results show that our method is efficient in detecting similarity regions in a DNA sequence database with high sensitivity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: A basic local alignment search tool. Journal of Molecular Biology (1990)

    Google Scholar 

  2. Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H.P., Vingron, M.: q-gram based database searching using a suffix array (quasar). In: Int. Conf. RECOMB, Lyon (April 1999)

    Google Scholar 

  3. Cao, X., Li, S.C., Ooi, B.C., Tung, A.: Piers: An efficient model for similarity search in dna sequence databases. ACM Sigmod Record 33 (2004)

    Google Scholar 

  4. Giladi, E., Walker, M., Wang, J., Volkmuth, W.: Sst: An algorithm for searching sequence databases in time proportional to the logarithm of the database size. In: Int. Conf. RECOMB, Japan (2000)

    Google Scholar 

  5. Hunt, E., Atkinson, M.P., Irving, R.W.: A database index to large biological sequences. International Journal on VLDB, 139–148 (September 2001)

    Google Scholar 

  6. Jokinen, P., Ukkonen, E.: Two algorithm for approximate string matching in static texts. In: Proc. of the 16th Symposium on Mathematical Foundataions of Computer Science, pp. 240–248 (1991)

    Google Scholar 

  7. Kahveci, T., Singh, A.: An efficient index structure for string databases. In: Proc. 2001 Int. Conf. Very Large Data Bases (VLDB 2001), Roma, Italy (2001)

    Google Scholar 

  8. Ma, B., Tromp, J., Li, M.: Patternhunter: faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)

    Article  Google Scholar 

  9. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string search. SIAM Journal on Computing 22, 935–948 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  10. Meek, C., Patel, J.M., Kasetty, S.: Oasis: An online and accurate technique for local-alignment searches on biological sequences. In: Proc. 2003 Int. Conf. Very Large Data Bases (VLDB 2003), Berlin, Germany, September 2003, pp. 910–921 (2003)

    Google Scholar 

  11. Muthukrishnan, S., Sahinalp, S.C.: Approximate nearest neighbors and sequence comparison with block operation. In: STOC, Portland, Or (2000)

    Google Scholar 

  12. Ozturk, O., Ferhatosmanoglu, H.: Effective indexing and filtering for similarity search in large biosequence datasbases. In: Third IEEE Symposium on BioInformatics and BioEngineering (BIBE 2003), Bethesda, Maryland (2003)

    Google Scholar 

  13. Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence comparison. Proceedings Natl. Acad. Sci. USA 85, 2444–2448 (1988)

    Article  Google Scholar 

  14. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Molecular Biology 147, 195–197 (1981)

    Article  Google Scholar 

  15. Tan, Z., Cao, X., Ooi, B.C., Tung, A.: The ed-tree: an index for large dna sequence databases. In: Proc. 15th Int. Conf. on Scientific and Statistical Database Management, pp. 151–160 (2003)

    Google Scholar 

  16. Weiner, P.: Linear pattern matching algorithms. In: Proc. 14th IEEE Symp. On Switching and Automata Theory, pp. 1–11 (1973)

    Google Scholar 

  17. Williams, H.E., Zobel, J.: Indexing and retrieval for genomic databases. IEEE Transactions on Knowledge and Data Engineering 14, 63–78 (2002)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cao, X., Li, S.C., Tung, A.K.H. (2005). Indexing DNA Sequences Using q-Grams. In: Zhou, L., Ooi, B.C., Meng, X. (eds) Database Systems for Advanced Applications. DASFAA 2005. Lecture Notes in Computer Science, vol 3453. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11408079_4

Download citation

  • DOI: https://doi.org/10.1007/11408079_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25334-1

  • Online ISBN: 978-3-540-32005-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics