Indexing DNA Sequences Using q-Grams

Cao, Xia; Li, Shuai Cheng; Tung, Anthony K. H.

doi:10.1007/11408079_4

Xia Cao¹⁹,
Shuai Cheng Li¹⁹ &
Anthony K. H. Tung¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3453))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1140 Accesses
18 Citations

Abstract

We have observed in recent years a growing interest in similarity search on large collections of biological sequences. Contributing to the interest, this paper presents a method for indexing the DNA sequences efficiently based on q-grams to facilitate similarity search in a DNA database and sidestep the need for linear scan of the entire database. Two level index – hash table and c-trees – are proposed based on the q-grams of DNA sequences. The proposed data structures allow the quick detection of sequences within a certain distance to the query sequence. Experimental results show that our method is efficient in detecting similarity regions in a DNA sequence database with high sensitivity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: A basic local alignment search tool. Journal of Molecular Biology (1990)
Google Scholar
Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H.P., Vingron, M.: q-gram based database searching using a suffix array (quasar). In: Int. Conf. RECOMB, Lyon (April 1999)
Google Scholar
Cao, X., Li, S.C., Ooi, B.C., Tung, A.: Piers: An efficient model for similarity search in dna sequence databases. ACM Sigmod Record 33 (2004)
Google Scholar
Giladi, E., Walker, M., Wang, J., Volkmuth, W.: Sst: An algorithm for searching sequence databases in time proportional to the logarithm of the database size. In: Int. Conf. RECOMB, Japan (2000)
Google Scholar
Hunt, E., Atkinson, M.P., Irving, R.W.: A database index to large biological sequences. International Journal on VLDB, 139–148 (September 2001)
Google Scholar
Jokinen, P., Ukkonen, E.: Two algorithm for approximate string matching in static texts. In: Proc. of the 16th Symposium on Mathematical Foundataions of Computer Science, pp. 240–248 (1991)
Google Scholar
Kahveci, T., Singh, A.: An efficient index structure for string databases. In: Proc. 2001 Int. Conf. Very Large Data Bases (VLDB 2001), Roma, Italy (2001)
Google Scholar
Ma, B., Tromp, J., Li, M.: Patternhunter: faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)
Article Google Scholar
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string search. SIAM Journal on Computing 22, 935–948 (1993)
Article MATH MathSciNet Google Scholar
Meek, C., Patel, J.M., Kasetty, S.: Oasis: An online and accurate technique for local-alignment searches on biological sequences. In: Proc. 2003 Int. Conf. Very Large Data Bases (VLDB 2003), Berlin, Germany, September 2003, pp. 910–921 (2003)
Google Scholar
Muthukrishnan, S., Sahinalp, S.C.: Approximate nearest neighbors and sequence comparison with block operation. In: STOC, Portland, Or (2000)
Google Scholar
Ozturk, O., Ferhatosmanoglu, H.: Effective indexing and filtering for similarity search in large biosequence datasbases. In: Third IEEE Symposium on BioInformatics and BioEngineering (BIBE 2003), Bethesda, Maryland (2003)
Google Scholar
Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence comparison. Proceedings Natl. Acad. Sci. USA 85, 2444–2448 (1988)
Article Google Scholar
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Molecular Biology 147, 195–197 (1981)
Article Google Scholar
Tan, Z., Cao, X., Ooi, B.C., Tung, A.: The ed-tree: an index for large dna sequence databases. In: Proc. 15th Int. Conf. on Scientific and Statistical Database Management, pp. 151–160 (2003)
Google Scholar
Weiner, P.: Linear pattern matching algorithms. In: Proc. 14th IEEE Symp. On Switching and Automata Theory, pp. 1–11 (1973)
Google Scholar
Williams, H.E., Zobel, J.: Indexing and retrieval for genomic databases. IEEE Transactions on Knowledge and Data Engineering 14, 63–78 (2002)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, National University of Singapore,
Xia Cao, Shuai Cheng Li & Anthony K. H. Tung

Authors

Xia Cao
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Cheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Anthony K. H. Tung
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Research Institute of Information Technology, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, China
Lizhu Zhou
National University of Singapore, Singapore
Beng Chin Ooi
School of Information, Renmin University of China,
Xiaofeng Meng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cao, X., Li, S.C., Tung, A.K.H. (2005). Indexing DNA Sequences Using q-Grams. In: Zhou, L., Ooi, B.C., Meng, X. (eds) Database Systems for Advanced Applications. DASFAA 2005. Lecture Notes in Computer Science, vol 3453. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11408079_4

Download citation

DOI: https://doi.org/10.1007/11408079_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25334-1
Online ISBN: 978-3-540-32005-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics