Skip to main content

Efficient and Scalable Indexing Techniques for Biological Sequence Data

  • Conference paper
Book cover Bioinformatics Research and Development (BIRD 2007)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4414))

Included in the following conference series:

Abstract

We investigate indexing techniques for sequence data, crucial in a wide variety of applications, where efficient, scalable, and versatile search algorithms are required. Recent research has focused on suffix trees (ST) and suffix arrays (SA) as desirable index representations. Existing solutions for very long sequences however provide either efficient index construction or efficient search, but not both. We propose a new ST representation, STTD64, which has reasonable construction time and storage requirement, and is efficient in search. We have implemented the construction and search algorithms for the proposed technique and conducted numerous experiments to evaluate its performance on various types of real sequence data. Our results show that while the construction time for STTD64 is comparable with current ST based techniques, it outperforms them in search. Compared to ESA, the best known SA technique, STTD64 exhibits slower construction time, but has similar space requirement and comparable search time. Unlike ESA, which is memory based, STTD64 is scalable and can handle very long sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abouelhoda, M., Kurtz, S., Ohlebusch, E.: Replacing Suffix Trees with Enhanced Suffix Arrays. Journal of Discrete Algorithms 2, 53–86 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  2. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic Local Alignment Search Tool. J. Mol. Biol. 215, 403–410 (1990)

    Google Scholar 

  3. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997)

    Article  Google Scholar 

  4. Andersson, A., Nilsson, S.: Efficient Implementation of Suffix Trees. Softw. Pract. Exp. 25(2), 129–141 (1995)

    Article  Google Scholar 

  5. Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H.P., Rivals, E., Vingron, M.: q-gram Based Database Searching Using a Suffix Array. In: RECOMB, pp. 77–83. ACM Press, New York (1999)

    Chapter  Google Scholar 

  6. Crauser, A., Ferragina, P.: A Theoretical and Experimental Study on the Construction of Suffix Arrays in External Memory. Algorithmica 32(1), 1–35 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  7. Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., Salzberg, S.L.: Alignment of Whole Genomes. Nucleic Acids Research 27, 2369–2376 (1999)

    Article  Google Scholar 

  8. Dementiev, R., Kärkkäinen, J., Mehnert, J., Sanders, P.: Better External Memory Suffix Array Construction. In: Proc. 7th Workshop Algorithm Engineering and Experiments, (2005)

    Google Scholar 

  9. ExPASy Server, http://us.expasy.org/

  10. Ferragina, P., Grossi, R.: The String B-Tree: A New Data Structure for String Searching in External Memory and its Applications. Journal of the ACM 46(2), 236–280 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  11. GenBank, http://www.ncbi.nlm.nih.gov/Genbank/index.html

  12. Giegerich, R., Kurtz, S., Stoye, J.: Efficient Implementation of Lazy Suffix Trees. Softw. Pract. Exper. 33, 1035–1049 (2003)

    Article  Google Scholar 

  13. Gusfield, D.: Algorithms on Strings, Trees and Sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997)

    MATH  Google Scholar 

  14. Halachev, M., Shiri, N., Thamildurai, A.: Exact Match Search in Sequence Data using Suffix Trees. In: Proc. of 14th ACM Conference on Information and Knowledge Management (CIKM), Bremen, Germany, ACM Press, New York (2005)

    Google Scholar 

  15. Hunt, E., Atkinson, M.P., Irving, R.W.: A Database Index to Large Biological Sequences. VLDB J. 7(3), 139–148 (2001)

    Google Scholar 

  16. Jagadish, H.V., Koudas, N., Srivastava, D.: On Effective Multi-dimensional Indexing for Strings. In: ACM SIGMOD Conference on Management of Data, pp. 403–414. ACM Press, New York (2000)

    Chapter  Google Scholar 

  17. Kurtz, S.: Reducing the Space Requirement of Suffix Trees. Software-Practice and Experience 29(13), 49–1171 (1999)

    Article  Google Scholar 

  18. Kurtz, S.: Vmatch: large scale sequence analysis software. http://www.vmatch.de/

  19. Kurtz, S., Schleiermacher, C.: REPuter: Fast Computation of Maximal Repeats in Complete Genomes. Bioinformatics, 426–427 (1999)

    Google Scholar 

  20. Manber, U., Myers, G.: Suffix Arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  21. McCreight, E.M.: A Space-economical Suffix Tree Construction Algorithm. Journal of the ACM 23(2), 262–272 (1976)

    Article  MATH  MathSciNet  Google Scholar 

  22. Morgante, M., Policriti, A., Vitacolonna, N., Zuccolo, A.: Structured Motifs Search. In: Proc. of RECOMB ’04 (2004)

    Google Scholar 

  23. Morrison, D.R.: PATRICIA - Practical Algorithm to Retrieve Information Coded in Alphanumeric. Journal of the ACM 15(4), 514–534 (1968)

    Article  MathSciNet  Google Scholar 

  24. Navarro, G.: A Guided Tour to Approximate String Matching. ACM Computing Surveys 31(1), 31–88 (2000)

    Google Scholar 

  25. Navarro, G., Baeza-Yates, R.: A Practical q-gram Index for Text Retrieval Allowing Errors. CLEI Electronic Journal 1(2) (1998)

    Google Scholar 

  26. Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing Text with Approximate q-grams. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 350–365. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  27. NCBI: National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/

  28. Project Gutenberg, http://www.gutenberg.org

  29. Tian, Y., Tata, S., Hankins, R.A., Patel, J.: Practical Methods for Constructing Suffix Trees. VLDB Journal 14(Issue 3), 281–299 (2005)

    Article  Google Scholar 

  30. Ukkonen, E.: On-line Construction of Suffix trees. Algorithmica 14, 249–260 (1995)

    Article  MATH  MathSciNet  Google Scholar 

  31. Weiner, P.: Linear Pattern Matching Algorithms. In: Proc. 14th Annual Symp. on Switching and Automata Theory (1973)

    Google Scholar 

  32. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, San Francisco (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Sepp Hochreiter Roland Wagner

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Halachev, M., Shiri, N., Thamildurai, A. (2007). Efficient and Scalable Indexing Techniques for Biological Sequence Data. In: Hochreiter, S., Wagner, R. (eds) Bioinformatics Research and Development. BIRD 2007. Lecture Notes in Computer Science(), vol 4414. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71233-6_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-71233-6_36

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-71232-9

  • Online ISBN: 978-3-540-71233-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics