Efficient and Scalable Indexing Techniques for Biological Sequence Data

Halachev, Mihail; Shiri, Nematollaah; Thamildurai, Anand

doi:10.1007/978-3-540-71233-6_36

Mihail Halachev¹,
Nematollaah Shiri¹ &
Anand Thamildurai¹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4414))

Included in the following conference series:

International Conference on Bioinformatics Research and Development

1151 Accesses
3 Citations

Abstract

We investigate indexing techniques for sequence data, crucial in a wide variety of applications, where efficient, scalable, and versatile search algorithms are required. Recent research has focused on suffix trees (ST) and suffix arrays (SA) as desirable index representations. Existing solutions for very long sequences however provide either efficient index construction or efficient search, but not both. We propose a new ST representation, STTD64, which has reasonable construction time and storage requirement, and is efficient in search. We have implemented the construction and search algorithms for the proposed technique and conducted numerous experiments to evaluate its performance on various types of real sequence data. Our results show that while the construction time for STTD64 is comparable with current ST based techniques, it outperforms them in search. Compared to ESA, the best known SA technique, STTD64 exhibits slower construction time, but has similar space requirement and comparable search time. Unlike ESA, which is memory based, STTD64 is scalable and can handle very long sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abouelhoda, M., Kurtz, S., Ohlebusch, E.: Replacing Suffix Trees with Enhanced Suffix Arrays. Journal of Discrete Algorithms 2, 53–86 (2004)
Article MATH MathSciNet Google Scholar
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic Local Alignment Search Tool. J. Mol. Biol. 215, 403–410 (1990)
Google Scholar
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997)
Article Google Scholar
Andersson, A., Nilsson, S.: Efficient Implementation of Suffix Trees. Softw. Pract. Exp. 25(2), 129–141 (1995)
Article Google Scholar
Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H.P., Rivals, E., Vingron, M.: q-gram Based Database Searching Using a Suffix Array. In: RECOMB, pp. 77–83. ACM Press, New York (1999)
Chapter Google Scholar
Crauser, A., Ferragina, P.: A Theoretical and Experimental Study on the Construction of Suffix Arrays in External Memory. Algorithmica 32(1), 1–35 (2002)
Article MATH MathSciNet Google Scholar
Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., Salzberg, S.L.: Alignment of Whole Genomes. Nucleic Acids Research 27, 2369–2376 (1999)
Article Google Scholar
Dementiev, R., Kärkkäinen, J., Mehnert, J., Sanders, P.: Better External Memory Suffix Array Construction. In: Proc. 7th Workshop Algorithm Engineering and Experiments, (2005)
Google Scholar
ExPASy Server, http://us.expasy.org/
Ferragina, P., Grossi, R.: The String B-Tree: A New Data Structure for String Searching in External Memory and its Applications. Journal of the ACM 46(2), 236–280 (1999)
Article MATH MathSciNet Google Scholar
GenBank, http://www.ncbi.nlm.nih.gov/Genbank/index.html
Giegerich, R., Kurtz, S., Stoye, J.: Efficient Implementation of Lazy Suffix Trees. Softw. Pract. Exper. 33, 1035–1049 (2003)
Article Google Scholar
Gusfield, D.: Algorithms on Strings, Trees and Sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997)
MATH Google Scholar
Halachev, M., Shiri, N., Thamildurai, A.: Exact Match Search in Sequence Data using Suffix Trees. In: Proc. of 14th ACM Conference on Information and Knowledge Management (CIKM), Bremen, Germany, ACM Press, New York (2005)
Google Scholar
Hunt, E., Atkinson, M.P., Irving, R.W.: A Database Index to Large Biological Sequences. VLDB J. 7(3), 139–148 (2001)
Google Scholar
Jagadish, H.V., Koudas, N., Srivastava, D.: On Effective Multi-dimensional Indexing for Strings. In: ACM SIGMOD Conference on Management of Data, pp. 403–414. ACM Press, New York (2000)
Chapter Google Scholar
Kurtz, S.: Reducing the Space Requirement of Suffix Trees. Software-Practice and Experience 29(13), 49–1171 (1999)
Article Google Scholar
Kurtz, S.: Vmatch: large scale sequence analysis software. http://www.vmatch.de/
Kurtz, S., Schleiermacher, C.: REPuter: Fast Computation of Maximal Repeats in Complete Genomes. Bioinformatics, 426–427 (1999)
Google Scholar
Manber, U., Myers, G.: Suffix Arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Article MATH MathSciNet Google Scholar
McCreight, E.M.: A Space-economical Suffix Tree Construction Algorithm. Journal of the ACM 23(2), 262–272 (1976)
Article MATH MathSciNet Google Scholar
Morgante, M., Policriti, A., Vitacolonna, N., Zuccolo, A.: Structured Motifs Search. In: Proc. of RECOMB ’04 (2004)
Google Scholar
Morrison, D.R.: PATRICIA - Practical Algorithm to Retrieve Information Coded in Alphanumeric. Journal of the ACM 15(4), 514–534 (1968)
Article MathSciNet Google Scholar
Navarro, G.: A Guided Tour to Approximate String Matching. ACM Computing Surveys 31(1), 31–88 (2000)
Google Scholar
Navarro, G., Baeza-Yates, R.: A Practical q-gram Index for Text Retrieval Allowing Errors. CLEI Electronic Journal 1(2) (1998)
Google Scholar
Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing Text with Approximate q-grams. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 350–365. Springer, Heidelberg (2000)
Chapter Google Scholar
NCBI: National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/
Project Gutenberg, http://www.gutenberg.org
Tian, Y., Tata, S., Hankins, R.A., Patel, J.: Practical Methods for Constructing Suffix Trees. VLDB Journal 14(Issue 3), 281–299 (2005)
Article Google Scholar
Ukkonen, E.: On-line Construction of Suffix trees. Algorithmica 14, 249–260 (1995)
Article MATH MathSciNet Google Scholar
Weiner, P.: Linear Pattern Matching Algorithms. In: Proc. 14th Annual Symp. on Switching and Automata Theory (1973)
Google Scholar
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, San Francisco (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science and Software Engineering, Concordia University, Montreal, Canada
Mihail Halachev, Nematollaah Shiri & Anand Thamildurai

Authors

Mihail Halachev
View author publications
You can also search for this author in PubMed Google Scholar
Nematollaah Shiri
View author publications
You can also search for this author in PubMed Google Scholar
Anand Thamildurai
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Sepp Hochreiter Roland Wagner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Halachev, M., Shiri, N., Thamildurai, A. (2007). Efficient and Scalable Indexing Techniques for Biological Sequence Data. In: Hochreiter, S., Wagner, R. (eds) Bioinformatics Research and Development. BIRD 2007. Lecture Notes in Computer Science(), vol 4414. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71233-6_36

Download citation

DOI: https://doi.org/10.1007/978-3-540-71233-6_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71232-9
Online ISBN: 978-3-540-71233-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics