Abstract
We investigate indexing techniques for sequence data, crucial in a wide variety of applications, where efficient, scalable, and versatile search algorithms are required. Recent research has focused on suffix trees (ST) and suffix arrays (SA) as desirable index representations. Existing solutions for very long sequences however provide either efficient index construction or efficient search, but not both. We propose a new ST representation, STTD64, which has reasonable construction time and storage requirement, and is efficient in search. We have implemented the construction and search algorithms for the proposed technique and conducted numerous experiments to evaluate its performance on various types of real sequence data. Our results show that while the construction time for STTD64 is comparable with current ST based techniques, it outperforms them in search. Compared to ESA, the best known SA technique, STTD64 exhibits slower construction time, but has similar space requirement and comparable search time. Unlike ESA, which is memory based, STTD64 is scalable and can handle very long sequences.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abouelhoda, M., Kurtz, S., Ohlebusch, E.: Replacing Suffix Trees with Enhanced Suffix Arrays. Journal of Discrete Algorithms 2, 53–86 (2004)
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic Local Alignment Search Tool. J. Mol. Biol. 215, 403–410 (1990)
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997)
Andersson, A., Nilsson, S.: Efficient Implementation of Suffix Trees. Softw. Pract. Exp. 25(2), 129–141 (1995)
Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H.P., Rivals, E., Vingron, M.: q-gram Based Database Searching Using a Suffix Array. In: RECOMB, pp. 77–83. ACM Press, New York (1999)
Crauser, A., Ferragina, P.: A Theoretical and Experimental Study on the Construction of Suffix Arrays in External Memory. Algorithmica 32(1), 1–35 (2002)
Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., Salzberg, S.L.: Alignment of Whole Genomes. Nucleic Acids Research 27, 2369–2376 (1999)
Dementiev, R., Kärkkäinen, J., Mehnert, J., Sanders, P.: Better External Memory Suffix Array Construction. In: Proc. 7th Workshop Algorithm Engineering and Experiments, (2005)
ExPASy Server, http://us.expasy.org/
Ferragina, P., Grossi, R.: The String B-Tree: A New Data Structure for String Searching in External Memory and its Applications. Journal of the ACM 46(2), 236–280 (1999)
Giegerich, R., Kurtz, S., Stoye, J.: Efficient Implementation of Lazy Suffix Trees. Softw. Pract. Exper. 33, 1035–1049 (2003)
Gusfield, D.: Algorithms on Strings, Trees and Sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997)
Halachev, M., Shiri, N., Thamildurai, A.: Exact Match Search in Sequence Data using Suffix Trees. In: Proc. of 14th ACM Conference on Information and Knowledge Management (CIKM), Bremen, Germany, ACM Press, New York (2005)
Hunt, E., Atkinson, M.P., Irving, R.W.: A Database Index to Large Biological Sequences. VLDB J. 7(3), 139–148 (2001)
Jagadish, H.V., Koudas, N., Srivastava, D.: On Effective Multi-dimensional Indexing for Strings. In: ACM SIGMOD Conference on Management of Data, pp. 403–414. ACM Press, New York (2000)
Kurtz, S.: Reducing the Space Requirement of Suffix Trees. Software-Practice and Experience 29(13), 49–1171 (1999)
Kurtz, S.: Vmatch: large scale sequence analysis software. http://www.vmatch.de/
Kurtz, S., Schleiermacher, C.: REPuter: Fast Computation of Maximal Repeats in Complete Genomes. Bioinformatics, 426–427 (1999)
Manber, U., Myers, G.: Suffix Arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
McCreight, E.M.: A Space-economical Suffix Tree Construction Algorithm. Journal of the ACM 23(2), 262–272 (1976)
Morgante, M., Policriti, A., Vitacolonna, N., Zuccolo, A.: Structured Motifs Search. In: Proc. of RECOMB ’04 (2004)
Morrison, D.R.: PATRICIA - Practical Algorithm to Retrieve Information Coded in Alphanumeric. Journal of the ACM 15(4), 514–534 (1968)
Navarro, G.: A Guided Tour to Approximate String Matching. ACM Computing Surveys 31(1), 31–88 (2000)
Navarro, G., Baeza-Yates, R.: A Practical q-gram Index for Text Retrieval Allowing Errors. CLEI Electronic Journal 1(2) (1998)
Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing Text with Approximate q-grams. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 350–365. Springer, Heidelberg (2000)
NCBI: National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/
Project Gutenberg, http://www.gutenberg.org
Tian, Y., Tata, S., Hankins, R.A., Patel, J.: Practical Methods for Constructing Suffix Trees. VLDB Journal 14(Issue 3), 281–299 (2005)
Ukkonen, E.: On-line Construction of Suffix trees. Algorithmica 14, 249–260 (1995)
Weiner, P.: Linear Pattern Matching Algorithms. In: Proc. 14th Annual Symp. on Switching and Automata Theory (1973)
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, San Francisco (1999)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Halachev, M., Shiri, N., Thamildurai, A. (2007). Efficient and Scalable Indexing Techniques for Biological Sequence Data. In: Hochreiter, S., Wagner, R. (eds) Bioinformatics Research and Development. BIRD 2007. Lecture Notes in Computer Science(), vol 4414. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71233-6_36
Download citation
DOI: https://doi.org/10.1007/978-3-540-71233-6_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71232-9
Online ISBN: 978-3-540-71233-6
eBook Packages: Computer ScienceComputer Science (R0)