Definition
Biological sequence databases are mainly composed of DNA, RNA, and protein sequences. DNA and RNA sequences are polymers of nucleotides, whereas proteins are polymers of amino acids. A database of biological sequences contains a set of biological sequences of the same type. The length of each sequence varies from less than a hundred to several hundred million bases. An index structure on a database of biological sequences helps in identifying sequences in that database that are similar to a given query sequence quickly. The definition of similarity depends on two orthogonal parameters; similarity function and the length of the similarity of interest.
The simplest similarity function is the edit distance, which measures the number of substitutions, insertions, and deletions needed to transform one sequence to the other. More complex functions involve variable gap penalties and substitution scores based on how frequent substitutions are observed in nature. The length of the...
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsRecommended Reading
Altschul S, Gish W, Miller W, Meyers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10.
Benson D, Karsch-Mizrachi I, Lipman D, Ostell J, Rapp B, Wheeler D. GenBank. Nucleic Acids Res. 2000;28(1):15–8.
Bray N, Dubchak I, Pachter L. AVID: a global alignment program. Genome Res. 2003;13(1):97–102.
Ferragina P, Grossi R. The string B-tree: a new data structure for string search in external memory and its applications. J ACM. 1999;46(2):236–80.
Flho RFS, Traina AJM, Caetano Traina J, Faloutsos C. Similarity search without tears: the OMNI family of all-purpose access methods. In: Proceedings of the 17th International Conference on Data Engineering; 2001. p. 623–30.
Giladi E, Walker M, Wang J, Volkmuth W. SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size. Bioinformatics. 2002;18(6):873–7.
Kahveci T, Singh A. An efficient index structure for string databases. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001. p. 351–60.
Manber U, Myers E. Suffix arrays: a new method for on-line string searches. SIAM J Comput. 1993;22(5):935–48.
McCreight E. A space-economical suffix tree construction algorithm. J ACM. 1976;23(2):262–72.
Pearson W, Lipman D. Improved tools for biological sequence comparison. Proc Natl Acad Sci. 1988;85(8):2444–8.
Pol A, Kahveci T. Highly scalable and accurate seeds for subsequence alignment. In: Proceedings of the IEEE International Conference on Bioinformatics and Bioengineering; 2005.
Ukkonen E. On-line construction of suffix-trees. Algorithmica. 1995;14(3):249–60.
Venkateswaran J, Lachwani D, Kahveci T, Jermaine C. Reference-based indexing for metric spaces with costly distance measures. VLDB J. 2008;17(5):1231–51.
Weiner P. Linear pattern matching algorithms. In: Proceedings of the IEEE Symposium on Switching and Automata Theory; 1973. p. 1–11.
Yianilos P. Data structures and algorithms for nearest neighbor search in general metric spaces. In: Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms; 1993. p. 311–21.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media, LLC, part of Springer Nature
About this entry
Cite this entry
Kahveci, T. (2018). Index Structures for Biological Sequences. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_1434
Download citation
DOI: https://doi.org/10.1007/978-1-4614-8265-9_1434
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering