Index Structures for Biological Sequences
Biological sequence databases are mainly composed of DNA, RNA, and protein sequences. DNA and RNA sequences are polymers of nucleotides, whereas proteins are polymers of amino acids. A database of biological sequences contains a set of biological sequences of the same type. The length of each sequence varies from less than a hundred to several hundred million bases. An index structure on a database of biological sequences helps in identifying sequences in that database that are similar to a given query sequence quickly. The definition of similarity depends on two orthogonal parameters; similarity function and the length of the similarity of interest.
The simplest similarity function is the edit distance, which measures the number of substitutions, insertions, and deletions needed to transform one sequence to the other. More complex functions involve variable gap penalties and substitution scores based on how frequent substitutions are observed in nature. The length of the...
- 5.Flho RFS, Traina AJM, Caetano Traina J, Faloutsos C. Similarity search without tears: the OMNI family of all-purpose access methods. In: Proceedings of the 17th International Conference on Data Engineering; 2001. p. 623–30.Google Scholar
- 7.Kahveci T, Singh A. An efficient index structure for string databases. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001. p. 351–60.Google Scholar
- 11.Pol A, Kahveci T. Highly scalable and accurate seeds for subsequence alignment. In: Proceedings of the IEEE International Conference on Bioinformatics and Bioengineering; 2005.Google Scholar
- 14.Weiner P. Linear pattern matching algorithms. In: Proceedings of the IEEE Symposium on Switching and Automata Theory; 1973. p. 1–11.Google Scholar
- 15.Yianilos P. Data structures and algorithms for nearest neighbor search in general metric spaces. In: Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms; 1993. p. 311–21.Google Scholar