Storage and Retrieval of Individual Genomes
A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies O(N logN) bits, which very soon inhibits in-memory analyses. Recent advances in full-text self-indexing reduce the space of suffix tree to O(N logσ) bits, where σ is the alphabet size. In practice, the space reduction is more than 10-fold, for example on suffix tree of Human Genome. However, this reduction factor remains constant when more sequences are added to the collection.
We develop a new family of self-indexes suited for the repetitive sequence collection setting. Their expected space requirement depends only on the length n of the base sequence and the number s of variations in its repeated copies. That is, the space reduction factor is no longer constant, but depends on N/n.
We believe the structures developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.
KeywordsComparative genomics full-text indexing suffix tree compressed data structures
Unable to display preview. Download preview PDF.
- 1.Blanford, D., Blelloch, G.: Compact representations of ordered sets. In: Proc. 15th SODA, pp. 11–19 (2004)Google Scholar
- 2.Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report Technical Report 124, Digital Equipment Corporation (1994)Google Scholar
- 5.Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms (TALG) 3(2) article 20 (2007)Google Scholar
- 8.Gupta, A., Hon, W.-K., Shah, R., Vitter, J.S.: Compressed data structures: Dictionaries and data-aware measures. In: DCC 2006: Proceedings of the Data Compression Conference (DCC 2006), pp. 213–222 (2006)Google Scholar
- 11.Kaplan, H.: Persistent Data Structures. In: Mehta, D.P., Sahni, S. (eds.) Handbook of Data Structures and Applications, vol. 31. Chapman & Hall, Boca Raton (2005)Google Scholar
- 12.Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12(1), 40–66 (2005)Google Scholar
- 15.Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) article 2 (2007)Google Scholar
- 16.Overmars, M.H.: Searching in the past, i. Technical Report Technical Report RUU-CS-81-7, Department of Computer Science, University of Utrecht, Utrecht, Netherlands (1981)Google Scholar
- 22.Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164–175. Springer, Heidelberg (2008)Google Scholar
- 23.Waterman, M.S.: Introduction to Computational Biology. Chapman & Hall, University Press (1995)Google Scholar