Abstract
We study a method of seed-based lossless filtration for approximate string matching and related applications. The method is based on a simultaneous use of several spaced seeds rather than a single seed as studied by Burkhardt and Karkkainen [1]. We present algorithms to compute several important parameters of seed families, study their combinatorial properties, and describe several techniques to construct efficient families. We also report a large-scale application of the proposed technique to the problem of oligonucleotide selection for an EST sequence database.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Burkhardt, S., Kärkkäinen, J.: Better filtering with gapped q-grams. Fundamenta Informaticae 56, 51–70 (2003) ;Preliminary version in Combinatorial Pattern Matching (2001)
Navarro, G., Raffinot, M.: Flexible Pattern Matching in Strings – Practical on-line search algorithms for texts and biological sequences, p. 280. Cambridge University Press, Cambridge (2002) ISBN 0-521-81307-7
Altschul, S., Madden, T., Schäffer, A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997)
Ma, B., Tromp, J., Li, M.: PatternHunter: Faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)
Schwartz, S., Kent, J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R., Haussler, D., Miller, W.: Human–mouse alignments with BLASTZ. Genome Research 13, 103–107 (2003)
Noe, L., Kucherov, G.: YASS: Similarity search in DNA sequences. Research Report RR-4852, INRIA (2003), http://www.inria.fr/rrrt/rr-4852.html
Pevzner, P., Waterman, M.: Multiple filtration and approximate pattern matching. Algorithmica 13, 135–154 (1995)
Califano, A., Rigoutsos, I.: Flash: A fast look-up algorithm for string homology. In: Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology, pp. 56–64 (1993)
Buhler, J.: Provably sensitive indexing strategies for biosequence similarity search. In: Proceedings of the 6th Annual International Conference on Computational Molecular Biology (RECOMB 2002), pp. 90–99. ACM Press, Washington (2002)
Keich, U., Li, M., Ma, B., Tromp, J.: On spaced seeds for similarity search. Discrete Applied Mathematics (2004) (to appear)
Buhler, J., Keich, U., Sun, Y.: Designing seeds for similarity search in genomic DNA. In: Proceedings of the 7th Annual International Conference on Computational Molecular Biology (RECOMB 2003), pp. 67–75. ACM Press, Berlin (2003)
Brejova, B., Brown, D., Vinar, T.: Vector seeds: An extension to spaced seeds allows substantial improvements in sensitivity and specificity. In: Benson, G., Page, R.D.M. (eds.) WABI 2003. LNCS (LNBI), vol. 2812, pp. 39–54. Springer, Heidelberg (2003)
Kucherov, G., Noe, L., Ponty, Y.: Estimating seed sensitivity on homogeneous alignments. In: Proceedings of the IEEE 4th Symposium on Bioinformatics and Bioengineering (BIBE 2004), May 19-21, IEEE Computer Society Press, Los Alamitos (2004)
Choi, K., Zhang, L.: Sensitivity analysis and efficient method for identifying optimal spaced seeds. Journal of Computer and System Sciences (2003) (to appear)
Li, F., Stormo, G.: Selection of optimal DNA oligos for gene expression arrays. Bioinformatics 17, 1067–1076 (2001)
Kaderali, L., Schliep, A.: Selecting signature oligonucleotides to identify organisms using DNA arrays. Bioinformatics 18, 1340–1349 (2002)
Rahmann, S.: Fast large scale oligonucleotide selection using the longest common factor approach. Journal of Bioinformatics and Computational Biology 1, 343–361 (2003)
Zheng, J., Close, T., Jiang, T., Lonardi, S.: Efficient selection of unique and popular oligos for large EST databases. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 384–401. Springer, Heidelberg (2003)
Burkhardt, S., Karkkainen, J.: One-gapped q-gram filtersfor levenshtein distance. In: Apostolico, A., Takeda, M. (eds.) CPM 2002. LNCS, vol. 2373, pp. 225–234. Springer, Heidelberg (2002)
Li, M., Ma, B., Kisman, D., Tromp, J.: PatternHunter II: Highly sensitive and fast homology search. Journal of Bioinformatics and Computational Biology (2004); Earlier version in GIW 2003 (International Conference on Genome Informatics)
Sun, Y., Buhler, J.: Designing multiple simultaneous seeds for DNA similarity search. In: Proceedings of the 8th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2004), ACM Press, New York (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kucherov, G., Noé, L., Roytberg, M. (2004). Multi-seed Lossless Filtration. In: Sahinalp, S.C., Muthukrishnan, S., Dogrusoz, U. (eds) Combinatorial Pattern Matching. CPM 2004. Lecture Notes in Computer Science, vol 3109. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-27801-6_22
Download citation
DOI: https://doi.org/10.1007/978-3-540-27801-6_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22341-2
Online ISBN: 978-3-540-27801-6
eBook Packages: Springer Book Archive