Skip to main content

Indexing nucleotide databases for fast query evaluation

  • Applications
  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1057))

Abstract

A query to a nucleotide database is a DNA sequence. Answers are similar sequences, that is, sequences with a high-quality local alignment. Existing techniques for finding answers use exhaustive search, but it is likely that, with increasing database size, these algorithms will become prohibitively expensive. We have developed a partitioned search approach, in which local alignment string matching techniques are used in tandem with an index. We show that fixedlength substrings, or intervals, are a suitable basis for indexing in conjunction with local alignment on likely answers. By use of suitable compression techniques the index size is held to an acceptable level, and queries can be evaluated several times more quickly than with exhaustive search techniques.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. S. Altschul, M. Boguski, W. Gish, and J. Wootton. Issues in searching molecular sequence databases. Nature Genetics, 6:119–129, 1994.

    Google Scholar 

  2. S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403–410, 1990.

    Google Scholar 

  3. S.F. Altschul. A protein alignment scoring system sensitive at all evolutionary distances. Journal of Molecular Evolution, 36:290–300, 1993.

    Google Scholar 

  4. D. Benson, D.J. Lipman, and J. Ostell. GenBank. Nucleic Acids Research, 21(13):2963–2965, 1993.

    Google Scholar 

  5. M.J. Cinkosky, J.W. Fickett, P. Gilna, and C. Burks. Electronic data publishing in Genbank. Science, 252:1273–1277, 1991.

    Google Scholar 

  6. F. Collins and D. Galas. A new five-year plan for the US human genome project. Science, 262:43–46, 1993.

    Google Scholar 

  7. P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, IT-21(2):194–203, March 1975.

    Google Scholar 

  8. S.W. Golomb. Run-length encodings. IEEE Transactions on Information Theory, IT-12(3):399–401, July 1966.

    Google Scholar 

  9. D.E. Knuth, J.H. Morris, and V.R. Pratt. Fast pattern matching in strings. SIAM Journal of Computing, 6:323–350, 1977.

    Google Scholar 

  10. D.J. Lipman and W.R. Pearson. Rapid and sensitive protein similarity searches. Science, 227:1435–1441, 1985.

    Google Scholar 

  11. A. Moffat. Economical inversion of large text files. Computing Systems, 5(2):125–139, Spring 1992.

    Google Scholar 

  12. A. Moffat and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems. (To appear).

    Google Scholar 

  13. A. Moffat and J. Zobel. Parameterised compression for sparse bitmaps. In Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 274–285, Copenhagen, Denmark, June 1992.

    Google Scholar 

  14. E.W. Myers. A sublinear algorithm for approximate keyword searching. Algorithmica. (To appear).

    Google Scholar 

  15. W.R. Pearson and D.J. Lipman. Improved tools for biological sequence comparison. Proc. National Academy of Science, 85:2444–2448, 1988.

    Google Scholar 

  16. C.M. Rice, R. Fachs, D.G. Higgins, P.J. Stoehr, and G.N. Cameron. The EMBL data library. Nucleic Acids Research, 21:2967–2971, 1993.

    Google Scholar 

  17. G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, Massachusetts, 1989.

    Google Scholar 

  18. G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983.

    Google Scholar 

  19. D. Sankoff and J.B. Kruskal, editors. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, Massachusetts, 1983.

    Google Scholar 

  20. H. Williams and J. Zobel. Practical compression of nucleotide databases. In Proc. Australian Computer Science Conference, pages 184–193, Melbourne, Australia, 1996.

    Google Scholar 

  21. R. W. Williams. The portable dictionary of the mouse genome: a personal database for gene mapping and molecular biology. Mammalian Genome, 5:372–375, 1994.

    Google Scholar 

  22. J. Zobel and P. Dart. Finding approximate matches in large lexicons. Software-Practice and Experience, 25(3):331–345, March 1995.

    Google Scholar 

  23. J. Zobel, A. Moffat, and R. Sacks-Davis. Searching large lexicons for partially specified terms using compressed inverted files. In Proc. International Conference on Very Large Databases, pages 290–301, Dublin, Ireland, 1993.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Peter Apers Mokrane Bouzeghoub Georges Gardarin

Rights and permissions

Reprints and permissions

Copyright information

© 1996 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Williams, H., Zobel, J. (1996). Indexing nucleotide databases for fast query evaluation. In: Apers, P., Bouzeghoub, M., Gardarin, G. (eds) Advances in Database Technology — EDBT '96. EDBT 1996. Lecture Notes in Computer Science, vol 1057. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0014158

Download citation

  • DOI: https://doi.org/10.1007/BFb0014158

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-61057-1

  • Online ISBN: 978-3-540-49943-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics