Data Structures for Accelerating Tanimoto Queries on Real Valued Vectors

Kristensen, Thomas G.; Pedersen, Christian N. S.

doi:10.1007/978-3-642-15294-8_3

Data Structures for Accelerating Tanimoto Queries on Real Valued Vectors

Thomas G. Kristensen²¹ &
Christian N. S. Pedersen²¹

Conference paper

822 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 6293))

Abstract

Previous methods for accelerating Tanimoto queries have been based on using bit strings for representing molecules. No work has gone into examining accelerating Tanimoto queries on real valued descriptors, even though these offer a much more fine grained measure of similarity between molecules. This study utilises a recently discovered reduction from Tanimoto queries to distance queries in Euclidean space to accelerate Tanimoto queries using standard metric data structures. The presented experiments show that it is possible to gain a significant speedup and that general metric data structures are better suited than a data structure tailored for Euclidean space on vectors generated from molecular data.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baldi, P., Hirschberg, D.S., Nasr, R.J.: Speeding up chemical database searches using a proximity filter based on the logical exclusive OR. Journal of Chemical Information and Modeling 48(7), 1367–1378 (2008)
Article CAS PubMed Google Scholar
Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
Article Google Scholar
Brin, S.: Near neighbor search in large metric spaces. The VLDB Journal, 574–584 (1995)
Google Scholar
Ciaccia, P., Patella, M., Zezula, P.: M-tree: An efficient access method for similarity search in metric spaces. In: Jarke, M., Carey, M.J., Dittrich, K.R., Lochovsky, F.H., Loucopoulos, P., Jeusfeld, M.A. (eds.) VLDB 1997, Proceedings of 23rd International Conference on Very Large Data Bases, Athens, Greece, August 25-29, pp. 426–435. Morgan Kaufmann, San Francisco (1997)
Google Scholar
Gillet, V.J., Willett, P., Bradshaw, J.: Similarity searching using reduced graphs. Journal of Chemical Information and Computer Sciences 43(2), 338–345 (2003)
Article CAS PubMed Google Scholar
Huafeng, X., Agrafiotis, D.K.: Nearest neighbor search in general metric spaces using a tree data structure with a simple heuristic. Journal of Chemical Information and Modeling 43(6), 1933–1941 (2003)
Google Scholar
Irwin, J.J., Shoichet, B.K.: ZINC: A free database of commercially available compounds for virtual screening. Journal of Chemical Information and Modeling 45(1), 177–182 (2005)
Article CAS PubMed PubMed Central Google Scholar
Kristensen, T.G., Nielsen, J., Pedersen, C.N.S.: A tree-based method for the rapid screening of chemical fingerprints. Algorithms for Molecular Biology 5(1), 9 (2010)
Article PubMed PubMed Central Google Scholar
Kristensen, T.G.: Transforming Tanimoto queries on real valued vectors to range queries in Euclidian space. Journal of Mathematical Chemistry (March 2010)
Google Scholar
Leach, A.R., Gillet, V.J.: An Introduction to Chemoinformatics, rev. ed edn. Kluwer Academic Publishers, Dordrecht (2007)
Book Google Scholar
Lipkus, A.H.: A proof of the triangle inequality for the Tanimoto distance. Journal of Mathematical Chemistry 26(1-3), 263–265 (1999)
Article Google Scholar
Molegro: Molegro Virtual Docker User Manual version 3.0.0 (2008)
Google Scholar
Späth, H.: Cluster Analysis Algorithms for Data Reduction and Classification of Objects. Ellis Horwood (1980)
Google Scholar
Steinbeck, C., Han, Y., Kuhn, S., Horlacher, O., Luttmann, E., Willighagen, E.: The Chemistry Development Kit (CDK): an open-source Java library for chemo- and bioinformatics. Journal of Chemical Information and Computer Sciences 43(2), 493–500 (2003)
Article CAS PubMed PubMed Central Google Scholar
Swamidass, S.J., Baldi, P.: Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time. Journal of Chemical Information and Modeling 47(2), 302–317 (2007)
Article CAS PubMed PubMed Central Google Scholar
Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: VLDB 1998: Proceedings of the 24rd International Conference on Very Large Data Bases, pp. 194–205. Morgan Kaufmann Publishers Inc., San Francisco (1998)
Google Scholar
Willett, P.: Similarity-based approaches to virtual screening. Biochemical Society Transactions 31(Pt 3), 603–606 (2003)
Article CAS PubMed Google Scholar
Willett, P., Barnard, J.M., Downs, G.M.: Chemical similarity searching. Journal of Chemical Information and Computer Sciences 38(6), 983–996 (1998)
Article CAS Google Scholar
Yianilos, P.N.: Data structures and algorithms for nearest neighbor search in general metric spaces. In: Proceedings of the Fourth ACM-SIAM Symposium on Discrete Algorithms (1993)
Google Scholar

Download references

Author information

Authors and Affiliations

Bioinformatics Research Center, Aarhus University, C. F. Møllers Alle 8, 8000, Aarhus C., Denmark
Thomas G. Kristensen & Christian N. S. Pedersen

Authors

Thomas G. Kristensen
View author publications
You can also search for this author in PubMed Google Scholar
Christian N. S. Pedersen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computing Sciences, University of East Anglia, NR 7TJ, Norwich, UK
Vincent Moulton
Lewis-Sigler Institute for Integrative Genomics, Department of Computer Science, Princeton University, NJ 08544, Princeton, USA
Mona Singh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kristensen, T.G., Pedersen, C.N.S. (2010). Data Structures for Accelerating Tanimoto Queries on Real Valued Vectors. In: Moulton, V., Singh, M. (eds) Algorithms in Bioinformatics. WABI 2010. Lecture Notes in Computer Science(), vol 6293. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15294-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-15294-8_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15293-1
Online ISBN: 978-3-642-15294-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics