Encyclopedia of Big Data Technologies

2019 Edition
| Editors: Sherif Sakr, Albert Y. Zomaya

Similarity Sketching

  • Rasmus PaghEmail author
Reference work entry
DOI: https://doi.org/10.1007/978-3-319-77525-8_58

Synonyms

Overview

Similarity between a pair of objects, usually expressed as a similarity score in [0, 1], is a key concept when dealing with noisy or uncertain data, as is common in big data applications.

The aim of similarity sketching is to estimate similarities in a (high-dimensional) space using fewer computational resources (time and/or storage) than a naïve approach that stores unprocessed objects. This is achieved using a form of lossy compression that produces succinct representations of objects in the space, from which similarities can be estimated. In some spaces, it is more natural to consider distances rather than similarities; we will consider both of these measures of proximity in the following.

Definitions

Formally, consider a space X of objects and a function d : X × XR +. We refer to d as a distance function for X. Similarity sketching with respect to ( X, d) is done by using a sketching function c...
This is a preview of subscription content, log in to check access.

Notes

Acknowledgements

This work received support from the European Research Council under the European Union’s 7th Framework Programme (FP7/2007-2013)/ ERC grant agreement no. 614331.

References

  1. Andoni A, Indyk P (2008) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun ACM 51(1):117–122CrossRefGoogle Scholar
  2. Broder AZ (1997) On the resemblance and containment of documents. In: Proceedings of compression and complexity of sequences. IEEE, pp 21–29Google Scholar
  3. Broder AZ, Glassman SC, Manasse MS, Zweig G (1997) Syntactic clustering of the web. Comput Netw ISDN Syst 29(8):1157–1166CrossRefGoogle Scholar
  4. Charikar M (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of symposium on theory of computing (STOC), pp 380–388Google Scholar
  5. Chierichetti F, Kumar R (2015) Lsh-preserving functions and their applications. J ACM 62(5):33MathSciNetzbMATHCrossRefGoogle Scholar
  6. Dahlgaard S, Knudsen MBT, Thorup M (2017) Fast similarity sketching. In: Proceedings of symposium on foundations of computer science (FOCS), pp 663–671Google Scholar
  7. Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of conference on very large databases (VLDB), pp 518–529Google Scholar
  8. Jégou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 33(1):117–128CrossRefGoogle Scholar
  9. Li P, König AC (2011) Theory and applications of b-bit minwise hashing. Commun ACM 54(8):101–109CrossRefGoogle Scholar
  10. Li P, Owen AB, Zhang C (2012) One permutation hashing. In: Advances in neural information processing systems (NIPS), pp 3122–3130Google Scholar
  11. Mitzenmacher M, Pagh R, Pham N (2014) Efficient estimation for high similarities using odd sketches. In: Proceedings of international world wide web conference (WWW), pp 109–118Google Scholar
  12. Rahimi A, Recht B (2007) Random features for large-scale kernel machines. In: Advances in neural information processing systems (NIPS), pp 1177–1184Google Scholar
  13. Thorup M (2013) Bottom-k and priority sampling, set similarity and subset sums with minimal independence. In: Proceedings of symposium on theory of computing (STOC). ACM, pp 371–380Google Scholar
  14. Wang J, Zhang T, Song J, Sebe N, Shen HT (2017) A survey on learning to hash. IEEE Trans Pattern Anal Mach Intell 13(9)  https://doi.org/10.1109/TPAMI.2017.2699960CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Computer Science DepartmentIT University of CopenhagenCopenhagen SDenmark