Skip to main content
Log in

A unifying framework for 0-sampling algorithms

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

The problem of building an 0-sampler is to sample near-uniformly from the support set of a dynamic multiset. This problem has a variety of applications within data analysis, computational geometry and graph algorithms. In this paper, we abstract a set of steps for building an 0-sampler, based on sampling, recovery and selection. We analyze the implementation of an 0-sampler within this framework, and show how prior constructions of 0-samplers can all be expressed in terms of these steps. Our experimental contribution is to provide a first detailed study of the accuracy and computational cost of 0-samplers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. More generally, we also seek solutions so that, given sketches of vectors a and b, we can form a sketch of (a+b) and sample from the 0-distribution on (a+b). All the algorithms that we discuss have this property.

  2. We note that tighter bounds are possible via a similar construction and a more involved analysis: adapting the approach of [11] improves the log term from log(s/δ r ) to log1/δ r , and the analysis of [26] further improves it to log s 1/δ r .

  3. Jowhari et al. [18] first present their algorithm assuming a random oracle, and then they remove this assumption through the use of the pseudo-random generator of Nisan [23].

  4. This level is ⌈log(2N/k)⌉ for the 0-sampler with k-wise independence, and ⌈logN/ϵ⌉ for the variant with pairwise independence.

References

  1. Achlioptas, D.: Database-friendly random projections. In: ACM Principles of Database Systems, pp. 274–281 (2001)

    Google Scholar 

  2. Ahn, K.J., Guha, S., McGregor, A.: Analyzing graph structure via linear measurements. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 459–467 (2012)

    Chapter  Google Scholar 

  3. Barkay, N., Porat, E., Shalem, B.: Feasible Sampling of Non-strict Turnstile Data Streams (2012). arXiv:1209.5566

  4. Beyer, K., Gemulla, R., Haas, P.J., Reinwald, B., Sismanis, Y.: Distinct-value synopses for multiset operations. Commun. ACM 52(10), 87–95 (2009)

    Article  Google Scholar 

  5. Cormode, G., Firmani, D.: On unifying the space of 0 sampling algorithms. In: Meeting on Algorithm Engineering & Experiments, pp. 163–172 (2013)

    Google Scholar 

  6. Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. In: International Conference on Very Large Data Bases, pp. 3–20 (2008)

    Google Scholar 

  7. Cormode, G., Korn, F., Muthukrishnan, S., Johnson, T., Spatscheck, O., Srivastava, D.: Holistic UDAFs at streaming speeds. In: ACM SIGMOD International Conference on Management of Data, pp. 35–46 (2004)

    Google Scholar 

  8. Cormode, G., Muthukrishnan, S., Rozenbaum, I.: Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In: International Conference on Very Large Data Bases, pp. 25–36 (2005)

    Google Scholar 

  9. Cormode, G., Garofalakis, M., Haas, P., Jermaine, C.: Synposes for Massive Data: Samples, Histograms, Wavelets and Sketches. Now Publishers, Hanover (2012)

    Google Scholar 

  10. Dasgupta, S., Gupta, A.: An Elementary Proof of the Johnson–Lindenstrauss Lemma. International Computer Science Institute, Berkeley (1999). Tech. Rep. TR-99-006

    Google Scholar 

  11. Eppstein, D., Goodrich, M.T.: Space-efficient straggler identification in round-trip data streams via Newton’s identitities and invertible Bloom filters. In: Workshop on Algorithms and Data Structures, pp. 637–648 (2007)

    Chapter  Google Scholar 

  12. Frahling, G., Indyk, P., Sohler, C.: Sampling in dynamic data streams and applications. In: Symposium on Computational Geometry, pp. 142–149 (2005)

    Google Scholar 

  13. Ganguly, S.: Counting distinct items over update streams. Theor. Comput. Sci. 378(3), 211–222 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  14. Gilbert, A.C., Strauss, M.J., Tropp, J.A., Vershynin, R.: One sketch for all: fast algorithms for compressed sensing. In: ACM Symposium on Theory of Computing, pp. 237–246 (2007)

    Google Scholar 

  15. Indyk, P.: A small approximately min-wise independent family of hash functions. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 454–456 (1999)

    Google Scholar 

  16. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: ACM Symposium on Theory of Computing, pp. 604–613 (1998)

    Google Scholar 

  17. Johnson, W., Lindenstrauss, J.: Extensions of Lipshitz mapping into Hilbert space. Contemp. Math. 26, 189–206 (1984)

    Article  MATH  MathSciNet  Google Scholar 

  18. Jowhari, H., Sağlam, M., Tardos, G.: Tight bounds for l p samplers, finding duplicates in streams, and related problems. In: ACM Principles of Database Systems, pp. 49–58 (2011)

    Google Scholar 

  19. Kane, D.M., Nelson, J., Woodruff, D.P.: An optimal algorithm for the distinct elements problem. In: ACM Principles of Database Systems, pp. 41–52 (2010)

    Google Scholar 

  20. Manerikar, N., Palpanas, T.: Frequent items in streaming data: an experimental evaluation of the state-of-the-art. Data Knowl. Eng. 68(4), 415–430 (2009)

    Article  Google Scholar 

  21. Metwally, A., Agrawal, D., El Abbadi, A.: Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic. In: EDBT, pp. 618–629 (2008)

    Chapter  Google Scholar 

  22. Monemizadeh, M., Woodruff, D.P.: 1-pass relative-error l p -sampling with applications. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 1143–1160 (2010)

    Chapter  Google Scholar 

  23. Nisan, N.: Pseudorandom generators for space-bounded computations. In: ACM Symposium on Theory of Computing, pp. 204–212 (1990)

    Google Scholar 

  24. Patrascu, M., Thorup, M.: The power of simple tabulation hashing. In: ACM Symposium on Theory of Computing, pp. 1–10 (2011)

    Google Scholar 

  25. Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with sawzall. Sci. Program. 13(4), 277–298 (2005)

    Google Scholar 

  26. Price, E.: Efficient sketches for the set query problem. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 41–56 (2011)

    Chapter  Google Scholar 

  27. Schmidt, J.P., Siegel, A., Srinivasan, A.: Chernoff–Hoeffding bounds for applications with limited independence. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 331–340 (1993)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Graham Cormode.

Additional information

Communicated by: Feifei Li and Suman Nath.

This paper is an extended version of [5].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cormode, G., Firmani, D. A unifying framework for 0-sampling algorithms. Distrib Parallel Databases 32, 315–335 (2014). https://doi.org/10.1007/s10619-013-7131-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-013-7131-9

Keywords

Navigation