Distributed and Parallel Databases

, Volume 32, Issue 3, pp 315–335 | Cite as

A unifying framework for 0-sampling algorithms



The problem of building an 0-sampler is to sample near-uniformly from the support set of a dynamic multiset. This problem has a variety of applications within data analysis, computational geometry and graph algorithms. In this paper, we abstract a set of steps for building an 0-sampler, based on sampling, recovery and selection. We analyze the implementation of an 0-sampler within this framework, and show how prior constructions of 0-samplers can all be expressed in terms of these steps. Our experimental contribution is to provide a first detailed study of the accuracy and computational cost of 0-samplers.


Data summarization Sampling Distinct elements Graph sketches 


  1. 1.
    Achlioptas, D.: Database-friendly random projections. In: ACM Principles of Database Systems, pp. 274–281 (2001) Google Scholar
  2. 2.
    Ahn, K.J., Guha, S., McGregor, A.: Analyzing graph structure via linear measurements. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 459–467 (2012) CrossRefGoogle Scholar
  3. 3.
    Barkay, N., Porat, E., Shalem, B.: Feasible Sampling of Non-strict Turnstile Data Streams (2012). arXiv:1209.5566
  4. 4.
    Beyer, K., Gemulla, R., Haas, P.J., Reinwald, B., Sismanis, Y.: Distinct-value synopses for multiset operations. Commun. ACM 52(10), 87–95 (2009) CrossRefGoogle Scholar
  5. 5.
    Cormode, G., Firmani, D.: On unifying the space of 0 sampling algorithms. In: Meeting on Algorithm Engineering & Experiments, pp. 163–172 (2013) Google Scholar
  6. 6.
    Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. In: International Conference on Very Large Data Bases, pp. 3–20 (2008) Google Scholar
  7. 7.
    Cormode, G., Korn, F., Muthukrishnan, S., Johnson, T., Spatscheck, O., Srivastava, D.: Holistic UDAFs at streaming speeds. In: ACM SIGMOD International Conference on Management of Data, pp. 35–46 (2004) Google Scholar
  8. 8.
    Cormode, G., Muthukrishnan, S., Rozenbaum, I.: Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In: International Conference on Very Large Data Bases, pp. 25–36 (2005) Google Scholar
  9. 9.
    Cormode, G., Garofalakis, M., Haas, P., Jermaine, C.: Synposes for Massive Data: Samples, Histograms, Wavelets and Sketches. Now Publishers, Hanover (2012) Google Scholar
  10. 10.
    Dasgupta, S., Gupta, A.: An Elementary Proof of the Johnson–Lindenstrauss Lemma. International Computer Science Institute, Berkeley (1999). Tech. Rep. TR-99-006 Google Scholar
  11. 11.
    Eppstein, D., Goodrich, M.T.: Space-efficient straggler identification in round-trip data streams via Newton’s identitities and invertible Bloom filters. In: Workshop on Algorithms and Data Structures, pp. 637–648 (2007) CrossRefGoogle Scholar
  12. 12.
    Frahling, G., Indyk, P., Sohler, C.: Sampling in dynamic data streams and applications. In: Symposium on Computational Geometry, pp. 142–149 (2005) Google Scholar
  13. 13.
    Ganguly, S.: Counting distinct items over update streams. Theor. Comput. Sci. 378(3), 211–222 (2007) CrossRefMATHMathSciNetGoogle Scholar
  14. 14.
    Gilbert, A.C., Strauss, M.J., Tropp, J.A., Vershynin, R.: One sketch for all: fast algorithms for compressed sensing. In: ACM Symposium on Theory of Computing, pp. 237–246 (2007) Google Scholar
  15. 15.
    Indyk, P.: A small approximately min-wise independent family of hash functions. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 454–456 (1999) Google Scholar
  16. 16.
    Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: ACM Symposium on Theory of Computing, pp. 604–613 (1998) Google Scholar
  17. 17.
    Johnson, W., Lindenstrauss, J.: Extensions of Lipshitz mapping into Hilbert space. Contemp. Math. 26, 189–206 (1984) CrossRefMATHMathSciNetGoogle Scholar
  18. 18.
    Jowhari, H., Sağlam, M., Tardos, G.: Tight bounds for l p samplers, finding duplicates in streams, and related problems. In: ACM Principles of Database Systems, pp. 49–58 (2011) Google Scholar
  19. 19.
    Kane, D.M., Nelson, J., Woodruff, D.P.: An optimal algorithm for the distinct elements problem. In: ACM Principles of Database Systems, pp. 41–52 (2010) Google Scholar
  20. 20.
    Manerikar, N., Palpanas, T.: Frequent items in streaming data: an experimental evaluation of the state-of-the-art. Data Knowl. Eng. 68(4), 415–430 (2009) CrossRefGoogle Scholar
  21. 21.
    Metwally, A., Agrawal, D., El Abbadi, A.: Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic. In: EDBT, pp. 618–629 (2008) CrossRefGoogle Scholar
  22. 22.
    Monemizadeh, M., Woodruff, D.P.: 1-pass relative-error l p-sampling with applications. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 1143–1160 (2010) CrossRefGoogle Scholar
  23. 23.
    Nisan, N.: Pseudorandom generators for space-bounded computations. In: ACM Symposium on Theory of Computing, pp. 204–212 (1990) Google Scholar
  24. 24.
    Patrascu, M., Thorup, M.: The power of simple tabulation hashing. In: ACM Symposium on Theory of Computing, pp. 1–10 (2011) Google Scholar
  25. 25.
    Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with sawzall. Sci. Program. 13(4), 277–298 (2005) Google Scholar
  26. 26.
    Price, E.: Efficient sketches for the set query problem. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 41–56 (2011) CrossRefGoogle Scholar
  27. 27.
    Schmidt, J.P., Siegel, A., Srinivasan, A.: Chernoff–Hoeffding bounds for applications with limited independence. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 331–340 (1993) Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.University of WarwickCoventryUK
  2. 2.Sapienza University of RomeRomeItaly

Personalised recommendations