Skip to main content

Abstract

Sample coordination, where similar instances have similar samples, was proposed by statisticians four decades ago as a way to maximize overlap in repeated surveys. Coordinated sampling had been since used for summarizing massive data sets.

The usefulness of a sampling scheme hinges on the scope and accuracy within which queries posed over the original data can be answered from the sample. We aim here to gain a fundamental understanding of the limits and potential of coordination. Our main result is a precise characterization, in terms of simple properties of the estimated function, of queries for which estimators with desirable properties exist. We consider unbiasedness, nonnegativity, finite variance, and bounded estimates.

Since generally a single estimator can not be optimal (minimize variance simultaneously) for all data, we propose variance competitiveness, which means that the expectation of the square on any data is not too far from the minimum one possible for the data. Surprisingly perhaps, we show how to construct, for any function for which an unbiased nonnegative estimator exists, a variance competitive estimator.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Beyer, K.S., Haas, P.J., Reinwald, B., Sismanis, Y., Gemulla, R.: On synopses for distinct-value estimation under multiset operations. In: SIGMOD, pp. 199–210. ACM (2007)

    Google Scholar 

  2. Brewer, K.R.W., Early, L.J., Joyce, S.F.: Selecting several samples from a single population. Australian Journal of Statistics 14(3), 231–239 (1972)

    Article  Google Scholar 

  3. Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of the Compression and Complexity of Sequences, pp. 21–29. IEEE (1997)

    Google Scholar 

  4. Broder, A.: Identifying and filtering near-duplicate documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1–10. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  5. Byers, J.W., Considine, J., Mitzenmacher, M., Rost, S.: Informed content delivery across adaptive overlay networks. IEEE/ACM Trans. Netw. 12(5), 767–780 (2004)

    Article  Google Scholar 

  6. Cohen, E.: Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci. 55, 441–453 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  7. Cohen, E., Kaplan, H.: Spatially-decaying aggregation over a network: model and algorithms. J. Comput. System Sci. 73, 265–288 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  8. Cohen, E., Kaplan, H.: Summarizing data using bottom-k sketches. In: Proc. of ACM PODC (2007)

    Google Scholar 

  9. Cohen, E., Kaplan, H.: Tighter estimation using bottom-k sketches. In: VLDB (2008)

    Google Scholar 

  10. Cohen, E., Kaplan, H.: Leveraging discarded samples for tighter estimation of multiple-set aggregates. In: ACM SIGMETRICS (2009)

    Google Scholar 

  11. Cohen, E., Kaplan, H.: Get the most out of your sample: Optimal unbiased estimators using partial information. In: Proc. of ACM PODS (2011), full version: http://arxiv.org/abs/1203.4903

  12. Cohen, E., Kaplan, H.: A case for customizing estimators: Coordinated samples. Technical Report cs.ST/1212.0243, arXiv (2012)

    Google Scholar 

  13. Cohen, E., Kaplan, H.: How to estimate change from samples. Technical Report cs.DS/1203.4903, arXiv (2012)

    Google Scholar 

  14. Cohen, E., Kaplan, H., Sen, S.: Coordinated weighted sampling for estimating aggregates over multiple weight assignments. In: VLDB (2009), full version: http://arxiv.org/abs/0906.4560

  15. Cohen, E., Wang, Y.-M., Suri, G.: When piecewise determinism is almost true. In: Proc. Pacific Rim International Symposium on Fault-Tolerant Systems (1995)

    Google Scholar 

  16. Das, A., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: WWW (2007)

    Google Scholar 

  17. Duffield, N., Thorup, M., Lund, C.: Priority sampling for estimating arbitrary subset sums. J. Assoc. Comput. Mach. 54(6) (2007)

    Google Scholar 

  18. Efraimidis, P.S., Spirakis, P.G.: Weighted random sampling with a reservoir. Inf. Process. Lett. 97(5), 181–185 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  19. Gibbons, P., Tirthapura, S.: Estimating simple functions on the union of data streams. In: ACM SPAA (2001)

    Google Scholar 

  20. Gibbons, P.B.: Distinct sampling for highly-accurate answers to distinct values queries and event reports. In: VLDB (2001)

    Google Scholar 

  21. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB (1999)

    Google Scholar 

  22. Hadjieleftheriou, M., Yu, X., Koudas, N., Srivastava, D.: Hashed samples: Selectivity estimators for set similarity selection queries. In: VLDB (2008)

    Google Scholar 

  23. Hájek, J.: Sampling from a finite population. Marcel Dekker, New York (1981)

    MATH  Google Scholar 

  24. Horvitz, D.G., Thompson, D.J.: A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 47(260), 663–685 (1952)

    Article  MathSciNet  MATH  Google Scholar 

  25. Indyk, P.: Stable distributions, pseudorandom generators, embeddings and data stream computation. In: IEEE FOCS (2001)

    Google Scholar 

  26. Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: ACM STOC (1998)

    Google Scholar 

  27. Lanke, J.: On umv-estimators in survey sampling. Metrika 20(1), 196–202 (1973)

    Article  MathSciNet  MATH  Google Scholar 

  28. Mosk-Aoyama, D., Shah, D.: Computing separable functions via gossip. In: ACM PODC (2006)

    Google Scholar 

  29. Ohlsson, E.: Sequential poisson sampling. J. Official Statistics 14(2), 149–162 (1998)

    Google Scholar 

  30. Ohlsson, E.: Coordination of pps samples over time. In: The 2nd International Conference on Establishment Surveys, pp. 255–264. American Statistical Association (2000)

    Google Scholar 

  31. Rosén, B.: Asymptotic theory for successive sampling with varying probabilities without replacement, I. The Annals of Mathematical Statistics 43(2), 373–397 (1972)

    Article  MATH  Google Scholar 

  32. Rosén, B.: Asymptotic theory for order sampling. J. Statistical Planning and Inference 62(2), 135–158 (1997)

    Article  MATH  Google Scholar 

  33. Saavedra, P.J.: Fixed sample size pps approximations with a permanent random number. In: Proc. of the Section on Survey Research Methods, Alexandria, VA, pp. 697–700. American Statistical Association (1995)

    Google Scholar 

  34. Szegedy, M.: The DLT priority sampling is essentially optimal. In: ACM STOC (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cohen, E., Kaplan, H. (2013). What You Can Do with Coordinated Samples. In: Raghavendra, P., Raskhodnikova, S., Jansen, K., Rolim, J.D.P. (eds) Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. APPROX RANDOM 2013 2013. Lecture Notes in Computer Science, vol 8096. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40328-6_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40328-6_32

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40327-9

  • Online ISBN: 978-3-642-40328-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics