Skip to main content

On the Variance of Subset Sum Estimation

  • Conference paper
  • 1772 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4698))

Abstract

For high volume data streams and large data warehouses, sampling is used for efficient approximate answers to aggregate queries over selected subsets. We are dealing with a possibly heavy-tailed set of weighted items. We address the question:

Which sampling scheme should we use to get the most accurate subset sum estimates?

We present a simple theorem on the variance of subset sum estimation and use it to prove optimality and near-optimality of different known sampling schemes. The performance measure suggested in this paper is the average variance over all subsets of any given size. By optimal we mean there is no set of input weights for which any sampling scheme can have a better average variance. For example, we show that appropriately weighted systematic sampling is simultaneously optimal for all subset sizes. More standard schemes such as uniform sampling and probability-proportional-to-size sampling with replacement can be arbitrarily bad.

Knowing the variance optimality of different sampling schemes can help deciding which sampling scheme to apply in a given context.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Olken, F., Rotem, D.: Random sampling from databases: a survey. Statistics and Computing 5(1), 25–42 (1995)

    Article  Google Scholar 

  2. Haas, P.J.: Speeding up db2 udb using sampling, http://www.almaden.ibm.com/cs/people/peterh/idugjbig.pdf

  3. FAQ, O.U.C.O.: http://www.jlcomp.demon.co.uk/faq/random.html

  4. Burleson, D.K.: Inside oracle10g dynamic sampling, http://www.dba-oracle.com/art_dbazine_oracle10g_dynamic_sampling_hint.htm

  5. Alon, N., Duffield, N.G., Lund, C., Thorup, M.: Estimating arbitrary subset sums with few probes. In: Proc. 24th PODS, pp. 317–325 (2005)

    Google Scholar 

  6. Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: Proc. ACM SIGMOD, pp. 171–182. ACM Press, New York (1997)

    Google Scholar 

  7. Johnson, T., Muthukrishnan, S., Rozenbaum, I.: Sampling algorithms in a stream operator. In: Proc. ACM SIGMOD, pp. 1–12. ACM Press, New York (2005)

    Chapter  Google Scholar 

  8. Garofalakis, M.N., Gibbons, P.B.: Approximate query processing: Taming the terabytes. In: Proc. 27th VLDB, Tutorial 4 (2001)

    Google Scholar 

  9. Duffield, N.G., Lund, C., Thorup, M.: Learn more, sample less: control of volume and variance in network measurements. IEEE Transactions on Information Theory 51(5), 1756–1775 (2005)

    Article  Google Scholar 

  10. Cohen, E.: Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci. 55(3), 441–453 (1997)

    Article  MATH  Google Scholar 

  11. Duffield, N.G., Lund, C., Thorup, M.: Sampling to estimate arbitrary subset sums. Technical Report cs.DS/0509026, Computing Research Repository (CoRR), Preliminary journal version of [16] (2005)

    Google Scholar 

  12. Särndal, C., Swensson, B., Wretman, J.: Model Assisted Survey Sampling. Springer, Heidelberg (1992)

    MATH  Google Scholar 

  13. Szegedy, M.: The DLT priority sampling is essentially optimal. In: STOC. Proc. 38th ACM Symp. Theory of Computing, pp. 150–158. ACM Press, New York (2006)

    Google Scholar 

  14. Adler, R., Feldman, R., Taqqu, M.: A Practical Guide to Heavy Tails. Birkhauser (1998)

    Google Scholar 

  15. Park, K., Kim, G., Crovella, M.: On the relationship between file sizes, transport protocols, and self-similar network traffic. In: ICNP. Proc. 4th IEEE Int. Conf. Network Protocols, IEEE Computer Society Press, Los Alamitos (1996)

    Google Scholar 

  16. Duffield, N.G., Lund, C., Thorup, M.: Flow sampling under hard resource constraints. In: Proc. ACM IFIP Conference on Measurement and Modeling of Computer Systems (SIGMETRICS/Performance), pp. 85–96. ACM Press, New York (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Lars Arge Michael Hoffmann Emo Welzl

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Szegedy, M., Thorup, M. (2007). On the Variance of Subset Sum Estimation. In: Arge, L., Hoffmann, M., Welzl, E. (eds) Algorithms – ESA 2007. ESA 2007. Lecture Notes in Computer Science, vol 4698. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75520-3_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-75520-3_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-75519-7

  • Online ISBN: 978-3-540-75520-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics