On the Variance of Subset Sum Estimation

Szegedy, Mario; Thorup, Mikkel

doi:10.1007/978-3-540-75520-3_9

On the Variance of Subset Sum Estimation

Mario Szegedy¹ &
Mikkel Thorup²

Conference paper

1772 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4698))

Abstract

For high volume data streams and large data warehouses, sampling is used for efficient approximate answers to aggregate queries over selected subsets. We are dealing with a possibly heavy-tailed set of weighted items. We address the question:

Which sampling scheme should we use to get the most accurate subset sum estimates?

We present a simple theorem on the variance of subset sum estimation and use it to prove optimality and near-optimality of different known sampling schemes. The performance measure suggested in this paper is the average variance over all subsets of any given size. By optimal we mean there is no set of input weights for which any sampling scheme can have a better average variance. For example, we show that appropriately weighted systematic sampling is simultaneously optimal for all subset sizes. More standard schemes such as uniform sampling and probability-proportional-to-size sampling with replacement can be arbitrarily bad.

Knowing the variance optimality of different sampling schemes can help deciding which sampling scheme to apply in a given context.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Olken, F., Rotem, D.: Random sampling from databases: a survey. Statistics and Computing 5(1), 25–42 (1995)
Article Google Scholar
Haas, P.J.: Speeding up db2 udb using sampling, http://www.almaden.ibm.com/cs/people/peterh/idugjbig.pdf
FAQ, O.U.C.O.: http://www.jlcomp.demon.co.uk/faq/random.html
Burleson, D.K.: Inside oracle10g dynamic sampling, http://www.dba-oracle.com/art_dbazine_oracle10g_dynamic_sampling_hint.htm
Alon, N., Duffield, N.G., Lund, C., Thorup, M.: Estimating arbitrary subset sums with few probes. In: Proc. 24th PODS, pp. 317–325 (2005)
Google Scholar
Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: Proc. ACM SIGMOD, pp. 171–182. ACM Press, New York (1997)
Google Scholar
Johnson, T., Muthukrishnan, S., Rozenbaum, I.: Sampling algorithms in a stream operator. In: Proc. ACM SIGMOD, pp. 1–12. ACM Press, New York (2005)
Chapter Google Scholar
Garofalakis, M.N., Gibbons, P.B.: Approximate query processing: Taming the terabytes. In: Proc. 27th VLDB, Tutorial 4 (2001)
Google Scholar
Duffield, N.G., Lund, C., Thorup, M.: Learn more, sample less: control of volume and variance in network measurements. IEEE Transactions on Information Theory 51(5), 1756–1775 (2005)
Article Google Scholar
Cohen, E.: Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci. 55(3), 441–453 (1997)
Article MATH Google Scholar
Duffield, N.G., Lund, C., Thorup, M.: Sampling to estimate arbitrary subset sums. Technical Report cs.DS/0509026, Computing Research Repository (CoRR), Preliminary journal version of [16] (2005)
Google Scholar
Särndal, C., Swensson, B., Wretman, J.: Model Assisted Survey Sampling. Springer, Heidelberg (1992)
MATH Google Scholar
Szegedy, M.: The DLT priority sampling is essentially optimal. In: STOC. Proc. 38th ACM Symp. Theory of Computing, pp. 150–158. ACM Press, New York (2006)
Google Scholar
Adler, R., Feldman, R., Taqqu, M.: A Practical Guide to Heavy Tails. Birkhauser (1998)
Google Scholar
Park, K., Kim, G., Crovella, M.: On the relationship between file sizes, transport protocols, and self-similar network traffic. In: ICNP. Proc. 4th IEEE Int. Conf. Network Protocols, IEEE Computer Society Press, Los Alamitos (1996)
Google Scholar
Duffield, N.G., Lund, C., Thorup, M.: Flow sampling under hard resource constraints. In: Proc. ACM IFIP Conference on Measurement and Modeling of Computer Systems (SIGMETRICS/Performance), pp. 85–96. ACM Press, New York (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Rutgers, the State University of New Jersey, 110 Frelinghuysen Road, Piscataway, NJ 08854-8019, USA
Mario Szegedy
AT&T Labs—Research, Shannon Laboratory, 180 Park Avenue, Florham Park, NJ 07932, USA
Mikkel Thorup

Authors

Mario Szegedy
View author publications
You can also search for this author in PubMed Google Scholar
Mikkel Thorup
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Lars Arge Michael Hoffmann Emo Welzl

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Szegedy, M., Thorup, M. (2007). On the Variance of Subset Sum Estimation. In: Arge, L., Hoffmann, M., Welzl, E. (eds) Algorithms – ESA 2007. ESA 2007. Lecture Notes in Computer Science, vol 4698. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75520-3_9

Download citation

DOI: https://doi.org/10.1007/978-3-540-75520-3_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75519-7
Online ISBN: 978-3-540-75520-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics