Abstract
For high volume data streams and large data warehouses, sampling is used for efficient approximate answers to aggregate queries over selected subsets. We are dealing with a possibly heavy-tailed set of weighted items. We address the question:
Which sampling scheme should we use to get the most accurate subset sum estimates?
We present a simple theorem on the variance of subset sum estimation and use it to prove optimality and near-optimality of different known sampling schemes. The performance measure suggested in this paper is the average variance over all subsets of any given size. By optimal we mean there is no set of input weights for which any sampling scheme can have a better average variance. For example, we show that appropriately weighted systematic sampling is simultaneously optimal for all subset sizes. More standard schemes such as uniform sampling and probability-proportional-to-size sampling with replacement can be arbitrarily bad.
Knowing the variance optimality of different sampling schemes can help deciding which sampling scheme to apply in a given context.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Olken, F., Rotem, D.: Random sampling from databases: a survey. Statistics and Computing 5(1), 25–42 (1995)
Haas, P.J.: Speeding up db2 udb using sampling, http://www.almaden.ibm.com/cs/people/peterh/idugjbig.pdf
FAQ, O.U.C.O.: http://www.jlcomp.demon.co.uk/faq/random.html
Burleson, D.K.: Inside oracle10g dynamic sampling, http://www.dba-oracle.com/art_dbazine_oracle10g_dynamic_sampling_hint.htm
Alon, N., Duffield, N.G., Lund, C., Thorup, M.: Estimating arbitrary subset sums with few probes. In: Proc. 24th PODS, pp. 317–325 (2005)
Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: Proc. ACM SIGMOD, pp. 171–182. ACM Press, New York (1997)
Johnson, T., Muthukrishnan, S., Rozenbaum, I.: Sampling algorithms in a stream operator. In: Proc. ACM SIGMOD, pp. 1–12. ACM Press, New York (2005)
Garofalakis, M.N., Gibbons, P.B.: Approximate query processing: Taming the terabytes. In: Proc. 27th VLDB, Tutorial 4 (2001)
Duffield, N.G., Lund, C., Thorup, M.: Learn more, sample less: control of volume and variance in network measurements. IEEE Transactions on Information Theory 51(5), 1756–1775 (2005)
Cohen, E.: Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci. 55(3), 441–453 (1997)
Duffield, N.G., Lund, C., Thorup, M.: Sampling to estimate arbitrary subset sums. Technical Report cs.DS/0509026, Computing Research Repository (CoRR), Preliminary journal version of [16] (2005)
Särndal, C., Swensson, B., Wretman, J.: Model Assisted Survey Sampling. Springer, Heidelberg (1992)
Szegedy, M.: The DLT priority sampling is essentially optimal. In: STOC. Proc. 38th ACM Symp. Theory of Computing, pp. 150–158. ACM Press, New York (2006)
Adler, R., Feldman, R., Taqqu, M.: A Practical Guide to Heavy Tails. Birkhauser (1998)
Park, K., Kim, G., Crovella, M.: On the relationship between file sizes, transport protocols, and self-similar network traffic. In: ICNP. Proc. 4th IEEE Int. Conf. Network Protocols, IEEE Computer Society Press, Los Alamitos (1996)
Duffield, N.G., Lund, C., Thorup, M.: Flow sampling under hard resource constraints. In: Proc. ACM IFIP Conference on Measurement and Modeling of Computer Systems (SIGMETRICS/Performance), pp. 85–96. ACM Press, New York (2004)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Szegedy, M., Thorup, M. (2007). On the Variance of Subset Sum Estimation. In: Arge, L., Hoffmann, M., Welzl, E. (eds) Algorithms – ESA 2007. ESA 2007. Lecture Notes in Computer Science, vol 4698. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75520-3_9
Download citation
DOI: https://doi.org/10.1007/978-3-540-75520-3_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75519-7
Online ISBN: 978-3-540-75520-3
eBook Packages: Computer ScienceComputer Science (R0)