Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Space-Efficient Estimation of Statistics Over Sub-Sampled Streams

  • 212 Accesses

  • 2 Citations

Abstract

In many stream monitoring situations, the data arrival rate is so high that it is not even possible to observe each element of the stream. The most common solution is to sub-sample the data stream and use the sample to infer properties and estimate aggregates of the original stream. However, in many cases, the estimation of aggregates on the original stream cannot be accomplished through simply estimating them on the sampled stream, followed by a normalization. We present algorithms for estimating frequency moments, support size, entropy, and heavy hitters of the original stream, through a single pass over the sampled stream.

This is a preview of subscription content, log in to check access.

Notes

  1. 1.

    Where \(\tilde{O}\) notation suppresses factors polynomial in \(1/\varepsilon \) and \(1/\delta \) and factors logarithmic in \(m\) and \(n\).

References

  1. 1.

    Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999)

  2. 2.

    Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 633–634 (2002)

  3. 3.

    Bar-Yossef, Z.: The complexity of massive dataset computations. Ph.D. thesis, University of California at Berkeley (2002)

  4. 4.

    Bar-Yossef, Z.: Sampling lower bounds via information theory. In: Proceedings of 35th Annual ACM Symposium on Theory of Computing (STOC), pp. 335–344 (2003)

  5. 5.

    Barakat, C., Iannaccone, G., Diot, C.: Ranking flows from sampled traffic. In: Proceedings of ACM Conference on Emerging Network Experiment and Technology (CoNEXT), pp. 188–199 (2005)

  6. 6.

    Bhattacharyya, S., Madeira, A., Muthukrishnan, S., Ye, T.: How to scalably and accurately skip past streams. In: Proceedings of 23rd International Conference on Data Engineering (ICDE) Workshops, pp. 654–663 (2007)

  7. 7.

    Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.R.: Towards estimation error guarantees for distinct values. In: Proceedings of 19th ACM Symposium on Principles of Database Systems (PODS), pp. 268–279 (2000)

  8. 8.

    Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. Theor. Comput. Sci. 312(1), 3–15 (2004)

  9. 9.

    Cisco Systems: Random Sampled NetFlow. http://www.cisco.com/en/US/docs/ios/12_0s/feature/guide/nfstatsa.html

  10. 10.

    Cohen, E., Cormode, G., Duffield, N.G.: Structure-aware sampling: flexible and accurate summarization. Proc. VLDB Endow. 4(11), 819–830 (2011)

  11. 11.

    Cohen, E., Duffield, N.G., Kaplan, H., Lund, C., Thorup, M.: Efficient stream sampling for variance-optimal estimation of subset sums. SIAM J. Comput. 40(5), 1402–1431 (2011)

  12. 12.

    Cohen, E., Duffield, N.G., Kaplan, H., Lund, C., Thorup, M.: Algorithms and estimators for summarization of unaggregated data streams. J. Comput. Syst. Sci. 80(7), 1214–1244 (2014)

  13. 13.

    Cohen, E., Grossaug, N., Kaplan, H.: Processing top-k queries from samples. Comput. Netw. 52(14), 2605–2622 (2008)

  14. 14.

    Cormode, G., Garofalakis, M.: Sketching probabilistic data streams. In: Proceedings of 26th ACM International Conference on Management of Data (SIGMOD), pp. 281–292 (2007)

  15. 15.

    Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)

  16. 16.

    Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Optimal sampling from distributed streams. In: Proceedings of ACM Symposium on Principles of Database Systems (PODS), pp. 77–86 (2010)

  17. 17.

    Duffield, N.G., Lund, C., Thorup, M.: Properties and prediction of flow statistics from sampled packet streams. In: Proceedings of Internet Measurement Workshop, pp. 159–171 (2002)

  18. 18.

    Duffield, N.G., Lund, C., Thorup, M.: Estimating flow distributions from sampled flow statistics. IEEE/ACM Trans. Netw. 13(5), 933–946 (2005)

  19. 19.

    Duffield, N.G., Lund, C., Thorup, M.: Priority sampling for estimation of arbitrary subset sums. J. ACM 54(6) (2007)

  20. 20.

    Efraimidis, P., Spirakis, P.G.: Weighted random sampling with a reservoir. Inf. Process. Lett. 97(5), 181–185 (2006)

  21. 21.

    Estan, C., Keys, K., Moore, D., Varghese, G.: Building a better netflow. In: Proceedings of ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), pp. 245–256 (2004)

  22. 22.

    Estan, C., Varghese, G.: New directions in traffic measurement and accounting. In: Proceedings of ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), pp. 323–336 (2002)

  23. 23.

    Gibbons, P.B., Matias, Y.: New sampling-based summary statistics for improving approximate query answers. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 331–342 (1998)

  24. 24.

    Guha, S., Huang, Z.: Revisiting the direct sum theorem and space lower bounds in random order streams. In: Automata, Languages and Programming, 36th International Colloquium, ICALP (1), pp. 513–524 (2009)

  25. 25.

    Harvey, N.J.A., Nelson, J., Onak, K.: Sketching and streaming entropy via approximation theory. In: Proceedings of 49th IEEE Conference on Foundations of Computer Science (FOCS), pp. 489–498 (2008)

  26. 26.

    Hohn, N., Veitch, D.: Inverting sampled traffic. IEEE/ACM Trans. Netw. 14(1), 68–80 (2006)

  27. 27.

    Indyk, P., Woodruff, D.P.: Optimal approximations of the frequency moments of data streams. In: Proceedings of 37th Annual ACM Symposium on Theory of Computing (STOC), pp. 202–208 (2005)

  28. 28.

    Jayram, T.S., McGregor, A., Muthukrishnan, S., Vee, E.: Estimating statistical aggregates on probabilistic data streams. ACM Trans. Database Syst. 33, 26:1–26:30 (2008)

  29. 29.

    Kane, D.M., Nelson, J., Woodruff, D.P.: On the exact space complexity of sketching and streaming small norms. In: Proceedings of 21st ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1161–1178 (2010)

  30. 30.

    Lahiri, B., Tirthapura, S.: Stream sampling. In: Liu, L., Özsu, M.T. (eds.) Encyclopedia of Database Systems, pp. 2838–2842. Springer, US (2009)

  31. 31.

    McGregor, A. (ed.): Open Problems in Data Streams and Related Topics (2007). http://www.cse.iitk.ac.in/users/sganguly/data-stream-probs

  32. 32.

    McGregor, A., Pavan, A., Tirthapura, S., Woodruff, D.: Space-efficient estimation of statistics over sub-sampled streams. In: Proceedings of 31st ACM Symposium on Principles of Database Systems (PODS), pp. 273–282 (2012)

  33. 33.

    Misra, J., Gries, D.: Finding repeated elements. Sci. Comput. Program. 2(2), 143–152 (1982)

  34. 34.

    Rusu, F., Dobra, A.: Sketching sampled data streams. In: Proceedings of 25th IEEE International Conference on Data Engineering (ICDE), pp. 381–392 (2009)

  35. 35.

    Szegedy, M.: The dlt priority sampling is essentially optimal. In: Proceedings of Annual ACM Symposium on Theory of Computing (STOC), pp. 150–158 (2006)

  36. 36.

    Tirthapura, S., Woodruff, D.P.: Optimal random sampling from distributed streams revisited. In: Proceedings of International Symposium on Distributed Computing (DISC), pp. 283–297 (2011)

  37. 37.

    Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)

Download references

Author information

Correspondence to Srikanta Tirthapura.

Additional information

McGregor is supported in part by grant NSF CAREER Award CCF-0953754. Pavan is supported in part by grant NSF CCF-0916797. Tirthapura is supported in part by grants NSF CNS-0834743, CNS-0831903.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

McGregor, A., Pavan, A., Tirthapura, S. et al. Space-Efficient Estimation of Statistics Over Sub-Sampled Streams. Algorithmica 74, 787–811 (2016). https://doi.org/10.1007/s00453-015-9974-0

Download citation

Keywords

  • Data streams
  • Frequency moments
  • Sub-sampling