Skip to main content

Optimal Random Sampling from Distributed Streams Revisited

  • Conference paper
Distributed Computing (DISC 2011)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6950))

Included in the following conference series:

Abstract

We give an improved algorithm for drawing a random sample from a large data stream when the input elements are distributed across multiple sites which communicate via a central coordinator. At any point in time the set of elements held by the coordinator represent a uniform random sample from the set of all the elements observed so far. When compared with prior work, our algorithms asymptotically improve the total number of messages sent in the system as well as the computation required of the coordinator. We also present a matching lower bound, showing that our protocol sends the optimal number of messages up to a constant factor with large probability. As a byproduct, we obtain an improved algorithm for finding the heavy hitters across multiple distributed sites.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal, C.C.: On biased reservoir sampling in the presence of stream evolution. In: VLDB, pp. 607–618 (2006)

    Google Scholar 

  2. Arackaparambil, C., Brody, J., Chakrabarti, A.: Functional monitoring without monotonicity. In: Albers, S., Marchetti-Spaccamela, A., Matias, Y., Nikoletseas, S., Thomas, W. (eds.) ICALP 2009, Part 1. LNCS, vol. 5555, pp. 95–106. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  3. Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: SODA, pp. 633–634 (2002)

    Google Scholar 

  4. Babcock, B., Olston, C.: Distributed top-k monitoring. In: SIGMOD Conference, pp. 28–39 (2003)

    Google Scholar 

  5. Braverman, V., Ostrovsky, R.: Effective computations on sliding windows. SIAM Journal on Computing 39(6), 2113–2131 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  6. Braverman, V., Ostrovsky, R., Zaniolo, C.: Optimal sampling from sliding windows. In: PODS, pp. 147–156 (2009)

    Google Scholar 

  7. Cormode, G., Garofalakis, M.N.: Sketching streams through the net: Distributed approximate query tracking. In: VLDB, pp. 13–24 (2005)

    Google Scholar 

  8. Cormode, G., Muthukrishnan, S., Yi, K.: Algorithms for distributed functional monitoring. In: SODA, pp. 1076–1085 (2008)

    Google Scholar 

  9. Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Optimal sampling from distributed streams. In: PODS, pp. 77–86 (2010)

    Google Scholar 

  10. Dash, M., Ng, W.: Efficient reservoir sampling for transactional data streams. In: Sixth IEEE International Conference on Data Mining (Workshops), pp. 662–666 (2006)

    Google Scholar 

  11. Gibbons, P.B., Tirthapura, S.: Estimating simple functions on the union of data streams. In: SPAA, pp. 281–291 (2001)

    Google Scholar 

  12. Gibbons, P.B., Tirthapura, S.: Distributed streams algorithms for sliding windows. In: SPAA, pp. 63–72 (2002)

    Google Scholar 

  13. Huang, L., Nguyen, X., Garofalakis, M.N., Hellerstein, J.M., Jordan, M.I., Joseph, A.D., Taft, N.: Communication-efficient online detection of network-wide anomalies. In: INFOCOM, pp. 134–142 (2007)

    Google Scholar 

  14. Keralapura, R., Cormode, G., Ramamirtham, J.: Communication-efficient distributed monitoring of thresholded counts. In: SIGMOD Conference, pp. 289–300 (2006)

    Google Scholar 

  15. Knuth, D.E.: The Art of Computer Programming, 2nd edn. Seminumerical Algorithms, vol. II. Addison-Wesley, Reading (1981)

    MATH  Google Scholar 

  16. Malbasa, V., Vucetic, S.: A reservoir sampling algorithm with adaptive estimation of conditional expectation. In: IJCNN 2007, International Joint Conference on Neural Networks, pp. 2200–2204 (2007)

    Google Scholar 

  17. Manjhi, A., Shkapenyuk, V., Dhamdhere, K., Olston, C.: Finding (recently) frequent items in distributed data streams. In: ICDE, pp. 767–778 (2005)

    Google Scholar 

  18. Muthukrishnan, S.: Data Streams: Algorithms and Applications. In: Foundations and Trends in Theoretical Computer Science, Now Publishers (August 2005)

    Google Scholar 

  19. Vitter, J.S.: Random sampling with a reservoir. ACM Transactions on Mathematical Software 11(1), 37–57 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  20. Xu, B., Tirthapura, S., Busch, C.: Sketching asynchronous data streams over sliding windows. Distributed Computing 20(5), 359–374 (2008)

    Article  MATH  Google Scholar 

  21. Yi, K., Zhang, Q.: Optimal tracking of distributed heavy hitters and quantiles. In: PODS, pp. 167–174 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Tirthapura, S., Woodruff, D.P. (2011). Optimal Random Sampling from Distributed Streams Revisited. In: Peleg, D. (eds) Distributed Computing. DISC 2011. Lecture Notes in Computer Science, vol 6950. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24100-0_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-24100-0_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-24099-7

  • Online ISBN: 978-3-642-24100-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics