Skip to main content

Adaptive Grid-Based k-median Clustering of Streaming Data with Accuracy Guarantee

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9049))

Abstract

Data stream clustering has wide applications, such as online financial transactions, telephone records, and network monitoring. Grid-based clustering partitions stream data into cells, derives statistical information of the cells, and then applies clustering on these much smaller statistical information without referring to the input data. Therefore, grid-based clustering is efficient and very suitable for high-throughput data streams, which are continuous, time-varying, and possibly unpredictable. Various grid-based clustering schemes have been proposed. However, to the best of our knowledge, none of them provides an accuracy guarantee for their clustering output. To fill this gap, in this paper we study grid-based k-median clustering. We first develop an accuracy guarantee on the cost difference between grid-based solution and the optimum. Based on the theoretical analysis, we then propose a general and adaptive solution, which partitions stream data into cells of dynamically determined granularity and runs k-median clustering on the statistical information of cells with an accuracy guarantee. An extensive experiment over three real datasets clearly shows that our solution provides high-quality clustering outputs in an efficient way.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. https://www.census.gov/geo/maps-data/data/tiger.html

  2. https://snap.stanford.edu/data/loc-gowalla.html

  3. https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption

  4. Ackermann, M.R., Blömer, J.: Coresets and approximate clustering for bregman divergences. In: SODA

    Google Scholar 

  5. Ackermann, M.R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., Sohler, C.: Streamkm++: a clustering algorithm for data streams. ACM Journal of Experimental Algorithmics 17(1) (2012)

    Google Scholar 

  6. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: VLDB, pp. 81–92 (2003)

    Google Scholar 

  7. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD Conference, pp. 94–105 (1998)

    Google Scholar 

  8. Ailon, N., Jaiswal, R., Monteleoni, C.: Streaming k-means approximation. In: NIPS, pp. 10–18 (2009)

    Google Scholar 

  9. Arora, S., Raghavan, P., Rao, S.: Approximation schemes for euclidean k-medians and related problems. In: STOC, pp. 106–113 (1998)

    Google Scholar 

  10. Arya, V., Garg, N., Khandekar, R., Meyerson, A., Munagala, K., Pandit, V.: Local search heuristic for k-median and facility location problems. In: STOC, pp. 21–29 (2001)

    Google Scholar 

  11. Bartal, Y.: Probabilistic approximations of metric spaces and its algorithmic applications. In: FOCS, pp. 184–193 (1996)

    Google Scholar 

  12. Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: SDM, pp. 328–339 (2006)

    Google Scholar 

  13. Charikar, M., Chekuri, C., Goel, A., Guha, S.: Rounding via trees: deterministic approximation algorithms for group steiner trees and k-median. In: STOC, pp. 114–123 (1998)

    Google Scholar 

  14. Charikar, M., Guha, S.: Improved combinatorial algorithms for the facility location and k-median problems. In: FOCS, pp. 378–388 (1999)

    Google Scholar 

  15. Charikar, M., Guha, S., Tardos, É., Shmoys, D.B.: A constant-factor approximation algorithm for the k-median problem (extended abstract). In: STOC, pp. 1–10 (1999)

    Google Scholar 

  16. Charikar, M., Guha, S., Tardos, É., Shmoys, D.B.: A constant-factor approximation algorithm for the k-median problem. J. Comput. Syst. Sci. 65(1), 129–149 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  17. Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: KDD, pp. 133–142 (2007)

    Google Scholar 

  18. Cormode, G., Muthukrishnan, S., Zhuang, W.: Conquering the divide: continuous clustering of distributed data streams. In: ICDE, pp. 1036–1045 (2007)

    Google Scholar 

  19. de Andrade Silva, J., Faria, E.R., Barros, R.C., Hruschka, E.R., de Carvalho, A.C.P.L.F., Gama, J.: Data stream clustering: a survey. ACM Comput. Surv. 46(1), 13 (2013)

    Google Scholar 

  20. Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: constant-size coresets for k-means, pca and projective clustering. In: SODA

    Google Scholar 

  21. Gama, J., Rodrigues, P.P., Lopes, L.M.B.: Clustering distributed sensor data streams using local processing and reduced communication. Intell. Data Anal. 15(1), 3–28 (2011)

    Google Scholar 

  22. Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: theory and practice. IEEE Trans. Knowl. Data Eng. 15(3), 515–528 (2003)

    Article  Google Scholar 

  23. Guo, T., Zhu, X., Pei, J., Zhang, C.: Snoc: streaming network node classification. In: ICDM (2014)

    Google Scholar 

  24. Har-Peled, S., Kushal, A.: Smaller coresets for k-median and k-means clustering. In: Proceedings of the Twenty-first Annual Symposium on Computational Geometry

    Google Scholar 

  25. Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: STOC, pp. 291–300 (2004)

    Google Scholar 

  26. Jain, K., Vazirani, V.V.: Primal-dual approximation algorithms for metric facility location and k-median problems. In: FOCS, pp. 2–13 (1999)

    Google Scholar 

  27. Koudas, N., Ooi, B.C., Tan, K.-L., Zhang, R.: Approximate nn queries on streams with guaranteed error/performance bounds. In: VLDB, pp. 804–815 (2004)

    Google Scholar 

  28. Lin, J., Vitter, J.S.: Approximation algorithms for geometric median problems. Inf. Process. Lett. 44(5), 245–249 (1992)

    Article  MATH  MathSciNet  Google Scholar 

  29. Lin, J., Vitter, J.S.: Epsilon-approximations with minimum packing constraint violation (extended abstract). In: STOC, pp. 771–782 (1992)

    Google Scholar 

  30. Park, N.H., Lee, W.S.: Statistical grid-based clustering over data streams. SIGMOD Record 33(1), 32–37 (2004)

    Article  Google Scholar 

  31. Park, N.H., Lee, W.S.: Cell trees: an adaptive synopsis structure for clustering multi-dimensional on-line data streams. Data Knowl. Eng. 63(2), 528–549 (2007)

    Article  Google Scholar 

  32. Tao, Y., Lian, X., Papadias, D., Hadjieleftheriou, M.: Random sampling for continuous streams with arbitrary updates. IEEE Trans. Knowl. Data Eng. 19(1), 96–110 (2007)

    Article  Google Scholar 

  33. Wang, W., Yang, J., Muntz, R.R.: Sting: a statistical information grid approach to spatial data mining. In: VLDB, pp. 186–195 (1997)

    Google Scholar 

  34. Zhang, Q., Liu, J., Wang, W.: Approximate clustering on distributed data streams. In: ICDE, pp. 1131–1139 (2008)

    Google Scholar 

  35. Zhang, Z., Shu, H., Chong, Z., Lu, H., Yang, Y.: C-cube: elastic continuous clustering in the cloud. In: ICDE, pp. 577–588 (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianneng Cao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Cao, J., Zhou, Y., Wu, M. (2015). Adaptive Grid-Based k-median Clustering of Streaming Data with Accuracy Guarantee. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science(), vol 9049. Springer, Cham. https://doi.org/10.1007/978-3-319-18120-2_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-18120-2_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-18119-6

  • Online ISBN: 978-3-319-18120-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics