Skip to main content

Stream Clustering Algorithms: A Primer

  • Chapter

Part of the book series: Studies in Big Data ((SBD,volume 9))

Abstract

Stream data has become ubiquitous due to advances in acquisition technology and pervades numerous applications. These massive data gathered as continuous flow, are often accompanied by dire need for real-time processing. One aspect of data streams deals with storage management and processing of continuous queries for aggregation. Another significant aspect pertains to discovery and understanding of hidden patterns to derive actionable knowledge using mining approaches. This chapter focuses on stream clustering and presents a primer of clustering algorithms in data stream environment.

Clustering of data streams has gained importance because of its ability to capture natural structures from unlabeled, non-stationary data. Single scan of data, bounded memory usage, and capturing data evolution are the key challenges during clustering of streaming data. We elaborate and compare the algorithms on the basis of these constraints. We also propose a taxonomy of algorithms based on the fundamental approaches used for clustering. For each approach, a systematic description of contemporary, well-known algorithms is presented. We place special emphasis on synopsis data structure used for consolidating characteristics of streaming data and feature it as an important issue in design of a stream clustering algorithms. We argue that a number of functional and operational characteristics (e.g. quality of clustering, handling of outliers, number of parameters etc.) of a clustering algorithm are influenced by the choice of synopsis. A summary of clustering features that are supported by different algorithms is given. Finally, research directions for improvement in the usability of stream clustering algorithms are suggested.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abadi, D.J., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., Zdonik, S.B.: Aurora: a new model and architecture for data stream management. Springer Journal on Very Large Databases 12(2), 120–139 (2003)

    Google Scholar 

  2. Ackermann, M.R., Lammersen, C., Märten, M., Raupach, C., Sohler, C., Swierkot, K.: Streamkm++: A clustering algorithm for data streams. In: The 2010 SIAM Workshop on Algorithm Engineering and Experiments, Texas, January 16, pp. 173–187 (2010), doi:10.1137/1.9781611972900

    Google Scholar 

  3. Aggarwal, C.C. (ed.): Data Streams: Models and Algorithms. Springer Science+Business Media (2007) ISBN: 978-0-387-28759-1

    Google Scholar 

  4. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: The 2003 International Conference on Very Large Data Bases (VLDB), Germany, September 9-12, pp. 81–92 (2003) ISBN: 0-12-722442-4

    Google Scholar 

  5. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for on-demand classification of evolving data streams. IEEE Transaction on Knowledge and Data Engineering 18(5), 577–589 (2006)

    Article  Google Scholar 

  6. Aggarwal, C.C., Han, J., Yu, P.S.: A framework for projected clustering of high dimensional data streams. In: The 2004 International Conference on Very large Databases (VLDB), Canada, August 31-September 3, pp. 853–863 (2004) ISBN: 0-12-088469-0

    Google Scholar 

  7. Aggarwal, C.C., Yu, P.S.: A framework for clustering massive text and categorical data streams. In: The 2006 SIAM International Conference on Data Mining, Maryland, April 20-22, pp. 479–483 (2006) ISBN: 978-0-89871-611-5

    Google Scholar 

  8. Akodjènou-Jeannin, M.-I., Salamatian, K., Gallinari, P.: Flexible grid-based clustering. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 350–357. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  9. Amini, A., Teh, Y.W., Saybani, M.R., Aghabozorgi, S.R., Yazdi, S.: A study of density-grid based clustering algorithms on data streams. In: The 2011 IEEE International Conference on Fuzzy Systems and Knowledge Discovery, China, July 26-28, pp. 1652–1656 (2011) ISBN: 978-1-61284-180-9

    Google Scholar 

  10. Amini, A., Wa, T.Y.: Dengris-stream: A density-grid based clustering algorithm for evolving data streams over sliding window. In: The 2012 International Conference on Data Mining and Computer Engineering, Thailand, December 21-22, pp. 206–211 (2012)

    Google Scholar 

  11. Amini, A., Weh, T.Y., Saboohi, H.: On density-based data streams clustering algorithms: A survey. Springer Journal of Computer Science and Technology 29(1), 116–141 (2014)

    Article  Google Scholar 

  12. Arasu, Babcock, Babu, Cieslewicz, Datar, Ito, Motwani, R., Srivastava, and Widom: Stream: The stanford data stream management system. Technical Report 2004-20, The Stanford InfoLab (2004)

    Google Scholar 

  13. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: The 2002 ACM Symposium on Principles of Database Systems, Wisconsin, June 3-5, pp. 1–58113 (2002) ISBN: 1-58113-507-6

    Google Scholar 

  14. Baraldi, A., Blonda, P.: A survey of fuzzy clustering algorithms for pattern recognition- part i and ii. IEEE Transactions on Systems, Man and Cybernetics 29(6), 778–801 (1999)

    Article  Google Scholar 

  15. Barbára, D.: Requirements of clustering data streams. ACM SIGKDD Explorations 3(2), 23–27 (2002)

    Article  Google Scholar 

  16. Barbára, D., Chen, P.: Tracking clusters in evolving data sets. In: The 2001 FLAIRS Special Track on Knowledge Discovery and Data Mining, Florida, May 18-20, pp. 239–243 (2001) ISBN: 1-57735-133-9

    Google Scholar 

  17. Berkhin, P.: A survey of clustering data mining techniques. In: Springer Grouping Multidimensional Data - Recent Advances in Clustering, pp. 25–71. Springer (2006)

    Google Scholar 

  18. Bhatnagar, V., Kaur, S., Chakravarthy, S.: Clustering data streams using grid-based synopsis. Springer Journal on Knowledge and Information System 41(1), 127–152 (2014)

    Article  Google Scholar 

  19. Bhatnagar, V., Kaur, S., Mignet, L.: A parameterized framework for stream clustering algorithms. IGI International Journal for Data Warehousing and Mining 5(1), 36–56 (2009)

    Article  Google Scholar 

  20. Braverman, V., Meyerson, A., Ostrovsky, R., Roytman, A., Shindler, M., Tagiku, B.: Streaming k-means on well-clusterable data. In: The 2011 ACM-SIAM Symposium on Discrete Algorithms, California, January 23-25, pp. 26–40. SIAM (2011), doi:10.1137/1.9781611973082.3

    Google Scholar 

  21. Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: The 2006 SIAM International Conference on Data Mining, USA, April 20-22, pp. 326–337 (2006), doi:10.1137/1.9781611972764.29

    Google Scholar 

  22. Chakravarthy,S., Jiang, Q.: Stream Data Processing: A Quality of Service Perspective. Springer (2009) ISBN 978-0-387-71002-0

    Google Scholar 

  23. Charikar, M., Callaghan, L.O., Panigrahy, R.: Better streaming algorithms for clustering problems. In: The 2003 ACM Symposium on Theory of Computing, California, June 9-11, pp. 30–38 (2003), doi:10.1145/780542.780548

    Google Scholar 

  24. Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: The ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, California, August 12-15, pp. 133–142 (2007), doi:10.1145/1281192.1281210

    Google Scholar 

  25. Coppi, R., Gil, M.A., Kiers, H.A.L. (eds.): Data Analysis with Fuzzy Clustering Methods, vol. 51(1). Elsevier (2006)

    Google Scholar 

  26. Cormode, G., Muthukrishnan, S.: What’s hot and what’s not: Tracking most frequent items dynamically. In: The 2003 ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, San Diego, June 9-12, pp. 296–306 (2003), doi:10.1145/1061318.1061325

    Google Scholar 

  27. Dang, X.H., Lee, V.C.S., Ng, W.K., Ong, K.-L.: Incremental and adaptive clustering stream data over sliding window. In: The 2009 International Conference on Database and Expert Systems Applications, Austria, August 31-September 4, pp. 660–674 (2009), doi:10.1007/978-3-642-03573-9-55

    Google Scholar 

  28. de Andrade Silva, J., Faria, E.R., Barros, R.C., Hruschka, E.R., de Carvalho, A.C.P.L.F., Gama, J.: Data stream clustering: A survey. ACM Computing Surveys 46(1), 1–31 (2013)

    Article  Google Scholar 

  29. Domingos, P., Hulten, G.: Mining High-Speed Data Streams. In: The 2000 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Maryland, August 20-23, pp. 71–80 (2000), doi:10.1145/347090.347107

    Google Scholar 

  30. Dong, G., Han, J., Lakshmanan, L.V., Pei, J., Wang, H., Yu, P.S.: Online mining of changes from data streams: Research problems and preliminary results. In: The 2003 ACM SIGMOD Workshop on Management and Processing of Data Streams, San Diego, CA, June 8 (2003)

    Google Scholar 

  31. Eiben, A., Smith, J.: Introduction to Evolutionary Computing, 2nd edn. Natural Computing. Springer (2007)

    Google Scholar 

  32. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: The 1996 AAAI International Conference on Knowledge Discovery and Data Mining, Oregon, August 2-4, pp. 226–231 (1996)

    Google Scholar 

  33. Fan, W., Huang, Y., Wang, H., Yu, P.S.: Active mining of data streams. In: The 2004 SIAM International Conference on Data Mining, Florida, April 22-24, pp. 457–461 (2004), doi:10.1137/1.9781611972740.46

    Google Scholar 

  34. FIMI, ICDM Workshop on Frequent Itemset Mining Implementations, FIMI 2003 (2003)

    Google Scholar 

  35. FIMI, ICDM Workshop on Frequent Itemset Mining Implementations, FIMI 2004 (2004)

    Google Scholar 

  36. Gaber, M.M., Zaslavsky, A., Krishnaswamy, S.: Mining data streams: A review. ACM SIGMOD Record 34(2), 18–26 (2005)

    Article  Google Scholar 

  37. Gama, J. (ed.): Knowledge Discovery From Data Streams. Chapman and Hall/CRC Press (2010) ISBN: 978-1-4398-2611-9

    Google Scholar 

  38. Gao, J., Li, J., Zhang, Z., Tan, P.-N.: An incremental data stream clustering algorithm based on dense units detection. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 420–425. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  39. Garofalakis, M., Gehrke,J.,Rastogi, R.: Querying and mining data streams: you only get one look a tutorial. In: The 2002 ACM SIGMOD International Conference on Management of Data, Medison, USA, June 02-06, p. 635 (2002)

    Google Scholar 

  40. Giannella, C., Han, J., Pei, J., Yan, X., Yu, P.: Mining frequent patterns in data streams at multiple time granularities. In: Kargupta, H., Joshi, A., Sivakumar, K., Yesha, Y. (eds.) Data Mining: Next Generation Challenges and Future Directions. AAAI/MIT Press (2003)

    Google Scholar 

  41. Goethals, B.: Frequent itemset mining implementation repository, http://fimi.ua.ac.be/ (last retrieved in July 2013)

  42. Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams. In: The 2000 IEEE Annual Symposium on Foundation of Computer Science, California, November 12-14, pp. 359–366 (2000) ISBN: 0-7695-0850-2

    Google Scholar 

  43. Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Streaming-data algorithms for high-quality clustering. In: The 2002 IEEE International Conference on Data Engineering, California, February 26-March 1, pp. 685–694 (2002), doi:10.1109/ICDE.2002.994785

    Google Scholar 

  44. Gupta, C., Grossman, R.L.: Genic: A single-pass generalized incremental algorithm for clustering. In: The 2004 SIAM International Conference on Data Mining, Florida, April 22-24, pp. 147–153 (2004), doi:10.1137/1.9781611972740.14

    Google Scholar 

  45. Gupta, C., Grossman, R.L.: Outlier Detection with Streaming Dyadic Decomposition. In: The 2007 Industrial Conference on Data Mining, Germany, July 14-18, pp. 77–91 (2007) ISBN: 978-3-540-73434-5

    Google Scholar 

  46. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann (2006) ISBN 1-55860-901-6

    Google Scholar 

  47. He, Z., Xu, X., Deng, S., Huang, J.Z.: Clustering Categorical Data Streams. Computing Research Repository, abs/cs/0412058 (2004)

    Google Scholar 

  48. Hinneburg, A., Keim, D.A.: Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th International Conference on Very Large Databases, Scotland, September 7-10, pp. 506–517 (1999) ISBN: 1-55860-615-7

    Google Scholar 

  49. Hirsh, H.: Data Mining Research: Current Status and Future Opportunities. Wiley Periodicals 1(2), 104–107 (2008), doi:10.1002/sam.10003

    MathSciNet  Google Scholar 

  50. Jia, C., Tan, C., Yong, A.: A grid and density-based clustering algorithm for processing data stream. In: The 2008 IEEE International Conference on Genetic and Evolutionary Computing, USA, September 25-28, pp. 517–521 (2008), doi:10.1109/WGEC.2008.32

    Google Scholar 

  51. Kaur, S., Bhatnagar, V., Mehta, S., Kapoor, S.: Categorizing concepts for detecting drifts in stream. In: The 2009 International Conference on Management of Data, Mysore, December 9-12, pp. 201–209 (2009)

    Google Scholar 

  52. Kifer, D., David, S.B., Gehrke, J.: Detecting change in data streams. In: The 2004 International Conference on Very Large Data Bases, Canada, August 29-September 3, pp. 180–191 (2004)

    Google Scholar 

  53. Kim, Y.S., Mitra, S.: Integrated adaptive fuzzy clustering (iafc) algorithm. In: The 1993 IEEE International Conference on Fuzzy System, San Francisco, March 28-April 1, vol. 2, pp. 1264–1268 (1993), doi:10.1109/FUZZY.1993.327574

    Google Scholar 

  54. Kranen, P., Assent, I., Baldauf, C., Seidl, T.: The clustree: Indexing micro-clusters for anytime stream mining. Springer Journal on Knowledge and Information Systems 29(2), 249–272 (2010)

    Article  Google Scholar 

  55. Li, Y., Gopalan, R.P.: Clustering transactional data streams. In: Proceedings of Australian Conference on Artificial Intelligence, Australia, December 4-8, pp. 1069–1073 (2006), doi:10.1007/11941439-124

    Google Scholar 

  56. Lu, Y.S., Sun, Y., Xu, G., Liu, G.: A grid-based clustering algorithm for high-dimensional data streams. In: Li, X., Wang, S., Dong, Z.Y. (eds.) ADMA 2005. LNCS (LNAI), vol. 3584, pp. 824–831. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  57. Luhr, S., Lazarescu, M.: Incremental clustering of dynamic data streams using connectivity-based representative points. Elsevier Journal on Data and Knowledge Engineering 68(1), 1–27 (2009)

    Article  Google Scholar 

  58. Mahdiraji, A.R.: Clustering data stream: A survey of algorithms. IOS Knowledge-Based and Intelligent Engineering Systems 13(2), 39–44 (2009)

    Google Scholar 

  59. Meesuksabai, W., Kangkachit, T., Waiyamai, K.: Hue-stream: Evolution-based clustering technique for heterogeneous data streams. In: The 2011 International Conference on Advanced Data Mining and Applications, China, December 17-19, pp. 27–40 (2011), doi:10.1007//978-3-642-25856-5-3

    Google Scholar 

  60. Motoyoshi, M., Miura, T., Shioya, I.: Clustering stream data by regression analysis. In: The 2004 ACSW of Australasian Workshop on Data Mining and Web Intelligence, New Zealand, pp. 115–120 (January 2004)

    Google Scholar 

  61. Park, N.H., Lee, W.S.: Statistical grid-based clustering over data streams. ACM SIGMOD Record 33(1), 32–37 (2004)

    Article  Google Scholar 

  62. Park, N.H., Lee, W.S.: Cell trees: An adaptive synopsis structure for clustering multi-dimensional on-line data streams. Springer Journal of Data and Knowledge Engineering 63(2), 528–549 (2007)

    Article  Google Scholar 

  63. Ruiz, C., Menasalvas, E., Spiliopoulou, M.: C-denstream: Using domain knowledge on a data stream. In: Gama, J., Costa, V.S., Jorge, A.M., Brazdil, P.B. (eds.) DS 2009. LNCS, vol. 5808, pp. 287–301. Springer, Heidelberg (2009), doi:10.1007/978-3-642-04747-3-23

    Chapter  Google Scholar 

  64. Sain, S.R.: Adaptive Kernel Density Estimation. PhD thesis, Rice University (1994)

    Google Scholar 

  65. Schikuta, E.: Grid-clustering: An efficient hierarchical clustering method for very large datasets. In: The 1996 IEEE International Conference on Pattern Recognition, UK, August 23-26, pp. 101–105 (1996)

    Google Scholar 

  66. Solo, A.M.G.: Tutorial on fuzzy logic theory and applications in data mining. In: The 2009 World Congress in Computer Science, Computer Engineering and Applied Computing, USA, July 14-17 (2008)

    Google Scholar 

  67. Song, M., Wang, H.: Incremental estimation of gaussian mixture models for online data stream clustering. In: The 2004 International Conference on Bioinformatics and Its Applications, USA, December 16-19 (2004)

    Google Scholar 

  68. Song, M., Wang, H.: Detecting low complexity clusters by skewness and kurtosis in data stream clustering. In: The 2006 International Symposium on Artficial Intelligence and Maths, January 4-6, pp. 1–8 (2006)

    Google Scholar 

  69. Spillopoulou, M., Ntoutsi, I., Theodoridis, Y., Schult, R.: Monic: Modeling and monitoring cluster transitions. In: The 2006 ACM International Conference on Knowledge Discovery and Data Mining, August 20-23, pp. 706–711 (2006), doi:10.1145/1150402.1150491

    Google Scholar 

  70. Spinosa, E.J., Carvalho, A.P., Gama, J.: Olindda: A cluster-based approach for detecting novelty and concept drift in data streams. In: The 2007 ACM Symposium on Applied Computing, March 11-15, pp. 448–452 (2007), doi:10.1145/1244002.1244107

    Google Scholar 

  71. Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson Education (2006)

    Google Scholar 

  72. Tang, L., Tang, C.-J., Duan, L., Li, C., Jiang, Y.-X., Zeng, C.-Q., Zhu, J.: Movstream: an efficient algorithm for monitoring clusters in evolving data streams. In: The 2008 IEEE International Conference on Granular Computing, China, August 26-28, pp. 582–587 (2008), doi:10.1109/GRC.2008.4664715

    Google Scholar 

  73. Tasoulis, D.K., Adams, N.M., Hand, D.J.: Unsupervised clustering in streaming data. In: The 2006 IEEE International Workshop on Mining Evolving and Streaming Data (ICDM), China, December 18-22, pp. 638–642 (2006), doi:10.1109/ICDMW.2006.165

    Google Scholar 

  74. Tasoulis, D.K., Ross, G.J., Adams, N.M.: Visualising the cluster structure of data streams. In: The 2007 International Conference on Intelligent Data Analysis, Slovenia, September 6-8, pp. 81–92 (2007)

    Google Scholar 

  75. Tsai, C.-F., Yen, C.-C.: G-TREACLE: A new grid-based and tree-alike pattern clustering technique for large databases. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 739–748. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  76. Udommanetanakit, K., Rakthanmanon, T., Waiyamai, K.: E-stream: Evolution-based technique for stream clustering. In: Alhajj, R., Gao, H., Li, X., Li, J., Zaïane, O.R. (eds.) ADMA 2007. LNCS (LNAI), vol. 4632, pp. 605–615. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  77. Wan, L., Ng, W.K., Dang, X.H., Yu, P.S., Zhang, K.: Density-based Clustering of Data Streams at Multiple Resolutions. ACM Transaction on Knowledge Discovery in Data 3(3), 1–28 (2009)

    Article  Google Scholar 

  78. Yue, S., Wei, M., Li, Y., Wang, X.: Ordering grids to identify the clustering structure. In: The 2007 International Symposium on Neural Networks, China, June 3-7, pp. 612–619 (2007), doi:10.1007/978-3-540-72393-6-73

    Google Scholar 

  79. Yue, S., Wei, M., Wang, J.-S., Wang, H.: A general grid-clustering approach. Elsevier Pattern Recognition Letters 29(9), 1372–1384 (2008)

    Article  Google Scholar 

  80. Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: The 1996 ACM International Conference on Management of Data, Canada, June 4-6, pp. 103–114 (1996)

    Google Scholar 

  81. Zhou, A., Cai, Z., Wei, L., Qian, W.: M-kernel merging: Towards density estimation over data streams. In: The 2003 IEEE International Conference on Database Systems for Advanced Applications, March 26-28, pp. 285–292 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sharanjit Kaur .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Kaur, S., Bhatnagar, V., Chakravarthy, S. (2015). Stream Clustering Algorithms: A Primer. In: Hassanien, A., Azar, A., Snasael, V., Kacprzyk, J., Abawajy, J. (eds) Big Data in Complex Systems. Studies in Big Data, vol 9. Springer, Cham. https://doi.org/10.1007/978-3-319-11056-1_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11056-1_4

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11055-4

  • Online ISBN: 978-3-319-11056-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics