Data stream clustering: a review

Abstract

Number of connected devices is steadily increasing and these devices continuously generate data streams. Real-time processing of data streams is arousing interest despite many challenges. Clustering is one of the most suitable methods for real-time data stream processing, because it can be applied with less prior information about the data and it does not need labeled instances. However, data stream clustering differs from traditional clustering in many aspects and it has several challenging issues. Here, we provide information regarding the concepts and common characteristics of data streams, such as concept drift, data structures for data streams, time window models and outlier detection. We comprehensively review recent data stream clustering algorithms and analyze them in terms of the base clustering technique, computational complexity and clustering accuracy. A comparison of these algorithms is given along with still open problems. We indicate popular data stream repositories and datasets, stream processing tools and platforms. Open problems about data stream clustering are also discussed.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3

References

  1. A community resource for archiving wireless data at Dartmouth (CRAWDAD) (n.d.) https://crawdad.org/keyword-sensor-network.html. Accessed 25 August 2018

  2. Ackermann MR, Märtens M, Raupach C, Swierkot K, Lammersen C, Sohler C (2012) Streamkm++: a clustering algorithm for data streams. J Exp Algorithm 17:2.4:2.1–2.4:2.30

    MathSciNet  MATH  Article  Google Scholar 

  3. Aggarwal CC (2013) A survey of stream clustering algorithms. In: Reddy CK, Aggarwal CC (eds) Data clustering: algorithms and applications. CRC Press, Boca Raton, pp 231–258

    Google Scholar 

  4. Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, VLDB ’03, vol 9, pp 81–92

  5. Aggarwal C, Han J, Wang J, Yu P (2004) A framework for projected clustering of high dimensional data streams, pp 852–863. https://doi.org/10.1016/B978-012088469-8/50075-9

  6. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data, association for computing machinery, SIGMOD ’98, New York, NY, USA, pp 94–105. https://doi.org/10.1145/276304.276314

  7. Alam F, Mehmood R, Katib I, Albeshri A (2016) Analysis of eight data mining algorithms for smarter internet of things (IoT). Procedia Comput Sci 98:437–442

    Article  Google Scholar 

  8. AmazonKinesis (2013) Amazon Kinesis. https://aws.amazon.com/kinesis/. Accessed 25 Mar 2018

  9. Amini A, Saboohi H, Herawan T, Wah TY (2016) Mudi-stream: a multi density clustering algorithm for evolving data stream. J Netw Comput Appl 59(C):370–385

    Article  Google Scholar 

  10. Andrade Silva J, Hruschka ER, Gama J (2017) An evolutionary algorithm for clustering data streams with a variable number of clusters. Expert Syst Appl 67:228–238

    Article  Google Scholar 

  11. Apache Kafka (2011) https://kafka.apache.org/. Accessed 25 Mar 2018

  12. Apache Samza (2013) Samza. https://samza.apache.org/. Accessed 25 Mar 2018

  13. Apache Spark (2012) Apache Spark lightning-fast cluster computing. https://spark.apache.org/. Accessed 25 Mar 2018

  14. Apache Storm (2011) http://storm.apache.org/. Accessed 25 Mar 2018

  15. Bezerra CG, Costa BSJ, Guedes LA, Angelov PP (2020) An evolving approach to data streams clustering based on typicality and eccentricity data analytics. Inf Sci 518:13–28. https://doi.org/10.1016/j.ins.2019.12.022

    MathSciNet  Article  Google Scholar 

  16. Bhosale SV (2014) A survey: outlier detection in streaming data using clustering approached. Int J Comput Sci Inf Technol 5:6050–6053

    Google Scholar 

  17. Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) MOA: massive online analysis. J Mach Learn Res 11:1601–1604

    Google Scholar 

  18. Bockermann C (2018) RapidMiner streams plugin. https://sfb876.de/streams/doc/rapidminer.html. Accessed 25 Mar 2018

  19. Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. vol 2006. https://doi.org/10.1137/1.9781611972764.29

  20. Carnein M, Assenmacher D, Trautmann H (2017) An empirical comparison of stream clustering algorithms. In: Proceedings of the computing frontiers conference, CF’17, pp 361–366

  21. Chauhan P, Shukla M (2015) A review on outlier detection techniques on data stream by using different approaches of K-Means algorithm. In: 2015 international conference on advances in computer engineering and applications

  22. Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’07, pp 133–142

  23. Christodoulou V, Bi Y, Wilkie G (2018) A fuzzy shape-based anomaly detection and its application to electromagnetic data. IEEE J Sel Top Appl Earth Obs Remote Sens 11(9):3366–3379. https://doi.org/10.1109/JSTARS.2018.2854865

    Article  Google Scholar 

  24. Citi Bike NYC (2013) Citi Bike: NYC’s official bike sharing system. https://www.citibikenyc.com/. Accessed 25 Mar 2018

  25. Citi Bike System Data (2013) https://www.citibikenyc.com/system-data. Accessed 25 Mar 2018

  26. Dang XH, Lee VCS, Ng WK, Ong KL (2009) Incremental and adaptive clustering stream data over sliding window. In: Bhowmick SS, Küng J, Wagner R (eds) Database and expert systems applications. Springer, Berlin, pp 660–674

    Google Scholar 

  27. Din SU, Shao J, Kumar J, Ali W, Liu J, Ye Y (2020) Online reliable semi-supervised learning on evolving data streams. Inf Sci 525:153–171. https://doi.org/10.1016/j.ins.2020.03.052

    MathSciNet  Article  Google Scholar 

  28. Ding S, Wu F, Qian J, Jia H, Jin F (2015) Research on data stream clustering algorithms. Artif Intell Rev 43(4):593–600

    Article  Google Scholar 

  29. Duan L, Xiong D, Lee J, Guo F (2006) A local density based spatial clustering algorithm with noise. Inf Syst 32:4061–4066. https://doi.org/10.1109/ICSMC.2006.384769

    Article  Google Scholar 

  30. Ester M, Kriegel HP, Sander J, Wimmer M, Xu X (1998) Incremental clustering for mining in a data warehousing environment. In: Proceedings of the 24rd international conference on very large data bases, VLDB ’98, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 323–333

  31. Fahy C, Yang S, Gongora M (2018) Ant colony stream clustering: a fast density clustering algorithm for dynamic data streams. IEEE Trans Cybern 49(6):2215–2228

    Article  Google Scholar 

  32. Fisher D (1996) Iterative optimization and simplification of hierarchical clustering. J Artif Intell Res 4:147–178. https://doi.org/10.1613/jair.276

    Article  MATH  Google Scholar 

  33. Gaber MM, Zaslavsky A, Krishnaswamy S (2009) Data stream mining. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, Berlin, pp 759–787

    Google Scholar 

  34. Gama J, Rodrigues PP, Lopes L (2011) Clustering distributed sensor data streams using local processing and reduced communication. Intell Data Anal 15(1):3–28

    Article  Google Scholar 

  35. Gama J, Žliobaite I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46(4):44:1–44:37

    MATH  Article  Google Scholar 

  36. Gedik B, Andrade H (2012) A model-based framework for building extensible, high performance stream processing middleware and programming language for IBM InfoSphere Streams. Softw Pract Exp 42(11):1363–1391

    Article  Google Scholar 

  37. Ghesmoune M, Lebbah M, Azzag H (2016) State-of-the-art on clustering data streams. Big Data Anal 1(1):13

    Article  Google Scholar 

  38. Google Cloud Stream (2012) Streaming analytics for real time insights—Google Cloud. https://cloud.google.com/solutions/big-data/stream-analytics/. Accessed 25 Mar 2018

  39. Hassani M, Spaus P, Seidl T (2014) Adaptive multiple-resolution stream clustering. In: Machine learning and data mining in pattern recognition, pp 134–148

  40. Hassani M, Spaus P, Cuzzocrea A, Seidl T (2015) Adaptive stream clustering using incremental graph maintenance. In: Proceedings of the 4th international conference on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications , BIGMINE’15, vol 41, pp 49–64

  41. Hassani M, Spaus P, Cuzzocrea A, Seidl T (2016) I-hastream: density-based hierarchical clustering of big data streams and its application to big graph analytics tools. In: 2016 16th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid), pp 656–665

  42. Hyde R, Angelov P, MacKenzie A (2017) Fully online clustering of evolving data streams into arbitrarily shaped clusters. Inf Sci 382–383:96–114

    Article  Google Scholar 

  43. Infosphere IBM (1996) Streaming analytics—overview—IBM Cloud. https://www.ibm.com/cloud/streaming-analytics. Accessed 25 Mar 2018

  44. Isaksson C, Dunham M, Hahsler M (2012) Sostream: self organizing density-based clustering over data stream. vol 7376. https://doi.org/10.1007/978-3-642-31537-4_21

  45. Janardan Mehta S (2017) Concept drift in streaming data classification: algorithms, platforms and issues. Procedia Comput Sci 122:804–811. https://doi.org/10.1016/j.procs.2017.11.440

    Article  Google Scholar 

  46. Karypis G, Han EH, Kumar V (1999) Chameleon a hierarchical clustering algorithm using dynamic modeling. Computer 32:68–75. https://doi.org/10.1109/2.781637

    Article  Google Scholar 

  47. Kaufman L, Rousseeuw PJ (1990) Chapter 3: Clustering large applications (Program CLARA). Wiley, Hoboken, pp 126–163. https://doi.org/10.1002/9780470316801.ch3

    Google Scholar 

  48. Keogh E, Lin J, Fu A (2005) Hot sax: efficiently finding the most unusual time series subsequence. In: Proceedings of the fifth IEEE international conference on data mining, ICDM ’05, IEEE Computer Society, USA, pp 226–233. https://doi.org/10.1109/ICDM.2005.79

  49. Kim T, Park CH (2020) Anomaly pattern detection for streaming data. Exp Syst Appl 149:113252. https://doi.org/10.1016/j.eswa.2020.113252

    Article  Google Scholar 

  50. Kong X, Bi Y, Glass DH (2019) Detecting anomalies in sequential data augmented with new features. Artif Intell Rev 53:625–652

    Article  Google Scholar 

  51. Kremer H, Kranen P, Jansen T, Seidl T, Bifet A, Holmes G, Pfahringer B (2011) An effecive evaluation measure for clustering on evolving data streams. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’11, pp 868–876

  52. Kumar P (2016) Data stream clustering in internet of things. SSRG Int J Comput Sci Eng 3(8):1–14

    Article  Google Scholar 

  53. Liu L, Huang H, Guo Y, Chen F (2009) rDenStream, a clustering algorithm over an evolving data stream. In: 2009 International conference on information engineering and computer science, pp 1–4

  54. Lu Y, Sun Y, Xu G, Liu G (2005) A grid-based clustering algorithm for high-dimensional data streams. In: Li X, Wang S, Dong ZY (eds) Advanced data mining and applications. Springer, Berlin, pp 824–831

    Google Scholar 

  55. Mahdiraji AR (2009) Clustering data stream: a survey of algorithms. Int J Knowl-Based Intell Eng Syst 13(2):39–44

    Google Scholar 

  56. Mansalis S, Ntoutsi E, Pelekis N, Theodoridis Y (2018) An evaluation of data stream clustering algorithms. Stat Anal Data Min ASA Data Sci J 11(4):167–187

    MathSciNet  MATH  Article  Google Scholar 

  57. Massive Online Analysis (MOA) (2014) MOA—machine learning for data streams. https://moa.cms.waikato.ac.nz/. Accessed 25 Mar 2018

  58. Meesuksabai W, Kangkachit T, Waiyamai K (2011) Hue-stream: evolution-based clustering technique for heterogeneous data streams with uncertainty, pp 27–40. https://doi.org/10.1007/978-3-642-25856-5_3

  59. Meetup (2002) We are what we do | Meetup. https://www.meetup.com/. Accessed 25 Mar 2018

  60. Meetup Stream (2002) Extend your community | Meetup. https://www.meetup.com/meetup_api/docs/stream/2/rsvps/. Accessed 25 Mar 2018

  61. Merino JA (2015) Streaming data clustering in MOA using the leader algorithm. PhD thesis, Universitat Politècnica de Catalunya

  62. Microsoft Azure Stream Analytics (2012) Stream analytics—real time data analytics—Microsoft Azure. https://azure.microsoft.com/en-us/services/stream-analytics/. Accessed 25 Mar 2018

  63. MOA Stream Generators (2014) MOA: Package moa.stream.generators. https://www.cs.waikato.ac.nz/~abifet/MOA/API/namespacemoa_1_1streams_1_1generators.html. Accessed 25 Mar 2018

  64. Modi KD, Oza PB (2017) Outlier analysis approaches in data mining. Int J Innov Res Technol 3:6–12

    Article  Google Scholar 

  65. Mousavi M, Bakar A, Vakilian M (2015) Data stream clustering algorithms: a review. Int J Adv Soft Comput Appl 7:1–15

    Google Scholar 

  66. Mouss H, Mouss D, Mouss N, Sefouhi L (2004) Test of page-hinckley, an approach for fault detection in an agro-alimentary production system. In: 2004 5th Asian control conference (IEEE Cat. No.04EX904), vol 2, pp 815–818

  67. Namadchian A, Esfandani G (2012) Dsclu: a new data stream clustring algorithm for multi density environments. In: 2012 13th ACIS international conference on software engineering, artificial intelligence, networking and parallel/distributed computing, pp 83–88

  68. National Weather Service (NWS) (1870) National Weather Service. https://www.weather.gov/. Accessed 25 Mar 2018

  69. Nguyen HL, Woon YK, Ng WK (2015) A survey on data stream clustering and classification. Knowl Inf Syst 45(3):535–569

    Article  Google Scholar 

  70. NWS Public Alerts (n.d.) NWS Public Alerts. https://alerts.weather.gov/. Accessed 25 Mar 2018

  71. O’Callaghan L, Meyerson A, Motwani R, Mishra N, Guha S (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of the 18th international conference on data engineering, ICDE ’02, pp 685–694

  72. Ordonez C (2003) Clustering binary data streams with k-means. In: Proceedings of the 8th ACM SIGMOD workshop on research issues in data mining and knowledge discovery, DMKD ’03, Association for Computing Machinery, New York, NY, USA, pp 12–19, https://doi.org/10.1145/882082.882087

  73. Parzen E (1962) On estimation of a probability density function and mode. Ann Math Stat 33(3):1065–1076

    MathSciNet  MATH  Article  Google Scholar 

  74. Prasad BR, Agarwal S (2016) Stream data mining: platforms, algorithms, performance evaluators and research trends. Int J Database Theory Appl 9(9):201–218

    Article  Google Scholar 

  75. Puschmann D, Barnaghi P, Tafazolli R (2017) Adaptive clustering for dynamic IoT data streams. IEEE Internet Things J 4(1):64–74

    Article  Google Scholar 

  76. R (1993) R—the R Project for statistical computing. https://www.r-project.org/. Accessed 25 Mar 2018

  77. Ramesh N (2013) Apache Samza, LinkedIn’s framework for stream processing—The New Stack. https://thenewstack.io/apache-samza-linkedins-framework-for-stream-processing/. Accessed 25 Mar 2018

  78. Ramirez-Gallego S, Krawczyk B, Garcia S, Wozniak M, Herrera F (2017) A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 239:39–57

    Article  Google Scholar 

  79. RapidMiner (2001) Data Sicence Platform—RapidMiner. https://rapidminer.com/. Accessed 25 Mar 2018

  80. Rodrigues P, Gama J, Pedroso JP (2006) Odac: hierarchical clustering of time series data streams. https://doi.org/10.1137/1.9781611972764.48

  81. Rosenblatt M (1956) Remarks on some nonparametric estimates of a density function. Ann Math Statist 27(3):832–837

    MathSciNet  MATH  Article  Google Scholar 

  82. Sadik S, Gruenwald L (2014) Research issues in outlier detection for data streams. SIGKDD Explor Newsl 15(1):33–40

    Article  Google Scholar 

  83. Satyanarayanan M (2017) The emergence of edge computing. Computer 50(1):30–39. https://doi.org/10.1109/MC.2017.9

    Article  Google Scholar 

  84. Sheikholeslami G, Chatterjee S, Zhang A (2000) Wavecluster: a wavelet-based clustering approach for spatial data in very large databases. VLDB J 8(3–4):289–304. https://doi.org/10.1007/s007780050009

    Article  Google Scholar 

  85. Shi W, Dustdar S (2016) The promise of edge computing. Computer 49(5):78–81

    Article  Google Scholar 

  86. Shi W, Cao J, Zhang Q, Li Y, Xu L (2016) Edge computing: vision and challenges. IEEE Internet Things J 3(5):637–646

    Article  Google Scholar 

  87. Silva JA, Faria ER, Barros RC, Hruschka ER, Carvalho ACPLFd, Ja Gama (2013) Data stream clustering: a survey. ACM Comput Surv 46(1):13:1–13:31

    MATH  Article  Google Scholar 

  88. Song Q, Kasabov N (2001) ECM–a novel on-line, evolving clustering method and its applications. In: Posner MI (ed) Foundations of cognitive science. The MIT Press, Cambridge, pp 631–682

    Google Scholar 

  89. Souiden I, Brahmi Z, Toumi H (2016) A survey on outlier detection in the context of stream mining: review of existing approaches and recommadations. In: Advances in intelligent systems and computing

  90. Streaming Spark (2012) Apache spark streaming. https://spark.apache.org/streaming/. Accessed 25 Mar 2018

  91. Sun Y, Lu Y (2006) A grid-based subspace clustering algorithm for high-dimensional data streams. In: Feng L, Wang G, Zeng C, Huang R (eds) Web information systems–WISE 2006 workshops. Springer, Berlin, pp 37–48

  92. Tasoulis D, Ross G, Adams N (2007) Visualising the cluster structure of data streams, vol 4723, pp 81–92. https://doi.org/10.1007/978-3-540-74825-0_8

  93. Thakkar P, Vala J, Prajapati V (2016) Survey on outlier detection in data stream. Int J Comput Appl 136(2):13–16

    Google Scholar 

  94. Udommanetanakit K, Rakthanmanon T, Waiyamai K (2007) E-stream: Evolution-based technique for stream clustering. vol 4632, pp 605–615. https://doi.org/10.1007/978-3-540-73871-8_58

  95. Waikato Environment for Knowledge Analysis (1993) Weka 3—data mining with open source machine learning software in Java. https://www.cs.waikato.ac.nz/ml/weka/. Accessed 25 Mar 2018

  96. Wan L, Ng WK, Dang XH, Yu PS, Zhang K (2009) Density-based clustering of data streams at multiple resolutions. ACM Trans Knowl Discov Data 3(3):1–28. https://doi.org/10.1145/1552303.1552307

    Article  Google Scholar 

  97. Wang H, Yu Y, Wang Q, Wan Y (2012) A density-based clustering structure mining algorithm for data streams. In: Proceedings of the 1st international workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications, BigMine’12, Association for Computing Machinery, New York, NY, USA, pp 69–76. https://doi.org/10.1145/2351316.2351326

  98. Wang W, Yang J, Muntz RR (1997) Sting: a statistical information grid approach to spatial data mining. In: Proceedings of the 23rd international conference on very large data bases, , VLDB ’97, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 186–195

  99. Xu J, Wang G, Li T, Deng W, Gou G (2017) Fat node leading tree for data stream clustering with density peaks. Knowl-Based Syst 120:99–117. https://doi.org/10.1016/j.knosys.2016.12.025

    Article  Google Scholar 

  100. Yasumoto K, Yamaguchi H, Shigeno H (2016) Survey of real-time processing technologies of iot data streams. J Inf Process 24(2):195–202

    Google Scholar 

  101. Yin C, Xia L, Zhang S, Sun R, Wang J (2017) Improved clustering algorithm based on high-speed network data stream. Soft Comput 22(13):4185–4195

    Article  Google Scholar 

  102. Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. SIGMOD Rec 25(2):103–114

    Article  Google Scholar 

  103. Zhang X, Furtlehner C, Germain-Renaud C, Sebag M (2014) Data stream clustering with affinity propagation. IEEE Trans Knowl Data Eng 26(7):1644–1656

    Article  Google Scholar 

  104. Zhang KS, Zhong L, Tian L, Zhang XY, Li L (2017) DBIECM—an evolving clustering method for streaming data clustering. AMSE J 60(1):239–254

    Google Scholar 

  105. Zhou A, Cao F, Yan Y, Sha C, He X (2007) Distributed data stream clustering: a fast em-based approach. In: 2007 IEEE 23rd international conference on data engineering, pp 736–745

  106. Zhou A, Cao F, Qian W, Jin C (2008) Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst 15(2):181–214

    Article  Google Scholar 

  107. Zhu XH (2010) Stream data mining repository. http://www.cse.fau.edu/~xqzhu/stream.html. Accessed 25 Mar 2018

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Alaettin Zubaroğlu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zubaroğlu, A., Atalay, V. Data stream clustering: a review. Artif Intell Rev 54, 1201–1236 (2021). https://doi.org/10.1007/s10462-020-09874-x

Download citation

Keywords

  • Data streams
  • Data stream clustering
  • Real-time clustering