Skip to main content
Log in

Data stream clustering: a review

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Number of connected devices is steadily increasing and these devices continuously generate data streams. Real-time processing of data streams is arousing interest despite many challenges. Clustering is one of the most suitable methods for real-time data stream processing, because it can be applied with less prior information about the data and it does not need labeled instances. However, data stream clustering differs from traditional clustering in many aspects and it has several challenging issues. Here, we provide information regarding the concepts and common characteristics of data streams, such as concept drift, data structures for data streams, time window models and outlier detection. We comprehensively review recent data stream clustering algorithms and analyze them in terms of the base clustering technique, computational complexity and clustering accuracy. A comparison of these algorithms is given along with still open problems. We indicate popular data stream repositories and datasets, stream processing tools and platforms. Open problems about data stream clustering are also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • A community resource for archiving wireless data at Dartmouth (CRAWDAD) (n.d.) https://crawdad.org/keyword-sensor-network.html. Accessed 25 August 2018

  • Ackermann MR, Märtens M, Raupach C, Swierkot K, Lammersen C, Sohler C (2012) Streamkm++: a clustering algorithm for data streams. J Exp Algorithm 17:2.4:2.1–2.4:2.30

    Article  MathSciNet  MATH  Google Scholar 

  • Aggarwal CC (2013) A survey of stream clustering algorithms. In: Reddy CK, Aggarwal CC (eds) Data clustering: algorithms and applications. CRC Press, Boca Raton, pp 231–258

    Chapter  Google Scholar 

  • Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, VLDB ’03, vol 9, pp 81–92

  • Aggarwal C, Han J, Wang J, Yu P (2004) A framework for projected clustering of high dimensional data streams, pp 852–863. https://doi.org/10.1016/B978-012088469-8/50075-9

  • Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data, association for computing machinery, SIGMOD ’98, New York, NY, USA, pp 94–105. https://doi.org/10.1145/276304.276314

  • Alam F, Mehmood R, Katib I, Albeshri A (2016) Analysis of eight data mining algorithms for smarter internet of things (IoT). Procedia Comput Sci 98:437–442

    Article  Google Scholar 

  • AmazonKinesis (2013) Amazon Kinesis. https://aws.amazon.com/kinesis/. Accessed 25 Mar 2018

  • Amini A, Saboohi H, Herawan T, Wah TY (2016) Mudi-stream: a multi density clustering algorithm for evolving data stream. J Netw Comput Appl 59(C):370–385

    Article  Google Scholar 

  • Andrade Silva J, Hruschka ER, Gama J (2017) An evolutionary algorithm for clustering data streams with a variable number of clusters. Expert Syst Appl 67:228–238

    Article  Google Scholar 

  • Apache Kafka (2011) https://kafka.apache.org/. Accessed 25 Mar 2018

  • Apache Samza (2013) Samza. https://samza.apache.org/. Accessed 25 Mar 2018

  • Apache Spark (2012) Apache Spark lightning-fast cluster computing. https://spark.apache.org/. Accessed 25 Mar 2018

  • Apache Storm (2011) http://storm.apache.org/. Accessed 25 Mar 2018

  • Bezerra CG, Costa BSJ, Guedes LA, Angelov PP (2020) An evolving approach to data streams clustering based on typicality and eccentricity data analytics. Inf Sci 518:13–28. https://doi.org/10.1016/j.ins.2019.12.022

    Article  MathSciNet  Google Scholar 

  • Bhosale SV (2014) A survey: outlier detection in streaming data using clustering approached. Int J Comput Sci Inf Technol 5:6050–6053

    Google Scholar 

  • Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) MOA: massive online analysis. J Mach Learn Res 11:1601–1604

    Google Scholar 

  • Bockermann C (2018) RapidMiner streams plugin. https://sfb876.de/streams/doc/rapidminer.html. Accessed 25 Mar 2018

  • Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. vol 2006. https://doi.org/10.1137/1.9781611972764.29

  • Carnein M, Assenmacher D, Trautmann H (2017) An empirical comparison of stream clustering algorithms. In: Proceedings of the computing frontiers conference, CF’17, pp 361–366

  • Chauhan P, Shukla M (2015) A review on outlier detection techniques on data stream by using different approaches of K-Means algorithm. In: 2015 international conference on advances in computer engineering and applications

  • Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’07, pp 133–142

  • Christodoulou V, Bi Y, Wilkie G (2018) A fuzzy shape-based anomaly detection and its application to electromagnetic data. IEEE J Sel Top Appl Earth Obs Remote Sens 11(9):3366–3379. https://doi.org/10.1109/JSTARS.2018.2854865

    Article  Google Scholar 

  • Citi Bike NYC (2013) Citi Bike: NYC’s official bike sharing system. https://www.citibikenyc.com/. Accessed 25 Mar 2018

  • Citi Bike System Data (2013) https://www.citibikenyc.com/system-data. Accessed 25 Mar 2018

  • Dang XH, Lee VCS, Ng WK, Ong KL (2009) Incremental and adaptive clustering stream data over sliding window. In: Bhowmick SS, Küng J, Wagner R (eds) Database and expert systems applications. Springer, Berlin, pp 660–674

    Chapter  Google Scholar 

  • Din SU, Shao J, Kumar J, Ali W, Liu J, Ye Y (2020) Online reliable semi-supervised learning on evolving data streams. Inf Sci 525:153–171. https://doi.org/10.1016/j.ins.2020.03.052

    Article  MathSciNet  Google Scholar 

  • Ding S, Wu F, Qian J, Jia H, Jin F (2015) Research on data stream clustering algorithms. Artif Intell Rev 43(4):593–600

    Article  Google Scholar 

  • Duan L, Xiong D, Lee J, Guo F (2006) A local density based spatial clustering algorithm with noise. Inf Syst 32:4061–4066. https://doi.org/10.1109/ICSMC.2006.384769

    Article  Google Scholar 

  • Ester M, Kriegel HP, Sander J, Wimmer M, Xu X (1998) Incremental clustering for mining in a data warehousing environment. In: Proceedings of the 24rd international conference on very large data bases, VLDB ’98, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 323–333

  • Fahy C, Yang S, Gongora M (2018) Ant colony stream clustering: a fast density clustering algorithm for dynamic data streams. IEEE Trans Cybern 49(6):2215–2228

    Article  Google Scholar 

  • Fisher D (1996) Iterative optimization and simplification of hierarchical clustering. J Artif Intell Res 4:147–178. https://doi.org/10.1613/jair.276

    Article  MATH  Google Scholar 

  • Gaber MM, Zaslavsky A, Krishnaswamy S (2009) Data stream mining. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, Berlin, pp 759–787

    Chapter  Google Scholar 

  • Gama J, Rodrigues PP, Lopes L (2011) Clustering distributed sensor data streams using local processing and reduced communication. Intell Data Anal 15(1):3–28

    Article  Google Scholar 

  • Gama J, Žliobaite I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46(4):44:1–44:37

    Article  MATH  Google Scholar 

  • Gedik B, Andrade H (2012) A model-based framework for building extensible, high performance stream processing middleware and programming language for IBM InfoSphere Streams. Softw Pract Exp 42(11):1363–1391

    Article  Google Scholar 

  • Ghesmoune M, Lebbah M, Azzag H (2016) State-of-the-art on clustering data streams. Big Data Anal 1(1):13

    Article  Google Scholar 

  • Google Cloud Stream (2012) Streaming analytics for real time insights—Google Cloud. https://cloud.google.com/solutions/big-data/stream-analytics/. Accessed 25 Mar 2018

  • Hassani M, Spaus P, Seidl T (2014) Adaptive multiple-resolution stream clustering. In: Machine learning and data mining in pattern recognition, pp 134–148

  • Hassani M, Spaus P, Cuzzocrea A, Seidl T (2015) Adaptive stream clustering using incremental graph maintenance. In: Proceedings of the 4th international conference on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications , BIGMINE’15, vol 41, pp 49–64

  • Hassani M, Spaus P, Cuzzocrea A, Seidl T (2016) I-hastream: density-based hierarchical clustering of big data streams and its application to big graph analytics tools. In: 2016 16th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid), pp 656–665

  • Hyde R, Angelov P, MacKenzie A (2017) Fully online clustering of evolving data streams into arbitrarily shaped clusters. Inf Sci 382–383:96–114

    Article  Google Scholar 

  • Infosphere IBM (1996) Streaming analytics—overview—IBM Cloud. https://www.ibm.com/cloud/streaming-analytics. Accessed 25 Mar 2018

  • Isaksson C, Dunham M, Hahsler M (2012) Sostream: self organizing density-based clustering over data stream. vol 7376. https://doi.org/10.1007/978-3-642-31537-4_21

  • Janardan Mehta S (2017) Concept drift in streaming data classification: algorithms, platforms and issues. Procedia Comput Sci 122:804–811. https://doi.org/10.1016/j.procs.2017.11.440

    Article  Google Scholar 

  • Karypis G, Han EH, Kumar V (1999) Chameleon a hierarchical clustering algorithm using dynamic modeling. Computer 32:68–75. https://doi.org/10.1109/2.781637

    Article  Google Scholar 

  • Kaufman L, Rousseeuw PJ (1990) Chapter 3: Clustering large applications (Program CLARA). Wiley, Hoboken, pp 126–163. https://doi.org/10.1002/9780470316801.ch3

    Book  Google Scholar 

  • Keogh E, Lin J, Fu A (2005) Hot sax: efficiently finding the most unusual time series subsequence. In: Proceedings of the fifth IEEE international conference on data mining, ICDM ’05, IEEE Computer Society, USA, pp 226–233. https://doi.org/10.1109/ICDM.2005.79

  • Kim T, Park CH (2020) Anomaly pattern detection for streaming data. Exp Syst Appl 149:113252. https://doi.org/10.1016/j.eswa.2020.113252

    Article  Google Scholar 

  • Kong X, Bi Y, Glass DH (2019) Detecting anomalies in sequential data augmented with new features. Artif Intell Rev 53:625–652

    Article  Google Scholar 

  • Kremer H, Kranen P, Jansen T, Seidl T, Bifet A, Holmes G, Pfahringer B (2011) An effecive evaluation measure for clustering on evolving data streams. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’11, pp 868–876

  • Kumar P (2016) Data stream clustering in internet of things. SSRG Int J Comput Sci Eng 3(8):1–14

    Article  Google Scholar 

  • Liu L, Huang H, Guo Y, Chen F (2009) rDenStream, a clustering algorithm over an evolving data stream. In: 2009 International conference on information engineering and computer science, pp 1–4

  • Lu Y, Sun Y, Xu G, Liu G (2005) A grid-based clustering algorithm for high-dimensional data streams. In: Li X, Wang S, Dong ZY (eds) Advanced data mining and applications. Springer, Berlin, pp 824–831

    Chapter  Google Scholar 

  • Mahdiraji AR (2009) Clustering data stream: a survey of algorithms. Int J Knowl-Based Intell Eng Syst 13(2):39–44

    Google Scholar 

  • Mansalis S, Ntoutsi E, Pelekis N, Theodoridis Y (2018) An evaluation of data stream clustering algorithms. Stat Anal Data Min ASA Data Sci J 11(4):167–187

    Article  MathSciNet  MATH  Google Scholar 

  • Massive Online Analysis (MOA) (2014) MOA—machine learning for data streams. https://moa.cms.waikato.ac.nz/. Accessed 25 Mar 2018

  • Meesuksabai W, Kangkachit T, Waiyamai K (2011) Hue-stream: evolution-based clustering technique for heterogeneous data streams with uncertainty, pp 27–40. https://doi.org/10.1007/978-3-642-25856-5_3

  • Meetup (2002) We are what we do | Meetup. https://www.meetup.com/. Accessed 25 Mar 2018

  • Meetup Stream (2002) Extend your community | Meetup. https://www.meetup.com/meetup_api/docs/stream/2/rsvps/. Accessed 25 Mar 2018

  • Merino JA (2015) Streaming data clustering in MOA using the leader algorithm. PhD thesis, Universitat Politècnica de Catalunya

  • Microsoft Azure Stream Analytics (2012) Stream analytics—real time data analytics—Microsoft Azure. https://azure.microsoft.com/en-us/services/stream-analytics/. Accessed 25 Mar 2018

  • MOA Stream Generators (2014) MOA: Package moa.stream.generators. https://www.cs.waikato.ac.nz/~abifet/MOA/API/namespacemoa_1_1streams_1_1generators.html. Accessed 25 Mar 2018

  • Modi KD, Oza PB (2017) Outlier analysis approaches in data mining. Int J Innov Res Technol 3:6–12

    Article  Google Scholar 

  • Mousavi M, Bakar A, Vakilian M (2015) Data stream clustering algorithms: a review. Int J Adv Soft Comput Appl 7:1–15

    Google Scholar 

  • Mouss H, Mouss D, Mouss N, Sefouhi L (2004) Test of page-hinckley, an approach for fault detection in an agro-alimentary production system. In: 2004 5th Asian control conference (IEEE Cat. No.04EX904), vol 2, pp 815–818

  • Namadchian A, Esfandani G (2012) Dsclu: a new data stream clustring algorithm for multi density environments. In: 2012 13th ACIS international conference on software engineering, artificial intelligence, networking and parallel/distributed computing, pp 83–88

  • National Weather Service (NWS) (1870) National Weather Service. https://www.weather.gov/. Accessed 25 Mar 2018

  • Nguyen HL, Woon YK, Ng WK (2015) A survey on data stream clustering and classification. Knowl Inf Syst 45(3):535–569

    Article  Google Scholar 

  • NWS Public Alerts (n.d.) NWS Public Alerts. https://alerts.weather.gov/. Accessed 25 Mar 2018

  • O’Callaghan L, Meyerson A, Motwani R, Mishra N, Guha S (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of the 18th international conference on data engineering, ICDE ’02, pp 685–694

  • Ordonez C (2003) Clustering binary data streams with k-means. In: Proceedings of the 8th ACM SIGMOD workshop on research issues in data mining and knowledge discovery, DMKD ’03, Association for Computing Machinery, New York, NY, USA, pp 12–19, https://doi.org/10.1145/882082.882087

  • Parzen E (1962) On estimation of a probability density function and mode. Ann Math Stat 33(3):1065–1076

    Article  MathSciNet  MATH  Google Scholar 

  • Prasad BR, Agarwal S (2016) Stream data mining: platforms, algorithms, performance evaluators and research trends. Int J Database Theory Appl 9(9):201–218

    Article  Google Scholar 

  • Puschmann D, Barnaghi P, Tafazolli R (2017) Adaptive clustering for dynamic IoT data streams. IEEE Internet Things J 4(1):64–74

    Article  Google Scholar 

  • R (1993) R—the R Project for statistical computing. https://www.r-project.org/. Accessed 25 Mar 2018

  • Ramesh N (2013) Apache Samza, LinkedIn’s framework for stream processing—The New Stack. https://thenewstack.io/apache-samza-linkedins-framework-for-stream-processing/. Accessed 25 Mar 2018

  • Ramirez-Gallego S, Krawczyk B, Garcia S, Wozniak M, Herrera F (2017) A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 239:39–57

    Article  Google Scholar 

  • RapidMiner (2001) Data Sicence Platform—RapidMiner. https://rapidminer.com/. Accessed 25 Mar 2018

  • Rodrigues P, Gama J, Pedroso JP (2006) Odac: hierarchical clustering of time series data streams. https://doi.org/10.1137/1.9781611972764.48

  • Rosenblatt M (1956) Remarks on some nonparametric estimates of a density function. Ann Math Statist 27(3):832–837

    Article  MathSciNet  MATH  Google Scholar 

  • Sadik S, Gruenwald L (2014) Research issues in outlier detection for data streams. SIGKDD Explor Newsl 15(1):33–40

    Article  Google Scholar 

  • Satyanarayanan M (2017) The emergence of edge computing. Computer 50(1):30–39. https://doi.org/10.1109/MC.2017.9

    Article  Google Scholar 

  • Sheikholeslami G, Chatterjee S, Zhang A (2000) Wavecluster: a wavelet-based clustering approach for spatial data in very large databases. VLDB J 8(3–4):289–304. https://doi.org/10.1007/s007780050009

    Article  Google Scholar 

  • Shi W, Dustdar S (2016) The promise of edge computing. Computer 49(5):78–81

    Article  Google Scholar 

  • Shi W, Cao J, Zhang Q, Li Y, Xu L (2016) Edge computing: vision and challenges. IEEE Internet Things J 3(5):637–646

    Article  Google Scholar 

  • Silva JA, Faria ER, Barros RC, Hruschka ER, Carvalho ACPLFd, Ja Gama (2013) Data stream clustering: a survey. ACM Comput Surv 46(1):13:1–13:31

    Article  MATH  Google Scholar 

  • Song Q, Kasabov N (2001) ECM–a novel on-line, evolving clustering method and its applications. In: Posner MI (ed) Foundations of cognitive science. The MIT Press, Cambridge, pp 631–682

    Google Scholar 

  • Souiden I, Brahmi Z, Toumi H (2016) A survey on outlier detection in the context of stream mining: review of existing approaches and recommadations. In: Advances in intelligent systems and computing

  • Streaming Spark (2012) Apache spark streaming. https://spark.apache.org/streaming/. Accessed 25 Mar 2018

  • Sun Y, Lu Y (2006) A grid-based subspace clustering algorithm for high-dimensional data streams. In: Feng L, Wang G, Zeng C, Huang R (eds) Web information systems–WISE 2006 workshops. Springer, Berlin, pp 37–48

  • Tasoulis D, Ross G, Adams N (2007) Visualising the cluster structure of data streams, vol 4723, pp 81–92. https://doi.org/10.1007/978-3-540-74825-0_8

  • Thakkar P, Vala J, Prajapati V (2016) Survey on outlier detection in data stream. Int J Comput Appl 136(2):13–16

    Google Scholar 

  • Udommanetanakit K, Rakthanmanon T, Waiyamai K (2007) E-stream: Evolution-based technique for stream clustering. vol 4632, pp 605–615. https://doi.org/10.1007/978-3-540-73871-8_58

  • Waikato Environment for Knowledge Analysis (1993) Weka 3—data mining with open source machine learning software in Java. https://www.cs.waikato.ac.nz/ml/weka/. Accessed 25 Mar 2018

  • Wan L, Ng WK, Dang XH, Yu PS, Zhang K (2009) Density-based clustering of data streams at multiple resolutions. ACM Trans Knowl Discov Data 3(3):1–28. https://doi.org/10.1145/1552303.1552307

    Article  Google Scholar 

  • Wang H, Yu Y, Wang Q, Wan Y (2012) A density-based clustering structure mining algorithm for data streams. In: Proceedings of the 1st international workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications, BigMine’12, Association for Computing Machinery, New York, NY, USA, pp 69–76. https://doi.org/10.1145/2351316.2351326

  • Wang W, Yang J, Muntz RR (1997) Sting: a statistical information grid approach to spatial data mining. In: Proceedings of the 23rd international conference on very large data bases, , VLDB ’97, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 186–195

  • Xu J, Wang G, Li T, Deng W, Gou G (2017) Fat node leading tree for data stream clustering with density peaks. Knowl-Based Syst 120:99–117. https://doi.org/10.1016/j.knosys.2016.12.025

    Article  Google Scholar 

  • Yasumoto K, Yamaguchi H, Shigeno H (2016) Survey of real-time processing technologies of iot data streams. J Inf Process 24(2):195–202

    Google Scholar 

  • Yin C, Xia L, Zhang S, Sun R, Wang J (2017) Improved clustering algorithm based on high-speed network data stream. Soft Comput 22(13):4185–4195

    Article  Google Scholar 

  • Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. SIGMOD Rec 25(2):103–114

    Article  Google Scholar 

  • Zhang X, Furtlehner C, Germain-Renaud C, Sebag M (2014) Data stream clustering with affinity propagation. IEEE Trans Knowl Data Eng 26(7):1644–1656

    Article  Google Scholar 

  • Zhang KS, Zhong L, Tian L, Zhang XY, Li L (2017) DBIECM—an evolving clustering method for streaming data clustering. AMSE J 60(1):239–254

    Google Scholar 

  • Zhou A, Cao F, Yan Y, Sha C, He X (2007) Distributed data stream clustering: a fast em-based approach. In: 2007 IEEE 23rd international conference on data engineering, pp 736–745

  • Zhou A, Cao F, Qian W, Jin C (2008) Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst 15(2):181–214

    Article  Google Scholar 

  • Zhu XH (2010) Stream data mining repository. http://www.cse.fau.edu/~xqzhu/stream.html. Accessed 25 Mar 2018

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alaettin Zubaroğlu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zubaroğlu, A., Atalay, V. Data stream clustering: a review. Artif Intell Rev 54, 1201–1236 (2021). https://doi.org/10.1007/s10462-020-09874-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-020-09874-x

Keywords

Navigation