Abstract
Number of connected devices is steadily increasing and these devices continuously generate data streams. Real-time processing of data streams is arousing interest despite many challenges. Clustering is one of the most suitable methods for real-time data stream processing, because it can be applied with less prior information about the data and it does not need labeled instances. However, data stream clustering differs from traditional clustering in many aspects and it has several challenging issues. Here, we provide information regarding the concepts and common characteristics of data streams, such as concept drift, data structures for data streams, time window models and outlier detection. We comprehensively review recent data stream clustering algorithms and analyze them in terms of the base clustering technique, computational complexity and clustering accuracy. A comparison of these algorithms is given along with still open problems. We indicate popular data stream repositories and datasets, stream processing tools and platforms. Open problems about data stream clustering are also discussed.
Similar content being viewed by others
References
A community resource for archiving wireless data at Dartmouth (CRAWDAD) (n.d.) https://crawdad.org/keyword-sensor-network.html. Accessed 25 August 2018
Ackermann MR, Märtens M, Raupach C, Swierkot K, Lammersen C, Sohler C (2012) Streamkm++: a clustering algorithm for data streams. J Exp Algorithm 17:2.4:2.1–2.4:2.30
Aggarwal CC (2013) A survey of stream clustering algorithms. In: Reddy CK, Aggarwal CC (eds) Data clustering: algorithms and applications. CRC Press, Boca Raton, pp 231–258
Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, VLDB ’03, vol 9, pp 81–92
Aggarwal C, Han J, Wang J, Yu P (2004) A framework for projected clustering of high dimensional data streams, pp 852–863. https://doi.org/10.1016/B978-012088469-8/50075-9
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data, association for computing machinery, SIGMOD ’98, New York, NY, USA, pp 94–105. https://doi.org/10.1145/276304.276314
Alam F, Mehmood R, Katib I, Albeshri A (2016) Analysis of eight data mining algorithms for smarter internet of things (IoT). Procedia Comput Sci 98:437–442
AmazonKinesis (2013) Amazon Kinesis. https://aws.amazon.com/kinesis/. Accessed 25 Mar 2018
Amini A, Saboohi H, Herawan T, Wah TY (2016) Mudi-stream: a multi density clustering algorithm for evolving data stream. J Netw Comput Appl 59(C):370–385
Andrade Silva J, Hruschka ER, Gama J (2017) An evolutionary algorithm for clustering data streams with a variable number of clusters. Expert Syst Appl 67:228–238
Apache Kafka (2011) https://kafka.apache.org/. Accessed 25 Mar 2018
Apache Samza (2013) Samza. https://samza.apache.org/. Accessed 25 Mar 2018
Apache Spark (2012) Apache Spark lightning-fast cluster computing. https://spark.apache.org/. Accessed 25 Mar 2018
Apache Storm (2011) http://storm.apache.org/. Accessed 25 Mar 2018
Bezerra CG, Costa BSJ, Guedes LA, Angelov PP (2020) An evolving approach to data streams clustering based on typicality and eccentricity data analytics. Inf Sci 518:13–28. https://doi.org/10.1016/j.ins.2019.12.022
Bhosale SV (2014) A survey: outlier detection in streaming data using clustering approached. Int J Comput Sci Inf Technol 5:6050–6053
Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) MOA: massive online analysis. J Mach Learn Res 11:1601–1604
Bockermann C (2018) RapidMiner streams plugin. https://sfb876.de/streams/doc/rapidminer.html. Accessed 25 Mar 2018
Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. vol 2006. https://doi.org/10.1137/1.9781611972764.29
Carnein M, Assenmacher D, Trautmann H (2017) An empirical comparison of stream clustering algorithms. In: Proceedings of the computing frontiers conference, CF’17, pp 361–366
Chauhan P, Shukla M (2015) A review on outlier detection techniques on data stream by using different approaches of K-Means algorithm. In: 2015 international conference on advances in computer engineering and applications
Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’07, pp 133–142
Christodoulou V, Bi Y, Wilkie G (2018) A fuzzy shape-based anomaly detection and its application to electromagnetic data. IEEE J Sel Top Appl Earth Obs Remote Sens 11(9):3366–3379. https://doi.org/10.1109/JSTARS.2018.2854865
Citi Bike NYC (2013) Citi Bike: NYC’s official bike sharing system. https://www.citibikenyc.com/. Accessed 25 Mar 2018
Citi Bike System Data (2013) https://www.citibikenyc.com/system-data. Accessed 25 Mar 2018
Dang XH, Lee VCS, Ng WK, Ong KL (2009) Incremental and adaptive clustering stream data over sliding window. In: Bhowmick SS, Küng J, Wagner R (eds) Database and expert systems applications. Springer, Berlin, pp 660–674
Din SU, Shao J, Kumar J, Ali W, Liu J, Ye Y (2020) Online reliable semi-supervised learning on evolving data streams. Inf Sci 525:153–171. https://doi.org/10.1016/j.ins.2020.03.052
Ding S, Wu F, Qian J, Jia H, Jin F (2015) Research on data stream clustering algorithms. Artif Intell Rev 43(4):593–600
Duan L, Xiong D, Lee J, Guo F (2006) A local density based spatial clustering algorithm with noise. Inf Syst 32:4061–4066. https://doi.org/10.1109/ICSMC.2006.384769
Ester M, Kriegel HP, Sander J, Wimmer M, Xu X (1998) Incremental clustering for mining in a data warehousing environment. In: Proceedings of the 24rd international conference on very large data bases, VLDB ’98, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 323–333
Fahy C, Yang S, Gongora M (2018) Ant colony stream clustering: a fast density clustering algorithm for dynamic data streams. IEEE Trans Cybern 49(6):2215–2228
Fisher D (1996) Iterative optimization and simplification of hierarchical clustering. J Artif Intell Res 4:147–178. https://doi.org/10.1613/jair.276
Gaber MM, Zaslavsky A, Krishnaswamy S (2009) Data stream mining. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, Berlin, pp 759–787
Gama J, Rodrigues PP, Lopes L (2011) Clustering distributed sensor data streams using local processing and reduced communication. Intell Data Anal 15(1):3–28
Gama J, Žliobaite I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46(4):44:1–44:37
Gedik B, Andrade H (2012) A model-based framework for building extensible, high performance stream processing middleware and programming language for IBM InfoSphere Streams. Softw Pract Exp 42(11):1363–1391
Ghesmoune M, Lebbah M, Azzag H (2016) State-of-the-art on clustering data streams. Big Data Anal 1(1):13
Google Cloud Stream (2012) Streaming analytics for real time insights—Google Cloud. https://cloud.google.com/solutions/big-data/stream-analytics/. Accessed 25 Mar 2018
Hassani M, Spaus P, Seidl T (2014) Adaptive multiple-resolution stream clustering. In: Machine learning and data mining in pattern recognition, pp 134–148
Hassani M, Spaus P, Cuzzocrea A, Seidl T (2015) Adaptive stream clustering using incremental graph maintenance. In: Proceedings of the 4th international conference on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications , BIGMINE’15, vol 41, pp 49–64
Hassani M, Spaus P, Cuzzocrea A, Seidl T (2016) I-hastream: density-based hierarchical clustering of big data streams and its application to big graph analytics tools. In: 2016 16th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid), pp 656–665
Hyde R, Angelov P, MacKenzie A (2017) Fully online clustering of evolving data streams into arbitrarily shaped clusters. Inf Sci 382–383:96–114
Infosphere IBM (1996) Streaming analytics—overview—IBM Cloud. https://www.ibm.com/cloud/streaming-analytics. Accessed 25 Mar 2018
Isaksson C, Dunham M, Hahsler M (2012) Sostream: self organizing density-based clustering over data stream. vol 7376. https://doi.org/10.1007/978-3-642-31537-4_21
Janardan Mehta S (2017) Concept drift in streaming data classification: algorithms, platforms and issues. Procedia Comput Sci 122:804–811. https://doi.org/10.1016/j.procs.2017.11.440
Karypis G, Han EH, Kumar V (1999) Chameleon a hierarchical clustering algorithm using dynamic modeling. Computer 32:68–75. https://doi.org/10.1109/2.781637
Kaufman L, Rousseeuw PJ (1990) Chapter 3: Clustering large applications (Program CLARA). Wiley, Hoboken, pp 126–163. https://doi.org/10.1002/9780470316801.ch3
Keogh E, Lin J, Fu A (2005) Hot sax: efficiently finding the most unusual time series subsequence. In: Proceedings of the fifth IEEE international conference on data mining, ICDM ’05, IEEE Computer Society, USA, pp 226–233. https://doi.org/10.1109/ICDM.2005.79
Kim T, Park CH (2020) Anomaly pattern detection for streaming data. Exp Syst Appl 149:113252. https://doi.org/10.1016/j.eswa.2020.113252
Kong X, Bi Y, Glass DH (2019) Detecting anomalies in sequential data augmented with new features. Artif Intell Rev 53:625–652
Kremer H, Kranen P, Jansen T, Seidl T, Bifet A, Holmes G, Pfahringer B (2011) An effecive evaluation measure for clustering on evolving data streams. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’11, pp 868–876
Kumar P (2016) Data stream clustering in internet of things. SSRG Int J Comput Sci Eng 3(8):1–14
Liu L, Huang H, Guo Y, Chen F (2009) rDenStream, a clustering algorithm over an evolving data stream. In: 2009 International conference on information engineering and computer science, pp 1–4
Lu Y, Sun Y, Xu G, Liu G (2005) A grid-based clustering algorithm for high-dimensional data streams. In: Li X, Wang S, Dong ZY (eds) Advanced data mining and applications. Springer, Berlin, pp 824–831
Mahdiraji AR (2009) Clustering data stream: a survey of algorithms. Int J Knowl-Based Intell Eng Syst 13(2):39–44
Mansalis S, Ntoutsi E, Pelekis N, Theodoridis Y (2018) An evaluation of data stream clustering algorithms. Stat Anal Data Min ASA Data Sci J 11(4):167–187
Massive Online Analysis (MOA) (2014) MOA—machine learning for data streams. https://moa.cms.waikato.ac.nz/. Accessed 25 Mar 2018
Meesuksabai W, Kangkachit T, Waiyamai K (2011) Hue-stream: evolution-based clustering technique for heterogeneous data streams with uncertainty, pp 27–40. https://doi.org/10.1007/978-3-642-25856-5_3
Meetup (2002) We are what we do | Meetup. https://www.meetup.com/. Accessed 25 Mar 2018
Meetup Stream (2002) Extend your community | Meetup. https://www.meetup.com/meetup_api/docs/stream/2/rsvps/. Accessed 25 Mar 2018
Merino JA (2015) Streaming data clustering in MOA using the leader algorithm. PhD thesis, Universitat Politècnica de Catalunya
Microsoft Azure Stream Analytics (2012) Stream analytics—real time data analytics—Microsoft Azure. https://azure.microsoft.com/en-us/services/stream-analytics/. Accessed 25 Mar 2018
MOA Stream Generators (2014) MOA: Package moa.stream.generators. https://www.cs.waikato.ac.nz/~abifet/MOA/API/namespacemoa_1_1streams_1_1generators.html. Accessed 25 Mar 2018
Modi KD, Oza PB (2017) Outlier analysis approaches in data mining. Int J Innov Res Technol 3:6–12
Mousavi M, Bakar A, Vakilian M (2015) Data stream clustering algorithms: a review. Int J Adv Soft Comput Appl 7:1–15
Mouss H, Mouss D, Mouss N, Sefouhi L (2004) Test of page-hinckley, an approach for fault detection in an agro-alimentary production system. In: 2004 5th Asian control conference (IEEE Cat. No.04EX904), vol 2, pp 815–818
Namadchian A, Esfandani G (2012) Dsclu: a new data stream clustring algorithm for multi density environments. In: 2012 13th ACIS international conference on software engineering, artificial intelligence, networking and parallel/distributed computing, pp 83–88
National Weather Service (NWS) (1870) National Weather Service. https://www.weather.gov/. Accessed 25 Mar 2018
Nguyen HL, Woon YK, Ng WK (2015) A survey on data stream clustering and classification. Knowl Inf Syst 45(3):535–569
NWS Public Alerts (n.d.) NWS Public Alerts. https://alerts.weather.gov/. Accessed 25 Mar 2018
O’Callaghan L, Meyerson A, Motwani R, Mishra N, Guha S (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of the 18th international conference on data engineering, ICDE ’02, pp 685–694
Ordonez C (2003) Clustering binary data streams with k-means. In: Proceedings of the 8th ACM SIGMOD workshop on research issues in data mining and knowledge discovery, DMKD ’03, Association for Computing Machinery, New York, NY, USA, pp 12–19, https://doi.org/10.1145/882082.882087
Parzen E (1962) On estimation of a probability density function and mode. Ann Math Stat 33(3):1065–1076
Prasad BR, Agarwal S (2016) Stream data mining: platforms, algorithms, performance evaluators and research trends. Int J Database Theory Appl 9(9):201–218
Puschmann D, Barnaghi P, Tafazolli R (2017) Adaptive clustering for dynamic IoT data streams. IEEE Internet Things J 4(1):64–74
R (1993) R—the R Project for statistical computing. https://www.r-project.org/. Accessed 25 Mar 2018
Ramesh N (2013) Apache Samza, LinkedIn’s framework for stream processing—The New Stack. https://thenewstack.io/apache-samza-linkedins-framework-for-stream-processing/. Accessed 25 Mar 2018
Ramirez-Gallego S, Krawczyk B, Garcia S, Wozniak M, Herrera F (2017) A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 239:39–57
RapidMiner (2001) Data Sicence Platform—RapidMiner. https://rapidminer.com/. Accessed 25 Mar 2018
Rodrigues P, Gama J, Pedroso JP (2006) Odac: hierarchical clustering of time series data streams. https://doi.org/10.1137/1.9781611972764.48
Rosenblatt M (1956) Remarks on some nonparametric estimates of a density function. Ann Math Statist 27(3):832–837
Sadik S, Gruenwald L (2014) Research issues in outlier detection for data streams. SIGKDD Explor Newsl 15(1):33–40
Satyanarayanan M (2017) The emergence of edge computing. Computer 50(1):30–39. https://doi.org/10.1109/MC.2017.9
Sheikholeslami G, Chatterjee S, Zhang A (2000) Wavecluster: a wavelet-based clustering approach for spatial data in very large databases. VLDB J 8(3–4):289–304. https://doi.org/10.1007/s007780050009
Shi W, Dustdar S (2016) The promise of edge computing. Computer 49(5):78–81
Shi W, Cao J, Zhang Q, Li Y, Xu L (2016) Edge computing: vision and challenges. IEEE Internet Things J 3(5):637–646
Silva JA, Faria ER, Barros RC, Hruschka ER, Carvalho ACPLFd, Ja Gama (2013) Data stream clustering: a survey. ACM Comput Surv 46(1):13:1–13:31
Song Q, Kasabov N (2001) ECM–a novel on-line, evolving clustering method and its applications. In: Posner MI (ed) Foundations of cognitive science. The MIT Press, Cambridge, pp 631–682
Souiden I, Brahmi Z, Toumi H (2016) A survey on outlier detection in the context of stream mining: review of existing approaches and recommadations. In: Advances in intelligent systems and computing
Streaming Spark (2012) Apache spark streaming. https://spark.apache.org/streaming/. Accessed 25 Mar 2018
Sun Y, Lu Y (2006) A grid-based subspace clustering algorithm for high-dimensional data streams. In: Feng L, Wang G, Zeng C, Huang R (eds) Web information systems–WISE 2006 workshops. Springer, Berlin, pp 37–48
Tasoulis D, Ross G, Adams N (2007) Visualising the cluster structure of data streams, vol 4723, pp 81–92. https://doi.org/10.1007/978-3-540-74825-0_8
Thakkar P, Vala J, Prajapati V (2016) Survey on outlier detection in data stream. Int J Comput Appl 136(2):13–16
Udommanetanakit K, Rakthanmanon T, Waiyamai K (2007) E-stream: Evolution-based technique for stream clustering. vol 4632, pp 605–615. https://doi.org/10.1007/978-3-540-73871-8_58
Waikato Environment for Knowledge Analysis (1993) Weka 3—data mining with open source machine learning software in Java. https://www.cs.waikato.ac.nz/ml/weka/. Accessed 25 Mar 2018
Wan L, Ng WK, Dang XH, Yu PS, Zhang K (2009) Density-based clustering of data streams at multiple resolutions. ACM Trans Knowl Discov Data 3(3):1–28. https://doi.org/10.1145/1552303.1552307
Wang H, Yu Y, Wang Q, Wan Y (2012) A density-based clustering structure mining algorithm for data streams. In: Proceedings of the 1st international workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications, BigMine’12, Association for Computing Machinery, New York, NY, USA, pp 69–76. https://doi.org/10.1145/2351316.2351326
Wang W, Yang J, Muntz RR (1997) Sting: a statistical information grid approach to spatial data mining. In: Proceedings of the 23rd international conference on very large data bases, , VLDB ’97, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 186–195
Xu J, Wang G, Li T, Deng W, Gou G (2017) Fat node leading tree for data stream clustering with density peaks. Knowl-Based Syst 120:99–117. https://doi.org/10.1016/j.knosys.2016.12.025
Yasumoto K, Yamaguchi H, Shigeno H (2016) Survey of real-time processing technologies of iot data streams. J Inf Process 24(2):195–202
Yin C, Xia L, Zhang S, Sun R, Wang J (2017) Improved clustering algorithm based on high-speed network data stream. Soft Comput 22(13):4185–4195
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. SIGMOD Rec 25(2):103–114
Zhang X, Furtlehner C, Germain-Renaud C, Sebag M (2014) Data stream clustering with affinity propagation. IEEE Trans Knowl Data Eng 26(7):1644–1656
Zhang KS, Zhong L, Tian L, Zhang XY, Li L (2017) DBIECM—an evolving clustering method for streaming data clustering. AMSE J 60(1):239–254
Zhou A, Cao F, Yan Y, Sha C, He X (2007) Distributed data stream clustering: a fast em-based approach. In: 2007 IEEE 23rd international conference on data engineering, pp 736–745
Zhou A, Cao F, Qian W, Jin C (2008) Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst 15(2):181–214
Zhu XH (2010) Stream data mining repository. http://www.cse.fau.edu/~xqzhu/stream.html. Accessed 25 Mar 2018
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zubaroğlu, A., Atalay, V. Data stream clustering: a review. Artif Intell Rev 54, 1201–1236 (2021). https://doi.org/10.1007/s10462-020-09874-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-020-09874-x