Stream Clustering Algorithms: A Primer

Kaur, Sharanjit; Bhatnagar, Vasudha; Chakravarthy, Sharma

doi:10.1007/978-3-319-11056-1_4

Stream Clustering Algorithms: A Primer

Sharanjit Kaur⁷,
Vasudha Bhatnagar⁸ &
Sharma Chakravarthy⁹

Chapter

3693 Accesses
2 Citations

Part of the book series: Studies in Big Data ((SBD,volume 9))

Abstract

Stream data has become ubiquitous due to advances in acquisition technology and pervades numerous applications. These massive data gathered as continuous flow, are often accompanied by dire need for real-time processing. One aspect of data streams deals with storage management and processing of continuous queries for aggregation. Another significant aspect pertains to discovery and understanding of hidden patterns to derive actionable knowledge using mining approaches. This chapter focuses on stream clustering and presents a primer of clustering algorithms in data stream environment.

Clustering of data streams has gained importance because of its ability to capture natural structures from unlabeled, non-stationary data. Single scan of data, bounded memory usage, and capturing data evolution are the key challenges during clustering of streaming data. We elaborate and compare the algorithms on the basis of these constraints. We also propose a taxonomy of algorithms based on the fundamental approaches used for clustering. For each approach, a systematic description of contemporary, well-known algorithms is presented. We place special emphasis on synopsis data structure used for consolidating characteristics of streaming data and feature it as an important issue in design of a stream clustering algorithms. We argue that a number of functional and operational characteristics (e.g. quality of clustering, handling of outliers, number of parameters etc.) of a clustering algorithm are influenced by the choice of synopsis. A summary of clustering features that are supported by different algorithms is given. Finally, research directions for improvement in the usability of stream clustering algorithms are suggested.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abadi, D.J., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., Zdonik, S.B.: Aurora: a new model and architecture for data stream management. Springer Journal on Very Large Databases 12(2), 120–139 (2003)
Google Scholar
Ackermann, M.R., Lammersen, C., Märten, M., Raupach, C., Sohler, C., Swierkot, K.: Streamkm++: A clustering algorithm for data streams. In: The 2010 SIAM Workshop on Algorithm Engineering and Experiments, Texas, January 16, pp. 173–187 (2010), doi:10.1137/1.9781611972900
Google Scholar
Aggarwal, C.C. (ed.): Data Streams: Models and Algorithms. Springer Science+Business Media (2007) ISBN: 978-0-387-28759-1
Google Scholar
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: The 2003 International Conference on Very Large Data Bases (VLDB), Germany, September 9-12, pp. 81–92 (2003) ISBN: 0-12-722442-4
Google Scholar
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for on-demand classification of evolving data streams. IEEE Transaction on Knowledge and Data Engineering 18(5), 577–589 (2006)
Article Google Scholar
Aggarwal, C.C., Han, J., Yu, P.S.: A framework for projected clustering of high dimensional data streams. In: The 2004 International Conference on Very large Databases (VLDB), Canada, August 31-September 3, pp. 853–863 (2004) ISBN: 0-12-088469-0
Google Scholar
Aggarwal, C.C., Yu, P.S.: A framework for clustering massive text and categorical data streams. In: The 2006 SIAM International Conference on Data Mining, Maryland, April 20-22, pp. 479–483 (2006) ISBN: 978-0-89871-611-5
Google Scholar
Akodjènou-Jeannin, M.-I., Salamatian, K., Gallinari, P.: Flexible grid-based clustering. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 350–357. Springer, Heidelberg (2007)
Chapter Google Scholar
Amini, A., Teh, Y.W., Saybani, M.R., Aghabozorgi, S.R., Yazdi, S.: A study of density-grid based clustering algorithms on data streams. In: The 2011 IEEE International Conference on Fuzzy Systems and Knowledge Discovery, China, July 26-28, pp. 1652–1656 (2011) ISBN: 978-1-61284-180-9
Google Scholar
Amini, A., Wa, T.Y.: Dengris-stream: A density-grid based clustering algorithm for evolving data streams over sliding window. In: The 2012 International Conference on Data Mining and Computer Engineering, Thailand, December 21-22, pp. 206–211 (2012)
Google Scholar
Amini, A., Weh, T.Y., Saboohi, H.: On density-based data streams clustering algorithms: A survey. Springer Journal of Computer Science and Technology 29(1), 116–141 (2014)
Article Google Scholar
Arasu, Babcock, Babu, Cieslewicz, Datar, Ito, Motwani, R., Srivastava, and Widom: Stream: The stanford data stream management system. Technical Report 2004-20, The Stanford InfoLab (2004)
Google Scholar
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: The 2002 ACM Symposium on Principles of Database Systems, Wisconsin, June 3-5, pp. 1–58113 (2002) ISBN: 1-58113-507-6
Google Scholar
Baraldi, A., Blonda, P.: A survey of fuzzy clustering algorithms for pattern recognition- part i and ii. IEEE Transactions on Systems, Man and Cybernetics 29(6), 778–801 (1999)
Article Google Scholar
Barbára, D.: Requirements of clustering data streams. ACM SIGKDD Explorations 3(2), 23–27 (2002)
Article Google Scholar
Barbára, D., Chen, P.: Tracking clusters in evolving data sets. In: The 2001 FLAIRS Special Track on Knowledge Discovery and Data Mining, Florida, May 18-20, pp. 239–243 (2001) ISBN: 1-57735-133-9
Google Scholar
Berkhin, P.: A survey of clustering data mining techniques. In: Springer Grouping Multidimensional Data - Recent Advances in Clustering, pp. 25–71. Springer (2006)
Google Scholar
Bhatnagar, V., Kaur, S., Chakravarthy, S.: Clustering data streams using grid-based synopsis. Springer Journal on Knowledge and Information System 41(1), 127–152 (2014)
Article Google Scholar
Bhatnagar, V., Kaur, S., Mignet, L.: A parameterized framework for stream clustering algorithms. IGI International Journal for Data Warehousing and Mining 5(1), 36–56 (2009)
Article Google Scholar
Braverman, V., Meyerson, A., Ostrovsky, R., Roytman, A., Shindler, M., Tagiku, B.: Streaming k-means on well-clusterable data. In: The 2011 ACM-SIAM Symposium on Discrete Algorithms, California, January 23-25, pp. 26–40. SIAM (2011), doi:10.1137/1.9781611973082.3
Google Scholar
Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: The 2006 SIAM International Conference on Data Mining, USA, April 20-22, pp. 326–337 (2006), doi:10.1137/1.9781611972764.29
Google Scholar
Chakravarthy,S., Jiang, Q.: Stream Data Processing: A Quality of Service Perspective. Springer (2009) ISBN 978-0-387-71002-0
Google Scholar
Charikar, M., Callaghan, L.O., Panigrahy, R.: Better streaming algorithms for clustering problems. In: The 2003 ACM Symposium on Theory of Computing, California, June 9-11, pp. 30–38 (2003), doi:10.1145/780542.780548
Google Scholar
Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: The ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, California, August 12-15, pp. 133–142 (2007), doi:10.1145/1281192.1281210
Google Scholar
Coppi, R., Gil, M.A., Kiers, H.A.L. (eds.): Data Analysis with Fuzzy Clustering Methods, vol. 51(1). Elsevier (2006)
Google Scholar
Cormode, G., Muthukrishnan, S.: What’s hot and what’s not: Tracking most frequent items dynamically. In: The 2003 ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, San Diego, June 9-12, pp. 296–306 (2003), doi:10.1145/1061318.1061325
Google Scholar
Dang, X.H., Lee, V.C.S., Ng, W.K., Ong, K.-L.: Incremental and adaptive clustering stream data over sliding window. In: The 2009 International Conference on Database and Expert Systems Applications, Austria, August 31-September 4, pp. 660–674 (2009), doi:10.1007/978-3-642-03573-9-55
Google Scholar
de Andrade Silva, J., Faria, E.R., Barros, R.C., Hruschka, E.R., de Carvalho, A.C.P.L.F., Gama, J.: Data stream clustering: A survey. ACM Computing Surveys 46(1), 1–31 (2013)
Article Google Scholar
Domingos, P., Hulten, G.: Mining High-Speed Data Streams. In: The 2000 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Maryland, August 20-23, pp. 71–80 (2000), doi:10.1145/347090.347107
Google Scholar
Dong, G., Han, J., Lakshmanan, L.V., Pei, J., Wang, H., Yu, P.S.: Online mining of changes from data streams: Research problems and preliminary results. In: The 2003 ACM SIGMOD Workshop on Management and Processing of Data Streams, San Diego, CA, June 8 (2003)
Google Scholar
Eiben, A., Smith, J.: Introduction to Evolutionary Computing, 2nd edn. Natural Computing. Springer (2007)
Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: The 1996 AAAI International Conference on Knowledge Discovery and Data Mining, Oregon, August 2-4, pp. 226–231 (1996)
Google Scholar
Fan, W., Huang, Y., Wang, H., Yu, P.S.: Active mining of data streams. In: The 2004 SIAM International Conference on Data Mining, Florida, April 22-24, pp. 457–461 (2004), doi:10.1137/1.9781611972740.46
Google Scholar
FIMI, ICDM Workshop on Frequent Itemset Mining Implementations, FIMI 2003 (2003)
Google Scholar
FIMI, ICDM Workshop on Frequent Itemset Mining Implementations, FIMI 2004 (2004)
Google Scholar
Gaber, M.M., Zaslavsky, A., Krishnaswamy, S.: Mining data streams: A review. ACM SIGMOD Record 34(2), 18–26 (2005)
Article Google Scholar
Gama, J. (ed.): Knowledge Discovery From Data Streams. Chapman and Hall/CRC Press (2010) ISBN: 978-1-4398-2611-9
Google Scholar
Gao, J., Li, J., Zhang, Z., Tan, P.-N.: An incremental data stream clustering algorithm based on dense units detection. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 420–425. Springer, Heidelberg (2005)
Chapter Google Scholar
Garofalakis, M., Gehrke,J.,Rastogi, R.: Querying and mining data streams: you only get one look a tutorial. In: The 2002 ACM SIGMOD International Conference on Management of Data, Medison, USA, June 02-06, p. 635 (2002)
Google Scholar
Giannella, C., Han, J., Pei, J., Yan, X., Yu, P.: Mining frequent patterns in data streams at multiple time granularities. In: Kargupta, H., Joshi, A., Sivakumar, K., Yesha, Y. (eds.) Data Mining: Next Generation Challenges and Future Directions. AAAI/MIT Press (2003)
Google Scholar
Goethals, B.: Frequent itemset mining implementation repository, http://fimi.ua.ac.be/ (last retrieved in July 2013)
Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams. In: The 2000 IEEE Annual Symposium on Foundation of Computer Science, California, November 12-14, pp. 359–366 (2000) ISBN: 0-7695-0850-2
Google Scholar
Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Streaming-data algorithms for high-quality clustering. In: The 2002 IEEE International Conference on Data Engineering, California, February 26-March 1, pp. 685–694 (2002), doi:10.1109/ICDE.2002.994785
Google Scholar
Gupta, C., Grossman, R.L.: Genic: A single-pass generalized incremental algorithm for clustering. In: The 2004 SIAM International Conference on Data Mining, Florida, April 22-24, pp. 147–153 (2004), doi:10.1137/1.9781611972740.14
Google Scholar
Gupta, C., Grossman, R.L.: Outlier Detection with Streaming Dyadic Decomposition. In: The 2007 Industrial Conference on Data Mining, Germany, July 14-18, pp. 77–91 (2007) ISBN: 978-3-540-73434-5
Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann (2006) ISBN 1-55860-901-6
Google Scholar
He, Z., Xu, X., Deng, S., Huang, J.Z.: Clustering Categorical Data Streams. Computing Research Repository, abs/cs/0412058 (2004)
Google Scholar
Hinneburg, A., Keim, D.A.: Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th International Conference on Very Large Databases, Scotland, September 7-10, pp. 506–517 (1999) ISBN: 1-55860-615-7
Google Scholar
Hirsh, H.: Data Mining Research: Current Status and Future Opportunities. Wiley Periodicals 1(2), 104–107 (2008), doi:10.1002/sam.10003
MathSciNet Google Scholar
Jia, C., Tan, C., Yong, A.: A grid and density-based clustering algorithm for processing data stream. In: The 2008 IEEE International Conference on Genetic and Evolutionary Computing, USA, September 25-28, pp. 517–521 (2008), doi:10.1109/WGEC.2008.32
Google Scholar
Kaur, S., Bhatnagar, V., Mehta, S., Kapoor, S.: Categorizing concepts for detecting drifts in stream. In: The 2009 International Conference on Management of Data, Mysore, December 9-12, pp. 201–209 (2009)
Google Scholar
Kifer, D., David, S.B., Gehrke, J.: Detecting change in data streams. In: The 2004 International Conference on Very Large Data Bases, Canada, August 29-September 3, pp. 180–191 (2004)
Google Scholar
Kim, Y.S., Mitra, S.: Integrated adaptive fuzzy clustering (iafc) algorithm. In: The 1993 IEEE International Conference on Fuzzy System, San Francisco, March 28-April 1, vol. 2, pp. 1264–1268 (1993), doi:10.1109/FUZZY.1993.327574
Google Scholar
Kranen, P., Assent, I., Baldauf, C., Seidl, T.: The clustree: Indexing micro-clusters for anytime stream mining. Springer Journal on Knowledge and Information Systems 29(2), 249–272 (2010)
Article Google Scholar
Li, Y., Gopalan, R.P.: Clustering transactional data streams. In: Proceedings of Australian Conference on Artificial Intelligence, Australia, December 4-8, pp. 1069–1073 (2006), doi:10.1007/11941439-124
Google Scholar
Lu, Y.S., Sun, Y., Xu, G., Liu, G.: A grid-based clustering algorithm for high-dimensional data streams. In: Li, X., Wang, S., Dong, Z.Y. (eds.) ADMA 2005. LNCS (LNAI), vol. 3584, pp. 824–831. Springer, Heidelberg (2005)
Chapter Google Scholar
Luhr, S., Lazarescu, M.: Incremental clustering of dynamic data streams using connectivity-based representative points. Elsevier Journal on Data and Knowledge Engineering 68(1), 1–27 (2009)
Article Google Scholar
Mahdiraji, A.R.: Clustering data stream: A survey of algorithms. IOS Knowledge-Based and Intelligent Engineering Systems 13(2), 39–44 (2009)
Google Scholar
Meesuksabai, W., Kangkachit, T., Waiyamai, K.: Hue-stream: Evolution-based clustering technique for heterogeneous data streams. In: The 2011 International Conference on Advanced Data Mining and Applications, China, December 17-19, pp. 27–40 (2011), doi:10.1007//978-3-642-25856-5-3
Google Scholar
Motoyoshi, M., Miura, T., Shioya, I.: Clustering stream data by regression analysis. In: The 2004 ACSW of Australasian Workshop on Data Mining and Web Intelligence, New Zealand, pp. 115–120 (January 2004)
Google Scholar
Park, N.H., Lee, W.S.: Statistical grid-based clustering over data streams. ACM SIGMOD Record 33(1), 32–37 (2004)
Article Google Scholar
Park, N.H., Lee, W.S.: Cell trees: An adaptive synopsis structure for clustering multi-dimensional on-line data streams. Springer Journal of Data and Knowledge Engineering 63(2), 528–549 (2007)
Article Google Scholar
Ruiz, C., Menasalvas, E., Spiliopoulou, M.: C-denstream: Using domain knowledge on a data stream. In: Gama, J., Costa, V.S., Jorge, A.M., Brazdil, P.B. (eds.) DS 2009. LNCS, vol. 5808, pp. 287–301. Springer, Heidelberg (2009), doi:10.1007/978-3-642-04747-3-23
Chapter Google Scholar
Sain, S.R.: Adaptive Kernel Density Estimation. PhD thesis, Rice University (1994)
Google Scholar
Schikuta, E.: Grid-clustering: An efficient hierarchical clustering method for very large datasets. In: The 1996 IEEE International Conference on Pattern Recognition, UK, August 23-26, pp. 101–105 (1996)
Google Scholar
Solo, A.M.G.: Tutorial on fuzzy logic theory and applications in data mining. In: The 2009 World Congress in Computer Science, Computer Engineering and Applied Computing, USA, July 14-17 (2008)
Google Scholar
Song, M., Wang, H.: Incremental estimation of gaussian mixture models for online data stream clustering. In: The 2004 International Conference on Bioinformatics and Its Applications, USA, December 16-19 (2004)
Google Scholar
Song, M., Wang, H.: Detecting low complexity clusters by skewness and kurtosis in data stream clustering. In: The 2006 International Symposium on Artficial Intelligence and Maths, January 4-6, pp. 1–8 (2006)
Google Scholar
Spillopoulou, M., Ntoutsi, I., Theodoridis, Y., Schult, R.: Monic: Modeling and monitoring cluster transitions. In: The 2006 ACM International Conference on Knowledge Discovery and Data Mining, August 20-23, pp. 706–711 (2006), doi:10.1145/1150402.1150491
Google Scholar
Spinosa, E.J., Carvalho, A.P., Gama, J.: Olindda: A cluster-based approach for detecting novelty and concept drift in data streams. In: The 2007 ACM Symposium on Applied Computing, March 11-15, pp. 448–452 (2007), doi:10.1145/1244002.1244107
Google Scholar
Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson Education (2006)
Google Scholar
Tang, L., Tang, C.-J., Duan, L., Li, C., Jiang, Y.-X., Zeng, C.-Q., Zhu, J.: Movstream: an efficient algorithm for monitoring clusters in evolving data streams. In: The 2008 IEEE International Conference on Granular Computing, China, August 26-28, pp. 582–587 (2008), doi:10.1109/GRC.2008.4664715
Google Scholar
Tasoulis, D.K., Adams, N.M., Hand, D.J.: Unsupervised clustering in streaming data. In: The 2006 IEEE International Workshop on Mining Evolving and Streaming Data (ICDM), China, December 18-22, pp. 638–642 (2006), doi:10.1109/ICDMW.2006.165
Google Scholar
Tasoulis, D.K., Ross, G.J., Adams, N.M.: Visualising the cluster structure of data streams. In: The 2007 International Conference on Intelligent Data Analysis, Slovenia, September 6-8, pp. 81–92 (2007)
Google Scholar
Tsai, C.-F., Yen, C.-C.: G-TREACLE: A new grid-based and tree-alike pattern clustering technique for large databases. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 739–748. Springer, Heidelberg (2008)
Chapter Google Scholar
Udommanetanakit, K., Rakthanmanon, T., Waiyamai, K.: E-stream: Evolution-based technique for stream clustering. In: Alhajj, R., Gao, H., Li, X., Li, J., Zaïane, O.R. (eds.) ADMA 2007. LNCS (LNAI), vol. 4632, pp. 605–615. Springer, Heidelberg (2007)
Chapter Google Scholar
Wan, L., Ng, W.K., Dang, X.H., Yu, P.S., Zhang, K.: Density-based Clustering of Data Streams at Multiple Resolutions. ACM Transaction on Knowledge Discovery in Data 3(3), 1–28 (2009)
Article Google Scholar
Yue, S., Wei, M., Li, Y., Wang, X.: Ordering grids to identify the clustering structure. In: The 2007 International Symposium on Neural Networks, China, June 3-7, pp. 612–619 (2007), doi:10.1007/978-3-540-72393-6-73
Google Scholar
Yue, S., Wei, M., Wang, J.-S., Wang, H.: A general grid-clustering approach. Elsevier Pattern Recognition Letters 29(9), 1372–1384 (2008)
Article Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: The 1996 ACM International Conference on Management of Data, Canada, June 4-6, pp. 103–114 (1996)
Google Scholar
Zhou, A., Cai, Z., Wei, L., Qian, W.: M-kernel merging: Towards density estimation over data streams. In: The 2003 IEEE International Conference on Database Systems for Advanced Applications, March 26-28, pp. 285–292 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Acharya Narendra Dev College, University of Delhi, Delhi, India
Sharanjit Kaur
Department of Computer Science, University of Delhi, Delhi, India
Vasudha Bhatnagar
Computer Science and Engineering Department, University of Texas, Arlington, TX, USA
Sharma Chakravarthy

Authors

Sharanjit Kaur
View author publications
You can also search for this author in PubMed Google Scholar
Vasudha Bhatnagar
View author publications
You can also search for this author in PubMed Google Scholar
Sharma Chakravarthy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sharanjit Kaur .

Editor information

Editors and Affiliations

Cairo University, Cairo, Egypt
Aboul Ella Hassanien
Faculty of Computers and Information, Benha University, Benha, Egypt
Ahmad Taher Azar
Faculty of Elec. Eng. & Comp. Sci., Department of Computer Science, VSB-Technical University of Ostrava, Ostrava-Poruba, Czech Republic
Vaclav Snasael
Polish Academy of Sciences, Warsaw, Poland
Janusz Kacprzyk
School of Information Technology, Deakin University, Victoria, Australia
Jemal H. Abawajy

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kaur, S., Bhatnagar, V., Chakravarthy, S. (2015). Stream Clustering Algorithms: A Primer. In: Hassanien, A., Azar, A., Snasael, V., Kacprzyk, J., Abawajy, J. (eds) Big Data in Complex Systems. Studies in Big Data, vol 9. Springer, Cham. https://doi.org/10.1007/978-3-319-11056-1_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-11056-1_4
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11055-4
Online ISBN: 978-3-319-11056-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics