Abstract
This paper proposes a grid-based clustering method that dynamically partitions the range of a grid-cell based on its distribution statistics of data elements in a data stream. Initially the multi-dimensional space of a data domain is partitioned into a set of mutually exclusive equal-size initial cells. As a new data element is generated continuously, each cell monitors the distribution statistics of data elements within its range. When the support of data elements in a cell becomes high enough, the cell is dynamically divided into two mutually exclusive smaller cells called intermediate cells by assuming the distribution of data elements is a normal distribution. Eventually, the dense sub-range of an initial cell is recursively partitioned until it becomes the smallest cell called a unit cell. In order to minimize the number of cells, a sparse intermediate or unit cell can be pruned if its support becomes much less than a minimum support. The performance of the proposed method is comparatively analyzed through a series of experiments.
Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining stream statistics over sliding windows. In: Proc. Of the 13th Annual ACM-SIAM Symp. on Discrete Algorithms (January 2002)
Charikar, M., Chen, K., Farach-Colton, M.: Finding Frequent Items In Data Streams. In: Proc. Of the 29th Int’l Colloq. on Automata, Language and Programming (2002)
Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: Proc. Of the 28th Int’l Conference on Very Large Databases, Hong Kong, China (August 2002)
Garofalakis, M., Gehrke, J., Rastogi, R.: Querying and mining data streams: you only get one look. In: The tutorial notes of the 28th Int’l Conference on Very Large Databases, Hong Kong, China (August 2002)
Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis. Wiley, Chichester (1972)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data. An Introduction to Cluster Analysis. Wiley, New York (1990)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proc. SIGMOD, pp. 103–114 (1996)
Guha, S., Rastogi, R., Shim, K.: CURE: An efficient clustering algorithm for large databases. In: Proc. SIGMOD, pp. 73–84 (1998)
Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases (1996)
Wang, W., Yang, J., Muntz, R.: Sting: A statistical information grid approach to spatial data mining (1997)
Ester, M., Kriegel, H., Sander, J., Wimmer, M., Xu, X.: Incremental clustering for mining in a data warehousing environment. In: Proc. VLDB 24th, New York (1998)
Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: Proc. of the 28th Int’l Conference on Very Large Databases, Hong Kong, China (August 2002)
O’Callaghan, L., Mishra, N., Meyerson, A., Guha, S., Motwani, R.: STREAM-data algorithms for high-quality clustering. In: Proc. of IEEE International Conference on Data Engineering (March 2002)
Cheng, C., Fu, A., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: KDD 1999, San Diego, August 1999, pp. 84–93 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Park, N.H., Lee, W.S. (2003). Statistical σ-Partition Clustering over Data Streams. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds) Knowledge Discovery in Databases: PKDD 2003. PKDD 2003. Lecture Notes in Computer Science(), vol 2838. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39804-2_35
Download citation
DOI: https://doi.org/10.1007/978-3-540-39804-2_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20085-7
Online ISBN: 978-3-540-39804-2
eBook Packages: Springer Book Archive