A Statistical μ-Partitioning Method for Clustering Data Streams
A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. Due to this reason, most algorithms for data streams sacrifice the correctness of their results for fast processing time. This paper proposes a clustering method over a data stream based on statistical μ-partition. The multi-dimensional space of a data domain is divided into a set of mutually exclusive equal-size initial cells. A cell maintains the distribution statistics of data elements in its range. Based on the distribution statistics of a cell, a dense cell is dynamically split into two mutually exclusive smaller cells called intermediate cells. Eventually, the dense sub-range of an initial cell is recursively partitioned until it becomes the smallest cell called a unit cell. A cluster of a data stream is a group of adjacent dense unit cells. As the size of a unit cell is set to be smaller, the resulting set of clusters is more accurately identified. Through a series of experiments, the performance of the proposed algorithm is comparatively analyzed.
Unable to display preview. Download preview PDF.
- 1.Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis. Wiley, Chichester (1972)Google Scholar
- 3.Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proc. SIGMOD, pp. 103–114 (1996)Google Scholar
- 4.Guha, S., Rastogi, R., Shim, K.: CURE: An efficient clustering algorithm for large databases. In: Proc. SIGMOD, pp. 73–84 (1998)Google Scholar
- 5.Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases (1996)Google Scholar
- 6.Wang, W., Yang, J., Muntz, R.: Sting: A statistical information grid approach to spatial data mining (1997)Google Scholar
- 7.Garofalakis, M., Gehrke, J., Rastogi, R.: Querying and mining data streams: you only get one look. In: The tutorial notes of the 28th Int’l Conference on Very Large Databases, Hong Kong, China (August 2002)Google Scholar
- 8.O’Callaghan, L., Mishra, N., Meyerson, A., Guha, S., Motwani, R.: STREAM-data algorithms for high-quality clustering. In: Proc. of IEEE International Conference on Data Engineering (March 2002)Google Scholar
- 9.Cheng, C., Fu, A., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: KDD 1999, San Diego, pp. 84–93 (August 1999)Google Scholar