Clustering by Pattern Similarity
- 80 Downloads
The task of clustering is to identify classes of similar objects among a set of objects. The definition of similarity varies from one clustering model to another. However, in most of these models the concept of similarity is often based on such metrics as Manhattan distance, Euclidean distance or other Lp distances. In other words, similar objects must have close values in at least a set of dimensions. In this paper, we explore a more general type of similarity. Under the pCluster model we proposed, two objects are similar if they exhibit a coherent pattern on a subset of dimensions. The new similarity concept models a wide range of applications. For instance, in DNA microarray analysis, the expression levels of two genes may rise and fall synchronously in response to a set of environmental stimuli. Although the magnitude of their expression levels may not be close, the patterns they exhibit can be very much alike. Discovery of such clusters of genes is essential in revealing significant connections in gene regulatory networks. E-commerce applications, such as collaborative filtering, can also benefit from the new model, because it is able to capture not only the closeness of values of certain leading indicators but also the closeness of (purchasing, browsing, etc.) patterns exhibited by the customers. In addition to the novel similarity model, this paper also introduces an effective and efficient algorithm to detect such clusters, and we perform tests on several real and synthetic data sets to show its performance.
Keywordsdata mining clustering pattern similarity
Unable to display preview. Download preview PDF.
- Ester M, Kriegel H, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. SIGKDD, 1996, pp.226–231.Google Scholar
- Ng R T, Han J. Efficient and effective clustering methods for spatial data mining. In Proc. Santiago de Chile, VLDB, 1994, pp.144–155.Google Scholar
- Zhang T, Ramakrishnan R, Livny M. Birch: An efficient data clustering method for very large databases. In Proc. SIGMOD, 1996, pp.103–114.Google Scholar
- Michalski R S, Stepp R E. Learning from observation: Conceptual clustering. Machine Learning: An Articial Intelligence Approach, Springer, 1983, pp.331–363.Google Scholar
- Fisher D H. Knowledge acquisition via incremental conceptual clustering. In Proc. Machine Learning, 1987.Google Scholar
- Fukunaga K. Introduction to Statistical Pattern Recognition. Academic Press, 1990.Google Scholar
- Beyer K, Goldstein J, Ramakrishnan R, Shaft U. When is nearest neighbors meaningful. In Proc. the Int. Conf. Database Theories, 1999, pp.217–235.Google Scholar
- Aggarwal C C, Procopiuc C, Wolf J, Yu P S, Park J S. Fast algorithms for projected clustering. In Proc. SIGMOD, Philadephia, USA, 1999, pp.61–72.Google Scholar
- Aggarwal C C, Yu P S. Finding generalized projected clusters in high dimensional spaces. In Proc. SIGMOD, Dallas, USA, 2000, pp.70–81.Google Scholar
- Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Authomatic subspace clustering of high dimensional data for data mining applications. In Proc. SIGMOD, 1998.Google Scholar
- Jagadish H V, Madar J, Ng R. Semantic compression and pattern extraction with fascicles. In Proc. VLDB, 1999, pp.186–196.Google Scholar
- Cheng C H, Fu A W, Zhang Y. Entropy-based subspace clustering for mining numerical data. In Proc. SIGKDD, San Diego, USA, 1999, pp.84–93.Google Scholar
- D'haeseleer P, Liang S, Somogyi R. Gene expression analysis and genetic network modeling. In Proc. Pacific Symposium on Biocomputing, Hawaii, 1999.Google Scholar
- Cheng Y, Church G. Biclustering of expression data. In Proc. of 8th International Conference on Intelligent System for Molecular Biology, 2000, pp.93–103.Google Scholar
- Yang J, Wang W, Wang H, Yu P S. δ-clusters: Capturing subspace correlation in a large data set. In Proc. ICDE, San Jose, USA, 2002, pp.517–528.Google Scholar
- Nagesh H, Goil H, Choudhary A. Mafia: Efficient and scalable subspace clustering for very large data sets. Technical Report 9906–010, Northwestern University, 1999.Google Scholar
- Shardanand U, Maes P. Social information filtering: Algorithms for automating “word of mouth”. In Proc. ACM CHI, Denver, USA, 1995, pp.210–217.Google Scholar
- Tavazoie S, Hughes J, Campbell M, Cho R, Church G. Yeast micro data set. http://arep.med.harvard.edu/biclustering/yeast.matrix, 2000.
- Wang H, Wang W, Yang J, Yu P S. Clustering by pattern similarity in large data sets. In Proc. SIGMOD, Madison, USA, 2002, pp.394–405.Google Scholar
- Niskanen S, Ostergard P R J. Cliquer user's guide, version 1.0. Technical Report T48, Communications Laboratory, Helsinki University of Technology, Espoo, Finland, 2003. http://www.hut./p̃at/cliquer.html.
- Riedl J, Konstan J. Movielens dataset. In http://www.cs.umn.edu/Research/GroupLens.
- Clifton S, Johnson S, Blumberg B et al. Washington university Xenopus EST project. Technical Report, Washington University School of Medicine, 1999.Google Scholar