Advertisement

Journal of Computer Science and Technology

, Volume 23, Issue 4, pp 481–496 | Cite as

Clustering by Pattern Similarity

  • Haixun WangEmail author
  • Jian Pei
Regular Paper

Abstract

The task of clustering is to identify classes of similar objects among a set of objects. The definition of similarity varies from one clustering model to another. However, in most of these models the concept of similarity is often based on such metrics as Manhattan distance, Euclidean distance or other Lp distances. In other words, similar objects must have close values in at least a set of dimensions. In this paper, we explore a more general type of similarity. Under the pCluster model we proposed, two objects are similar if they exhibit a coherent pattern on a subset of dimensions. The new similarity concept models a wide range of applications. For instance, in DNA microarray analysis, the expression levels of two genes may rise and fall synchronously in response to a set of environmental stimuli. Although the magnitude of their expression levels may not be close, the patterns they exhibit can be very much alike. Discovery of such clusters of genes is essential in revealing significant connections in gene regulatory networks. E-commerce applications, such as collaborative filtering, can also benefit from the new model, because it is able to capture not only the closeness of values of certain leading indicators but also the closeness of (purchasing, browsing, etc.) patterns exhibited by the customers. In addition to the novel similarity model, this paper also introduces an effective and efficient algorithm to detect such clusters, and we perform tests on several real and synthetic data sets to show its performance.

Keywords

data mining clustering pattern similarity 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    Ester M, Kriegel H, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. SIGKDD, 1996, pp.226–231.Google Scholar
  2. [2]
    Ng R T, Han J. Efficient and effective clustering methods for spatial data mining. In Proc. Santiago de Chile, VLDB, 1994, pp.144–155.Google Scholar
  3. [3]
    Zhang T, Ramakrishnan R, Livny M. Birch: An efficient data clustering method for very large databases. In Proc. SIGMOD, 1996, pp.103–114.Google Scholar
  4. [4]
    Murtagh F. A survey of recent hierarchical clustering algorithms. The Computer Journal, 1983, 26: 354–359.zbMATHGoogle Scholar
  5. [5]
    Michalski R S, Stepp R E. Learning from observation: Conceptual clustering. Machine Learning: An Articial Intelligence Approach, Springer, 1983, pp.331–363.Google Scholar
  6. [6]
    Fisher D H. Knowledge acquisition via incremental conceptual clustering. In Proc. Machine Learning, 1987.Google Scholar
  7. [7]
    Fukunaga K. Introduction to Statistical Pattern Recognition. Academic Press, 1990.Google Scholar
  8. [8]
    Beyer K, Goldstein J, Ramakrishnan R, Shaft U. When is nearest neighbors meaningful. In Proc. the Int. Conf. Database Theories, 1999, pp.217–235.Google Scholar
  9. [9]
    Aggarwal C C, Procopiuc C, Wolf J, Yu P S, Park J S. Fast algorithms for projected clustering. In Proc. SIGMOD, Philadephia, USA, 1999, pp.61–72.Google Scholar
  10. [10]
    Aggarwal C C, Yu P S. Finding generalized projected clusters in high dimensional spaces. In Proc. SIGMOD, Dallas, USA, 2000, pp.70–81.Google Scholar
  11. [11]
    Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Authomatic subspace clustering of high dimensional data for data mining applications. In Proc. SIGMOD, 1998.Google Scholar
  12. [12]
    Jagadish H V, Madar J, Ng R. Semantic compression and pattern extraction with fascicles. In Proc. VLDB, 1999, pp.186–196.Google Scholar
  13. [13]
    Cheng C H, Fu A W, Zhang Y. Entropy-based subspace clustering for mining numerical data. In Proc. SIGKDD, San Diego, USA, 1999, pp.84–93.Google Scholar
  14. [14]
    D'haeseleer P, Liang S, Somogyi R. Gene expression analysis and genetic network modeling. In Proc. Pacific Symposium on Biocomputing, Hawaii, 1999.Google Scholar
  15. [15]
    Cheng Y, Church G. Biclustering of expression data. In Proc. of 8th International Conference on Intelligent System for Molecular Biology, 2000, pp.93–103.Google Scholar
  16. [16]
    Yang J, Wang W, Wang H, Yu P S. δ-clusters: Capturing subspace correlation in a large data set. In Proc. ICDE, San Jose, USA, 2002, pp.517–528.Google Scholar
  17. [17]
    Nagesh H, Goil H, Choudhary A. Mafia: Efficient and scalable subspace clustering for very large data sets. Technical Report 9906–010, Northwestern University, 1999.Google Scholar
  18. [18]
    Shardanand U, Maes P. Social information filtering: Algorithms for automating “word of mouth”. In Proc. ACM CHI, Denver, USA, 1995, pp.210–217.Google Scholar
  19. [19]
    Tavazoie S, Hughes J, Campbell M, Cho R, Church G. Yeast micro data set. http://arep.med.harvard.edu/biclustering/yeast.matrix, 2000.
  20. [20]
    Wang H, Wang W, Yang J, Yu P S. Clustering by pattern similarity in large data sets. In Proc. SIGMOD, Madison, USA, 2002, pp.394–405.Google Scholar
  21. [21]
    Niskanen S, Ostergard P R J. Cliquer user's guide, version 1.0. Technical Report T48, Communications Laboratory, Helsinki University of Technology, Espoo, Finland, 2003. http://www.hut./p̃at/cliquer.html.
  22. [22]
    Riedl J, Konstan J. Movielens dataset. In http://www.cs.umn.edu/Research/GroupLens.
  23. [23]
    Clifton S, Johnson S, Blumberg B et al. Washington university Xenopus EST project. Technical Report, Washington University School of Medicine, 1999.Google Scholar

Copyright information

© Springer 2008

Authors and Affiliations

  1. 1.IBM T. J. Watson Research CenterHawthorneU.S.A.
  2. 2.Simon Fraser UniversityBritish ColumbiaCanada

Personalised recommendations