Advertisement

Data Mining and Knowledge Discovery

, Volume 15, Issue 3, pp 321–348 | Cite as

CrossClus: user-guided multi-relational clustering

  • Xiaoxin Yin
  • Jiawei Han
  • Philip S. Yu
Article

Abstract

Most structured data in real-life applications are stored in relational databases containing multiple semantically linked relations. Unlike clustering in a single table, when clustering objects in relational databases there are usually a large number of features conveying very different semantic information, and using all features indiscriminately is unlikely to generate meaningful results. Because the user knows her goal of clustering, we propose a new approach called CrossClus, which performs multi-relational clustering under user’s guidance. Unlike semi-supervised clustering which requires the user to provide a training set, we minimize the user’s effort by using a very simple form of user guidance. The user is only required to select one or a small set of features that are pertinent to the clustering goal, and CrossClus searches for other pertinent features in multiple relations. Each feature is evaluated by whether it clusters objects in a similar way with the user specified features. We design efficient and accurate approaches for both feature selection and object clustering. Our comprehensive experiments demonstrate the effectiveness and scalability of CrossClus.

Keywords

Relational data mining Clustering 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal CC, Yu PS (2000) Finding generalized projected clusters in high dimensional spaces. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, Dallas, TX, pp 70–81Google Scholar
  2. Aggarwal CC, Procopiuc C, Wolf JL, Yu PS, Park JS (1999) Fast algorithms for projected clustering. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, Philadelphia, PA, pp 61–72Google Scholar
  3. Bilenko M, Basu S, Mooney RJ (2004) Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of the 2004 international conference on machine learning, Alberta, Canada, pp 81–88Google Scholar
  4. Blockeel H, Dehaspe L and Demoen B (2002). Improving the efficiency of inductive logic programming through the use of query packs. J Artif Intell Res 16: 135–166 MATHGoogle Scholar
  5. Cheeseman P et al (1988) AutoClass: a Bayesian classfication system. In: Proceedings of the 1988 international conference on machine learning, Alberta, Ann Arbor, MI, pp 54–64Google Scholar
  6. DBLP Bibliography. http://www.informatik.uni-trier.de/∼ley/db/Google Scholar
  7. Ding C, Peng H (2003) Minimum redundancy feature selection from microarray gene expression data. In: Proceedings of the 2003 IEEE computer society bioinformatics conference, Stanford, CA, pp 523–529Google Scholar
  8. Dy JG and Brodley CE (2004). Feature selection for unsupervised learning. J Mach Learn Res 5: 845–889 MathSciNetGoogle Scholar
  9. Emde W, Wettschereck D (1996) Relational instance-based learning. In: Proceedings of the 1996 international conference on machine learning, Bari, Italy, pp 122–130Google Scholar
  10. Gärtner T, Lloyd JW and Flach PA (2004). Kernels and distances for structured data. Mach Learn 57: 205–232 MATHCrossRefGoogle Scholar
  11. Guyon I and Elisseeff A (2003). An introduction to variable and feature selection. J Mach Learn Res 3: 1157–1182 MATHCrossRefGoogle Scholar
  12. Hall MA (2000) Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of the 2000 international conference on machine learning, Stanford, CA, pp 359–366Google Scholar
  13. Hristidis V, Papakonstantinou Y (2002) DISCOVER: keyword search in relational databases. In: Proceedings of the 2002 international conference on very large data bases, Hong Kong, China, pp 670–681Google Scholar
  14. Jain AK, Murty MN and Flynn PJ (1999). Data clustering: a review. ACM Comput Surv 31: 264–323 CrossRefGoogle Scholar
  15. Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. John Wiley and SonsGoogle Scholar
  16. Klein D, Kamvar SD, Manning C (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proceedings of the 2002 international conference on machine learning, Sydney, Australia, pp 307–314Google Scholar
  17. Kirsten M, Wrobel S (1998) Relational distance-based clustering. In: Proceedings of the 1998 international Workshop on inductive logic programming, Madison, WI, pp 261–270Google Scholar
  18. Kirsten M, Wrobel S (2000) Extending K-means clustering to first-order representations. In: Proceedings of the 2000 international workshop on inductive logic programming, London, UK, pp 112–129Google Scholar
  19. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 1967 Berkeley symposium on mathematics, statistics and probability, Berkeley, CA, pp 281–298Google Scholar
  20. Mitchell TM (1997) Machine learning. McGraw HillGoogle Scholar
  21. Mitra P, Murthy CA and Pal SK (2002). Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24: 301–312 CrossRefGoogle Scholar
  22. Ng RT, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 1994 international conference on very large data bases, Santiago de Chile, Chile, pp 144–155Google Scholar
  23. Quinlan JR, Cameron-Jones RM (1993) FOIL: a midterm report. In: Proceedings of the 1993 European conference on machine learning, Vienna, Austria, pp 3–20Google Scholar
  24. Tan P-N, Steinbach M, Kumar W (2005) Introdution to data mining. Addison-WesleyGoogle Scholar
  25. Wagstaff K, Cardie C, Rogers S, Schroedl S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of the 2001 international conference on machine learning, Williamstown, MA, pp 577–584Google Scholar
  26. Xing EP, Ng AY, Jordan MI, Russell S (2002) Distance metric learning, with application to clustering with side-information. In: Proceedings of the 2002 neural information processing systems, Vancouver, Canada, pp 505–512Google Scholar
  27. Yin X, Han J, Yang J, Yu PS (2004) CrossMine: efficient classification across multiple database relations. In: Proceedings of the 2004 international conference on data engineering, Boston, MA, pp 399–411Google Scholar
  28. Yin X, Han J, Yu PS (2005) Cross-relational clustering with user’s guidance. In: Proceedings of the 2005 ACM SIGKDD international conference on knowledge discovery and data mining, Chicago, IL, pp 344–353Google Scholar
  29. Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data, Montreal, Canada, pp 103–114Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of Illinois at Urbana-ChampaignUrbanaUSA
  2. 2.IBM T.J. Watson Research CenterYorktown HeightsUSA

Personalised recommendations