Advertisement

Data Mining and Knowledge Discovery

, Volume 11, Issue 1, pp 5–33 | Cite as

Automatic Subspace Clustering of High Dimensional Data

  • Rakesh Agrawal
  • Johannes Gehrke
  • Dimitrios Gunopulos
  • Prabhakar Raghavan
Article

Abstract

Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate clusters in large high dimensional datasets.

Keywords

subspace clustering clustering dimensionality reduction 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal, C.C. and Yu, P.S. 2000. Finding generalized projected clusters in high dimensional spaces. In Proc. of SIGMOD 2000 Conference, pp. 70–81.Google Scholar
  2. Aggrawal, C., Procopiuc, C., Wolf, J., Yu, P., and Park, J. 1999. Fast algorithms for projected clustering. In Proc. of 1999 ACM SIGMOD Int. Conf. on Management of Data, Philadelphia, PA.Google Scholar
  3. Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. 1998. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. of 1998 ACM SIGMOD Int. Conf. on Management of Data, pp. 94–105.Google Scholar
  4. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and Verkamo, A.I. 1996. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.). AAAI/MIT Press, Chap 12, pp. 307–328.Google Scholar
  5. Aho, A., Hopcroft, J., and Ullman, J. 1974. The Design and Analysis of Computer Algorithms. Addison-Welsley.Google Scholar
  6. Arabie, P. and Hubert, L.J. 1996. An overview of combinatorial data analyis. In Clustering and Classification. P. Arabie, L. Hubert, and G.D. Soete, (Eds.). New Jersey: World Scientific Pub., pp. 5–63.Google Scholar
  7. Arbor Software Corporation. Application Manager User’s Guide, Essbase Version 4.0 edition.Google Scholar
  8. Bayardo, R. 1998. Efficiently mining long patterns from databases. In Proc. of the ACM SIGMOD Conference on Management of Data, Seattle, Washington.Google Scholar
  9. Berchtold, S., Bohm, C., Keim, D., and Kriegel, H.-P. 1997. A cost model for nearest neighbor search in high-dimensional data space. In Proceedings of the 16th Symposium on Principles of Database Systems (PODS), pp. 78–86.Google Scholar
  10. Berger, M. and Regoutsos, I. 1991. An algorithm for point clustering and grid generation. IEEE Transactions on Systems, Man and Cybernetics, 21(5):1278–86.Google Scholar
  11. Brin, S., Motwani, R., Ullman, J. D., and Tsur, S. 1997. Dynamic itemset counting and implication rules for market basket data. In Proc. of the ACM SIGMOD Conference on Management of Data.Google Scholar
  12. Bronniman, H. and Goodrich, M. 1994. Almost optimal set covers in finite VC-dimension. In Proc. of the 10th ACM Symp. on Computational Geometry, pp. 293–302.Google Scholar
  13. Cheeseman, P. and Stutz, J. 1996. Bayesian classification (autoclass): Theory and results. In Advances in Knowledge Discovery and Data Mining. U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, (Eds.). Chap 6. AAAI/MIT Press, pp. 153–180.Google Scholar
  14. Chhikara, R. and Register, D. 1979. A numerical classification method for partitioning of a large multidimensional mixed data set. Technometrics, 21:531–537.Google Scholar
  15. Domeniconi, C., Papadopoulos, D., Gunopulos, D., and Ma, S. 2004. Subspace clustering of high dimensional data. SIAM International Conference on Data Mining (SDM).Google Scholar
  16. Duda, R.O. and Hart, P.E. 1973. Pattern Classification and Scene Analysis. John Wiley and Sons.Google Scholar
  17. Earle, R.J. 1994. Method and apparatus for storing and retrieving multi-dimensional data in computer memory. U.S. Patent No. 5359724.Google Scholar
  18. Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. of the 2nd Int’l Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon.Google Scholar
  19. Ester, M., Kriegel, H. -P., and Xu, X. 1995. A database interface for clustering in large spatial databases. In Proc. of the 1st Int’l Conference on Knowledge Discovery in Databases and Data Mining, Montreal, Canada.Google Scholar
  20. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R. (Eds.). 1996. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press.Google Scholar
  21. Feige, U. 1996. A threshold of ln n for approximating set cover. In Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing, pp. 314–318.Google Scholar
  22. Franzblau, D. 1989. Performance guarantees on a sweep-line heuristic for covering rectilinear polygons with rectangles. SIAM J. Disc. Math, 2(3):307–321.CrossRefGoogle Scholar
  23. Franzblau, D.S. and Kleitman, D.J. 1984. An algorithm for constructing regions with rectangles: Independence and minimum generating sets for collections of intervals. In Proc. of the 6th Annual Symp. on Theory of Computing, Washington D.C., pp. 268–276.Google Scholar
  24. Friedman, J. 1997. Optimizing a noisy function of many variables with application to data mining. In UW/MSR Summer Research Institute in Data Mining.Google Scholar
  25. Fukunaga, K. 1990. Introduction to Statistical Pattern Recognition. Academic Press.Google Scholar
  26. Guha, S., Rastogi, R., and Shim, K. 1998. CURE: An efficient clustering algorithm for large databases. Proceedings of ACM SIGMOD, pp. 73–84.Google Scholar
  27. Gunopulos, D., Khardon, R., Mannila, H., and Saluja, S. 1997. Data mining, hypergraph transversals, and machine learning. In Proc. of the 16th ACM Symp. on Principles of Database Systems, pp. 209–216.Google Scholar
  28. Ho, C.-T., Agrawal, R., Megiddo, N., and Srikant, R. 1997. Range queries in OLAP data cubes. In Proc. of the ACM SIGMOD Conference on Management of Data, Tucson, Arizona.Google Scholar
  29. Hong, S.J. 1987. MINI: A heuristic algorithm for two-level logic minimization. In Selected Papers on Logic Synthesis for Integrated Circuit Design, R. Newton (Eds.). IEEE Press.Google Scholar
  30. Internationl Business Machines. 1996. IBM Intelligent Miner User’s Guide, Version 1 Release 1, SH12-6213-00 edition, July 1996.Google Scholar
  31. Jain, A.K. and Dubes, R.C. 1988. Algorithms for Clustering Data. Prentice Hall.Google Scholar
  32. Kaufman, L. and Rousseeuw, P. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons.Google Scholar
  33. Lin, D.-I. and Kedem, Z.M. 1998. Pincer search: A new algorithm for discovering the maximum frequent sets. In Proc. of the 6th Int’l Conference on Extending Database Technology (EDBT), Valencia, Spain.Google Scholar
  34. Lovász, L. 1975. On the ratio of the optimal integral and fractional covers. Discrete Mathematics, 13:383–390.CrossRefGoogle Scholar
  35. Lund, C. and Yannakakis, M. 1993. On the hardness of approximating minimization problems. In Proceedings of the ACM Symposium on Theory of Computing, pp. 286–293.Google Scholar
  36. Masek, W. 1978. Some NP-Complete Set Covering Problems. M.S. Thesis, MIT.Google Scholar
  37. Mehta, M., Agrawal, R., and Rissanen, J. 1996. SLIQ: A fast scalable classifier for data mining. In Proc. of the Fifth Int’l Conference on Extending Database Technology (EDBT), Avignon, France.Google Scholar
  38. Michalski, R.S. and Stepp, R.E. 1983. Learning from observation: Conceptual clustering. In Machine Learning: An Artificial Intelligence Approach, R.S. Michalski, J.G. Carbonell, and T. M. Mitchell (Eds.). Volume I. Morgan Kaufmann, pp. 331–363.Google Scholar
  39. Miller, R. and Yang, Y. 1997. Association rules over interval data. In Proc. ACM SIGMOD International Conf. on Management of Data, pp. 452–461.Google Scholar
  40. Ng, R.T. and Han, J. 1994. Efficient and effective clustering methods for spatial data mining. In Proc. of the VLDB Conference, Santiago, Chile.Google Scholar
  41. Procopiuc, C.M., Jones, M., Agarwal, P.K., and Murali, T.M. 2002. A Monte Carlo algorithm for fast projective clustering. SIGMOD.Google Scholar
  42. Reckhow, R.A. and Culberson, J. 1987. Covering simple orthogonal polygon with a minimum number of orthogonally convex polygons. In Proc. of the ACM 3rd Annual Computational Geometry Conference, pp. 268–277.Google Scholar
  43. Rissanen, J. 1989. Stochastic Complexity in Statistical Inquiry. World Scientific Publ. Co.Google Scholar
  44. Schroeter, P. and Bigun, J. 1995. Hierarchical image segmentation by multi-dimensional clustering and orientation-adaptive boundary refinement. Pattern Recognition, 25(5):695–709.CrossRefGoogle Scholar
  45. Shafer, J., Agrawal, R. and Mehta, M. 1996. SPRINT: A scalable parallel classifier for data mining. In Proc. of the 22nd Int’l Conference on Very Large Databases, Bombay, India.Google Scholar
  46. Shoshani, A. Personal communication, 1997.Google Scholar
  47. Sneath, P. and Sokal, R. 1973. Numerical Taxonomy. Freeman.Google Scholar
  48. Soltan, V. and Gorpinevich, A. 1992. Minimum dissection of rectilinear polygon with arbitrary holes into rectangles. In Proc. of the ACM 8th Annual Computational Geometry Conference, Berlin, Germany, pp. 296–302.Google Scholar
  49. Srikant, R. and Agrawal, R. 1996. Mining quantitative association rules in large relational tables. In Proc. of the ACM SIGMOD Conference on Management of Data, Montreal, Canada.Google Scholar
  50. Toivonen, H. 1996. Sampling large databases for association rules. In Proc. of the 22nd Int’l Conference on Very Large Databases, Mumbai (Bombay), India, pp. 134–145.Google Scholar
  51. Wharton, S. 1983. A generalized histogram clustering for multidimensional image data. Pattern Recognition, 16(2):193–199.CrossRefGoogle Scholar
  52. Zait, M. and Messatfa, H. 1997. A comparative study of clustering methods. Future Generation Computer Systems, 13(2-3):149–159.CrossRefGoogle Scholar
  53. Zhang, D. and Bowyer, A. 1986. CSG set-theoretic solid modelling and NC machining of blend surfaces. In Proceedings of the Second Annual ACM Symposium on Computational Geometry, pp. 314–318.Google Scholar
  54. Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: An efficient data clustering method for very large databases. In Proc. of the ACM SIGMOD Conference on Management of Data, Montreal, Canada.Google Scholar

Copyright information

© Springer Science + Business Media, Inc. 2005

Authors and Affiliations

  • Rakesh Agrawal
    • 1
  • Johannes Gehrke
    • 2
  • Dimitrios Gunopulos
    • 3
  • Prabhakar Raghavan
    • 4
  1. 1.IBM Almaden Research CenterSan Jose
  2. 2.Computer Science DepartmentCornell UniversityIthaca
  3. 3.Department of Computer Science and Eng.University of California RiversideRiverside
  4. 4.Verity, Inc.Germany

Personalised recommendations