Advertisement

SCALE: a scalable framework for efficiently clustering transactional data

  • Hua Yan
  • Keke Chen
  • Ling Liu
  • Zhang Yi
Article

Abstract

This paper presents SCALE, a fully automated transactional clustering framework. The SCALE design highlights three unique features. First, we introduce the concept of Weighted Coverage Density as a categorical similarity measure for efficient clustering of transactional datasets. The concept of weighted coverage density is intuitive and it allows the weight of each item in a cluster to be changed dynamically according to the occurrences of items. Second, we develop the weighted coverage density measure based clustering algorithm, a fast, memory-efficient, and scalable clustering algorithm for analyzing transactional data. Third, we introduce two clustering validation metrics and show that these domain specific clustering evaluation metrics are critical to capture the transactional semantics in clustering analysis. Our SCALE framework combines the weighted coverage density measure for clustering over a sample dataset with self-configuring methods. These self-configuring methods can automatically tune the two important parameters of our clustering algorithms: (1) the candidates of the best number K of clusters; and (2) the application of two domain-specific cluster validity measures to find the best result from the set of clustering results. We have conducted extensive experimental evaluation using both synthetic and real datasets and our results show that the weighted coverage density approach powered by the SCALE framework can efficiently generate high quality clustering results in a fully automated manner.

Keywords

Transactional data clustering Cluster assessment Cluster validation Frequent itemset mining Weighted coverage density 

References

  1. Abello J, Resende MGC, Sudarsky S (2002) Massive quasi-clique detection. In: Proceedings of the 5th Latin American symposium on theoretical informatics, pp 598–612Google Scholar
  2. Aggarwal CC, Magdalena C, Yu PS (2002) Finding localized associations in market basket data. IEEE Trans Knowl Data Eng 14(1):51–62CrossRefGoogle Scholar
  3. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large data bases (VLDB), pp 487–499Google Scholar
  4. Andritsos P, Tsaparas P, Miller RJ, Sevcik KC (2004) Limbo: scalable clustering of categorical data. In: Proceedings of international conference on extending database technology (EDBT), pp 123–146Google Scholar
  5. Babcock B, Datar M, Motwani R, O’Callaghan L (2003) Maintaining variance and k-medians over data stream windows. In: Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, pp 234–243Google Scholar
  6. Barbara D, Li Y, Couto J (2002) Coolcat: an entropy-based algorithm for categorical clustering. In: Proceedings of ACM conference on information and knowledge management (CIKM), pp 582–589Google Scholar
  7. Brijs T, Swinnen G, Vanhoof K, Wets G (1999) Using association rules for product assortment decisions: a case study. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, pp 254–260Google Scholar
  8. Chakrabarti D, Papadimitriou S, Modha DS, Faloutsos C (2004) Fully automatic cross-associations. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, pp 79–88Google Scholar
  9. Chen K, Liu L (2004) VISTA: validating and refining clusters via visualization. Inf Vis 3(4): 257–270CrossRefGoogle Scholar
  10. Chen K, Liu L (2005) The “best k” for entropy-based categorical clustering. In: Proceedings of international conference on scientific and statistical database management (SSDBM), pp 253–262Google Scholar
  11. Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining, pp 269–274Google Scholar
  12. Ding CHQ, He X, Zha H, Gu M, Simon HD (2001) A min–max cut algorithm for graph partitioning and data clustering. In: Proceedings of ICDM 2001, pp 107–114Google Scholar
  13. Ganti V, Gehrke J, Ramakrishnan R (1999) Cactus: clustering categorical data using summaries. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, pp 73–83Google Scholar
  14. Gibson D, Kleinberg J, Raghavan P (1998) Clustering categorical data: an approach based on dynamical systems. In: Proceedings of the 24th international conference on very large data bases (VLDB), pp 311–322Google Scholar
  15. Guha S, Rastogi R, Shim K (1999) Rock: a robust clustering algorithm for categorical attributes. In: Proceedings of IEEE international conference on data engineering (ICDE), pp 512–521Google Scholar
  16. Guha S, Mishra N, Motwani R (2000) Clustering data streams. In: Proceeding of IEEE symposium on foundations of computer science, pp 359–366Google Scholar
  17. Halkidi M, Batistakis Y, Vazirgiannis M (2002) Cluster validity methods: part I and II. SIGMOD Rec 31(2): 40–45CrossRefGoogle Scholar
  18. Hastie T, Tibshirani R, Friedmann J (2001) The elements of statistical learning. Springer, New YorkMATHGoogle Scholar
  19. Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3): 283–304CrossRefGoogle Scholar
  20. Jain AK, Dubes RC (1999) Data clustering: a review. ACM Comput Surv 31: 264–323CrossRefGoogle Scholar
  21. Li Y, Gopalan R (2006) Clustering transactional data streams. Lect Notes Artif Intell 4304: 1069–1073Google Scholar
  22. Li T, Ma S, Ogihara M (2004) Entropy-based criterion in categorical clustering. In: Proceedings of international conference on machine learning (ICML), pp 68–75Google Scholar
  23. Meiľ M (2005) Comparing clusterings: an axiomatic view. In: Proceedings of the 22nd international conference on machine learning, pp 577–584Google Scholar
  24. Mishra N, Ron D, Swaminathan R (2003) On finding large conjunctive clusters. In: Proceedings of the 16th annual conference on computational learning theory (COLT), pp 448–462Google Scholar
  25. Ong K-l, Li Wy, Ng W-k, Lim E-p (2004) SCLOPE: an algorithm for clustering data streams of categorical attributes. In: Proceedings of international conference on data warehousing and knowledge discovery, pp 209–218Google Scholar
  26. Ordonez C (2003) Clustering binary data streams with K-means. In: Proceedings of the 8th ACM SIGMOD workshop on research issues on data mining and knowledge discovery, pp 12–19Google Scholar
  27. Tishby N, Pereira FC, Bialek W (1999) The information bottleneck method. In: Proceedings of the 37th annual allerton conference on communication, control and computing, pp 368–377Google Scholar
  28. Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: Proceedings of ACM conference on information and knowledge management (CIKM), pp 483–490Google Scholar
  29. Yan H, Zhang L, Zhang Y (2005) Clustering categorical data using coverage density. In: Proceedings of international conference on advance data mining and application, pp 248–255Google Scholar
  30. Yan H, Chen K, Liu L (2006) Efficiently clustering transactional data with weighted coverage density. In: Proceedings of ACM conference on information and knowledge management (CIKM), pp 367–376Google Scholar
  31. Yang Y, Guan X, You J (2002) Clope: a fast and effective clustering algorithm for transactional data. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 682–687Google Scholar
  32. Zha H, He X, Ding CHQ, Gu M, Simon HD (2001) Bipartite graph partitioning and data clustering. In: Proceedings of the 10th international conference on information and knowledge management, pp 25–32Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Computational Intelligence Laboratory, School of Computer Science and EngineeringUniversity of Electronic Science and Technology of ChinaChengduPeople’s Republic of China
  2. 2.Department of Computer Science and EngineeringWright State UniversityDaytonUSA
  3. 3.Georgia Institute of TechnologyCollege of ComputingAtlantaUSA
  4. 4.Machine Intelligence Laboratory, College of Computer ScienceSichuan UniversityChengduPeople’s Republic of China

Personalised recommendations