On Mining Maximal Pattern-Based Clusters

  • Jian Pei
  • Xiaoling Zhang
  • Moonjung Cho
  • Haixun Wang
  • Philip S. Yu

Pattern-based clustering is important in many applications, such as DNA micro-array data analysis in bio-informatics, as well as automatic recommendation systems and target marketing systems in e-business. However, pattern-based clustering in large databases is still challenging. On the one hand, there can be a huge number of clusters and many of them can be redundant and thus make the pattern-based clustering ineffective. On the other hand, the previous proposed methods may not be efficient or scalable in mining large databases.

In this paper, we study the problem of maximal pattern-based clustering. The major idea is that the redundant clusters are avoided completely by mining only the maximal pattern-based clusters. We show that maximal pattern-based clusters are skylines of all pattern-based clusters. Two efficient algorithms, MaPle and MaPle+ (MaPle is for Maximal Pattern-based Clustering) are developed. The algorithms conduct a depth-first, progressively refining search and prune unpromising branches smartly. MaPle+ integrates several interesting heuristics further. Our extensive performance study on both synthetic data sets and real data sets shows that maximal pattern-based clustering is effective — it reduces the number of clusters substantially. Moreover, MaPle and MaPle+ are more efficient and scalable than the previously proposed pattern-based clustering methods in mining large databases, and MaPle,+ often performs better than MaPle.


Mining Association Rule Subspace Cluster Object Pair Frequent Itemset Mining Dominant Attribute 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ramesh C. Agarwal, Charu C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent item sets. Journal of Parallel and Distributed Computing, 61(3):350–371, 2001.MATHCrossRefGoogle Scholar
  2. 2.
    C.C. Aggarwal, J.L. Wolf, P.S. Yu, C. Procopiuc, and J.S. Park. Fast algorithms for projected clustering. In Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'99),pages 61–72, Philadelphia, PA, June 1999.Google Scholar
  3. 3.
    C.C. Aggarwal and P.S. Yu. Finding generalized projected clusters in high dimensional spaces.In Proc. 2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'00), pages 70–81,Dallas, TX, May 2000.Google Scholar
  4. 4.
    R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. 1998 ACM-SIGMOD Int. Conf.Management of Data (SIGMOD'98), pages 94–105, Seattle, WA, June 1998.Google Scholar
  5. 5.
    R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. 1993 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'93),pages 207–216, Washington, DC, May 1993.Google Scholar
  6. 6.
    R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int.Conf. Very Large Data Bases (VLDB'94), pages 487–499, Santiago, Chile, Sept. 1994.Google Scholar
  7. 7.
    K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is “nearest neighbor” meaningful? In C. Beeri and P. Buneman, editorsProceedings of the 7th International Conference on Database Theory (ICDT'99), pages 217–235, Berlin, Germany, January 1999.Google Scholar
  8. 8.
    C. H. Cheng, A. W-C. Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. In Proc. 1999 Int. Conf. Knowledge Discovery and Data Mining (KDD'99),pages 84–93, San Diego, CA, Aug. 1999.Google Scholar
  9. 9.
    Yizong Cheng and George M. Church. Biclustering of expression data. In Proc. of the 8th International Conference on Intelligent System for Molecular Biology, pages 93–103, 2000.Google Scholar
  10. 10.
    Mohammad El-Hajj and Osmar R. Zaïane. Inverted matrix: efficient discovery of frequent items in large datasets in the context of interactive mining. In KDD'03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining,pages 109–. ACM Press, 2003.Google Scholar
  11. 11.
    B. Ganter and R. Wille. Formal Concept Analysis — Mathematical Foundations. Springer,1996.Google Scholar
  12. 12.
    J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc.2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'00), pages 1–12, Dallas, TX,May 2000.Google Scholar
  13. 13.
    H. V. Jagadish, J. Madar, and R. Ng. Semantic compression and pattern extraction with fascicles. In Proc. 1999 Int. Conf. Very Large Data Bases (VLDB'99), pages 186–197, Edinburgh,UK, Sept. 1999.Google Scholar
  14. 14.
    D. Jiang, J. Pei, M. Ramanathan, C. Tang, and A. Zhang. Mining coherent gene clusters from gene-sample-time microarray data. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD'04), pages 430–439. ACM Press,2004.Google Scholar
  15. 15.
    Daxin Jiang, Jian Pei, and Aidong Zhang. DHC: A density-based hierarchical clustering method for gene expression data. In The Third IEEE Symposium on Bioinformatics and Bio-engineering (BIBE'03), Washington D.C., March 2003.Google Scholar
  16. 16.
    Guimei Liu, Hongjun Lu, Wenwu Lou, and Jeffrey Xu Yu. On computing, storing and querying frequent patterns. In KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 607–612. ACM Press, 2003.Google Scholar
  17. 17.
    J. Liu, Y. Pan, K. Wang, and J. Han. Mining frequent item sets by opportunistic projection. In Proc. 2002 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'02),pages 229–238, Edmonton, Alberta, Canada, July 2002.CrossRefGoogle Scholar
  18. 18.
    J. Liu and W. Wang. Op-cluster: Clustering by tendency in high dimensional space. In Proceedings of the Third IEEE International Conference on Data Mining (ICDM'03), Melbourne,Florida, Nov. 2003. IEEE.Google Scholar
  19. 19.
    N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Proc. 7th Int. Conf. Database Theory (ICDT'99), pages 398–416,Jerusalem, Israel, Jan. 1999.Google Scholar
  20. 20.
    J. Pei, X. Zhang, M. Cho, H. Wang, and P. S. Yu. Maple: A fast algorithm for maximal pattern-based clustering. In Proceedings of the Third IEEE International Conference on Data Mining (ICDM'03), Melbourne, Florida, Nov. 2003. IEEE.Google Scholar
  21. 21.
    S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and G. Church. Yeast micro data set. In http://arep.med.harvard.edu/biclustering/yeast.matrix, 2000.Google Scholar
  22. 22.
    H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data sets. In Proc. 2002 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'02), Madison, WI,June 2002.Google Scholar
  23. 23.
    Jiong Yang, Wei Wang, Haixun Wang, and Philip S. Yu. δ-cluster: Capturing subspace correlation in a large data set. In Proc. 2002 Int. Conf. Data Engineering (ICDE'02), San Fransisco,CA, April 2002.Google Scholar
  24. 24.
    M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. In Proc. 1997 Int. Conf. Knowledge Discovery and Data Mining (KDD'97),pages 283–286, Newport Beach, CA, Aug. 1997.Google Scholar
  25. 25.
    L. Zhao and M. Zaki. Tricluster: An effective algorithm for mining coherent clusters in 3d mi-croarray data. In Proc. 2005 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'05), Baltimore, Maryland, June 2005.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Jian Pei
  • Xiaoling Zhang
    • 1
  • Moonjung Cho
  • Haixun Wang
  • Philip S. Yu
    1. 1.Boston UniversityBostonUSA

    Personalised recommendations