Abstract
Clustering results could be comprehensible and usable if individual groups are associated with characteristic descriptions. However, characterization of clusters followed by clustering may not always produce clusters associated with special features, because the first clustering process and the second classification step are done independently, demanding an elegant way that combines clustering and classification and executes both simultaneously.
In this paper, we focus on itemsets as the feature for characterizing groups, and present a technique called “itemset classified clustering,” which divides data into groups given the restriction that only divisions expressed using a common itemset are allowed and computes the optimal itemset maximizing the interclass variance between the groups. Although this optimization problem is generally intractable, we develop techniques that effectively prune the search space and efficiently compute optimal solutions in practice. We remark that itemset classified clusters are likely to be overlooked by traditional clustering algorithms such as two-clustering or k-means, and demonstrate the scalability of our algorithm with respect to the amount of data by the application of our method to real biological datasets.
Chapter PDF
References
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proc. of ACM SIGMOD 1998, pp. 94–105 (1998)
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. of 20th VLDB, pp. 487–499 (1994)
Bayardo, R.: Efficiently mining long patterns from databases. In: Proc. of ACM SIGMOD 1998, pp. 85–93 (1998)
Breiman, L., Olshen, R.A., Friedman, J.H., Stone, C.J.: Classification and Regression Trees. Brooks/Cole Publishing Company, Monterey (1984)
Cheng, Y., Church, G.M.: Biclustering of expression data. In: Proc. of the Eighth Intl. Conf. on ISMB, pp. 93–103 (2000)
Cho, R.J., Campbell, M.J., et al.: A genome-wide transcriptional analysis of the mitotic cell cycle. Molecular Cell 2, 65–73 (1998)
DeRisi, J.L., Iyer, V.R., Brown, P.O.: Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680–686 (1997)
Garey, M.R., Johnson, D.S.: Computer and Intractability. A Guide to NP-Completeness. W.H. Freeman, New York (1979)
Guha, S., Rastogi, R., Shim, K.: Cure: an efficient clustering algorithm for large databases. In: Proc. of ACM SIGMOD 1998, pp. 73–84 (1998)
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proc. of ACM SIGMOD 2000, pp. 1–12 (2000)
Horst, R., Tuy, H.: Global optimization: Deterministic approaches. Springer, Heidelberg (1993)
Michalski, R.S., Stepp, R.E.: Learning from observation: Conceptual clustering, pp. 331–363. Tioga Publishing Company (1983)
Morishita, S., Sese, J.: Traversing itemset lattice with statistical metric pruning. In: Proc. of ACM PODS 2000, pp. 226–236 (2000)
Ng, R.T., Han, J.: Efficient and effective clustering methods for spatial data mining. In: 20th VLDB, Los Altos, CA 94022, USA, pp. 144–155 (1994)
Quinlan, J.R.: C4-5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)
Sese, J., Kurokawa, Y., Monden, M., Kato, K., Morishita, S.: Constrained clusters of gene expression profiles with pathological features. Bioinformatics (2004) (in press)
Sese, J., Morishita, S.: Answering the most correlated n association rules efficiently. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS (LNAI), vol. 2431, pp. 410–422. Springer, Heidelberg (2002)
Spellman, P.T., other: Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 9, 3273–3297 (1998)
Tantrum, J., Murua, A., Stuetzle, W.: Hierarchical model-based clustering of large datasets through fractionation and refractionation. In: Proc. of the KDD 2002 (2002)
Wang, H., Wang, W., Yang, J., Yu, P.S.: Clustering by pattern similarity in large data sets. In: Proc. of ACM SIGMOD 2002, pp. 394–405 (2002)
Zaki, M., Hsiao, C.: Charm: An efficient algorithm for closed itemset mining. In: 2nd SI AM International Conference on Data Mining (2002)
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: an efficient data clustering method for very large databases. In: Proc. of ACM SIGMOD 1996, pp. 103–114 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sese, J., Morishita, S. (2004). Itemset Classified Clustering. In: Boulicaut, JF., Esposito, F., Giannotti, F., Pedreschi, D. (eds) Knowledge Discovery in Databases: PKDD 2004. PKDD 2004. Lecture Notes in Computer Science(), vol 3202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30116-5_37
Download citation
DOI: https://doi.org/10.1007/978-3-540-30116-5_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23108-0
Online ISBN: 978-3-540-30116-5
eBook Packages: Springer Book Archive