Itemset Classified Clustering

Sese, Jun; Morishita, Shinichi

doi:10.1007/978-3-540-30116-5_37

Itemset Classified Clustering

Jun Sese²² &
Shinichi Morishita²³

Conference paper

2263 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3202))

Abstract

Clustering results could be comprehensible and usable if individual groups are associated with characteristic descriptions. However, characterization of clusters followed by clustering may not always produce clusters associated with special features, because the first clustering process and the second classification step are done independently, demanding an elegant way that combines clustering and classification and executes both simultaneously.

In this paper, we focus on itemsets as the feature for characterizing groups, and present a technique called “itemset classified clustering,” which divides data into groups given the restriction that only divisions expressed using a common itemset are allowed and computes the optimal itemset maximizing the interclass variance between the groups. Although this optimization problem is generally intractable, we develop techniques that effectively prune the search space and efficiently compute optimal solutions in practice. We remark that itemset classified clusters are likely to be overlooked by traditional clustering algorithms such as two-clustering or k-means, and demonstrate the scalability of our algorithm with respect to the amount of data by the application of our method to real biological datasets.

Download to read the full chapter text

Chapter PDF

References

Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proc. of ACM SIGMOD 1998, pp. 94–105 (1998)
Google Scholar
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. of 20th VLDB, pp. 487–499 (1994)
Google Scholar
Bayardo, R.: Efficiently mining long patterns from databases. In: Proc. of ACM SIGMOD 1998, pp. 85–93 (1998)
Google Scholar
Breiman, L., Olshen, R.A., Friedman, J.H., Stone, C.J.: Classification and Regression Trees. Brooks/Cole Publishing Company, Monterey (1984)
MATH Google Scholar
Cheng, Y., Church, G.M.: Biclustering of expression data. In: Proc. of the Eighth Intl. Conf. on ISMB, pp. 93–103 (2000)
Google Scholar
Cho, R.J., Campbell, M.J., et al.: A genome-wide transcriptional analysis of the mitotic cell cycle. Molecular Cell 2, 65–73 (1998)
Article Google Scholar
DeRisi, J.L., Iyer, V.R., Brown, P.O.: Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680–686 (1997)
Article Google Scholar
Garey, M.R., Johnson, D.S.: Computer and Intractability. A Guide to NP-Completeness. W.H. Freeman, New York (1979)
Google Scholar
Guha, S., Rastogi, R., Shim, K.: Cure: an efficient clustering algorithm for large databases. In: Proc. of ACM SIGMOD 1998, pp. 73–84 (1998)
Google Scholar
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proc. of ACM SIGMOD 2000, pp. 1–12 (2000)
Google Scholar
Horst, R., Tuy, H.: Global optimization: Deterministic approaches. Springer, Heidelberg (1993)
Google Scholar
Michalski, R.S., Stepp, R.E.: Learning from observation: Conceptual clustering, pp. 331–363. Tioga Publishing Company (1983)
Google Scholar
Morishita, S., Sese, J.: Traversing itemset lattice with statistical metric pruning. In: Proc. of ACM PODS 2000, pp. 226–236 (2000)
Google Scholar
Ng, R.T., Han, J.: Efficient and effective clustering methods for spatial data mining. In: 20th VLDB, Los Altos, CA 94022, USA, pp. 144–155 (1994)
Google Scholar
Quinlan, J.R.: C4-5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)
Google Scholar
Sese, J., Kurokawa, Y., Monden, M., Kato, K., Morishita, S.: Constrained clusters of gene expression profiles with pathological features. Bioinformatics (2004) (in press)
Google Scholar
Sese, J., Morishita, S.: Answering the most correlated n association rules efficiently. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS (LNAI), vol. 2431, pp. 410–422. Springer, Heidelberg (2002)
Chapter Google Scholar
Spellman, P.T., other: Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 9, 3273–3297 (1998)
Google Scholar
Tantrum, J., Murua, A., Stuetzle, W.: Hierarchical model-based clustering of large datasets through fractionation and refractionation. In: Proc. of the KDD 2002 (2002)
Google Scholar
Wang, H., Wang, W., Yang, J., Yu, P.S.: Clustering by pattern similarity in large data sets. In: Proc. of ACM SIGMOD 2002, pp. 394–405 (2002)
Google Scholar
Zaki, M., Hsiao, C.: Charm: An efficient algorithm for closed itemset mining. In: 2nd SI AM International Conference on Data Mining (2002)
Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: an efficient data clustering method for very large databases. In: Proc. of ACM SIGMOD 1996, pp. 103–114 (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

Undergraduate Program for Bioinformatics and Systems Biology, Graduate School of Information Science and Technology, University of Tokyo,
Jun Sese
Department of Computational Biology, Graduate School of Frontier Sciences, University of Tokyo and Institute for Bioinformatics and Research and Development Japan Science and Technology Corporation,
Shinichi Morishita

Authors

Jun Sese
View author publications
You can also search for this author in PubMed Google Scholar
Shinichi Morishita
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

INSA-Lyon, LIRIS CNRS UMR5205, F-69621, Villeurbanne, France
Jean-François Boulicaut
Dipartimento di Informatica, Università degli Studi di Bari,
Floriana Esposito
Pisa KDD Laboratory, ISTI - CNR, Area della Ricerca di Pisa, Via Giuseppe Moruzzi 1, Pisa, Italy
Fosca Giannotti
Dipartimento di Informatica, Via F. Buonarroti 2, 56127, Pisa, Italy
Dino Pedreschi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sese, J., Morishita, S. (2004). Itemset Classified Clustering. In: Boulicaut, JF., Esposito, F., Giannotti, F., Pedreschi, D. (eds) Knowledge Discovery in Databases: PKDD 2004. PKDD 2004. Lecture Notes in Computer Science(), vol 3202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30116-5_37

Download citation

DOI: https://doi.org/10.1007/978-3-540-30116-5_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23108-0
Online ISBN: 978-3-540-30116-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics