Clustering Algorithms Optimizer: A Framework for Large Datasets

Varshavsky, Roy; Horn, David; Linial, Michal

doi:10.1007/978-3-540-72031-7_8

Roy Varshavsky¹,
David Horn² &
Michal Linial³

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4463))

Included in the following conference series:

International Symposium on Bioinformatics Research and Applications

846 Accesses
5 Citations

Abstract

Clustering algorithms are employed in many bioinformatics tasks, including categorization of protein sequences and analysis of gene-expression data. Although these algorithms are routinely applied, many of them suffer from the following limitations: (i) relying on predetermined parameters tuning, such as a-priori knowledge regarding the number of clusters; (ii) involving nondeterministic procedures that yield inconsistent outcomes. Thus, a framework that addresses these shortcomings is desirable. We provide a data-driven framework that includes two interrelated steps. The first one is SVD-based dimension reduction and the second is an automated tuning of the algorithm’s parameter(s). The dimension reduction step is efficiently adjusted for very large datasets. The optimal parameter setting is identified according to the internal evaluation criterion known as Bayesian Information Criterion (BIC). This framework can incorporate most clustering algorithms and improve their performance. In this study we illustrate the effectiveness of this platform by incorporating the standard K-Means and the Quantum Clustering algorithms. The implementations are applied to several gene-expression benchmarks with significant success.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)
MATH Google Scholar
Sharan, R., Shamir, R.: CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis. In: ISMB’00, pp. 307–316. AAAI Press, Menlo Park (2000)
Google Scholar
Blatt, M., Wiseman, S., Domany, E.: Superparamagnetic Clustering of Data. Physical Review Letters 76, 3251–3254 (1996)
Article Google Scholar
Getz, G., Levine, E., Domany, E.: Coupled two-way clustering analysis of gene microarray data. PNAS 97(22), 12079–12084 (2000)
Article Google Scholar
Ben-Dor, A., Shamir, R., Yakhini, Z.: Clustering Gene Expression Patterns. Journal of Computational Biology 6(3-4), 281–297 (1999)
Article Google Scholar
Dembele, D., Kastner, P.: Fuzzy C-means method for clustering microarray data. Bioinformatics 19(8), 973–980 (2003)
Article Google Scholar
Horn, D., Gottlieb, A.: Algorithm for data clustering in pattern recognition problems based on quantum mechanics. Physical Review Letters 88(1) (2002)
Google Scholar
Horn, D., Axel, I.: Novel clustering algorithm for microarray expression data in a truncated SVD space. Bioinformatics 19(9), 1110–1115 (2003)
Article Google Scholar
Eisen, M.B., et al.: Cluster analysis and display of genome-wide expression patterns. PNAS 95(25), 14863–14868 (1998)
Article Google Scholar
Teschendorff, A.E., et al.: A variational Bayesian mixture modelling framework for cluster analysis of gene-expression data. Bioinformatics 21(13), 3025–3033 (2005)
Article Google Scholar
Zhong, S., Ghosh, J.: A unified framework for model-based clustering. Journal of Machine Learning Research 4, 1001–1037 (2003)
Article MathSciNet Google Scholar
Wall, M., Rechtsteiner, A., Rocha, L.: Singular Value Decomposition and Principal Component Analysis. In: Berrar, D., Dubitzky, W., Granzow, M. (eds.) A Practical Approach to Microarray Data Analysis, pp. 91–109. Kluwer Academic Publishers, Dordrecht (2003)
Chapter Google Scholar
Ding, C., et al.: Adaptive dimension reduction for clustering high dimensional data. In: IEEE International Conference on Data Mining 2002, pp. 107–114 (2002)
Google Scholar
Xing, E.P., Karp, R.M.: CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts. Bioinformatics 17(90001), S306–315 (2001)
Google Scholar
Plagianakos, V.P., Tasoulis, D.K., Vrahatis, M.N.: Hybrid dimension reduction approach for gene expression data classification. In: International Joint Conference on Neural Networks 2005, Post-Conference Workshop on Computational Intelligence Approaches for the Analysis of Bioinformatics (2005)
Google Scholar
Zhong, W., et al.: Improved K-means Clustering Algorithm for Exploring Local Protein Sequence Motifs Representing Common Structural Property. In: IEEE Transactions on NanoBioscience, 255-265 (2005)
Google Scholar
Handl, J., Knowles, J., Kell, D.B.: Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15), 3201–3212 (2005)
Article Google Scholar
Varshavsky, R., Linial, M., Horn, D.: COMPACT: A Comparative Package for Clustering Assessment. In: Chen, G., et al. (eds.) ISPA-WS 2005. LNCS, vol. 3759, pp. 159–167. Springer, Heidelberg (2005)
Chapter Google Scholar
Alter, O., Brown, P.O., Botstein, D.: Singular value decomposition for genome-wide expression data processing and modeling. PNAS 97(18), 10101–10106 (2000)
Article Google Scholar
Landauer, T.K., Foltz, P.W., Laham, D.: Introduction to Latent Semantic Analysis. Discourse Processes 25, 259–284 (1998)
Article Google Scholar
Fraley, C., Raftery, A.E.: How many clusters? Which clustering method? - Answers via Model-Based Cluster Analysis. Computer Journal 41, 578–588 (1998)
Article MATH Google Scholar
Barash, D., Comaniciu, D.: Meanshift clustering for DNA microarray analysis. In: IEEE Computational Systems Bioinformatics Conference (CSB) (2004)
Google Scholar
Alon, U., et al.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. PNAS 96(12), 6745–6750 (1999)
Article Google Scholar
Golub, T.R., et al.: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 286(5439), 531–537 (1999)
Article Google Scholar
Spellman, P.T., et al.: Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization. Mol. Biol. Cell. 9(12), 3273–3297 (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel
Roy Varshavsky
School of Physics and Astronomy, Tel Aviv University, Israel
David Horn
Deptartment of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Israel
Michal Linial

Authors

Roy Varshavsky
View author publications
You can also search for this author in PubMed Google Scholar
David Horn
View author publications
You can also search for this author in PubMed Google Scholar
Michal Linial
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Ion Măndoiu Alexander Zelikovsky

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Varshavsky, R., Horn, D., Linial, M. (2007). Clustering Algorithms Optimizer: A Framework for Large Datasets. In: Măndoiu, I., Zelikovsky, A. (eds) Bioinformatics Research and Applications. ISBRA 2007. Lecture Notes in Computer Science(), vol 4463. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72031-7_8

Download citation

DOI: https://doi.org/10.1007/978-3-540-72031-7_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72030-0
Online ISBN: 978-3-540-72031-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics