Abstract
Clustering is routinely used in gene expression data analysis to mine groups of co-expressed genes. Commonly used clustering algorithms require the user to specify the number of clusters a priori. We have developed a method that identifies, from a set of candidate partitions, the one with the maximal number of distinct clusters. Principal component analysis is used to characterize each cluster by its dominant eigenvectors that describe the correlation between the constituent genes. Similarity between each pair of clusters is measured as the angle between their principal component subspaces. A cluster is deemed to be ‘distinct’ if it shows low similarity to all other clusters in that partition. The method assigns each candidate partition a cumulative measure of the distinctness of all the clusters, called the Net Principal Subspace Information (NEPSI) Index. A candidate partition with the highest NEPSI index value has the maximal number of distinct clusters and is selected as the ‘best’. We illustrate the efficacy of the proposed method using two gene expression datasets and two different clustering algorithms—k-means and model-based clustering. A comparison of the results with those from Bayesian Information Criterion is also given.
Similar content being viewed by others
References
Jiang, D., Tang, C., Zhang, A.: Cluster analysis for gene expression data: a survey. IEEE Trans. Knowl. Data Eng. 16, 1370–1386 (2004)
Horimoto, K., Toh, H.: Statistical estimation of cluster boundaries in gene expression profile data. Bioinformatics 17, 1143–1151 (2001)
Lukashin, A.V., Fuchs, R.: Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters. Bioinformatics 17, 405–414 (2001)
Yeung, K.Y., Fraley, C., Murua, A., Raftery, A.E., Ruzzo, W.L.: Model-based clustering and data transformations for gene expression data. Bioinformatics 17, 977–987 (2001)
Wicker, N., Dembele, D., Raffelsberger, W., Poch, O.: Density of points clustering, application to transcriptomic data analysis. Nucleic Acids Res. 30, 3992–4000 (2002)
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. J. Intell. Inf. Syst. 17, 107–145 (2001)
Bolshakova, N., Azuaje, F.: Cluster validation techniques for genome expression data. Signal Process. 83, 825–833 (2003)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Dunn, J.: Well separated clusters and optimal fuzzy partitions. J. Cybern. 4, 95–104 (1974)
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1, 224–227 (1979)
Jonnalagadda, S., Srinivasan, R.: An information theory approach for validating clusters in microarray data. In Proceedings of the 12th Intelligent Systems for Molecular Biology, July 31–August 4, 2004. Glasgow, UK. http://www.iscb.org/ismbeccb2004/short%20papers/39.pdf (2004)
Jackson, J.E.: A User’s Guide to Principal Components. Wiley, NY (1991)
Krzanowski, W.J.: Between-groups comparison of principal components. J. Am. Stat. Assoc. 74, 703–707 (1979)
Singhal, A., Seborg, D.E.: Pattern matching in historical batch data using PCA. IEEE Control Syst. Mag. 22, 53–63 (2002)
Srinivasan, R., Wang, C., Ho, W.K., Lim, K.W.: Dynamic principal component analysis based methodology for clustering process states in agile chemical plants. Ind. Eng. Chem. Res. 43, 2123–2139 (2004)
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27379–27423 and 623–656 (1948)
Fuhrman, S., Cunningham, M.J., Wen, X., Zweiger, G., Seilhamer, J.J., Somogyi, R.: The application of Shannon entropy in the identification of putative drug targets. BioSystems 55, 5–14 (2000)
Li, H., Zhang, K., Jiang, T.: Minimum entropy clustering and applications to gene expression data. Proceedings of IEEE Computational Systems Bioinformatics Conference (CSB’04), pp. 142–151 (2004)
Fraley, C., Raftery, A.E.: Mclust: software for model-based cluster analysis. J. Classif. 16, 297–306 (1999)
Kass, R.E., Raftery, A.E.: Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995)
Cho, R.J., Campbell, M.J., Winzeler, E.A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T.G., Gabrielian, A.E., Landsman, D., Lockhart, D.J., Davis, R.W.: A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Biol. Cell 2, 65–73 (1998)
Sharan, R., Adi, Moron.-Katz., Shamir, R.: CLICK and EXPANDER: a system for clustering and visualizing gene expression data. Bioinformatics 19, 1787–1799 (2003)
Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95, 14863–14868 (1998)
Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Bostein, D., Brown, P.O., Herskowitz, I.: The transcriptional program of Sporulation in budding yeast. Science 282, 699–705 (1998)
Gibbons, D.F., Roth, F.: Judging the quality of gene expression-based clustering methods using gene annotation. Genome Res. 12, 1574–1581 (2002)
Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Davis, A.P., Dolinski, K., Dwight, S.S., et al.: Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000)
Issel-Tarver, L., Christie, K., Dolinski, K., Andrada, R., Balakrishnan, R., Ball, C.A., Binkley, G., Dong, S., Dwight, S.S., Fisk, D.G.: Saccharomyces, genome database. Methods Enzymol. 350, 329–346 (2002)
Draghici, S.: Data Analysis Tools for DNA Microarrays. Chapman and Hall/CRC, Boca Raton (2003)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Jonnalagadda, S., Srinivasan, R. Determining distinct clusters in gene expression data using similarity in principal component subspaces. Int J Adv Eng Sci Appl Math 4, 41–51 (2012). https://doi.org/10.1007/s12572-012-0055-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12572-012-0055-1