Skip to main content
Log in

Determining distinct clusters in gene expression data using similarity in principal component subspaces

  • Published:
International Journal of Advances in Engineering Sciences and Applied Mathematics Aims and scope Submit manuscript

Abstract

Clustering is routinely used in gene expression data analysis to mine groups of co-expressed genes. Commonly used clustering algorithms require the user to specify the number of clusters a priori. We have developed a method that identifies, from a set of candidate partitions, the one with the maximal number of distinct clusters. Principal component analysis is used to characterize each cluster by its dominant eigenvectors that describe the correlation between the constituent genes. Similarity between each pair of clusters is measured as the angle between their principal component subspaces. A cluster is deemed to be ‘distinct’ if it shows low similarity to all other clusters in that partition. The method assigns each candidate partition a cumulative measure of the distinctness of all the clusters, called the Net Principal Subspace Information (NEPSI) Index. A candidate partition with the highest NEPSI index value has the maximal number of distinct clusters and is selected as the ‘best’. We illustrate the efficacy of the proposed method using two gene expression datasets and two different clustering algorithms—k-means and model-based clustering. A comparison of the results with those from Bayesian Information Criterion is also given.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Jiang, D., Tang, C., Zhang, A.: Cluster analysis for gene expression data: a survey. IEEE Trans. Knowl. Data Eng. 16, 1370–1386 (2004)

    Article  Google Scholar 

  2. Horimoto, K., Toh, H.: Statistical estimation of cluster boundaries in gene expression profile data. Bioinformatics 17, 1143–1151 (2001)

    Article  Google Scholar 

  3. Lukashin, A.V., Fuchs, R.: Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters. Bioinformatics 17, 405–414 (2001)

    Article  Google Scholar 

  4. Yeung, K.Y., Fraley, C., Murua, A., Raftery, A.E., Ruzzo, W.L.: Model-based clustering and data transformations for gene expression data. Bioinformatics 17, 977–987 (2001)

    Article  Google Scholar 

  5. Wicker, N., Dembele, D., Raffelsberger, W., Poch, O.: Density of points clustering, application to transcriptomic data analysis. Nucleic Acids Res. 30, 3992–4000 (2002)

    Article  Google Scholar 

  6. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. J. Intell. Inf. Syst. 17, 107–145 (2001)

    Article  MATH  Google Scholar 

  7. Bolshakova, N., Azuaje, F.: Cluster validation techniques for genome expression data. Signal Process. 83, 825–833 (2003)

    Article  MATH  Google Scholar 

  8. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  MATH  Google Scholar 

  9. Dunn, J.: Well separated clusters and optimal fuzzy partitions. J. Cybern. 4, 95–104 (1974)

    Article  MathSciNet  Google Scholar 

  10. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1, 224–227 (1979)

    Article  Google Scholar 

  11. Jonnalagadda, S., Srinivasan, R.: An information theory approach for validating clusters in microarray data. In Proceedings of the 12th Intelligent Systems for Molecular Biology, July 31–August 4, 2004. Glasgow, UK. http://www.iscb.org/ismbeccb2004/short%20papers/39.pdf (2004)

  12. Jackson, J.E.: A User’s Guide to Principal Components. Wiley, NY (1991)

    Book  MATH  Google Scholar 

  13. Krzanowski, W.J.: Between-groups comparison of principal components. J. Am. Stat. Assoc. 74, 703–707 (1979)

    MathSciNet  MATH  Google Scholar 

  14. Singhal, A., Seborg, D.E.: Pattern matching in historical batch data using PCA. IEEE Control Syst. Mag. 22, 53–63 (2002)

    Article  Google Scholar 

  15. Srinivasan, R., Wang, C., Ho, W.K., Lim, K.W.: Dynamic principal component analysis based methodology for clustering process states in agile chemical plants. Ind. Eng. Chem. Res. 43, 2123–2139 (2004)

    Article  Google Scholar 

  16. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27379–27423 and 623–656 (1948)

  17. Fuhrman, S., Cunningham, M.J., Wen, X., Zweiger, G., Seilhamer, J.J., Somogyi, R.: The application of Shannon entropy in the identification of putative drug targets. BioSystems 55, 5–14 (2000)

    Article  Google Scholar 

  18. Li, H., Zhang, K., Jiang, T.: Minimum entropy clustering and applications to gene expression data. Proceedings of IEEE Computational Systems Bioinformatics Conference (CSB’04), pp. 142–151 (2004)

  19. Fraley, C., Raftery, A.E.: Mclust: software for model-based cluster analysis. J. Classif. 16, 297–306 (1999)

    Article  MATH  Google Scholar 

  20. Kass, R.E., Raftery, A.E.: Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995)

    MATH  Google Scholar 

  21. Cho, R.J., Campbell, M.J., Winzeler, E.A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T.G., Gabrielian, A.E., Landsman, D., Lockhart, D.J., Davis, R.W.: A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Biol. Cell 2, 65–73 (1998)

    Google Scholar 

  22. Sharan, R., Adi, Moron.-Katz., Shamir, R.: CLICK and EXPANDER: a system for clustering and visualizing gene expression data. Bioinformatics 19, 1787–1799 (2003)

    Article  Google Scholar 

  23. Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95, 14863–14868 (1998)

    Article  Google Scholar 

  24. Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Bostein, D., Brown, P.O., Herskowitz, I.: The transcriptional program of Sporulation in budding yeast. Science 282, 699–705 (1998)

    Article  Google Scholar 

  25. Gibbons, D.F., Roth, F.: Judging the quality of gene expression-based clustering methods using gene annotation. Genome Res. 12, 1574–1581 (2002)

    Article  Google Scholar 

  26. Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Davis, A.P., Dolinski, K., Dwight, S.S., et al.: Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000)

    Article  Google Scholar 

  27. Issel-Tarver, L., Christie, K., Dolinski, K., Andrada, R., Balakrishnan, R., Ball, C.A., Binkley, G., Dong, S., Dwight, S.S., Fisk, D.G.: Saccharomyces, genome database. Methods Enzymol. 350, 329–346 (2002)

    Article  Google Scholar 

  28. Draghici, S.: Data Analysis Tools for DNA Microarrays. Chapman and Hall/CRC, Boca Raton (2003)

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rajagopalan Srinivasan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jonnalagadda, S., Srinivasan, R. Determining distinct clusters in gene expression data using similarity in principal component subspaces. Int J Adv Eng Sci Appl Math 4, 41–51 (2012). https://doi.org/10.1007/s12572-012-0055-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12572-012-0055-1

Keywords

Navigation