A Systems Biology Approach for Unsupervised Clustering of High-Dimensional Data

  • Diana DiazEmail author
  • Tin Nguyen
  • Sorin Draghici
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10122)


One main challenge in modern medicine is the discovery of molecular disease subtypes characterized by relevant clinical differences, such as survival. However, clustering high-dimensional expression data is challenging due to noise and the curse of high-dimensionality. This article describes a disease subtyping pipeline that is able to exploit the important information available in pathway databases and clinical variables. The pipeline consists of a new feature selection procedure and existing clustering methods. Our procedure partitions a set of patients using the set of genes in each pathway as clustering features. To select the best features, this procedure estimates the relevance of each pathway and fuses relevant pathways. We show that our pipeline finds subtypes of patients with more distinctive survival profiles than traditional subtyping methods by analyzing a TCGA colon cancer gene expression dataset. Here we demonstrate that our pipeline improves three different clustering methods: k-means, SNF, and hierarchical clustering.


Feature Selection Hierarchical Cluster Gene Expression Data Feature Selection Method Pathway Database 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This study used data generated by the TCGA Research Network; we thank donors and research groups for sharing these valuable data. This research was supported in part by the following grants: NIH R01 DK089167, R42 GM087013 and NSF DBI-0965741, and by the Robert J. Sokol Endowment in Systems Biology. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of any of the funding agencies.


  1. 1.
    Saria, S., Goldenberg, A.: Subtyping: what it is and its role in precision medicine. IEEE Intell. Syst. 30(4), 70–75 (2015)CrossRefGoogle Scholar
  2. 2.
    Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95(25), 14863–14868 (1998)CrossRefGoogle Scholar
  3. 3.
    Kim, E.Y., Kim, S.Y., Ashlock, D., Nam, D.: MULTI-K: accurate classification of microarray subtypes using ensemble k-means clustering. BMC Bioinform. 10, 260 (2009)CrossRefGoogle Scholar
  4. 4.
    Wang, B., Mezlini, A.M., Demir, F., Fiume, M., Tu, Z., Brudno, M., Haibe-Kains, B., Goldenberg, A.: Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 11(3), 333–337 (2014)CrossRefGoogle Scholar
  5. 5.
    Hsu, J.J., Finkelstein, D.M., Schoenfeld, D.A.: Outcome-driven cluster analysis with application to microarray data. PLoS ONE 10(11), e0141874 (2015)CrossRefGoogle Scholar
  6. 6.
    Shai, R., Shi, T., Kremen, T.J., Horvath, S., Liau, L.M., Cloughesy, T.F., Mischel, P.S., Nelson, S.F.: Gene expression profiling identifies molecular subtypes of gliomas. Oncogene 22(31), 4918–4923 (2003)CrossRefGoogle Scholar
  7. 7.
    Hira, Z.M., Gillies, D.F., Hira, Z.M., Gillies, D.F.: A review of feature selection and feature extraction methods applied on microarray data. Adv. Bioinform. 2015, e198363 (2015)Google Scholar
  8. 8.
    Huang, G.T., Cunningham, K.I., Benos, P.V., Chennubhotla, C.S.: Spectral clustering strategies for heterogeneous disease expression data. In: Pacific Symposium on Biocomputing, pp. 212–223 (2013)Google Scholar
  9. 9.
    Pyatnitskiy, M., Mazo, I., Shkrob, M., Schwartz, E., Kotelnikova, E.: Clustering gene expression regulators: new approach to disease subtyping. PLoS ONE 9(1), e84955 (2014)CrossRefGoogle Scholar
  10. 10.
    Li, T., Zhang, C., Ogihara, M.: A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20(15), 2429–2437 (2004)CrossRefGoogle Scholar
  11. 11.
    Hernández-Torruco, J., Canul-Reich, J., Frausto-Solís, J., Méndez-Castillo, J.J.: Feature selection for better identification of subtypes of Guillain-Barré. Comput. Math. Methods Med. 2014, e432109 (2014)CrossRefGoogle Scholar
  12. 12.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)CrossRefGoogle Scholar
  13. 13.
    Liu, Y., Schumann, M.: Data mining feature selection for credit scoring models. J. Oper. Res. Soc. 56(9), 1099–1108 (2005)CrossRefzbMATHGoogle Scholar
  14. 14.
    Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbalanced data. SIGKDD Explor. Newsl. 6(1), 80–89 (2004)CrossRefGoogle Scholar
  15. 15.
    Hall, M.A.: Correlation-based feature selection for machine learning. Ph.D. thesis, The University of Waikato (1999)Google Scholar
  16. 16.
    Diaz-Uriarte, R., de Andres, S.A.: Gene selection and classification of microarray data using random forest. BMC Bioinform. 7, 3 (2006)CrossRefGoogle Scholar
  17. 17.
    Sharma, A., Imoto, S., Miyano, S., Sharma, V.: Null space based feature selection method for gene expression data. Int. J. Mach. Learn. Cybern. 3(4), 269–276 (2011)CrossRefGoogle Scholar
  18. 18.
    Bair, E., Tibshirani, R.: Semi-supervised methods to predict patient survival from gene expression data. PLOS Biol. 2(4), e108 (2004)CrossRefGoogle Scholar
  19. 19.
    Paoli, S., Jurman, G., Albanese, D., Merler, S., Furlanello, C.: Integrating gene expression profiling and clinical data. Int. J. Approx. Reason. 47(1), 58–69 (2008)CrossRefGoogle Scholar
  20. 20.
    Bushel, P.R., Wolfinger, R.D., Gibson, G.: Simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes. BMC Syst. Biol. 1, 15 (2007)CrossRefGoogle Scholar
  21. 21.
    Chalise, P., Koestler, D.C., Bimali, M., Yu, Q., Fridley, B.L.: Integrative clustering methods for high-dimensional molecular data. Transl. Cancer Res. 3(3), 202–216 (2014)Google Scholar
  22. 22.
    Kanehisa, M., Goto, S.: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28(1), 27–30 (2000)CrossRefGoogle Scholar
  23. 23.
    Croft, D., Mundo, A.F., Haw, R., Milacic, M., Weiser, J., Wu, G., Caudy, M., Garapati, P., Gillespie, M., Kamdar, M.R., Jassal, B., Jupe, S., Matthews, L., May, B., Palatnik, S., Rothfels, K., Shamovsky, V., Song, H., Williams, M., Birney, E., Hermjakob, H., Stein, L., D’Eustachio, P.: The Reactome pathway knowledgebase. Nucleic Acids Res. 42(D1), D472–D477 (2014)CrossRefGoogle Scholar
  24. 24.
    Hanisch, D., Zien, A., Zimmer, R., Lengauer, T.: Co-clustering of biological networks and gene expression data. Bioinformatics 18(suppl. 1), S145–S154 (2002)CrossRefGoogle Scholar
  25. 25.
    Huang, D., Pan, W.: Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data. Bioinformatics 22(10), 1259–1268 (2006)CrossRefGoogle Scholar
  26. 26.
    Rapaport, F., Zinovyev, A., Dutreix, M., Barillot, E., Vert, J.P.: Classification of microarray data using gene networks. BMC Bioinform. 8, 35 (2007)CrossRefGoogle Scholar
  27. 27.
    Pok, G., Liu, J.C.S., Ryu, K.H.: Effective feature selection framework for cluster analysis of microarray data. Bioinformation 4(8), 385–389 (2010)CrossRefGoogle Scholar
  28. 28.
    Prlić, A., Procter, J.B.: Ten Simple rules for the open development of scientific software. PLOS Comput. Biol. 8(12), e1002802 (2012)CrossRefGoogle Scholar
  29. 29.
    Carey, V.J., Stodden, V.: Reproducible research concepts and tools for cancer bioinformatics. In: Ochs, M.F., Casagrande, J.T., Davuluri, R.V. (eds.) Biomedical Informatics for Cancer Research, pp. 149–175. Springer, New York (2010). doi: 10.1007/978-1-4419-5714-6_8 CrossRefGoogle Scholar
  30. 30.
    Diaz, D., Draghici, S.: mirIntegrator: Integrating miRNAs into signaling pathways. R package (2015)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.Wayne State University, Computer ScienceDetroitUSA
  2. 2.Wayne State University, Obstetrics and GynecologyDetroitUSA

Personalised recommendations