Clustering Procedures and Module Detection

  • Steve Horvath


Detecting clusters (also referred to as groups or modules) of closely related objects is an important problem in data mining in general. Network modules are often defined as clusters. Partitioning-around-medoids (PAM) clustering and hierarchical clustering are often used in network applications. Partitioning-around-medoids (aka. k-medoid clustering) leads to relatively robust clusters but requires that the user specify the number k of clusters. Hierarchical clustering is attractive in network applications since (a) it does not require the specification of the number of clusters and (b) it works well when there are many singleton clusters and when cluster sizes vary greatly. But hierarchical clustering requires the user to determine how to cut branches of the resulting cluster tree. Toward this end, one can use the dynamicTreeCut method and R library. The dynamic hybrid method combines the advantages of hierarchical clustering and partitioning-around-medoids clustering. Network concepts are useful for defining cluster quality statistics (e.g., to measure the density or separability of clusters). To determine whether the cluster structure is preserved in another data sets, one can use cross-tabulation-based preservation statistics. To measure the agreement between two clusterings, one can use the Rand index and other cross-tabulation-based statistics.


Dissimilarity Measure Rand Index Cluster Assignment Adjusted Rand Index Cluster Label 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. Carlson M, Zhang B, Fang Z, Mischel P, Horvath S, Nelson SF (2006) Gene connectivity, function, and sequence conservation: Predictions from modular yeast co-expression networks. BMC Genomics 7(7):40PubMedCrossRefGoogle Scholar
  2. Dong J, Horvath S (2007) Understanding network concepts in modules. BMC Syst Biol 1(1):24PubMedCrossRefGoogle Scholar
  3. Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3(7):RESEARCH0036Google Scholar
  4. Gargalovic PS, Imura M, Zhang B, Gharavi NM, Clark MJ, Pagnon J, Yang WP, He A, Truong A, Patel S, Nelson SF, Horvath S, Berliner JA, Kirchgessner TG, Lusis AJ (2006) Identification of inflammatory gene modules based on variations of human endothelial cell responses to oxidized lipids. Proc Natl Acad Sci USA 103(34):12741–12746PubMedCrossRefGoogle Scholar
  5. Ghazalpour A, Doss S, Zhang B, Plaisier C, Wang S, Schadt EE, Thomas A, Drake TA, Lusis AJ, Horvath S (2006) Integrating genetics and network analysis to characterize genes related to mouse weight. PloS Genet 2(2):8CrossRefGoogle Scholar
  6. Hastie T, Tibshirani R, Friedman J (2001) The elements of statistcal learning: Data mining, inference, and prediction. Springer, New YorkGoogle Scholar
  7. Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218CrossRefGoogle Scholar
  8. Kapp AV, Tibshirani R (2007) Are clusters found in one dataset present in another dataset? Biostat 8(1):9–31CrossRefGoogle Scholar
  9. Kaufman L, Rousseeuw PJ (1990) Finding groups in data: An introduction to cluster analysis. Wiley, New YorkCrossRefGoogle Scholar
  10. Langfelder P, Horvath S (2011) Fast R functions for robust correlations and hierarchical clustering. J Stat Software. In pressGoogle Scholar
  11. Langfelder P, Zhang B, Horvath S (2007) Defining clusters from a hierarchical cluster tree: The Dynamic Tree Cut library for R. Bioinformatics 24(5):719–720PubMedCrossRefGoogle Scholar
  12. Langfelder P, Luo R, Oldham MC, Horvath S (2011) Is my network module preserved and reproducible? Plos Comput Biol 7(1):e1001057PubMedCrossRefGoogle Scholar
  13. Oldham MC, Langfelder P, Horvath S (2011) Sample networks for enhancing cluster analysis of genomic data: Application to huntington’s disease. Technical ReportGoogle Scholar
  14. Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850CrossRefGoogle Scholar
  15. Sokal RR, Rohlf FJ (1962) The comparison of dendrograms by objective methods. Taxon 11:33–40CrossRefGoogle Scholar
  16. Tibshirani R, Walther G (2005) Cluster validation by prediction strength. J Comput Graph Stat 14:511–528CrossRefGoogle Scholar
  17. Yip A, Horvath S (2007) Gene network interconnectedness and the generalized topological overlap measure. BMC Bioinform 8(8):22CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  1. 1.University of California, Los AngelesLos AngelesUSA

Personalised recommendations