Clustering Procedures and Module Detection
Detecting clusters (also referred to as groups or modules) of closely related objects is an important problem in data mining in general. Network modules are often defined as clusters. Partitioning-around-medoids (PAM) clustering and hierarchical clustering are often used in network applications. Partitioning-around-medoids (aka. k-medoid clustering) leads to relatively robust clusters but requires that the user specify the number k of clusters. Hierarchical clustering is attractive in network applications since (a) it does not require the specification of the number of clusters and (b) it works well when there are many singleton clusters and when cluster sizes vary greatly. But hierarchical clustering requires the user to determine how to cut branches of the resulting cluster tree. Toward this end, one can use the dynamicTreeCut method and R library. The dynamic hybrid method combines the advantages of hierarchical clustering and partitioning-around-medoids clustering. Network concepts are useful for defining cluster quality statistics (e.g., to measure the density or separability of clusters). To determine whether the cluster structure is preserved in another data sets, one can use cross-tabulation-based preservation statistics. To measure the agreement between two clusterings, one can use the Rand index and other cross-tabulation-based statistics.
KeywordsDissimilarity Measure Rand Index Cluster Assignment Adjusted Rand Index Cluster Label
- Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3(7):RESEARCH0036Google Scholar
- Gargalovic PS, Imura M, Zhang B, Gharavi NM, Clark MJ, Pagnon J, Yang WP, He A, Truong A, Patel S, Nelson SF, Horvath S, Berliner JA, Kirchgessner TG, Lusis AJ (2006) Identification of inflammatory gene modules based on variations of human endothelial cell responses to oxidized lipids. Proc Natl Acad Sci USA 103(34):12741–12746PubMedCrossRefGoogle Scholar
- Hastie T, Tibshirani R, Friedman J (2001) The elements of statistcal learning: Data mining, inference, and prediction. Springer, New YorkGoogle Scholar
- Langfelder P, Horvath S (2011) Fast R functions for robust correlations and hierarchical clustering. J Stat Software. In pressGoogle Scholar
- Oldham MC, Langfelder P, Horvath S (2011) Sample networks for enhancing cluster analysis of genomic data: Application to huntington’s disease. Technical ReportGoogle Scholar