An Overview of Clustering Applied to Molecular Biology

  • Rebecca Nugent
  • Marina Meila
Part of the Methods in Molecular Biology book series (MIMB, volume 620)


In molecular biology, we are often interested in determining the group structure in, e.g., a population of cells or microarray gene expression data. Clustering methods identify groups of similar observations, but the results can depend on the chosen method’s assumptions and starting parameter values. In this chapter, we give a broad overview of both attribute- and similarity-based clustering, describing both the methods and their performance. The parametric and nonparametric approaches presented vary in whether or not they require knowing the number of clusters in advance as well as the shapes of the estimated clusters. Additionally, we include a biclustering algorithm that incorporates variable selection into the clustering procedure. We finish with a discussion of some common methods for comparing two clustering solutions (possibly from different methods). The user is advised to devote time and attention to determining the appropriate clustering approach (and any corresponding parameter values) for the specific application prior to analysis.

Key words

Cluster analysis K-means model-based clustering EM algorithm similarity-based clustering spectral clustering nonparametric clustering hierarchical clustering biclustering comparing partitions 


  1. 1.
    Getz, G, Levine, E, and Domany, E (2000). Coupled two-way clustering analysis of gene microarray data. Proceedings of the National Academy of Sciences, 97(22): 12079–12084.CrossRefGoogle Scholar
  2. 2.
    Lo, K., Brinkman R., and Gottardo, R. (2008). Automated gating of flow cytometry data via robust model-based clustering. Cytometry, Part A, 73A: 321–332.CrossRefGoogle Scholar
  3. 3.
    Gottardo, R. and Lo, K. (2008). flowClust Bioconductor packageGoogle Scholar
  4. 4.
    Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28:129–137.CrossRefGoogle Scholar
  5. 5.
    Mardia, K., Kent, J.T., and Bibby, J.M. (1979). Multivariate Analysis. Academic Press, New york.Google Scholar
  6. 6.
    Fraley, C. and Raftery, A. (1998). How many clusters? Which clustering method? answers via model-based cluster analysis. The Computer Journal, 41:578–588.CrossRefGoogle Scholar
  7. 7.
    McLachlan, G. J. and Basford, K. E. (1988). Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New YorkGoogle Scholar
  8. 8.
    Dean, N., Murphy, T.B., and Downey, G. (2006). Using unlabelled data to update classification rules with applications in food authenticity studies. Journal of the Royal Statistical Society: Series C 55(1):1–14.CrossRefGoogle Scholar
  9. 9.
    Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B, 39:1–38.Google Scholar
  10. 10.
    Banfield, J. D. and Raftery, A. E. (1993). Model-based gaussian and non-gaussian clustering. Biometrics, 49:803–821.CrossRefGoogle Scholar
  11. 11.
    Wishart, D. (1969). Mode analysis: A generalization of nearest neighbor which reduces chaining effect. In A. J. Cole, (ed.) Numerical Taxonomy, Academic Press, New York, 282–311.Google Scholar
  12. 12.
    Hartigan, J. A. (1975) Clustering Algorithms. Wiley, New YorkGoogle Scholar
  13. 13.
    Hartigan, J. A. (1981) Consistency of single linkage for high-density clusters. Journal of the American Statistical Association, 76:388–394.CrossRefGoogle Scholar
  14. 14.
    Hartigan, J. A. (1985) Statistical theory in clustering. Journal of Classification, 2:63–76.CrossRefGoogle Scholar
  15. 15.
    Silverman, B.W. (1981) Using kernel density estimate to investigate multimodality. Journal of the Royal Statistical Society, Series B, 43:97–99.Google Scholar
  16. 16.
    Wand, M. P. and Jones, M. C. (1995) Kernel Smoothing. Chapman & Hall, New York.Google Scholar
  17. 17.
    Wand, M. P. (1994) Fast computation of multivariate kernel estimators. Journal of Computational and Graphical Statistics, 3: 433–445.Google Scholar
  18. 18.
    Stuetzle, W. (2003) Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. Journal of Classification. 20:25–47.CrossRefGoogle Scholar
  19. 19.
    Stuetzle, W. and Nugent, R. (2009). A generalized single linkage method for estimating the cluster tree of a density. Journal of Computational and Graphical Statistics, in Press. The Fast Track version is at DOI:10.1198/jcgs.2009.070409Google Scholar
  20. 20.
    Comăaniciu, D. and Meer, P. (1999). Mean-shift analysis and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17: 790–799.Google Scholar
  21. 21.
    Fukunaga, K. and Hostetler, L. D. (1975). The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions Information Theory, IT-21:32–40.CrossRefGoogle Scholar
  22. 22.
    Cheng, Y. (1995). Mean shift, mode seeking, and clustering. IEEE Transaction. on Pattern Analysis and Machine Intelligence, 17(8):790–799.CrossRefGoogle Scholar
  23. 23.
    Carreira-Perpiñan, M. A. (2007). Gaussian mean shift is an EM algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5): 767–776.PubMedCrossRefGoogle Scholar
  24. 24.
    Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249–265.Google Scholar
  25. 25.
    Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York. (Springer Series in Statistics).Google Scholar
  26. 26.
    Gan, G., Ma, C., and Wu, J. (2007) Data Clustering: Theory, Algorithms, and Applications. ASA-SIAM Series on Statistics and Applied Probability, SIAM, Philadelphia, ASA, Alexandria, VA.CrossRefGoogle Scholar
  27. 27.
    Lafon, S. and Lee, A. (2006) Diffusion Maps and Course-Graining: A Unified Framework for Dimensionality Reduction, Graph Partitioning, and Data Set Parameterization IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9): 1393–1403.PubMedCrossRefGoogle Scholar
  28. 28.
    Meila, M. and Shi, J. (2001b). A random walks view of spectral segmentation. In Jaakkola, T. and Richardson, T. (eds.), Eighth International Workshop on Artificial Intelligence and Statistics (AISTATS), January 4–7, 2001, Key West, Florida.Google Scholar
  29. 29.
    Ng, A. Y., Jordan, M. I., and Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. In Dietterich, T. G., Becker, S., and Ghahramani, Z. (ed.), Advances in Neural Information Processing Systems 14, MIT Press, Cambridge, MA.Google Scholar
  30. 30.
    Meilăa, M. and Shi, J. (2001a). Learning segmentation by random walks. In Leen, T. K., Dietterich, T. G., and Tresp, V. (eds.), Advances in Neural Information Processing Systems, vol. 13, MIT Press, Cambridge, MA. pp. 873–879.Google Scholar
  31. 31.
    Ding, C. and He, X. (2004). K-means clustering via principal component analysis. In Brodley, C. E., (ed.), Proceedings of the International Machine Learning Conference (ICML). Morgan Kauffman.Google Scholar
  32. 32.
    Bach, F. and Jordan, M. I. (2006). Learning spectral clustering with applications to speech separation. Journal of Machine Learning Research, 7:1963–2001.Google Scholar
  33. 33.
    Meilăa, M., Shortreed, S., and Xu, L. (2005). Regularized spectral learning. In Cowell, R. and Ghahramani, Z., (eds.), Proceedings of the Artificial Intelligence and Statistics Workshop(AISTATS 05).Google Scholar
  34. 34.
    Shortreed, S. and Meilăa, M. (2005). Unsupervised spectral learning. In Jaakkola, T. and Bachhus, F. (ed.), Proceedings of the 21st Conference on Uncertainty in AI, AUAI Press, Arlington, Virginia, pp. 534–544.Google Scholar
  35. 35.
    Frey, B. J. and Dueck, D. (2007). Clustering by passing messages between data points. Science, 315:973–976.CrossRefGoogle Scholar
  36. 36.
    Cheng, Y. and Church, G.M. (2000). Biclustering of Expression Data. Proceedings of the International Conference on Intelligent Systems in Molecular Biology. 8:93–103.Google Scholar
  37. 37.
    Lazzeroni, L. and Owen, A. (2000). Plaid Models for Gene Expression Data. Statistica Sinica 12:61–86.Google Scholar
  38. 38.
    Friedman, J. and Meulman, J. (2004). Clustering objects on subsets of attributes. Journal of the Royal Statistical Society, 66: 815–849.CrossRefGoogle Scholar
  39. 39.
    Raftery, A.E. and Dean, N. (2006) Variable Selection for Model-Based Clustering Journal of the American Statistical Association, 101(473) 168–178.CrossRefGoogle Scholar
  40. 40.
    Fowlkes, E. B. and Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383):553–569.CrossRefGoogle Scholar
  41. 41.
    Ben-Hur, A., Elisseeff, A., and Guyon, I. (2002). A stability based method for discovering structure in clustered data. In Pacific Symposium on Biocomputing, pp. 6–17.Google Scholar
  42. 42.
    Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2:193–218.CrossRefGoogle Scholar
  43. 43.
    Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66:846–850.CrossRefGoogle Scholar
  44. 44.
    Wallace, D. L. (1983). Comment. Journal of the American Statistical Association, 78(383):569–576.Google Scholar
  45. 45.
    Meilăa, M. (2005). Comparing clusterings – an axiomatic view. In Wrobel, S. and De Raedt, L. (eds.), Proceedings of the International Machine Learning Conference (ICML). ACM Press, New York.Google Scholar
  46. 46.
    Steinley, D. L. (2004). Properties of the Hubert-Arabie adjusted Rand index. Psychological Methods, 9(3):386–396. Simulations of some adjusted indices and of misclassification error.Google Scholar
  47. 47.
    Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. Wiley, New York.CrossRefGoogle Scholar
  48. 48.
    Meilăa, M. (2007). Comparing clusterings – an information based distance. Journal of Multivariate Analysis, 98:873–895.CrossRefGoogle Scholar

Copyright information

© Humana Press, a part of Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Rebecca Nugent
    • 1
  • Marina Meila
    • 2
  1. 1.Department of StatisticsCarnegie Mellon UniversityPittsburghUSA
  2. 2.Department of StatisticsUniversity of WashingtonSeattleUSA

Personalised recommendations