Clustering dependent observations with copula functions
This paper deals with the problem of clustering dependent observations according to their underlying complex generating process. Di Lascio and Giannerini (Journal of Classification 29(1):50–75, 2012) introduced the CoClust, a clustering algorithm based on copula function that achieves the task but has a high computational burden. Moreover, the CoClust automatically allocates all the observations to the clusters; thus, it cannot discard potentially irrelevant observations. In this paper we introduce an improved version of the CoClust that both overcomes these issues and performs better in many respects. By means of a Monte Carlo study we investigate the features of the algorithm and show that it improves consistently with respect to the old CoClust. The validity of our proposal is also supported by applications to real data sets of human breast tumor samples for which the algorithm provides a meaningful biological interpretation. The new algorithm is implemented and made available through an updated version of the R package CoClust.
KeywordsCopula function Multivariate dependence structure Clustering Biological tumor sample
Mathematics Subject Classification62H30 62H20 62P10
F. Marta L. Di Lascio acknowledges the support of Free University of Bozen-Bolzano, Faculty of Economics and Management, via the project “Multivariate analysis techniques based on copula function”.
- Brechmann E, Schepsmeier U (2013) Modeling dependence with c- and d-vine copulas: the R package CDVine. J Stat Softw 52(3):1–27Google Scholar
- Clarke K (2007) A simple distribution-free test for non-nested model selection. Polit Anal 15:347–363Google Scholar
- Di Lascio FML, Giannerini S (2015) CoClust. R package version 0.3-1Google Scholar
- Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95:14863–14868Google Scholar
- Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, kallioniemi OP, Wilfond B, Borg A, Dougherty E, Kononen J, Bubendorf L, Fehrle W, Pittaluga S, Gruvberger S, Loman N, Johannsson O, Olsson H, Sauter G (2001) Gene-expression profiles in hereditary breast cancer. N Engl J Med 344(8):539–548Google Scholar
- Joe H, Xu J (1996) The estimation method of inference functions for margins for multivariate models. Technical Report 166, Department of Statistics, University of British ColumbiaGoogle Scholar
- Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander E, Golub T (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 96(6):2907–2912Google Scholar
- Yeung K, Fraley C, Murua A, Raftery A, Ruzzo W (2001) Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10):977–987Google Scholar