Determining the Number of Clusters Using Multivariate Ranks
Determining number of clusters in a multivariate data has become one of the most important issues in very diversified areas of scientific disciplines. The forward search algorithm is a graphical approach that helps us in this task. The traditional forward search approach based on Mahalanobis distances has been introduced by Hadi (1992), Atkinson (1994), while Atkinson et al. (2004) used it as a clustering method. But like many other Mahalanobis distance-based methods, it cannot be correctly applied to asymmetric distributions and more generally, to distributions which depart from the elliptical symmetry assumption. We propose a new forward search methodology based on spatial ranks, where clusters are grown with one data point at a time sequentially, using spatial ranks with respect to the points already in the subsample. The algorithm starts from a randomly chosen initial subsample. We illustrate with simulated data that the proposed algorithm is robust to the choice of initial subsample and it performs well in different mixture multivariate distributions. We also propose a modified algorithm based on the volume of central rank regions. Our numerical examples show that it produces the best results under elliptic symmetry.
KeywordsGaussian Mixture Model Mahalanobis Distance Subset Size Forward Search Clear Maximum
The authors would like to greatly thank the editors of ICORS 2015 and the two referees for their helpful remarks and comments on an earlier version of the manuscript. The research of Mohammed Baragilly is partially supported by the Egyptian Government and he would like to express his greatest appreciation to the Egyptian Cultural Centre and Educational Bureau in London and to the Department of Applied Statistics, Helwan University.
- Atkinson AC, Riani M (2012) Discussion on the paper by spiegelhalter, sherlaw-johnson, bardsley, blunt, wood and grigg. J Roy Stat Soc 175Google Scholar
- Atkinson AC, Riani M, Cerioli A (2006) Random start forward searches with envelopes for detecting clusters in multivariate data. Springer, Berlin, pp 163–171Google Scholar
- Beale EML (1969) Euclidean cluster analysis. ISI, Voorburg, NetherlandsGoogle Scholar
- Gan G, Ma C, Wu J (2007) Data clustering theory, algorithms, and applications. ASA-SIAM series on statistics and applied probability. PhiladelphiaGoogle Scholar
- Gordon AD (1998) Cluster validation. In: C Hayashi KYeae, N Ohsumi (eds) Data science, classification and related methods. Springer, Tokyo, pp 22–39Google Scholar
- Serfling R (2002) A depth function and a scale curve based on spatial quantiles. In: Dodge Y (ed) Statistical data analysis based on the L1-norm and related methods. Birkhaeuser, pp 25–38Google Scholar