Determining the Number of Clusters Using Multivariate Ranks

  • Mohammed Baragilly
  • Biman Chakraborty
Conference paper


Determining number of clusters in a multivariate data has become one of the most important issues in very diversified areas of scientific disciplines. The forward search algorithm is a graphical approach that helps us in this task. The traditional forward search approach based on Mahalanobis distances has been introduced by Hadi (1992), Atkinson (1994), while Atkinson et al. (2004) used it as a clustering method. But like many other Mahalanobis distance-based methods, it cannot be correctly applied to asymmetric distributions and more generally, to distributions which depart from the elliptical symmetry assumption. We propose a new forward search methodology based on spatial ranks, where clusters are grown with one data point at a time sequentially, using spatial ranks with respect to the points already in the subsample. The algorithm starts from a randomly chosen initial subsample. We illustrate with simulated data that the proposed algorithm is robust to the choice of initial subsample and it performs well in different mixture multivariate distributions. We also propose a modified algorithm based on the volume of central rank regions. Our numerical examples show that it produces the best results under elliptic symmetry.


Gaussian Mixture Model Mahalanobis Distance Subset Size Forward Search Clear Maximum 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The authors would like to greatly thank the editors of ICORS 2015 and the two referees for their helpful remarks and comments on an earlier version of the manuscript. The research of Mohammed Baragilly is partially supported by the Egyptian Government and he would like to express his greatest appreciation to the Egyptian Cultural Centre and Educational Bureau in London and to the Department of Applied Statistics, Helwan University.


  1. Atkinson AC (1994) Fast very robust methods for the detection of multiple outliers. J Am Stat Assoc 89:1329–1339CrossRefMATHGoogle Scholar
  2. Atkinson AC, Mulira H (1993) The stalactite plot for the detection of multivariate outliers. Stat Comput 3:27–35CrossRefGoogle Scholar
  3. Atkinson AC, Riani M (2007) Exploratory tools for clustering multivariate data. Comput Stat Data Anal 52:272–285MathSciNetCrossRefMATHGoogle Scholar
  4. Atkinson AC, Riani M (2012) Discussion on the paper by spiegelhalter, sherlaw-johnson, bardsley, blunt, wood and grigg. J Roy Stat Soc 175Google Scholar
  5. Atkinson AC, Riani M, Cerioli A (2004) Exploring multivariate data with the forward search. Springer, NewYorkCrossRefMATHGoogle Scholar
  6. Atkinson AC, Riani M, Cerioli A (2006) Random start forward searches with envelopes for detecting clusters in multivariate data. Springer, Berlin, pp 163–171Google Scholar
  7. Atkinson AC, Riani M, Cerioli A (2010) The forward search: theory and data analysis. J Korean Stat Soc 39:117–134MathSciNetCrossRefMATHGoogle Scholar
  8. Azzalini A, Bowman A (1990) A look at some data on the old faithful geyser. J Roy Stat Soc 39(3):357–365MATHGoogle Scholar
  9. Banfield J, Raftery AE (1993) Model-based gaussian and non-gaussian clustering. Biometrics 49:803–821MathSciNetCrossRefMATHGoogle Scholar
  10. Barber CB, Dobkin DP, Huhdanpaa H (1996) The quickhull algorithm for convex hulls. ACM Trans Math Softw 22(4):469–483MathSciNetCrossRefMATHGoogle Scholar
  11. Beale EML (1969) Euclidean cluster analysis. ISI, Voorburg, NetherlandsGoogle Scholar
  12. Calinski RB, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3:1–27MathSciNetCrossRefMATHGoogle Scholar
  13. Chakraborty B (2001) On affine equivariant multivariate quantiles. Ann Inst Stat Math 53:380–403MathSciNetCrossRefMATHGoogle Scholar
  14. Chaudhuri P (1996) On a geometric notion of multivariate data. J Am Stat Assoc 90:862–872MathSciNetCrossRefMATHGoogle Scholar
  15. Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New YorkMATHGoogle Scholar
  16. Everitt B, Landau S, Leese M, Stahl D (2011) Cluster analysis, 5th edn. Wiley, ChichesterCrossRefMATHGoogle Scholar
  17. Fraley C, Raftery A (2003) Enhanced model-based clustering, density estimation and discriminant analysis: Mclust. J Classif 20(263):286MathSciNetMATHGoogle Scholar
  18. Friedman HP, Rubin J (1967) On some invariant criteria for grouping data. J Am Stat Assoc 62:1159–1178MathSciNetCrossRefGoogle Scholar
  19. Gan G, Ma C, Wu J (2007) Data clustering theory, algorithms, and applications. ASA-SIAM series on statistics and applied probability. PhiladelphiaGoogle Scholar
  20. Gordon AD (1998) Cluster validation. In: C Hayashi KYeae, N Ohsumi (eds) Data science, classification and related methods. Springer, Tokyo, pp 22–39Google Scholar
  21. Hadi AS (1992) Identifying multiple outliers in multivariate data. J Roy Stat Soc 54:761–771MathSciNetGoogle Scholar
  22. Hadi AS, Simonoff JS (1993) Procedures for the identification of multiple outliers in linear models. J Am Stat Assoc 88(424):1264–1272MathSciNetCrossRefGoogle Scholar
  23. Hartigan JA (1975) Clustering algorithms. Wiley, New YorkMATHGoogle Scholar
  24. Kaufman L, Rousseeuw PJ (1990) Finding groups in data. Wiley, New YorkCrossRefMATHGoogle Scholar
  25. Koltchinskii V (1997) M-estimation, convexity and quantiles. Ann Stat 25:435–477MathSciNetCrossRefMATHGoogle Scholar
  26. Krzanowski WJ, Lai YT (1985) A criterion for determining the number of clusters in a data set. Biometrics 44(23):34MathSciNetMATHGoogle Scholar
  27. Marriott FHC (1971) Practical problems in a method of cluster analysis. Biometrics 27:501–514CrossRefGoogle Scholar
  28. Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50:159–179CrossRefGoogle Scholar
  29. Mojena R (1977) Hierarchical grouping methods and stopping rules: an evaluation. Comput J 20:359–363CrossRefMATHGoogle Scholar
  30. Overall JE, Magee KN (1992) Replication as a rule for determining the number of clusters in hierarchical cluster analysis. Appl Psychol Measur 16:119–128CrossRefGoogle Scholar
  31. Serfling R (2002) A depth function and a scale curve based on spatial quantiles. In: Dodge Y (ed) Statistical data analysis based on the L1-norm and related methods. Birkhaeuser, pp 25–38Google Scholar
  32. Sugar CA, James GM (2003) Finding the number of clusters in a data set: an information theoretic approach. J Am Stat Assoc 98:750–763MathSciNetCrossRefMATHGoogle Scholar
  33. Thorndike RL (1953) Who belongs in a family? Psychometrika 18:267–276CrossRefGoogle Scholar
  34. Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J Roy Stat Soc 63:411–423MathSciNetCrossRefMATHGoogle Scholar
  35. Venables W, Ripley B (2002) Modern applied statistics with S, 4th edn. Springer, NewYorkCrossRefMATHGoogle Scholar

Copyright information

© Springer India 2016

Authors and Affiliations

  1. 1.School of MathematicsUniversity of BirminghamBirminghamUK
  2. 2.Department of Applied StatisticsHelwan UniversityCairoEgypt

Personalised recommendations