Advertisement

Null Models in Cluster Validation

  • A. D. Gordon
Part of the Studies in Classification, Data Analysis, and Knowledge Organization book series (STUDIES CLASS)

Summary

A brief overview is given of the problem of validation in classification studies. Attention is concentrated on the specification of appropriate null models for data, with respect to which one may assess some cluster structure that has been obtained as the output of a clustering algorithm. In addition to standard null models, a discussion is given of ‘data-influenced’ null models, in which the precise form of the null hypothesis is influenced by characteristics of the data set under investigation. To illustrate the importance of specifying relevant null models, the behaviour of U-statistics under these null models is used to assess individual clusters found when data were classified using some standard clustering criteria implemented in an agglomerative algorithm.

Keywords

Null Model Poisson Model Multivariate Normal Distribution Hierarchical Classification Cluster Validation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. BOCK, H. H. (1985): On some significance tests in cluster analysis. Journal of Classification, 2, 77–108. CrossRefGoogle Scholar
  2. BOCK, H. H. (1989): Probabilistic aspects in cluster analysis. In: O. Opitz (ed.): Conceptual and Numerical Analysis of Data. Springer-Verlag, Berlin, 12–44.Google Scholar
  3. BRECKENRIDGE, J. N. (1989): Replicating cluster analysis: Method, consistency, and validity. Multivariate Behavioral Research, 24, 147–161 CrossRefGoogle Scholar
  4. CHAND, D. R., and KAPUR, S. S. (1970): An algorithm for convex polytopes. Journal of the Association for Computing Machinery, 17, 78–86. Google Scholar
  5. CHAZELLE, B. (1985): Fast searching in a real algebraic manifold with applications to geometric complexity. Lecture Notes in Computer Science, 185, 145–156. Google Scholar
  6. COOK, R. D., HAWKINS, D. M., and WEISBERG, S. (1993): Exact iterative computation of the robust multivariate minimum volume ellipsoid estimator. Statistics & Probability Letters, 16, 213–218. CrossRefGoogle Scholar
  7. DOBKIN, D., and LIPTON, R. J. (1976): Multidimensional searching problems. SIAM Journal on Computing, 5, 181–186. CrossRefGoogle Scholar
  8. DUBES, R. C., and ZENG, G. (1987): A test for spatial homogeneity in cluster analysis.Journal of Classification, 4, 33–56. CrossRefGoogle Scholar
  9. EDELSBRUNNER, H. (1987): Algorithms in Combinatorial Geometry. Springer-Verlag, Berlin.Google Scholar
  10. EDELSBRUNNER, H., KIRKPATRICK, D. G., and SEIDEL, R. (1983): On the shape of a set of points in the plane.IEEE Trans, on Inform. Theory, IT-29, 551–559. CrossRefGoogle Scholar
  11. FISHER, L., and VAN NESS, J. W. (1971): Admissible clustering procedures. Biometrika, 58, 91–104 CrossRefGoogle Scholar
  12. GORDON, A. D. (1981): Classification: Methods for the Exploratory Analysis of Multivariate Data. Chapman & Hall, London.Google Scholar
  13. GORDON, A. D. (1994a): Clustering algorithms and cluster validation. In: P. Dirschedl and R. Ostermann (eds.): Computational Statistics. Physica-Verlag, Heidelberg, 503–518.Google Scholar
  14. GORDON, A. D. (1994b): Identifying genuine clusters in a classification. Computational Statistics & Data Analysis, 18, in press. Google Scholar
  15. GOWER, J. C., and BANFIELD, C. F. (1975): Goodness-of-fit criteria for hierarchical classification and their empirical distributions. In: L. C. A. Corsten and T. Postelnicu (eds.): Proc. of the 8 th Intern. Biometric Conference, 347–361.Google Scholar
  16. HARPER, C. W., Jr. (1978): Groupings by locality in community ecology and paleoecology: Tests of significance. Lethaia, 11, 251–257. CrossRefGoogle Scholar
  17. HARTIGAN, J. A. (1975): Clustering Algorithms. Wiley, New York.Google Scholar
  18. HARTIGAN, J. A., and MOHANTY, S. (1992): The RUNT test for multimodality. Journal of Classification, 9, 63–70. CrossRefGoogle Scholar
  19. HSUAN, F. C. (1979): Generating uniform polygonal random pairs. Applied Statistics, 28, 170–172. CrossRefGoogle Scholar
  20. HWANG, K., and BRIGGS, F. A. (1984): Computer Architecture and Parallel Processing. McGraw-Hill, New York.Google Scholar
  21. JAIN, A. K., and DUBES, R. C. (1988): Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, NJ.Google Scholar
  22. JAIN, A. K., and MOREAU, J. V. (1987): Bootstrap techniques in cluster analysis. Pattern Recognition, 20, 547–568. CrossRefGoogle Scholar
  23. LING, R. F. (1973): A probability theory for cluster analysis. Journal of the American Statistical Association, 68, 159–164 CrossRefGoogle Scholar
  24. McINTYRE, R. M., and BLASHFIELD, R. K. (1980): A nearest-centroid technique for evaluating the minimum-variance clustering procedure. Multivariate Behavioral Research, 15, 225–238. CrossRefGoogle Scholar
  25. MANN, H. B., and WHITNEY, D. R. (1947): On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics, 18, 50–60. CrossRefGoogle Scholar
  26. PERRUCHET, C. (1983): Une analyse bibliographique des épreuves de classifiabilité en analyse des données. Statistiques et Analyse de Données, 8, 18–41 Google Scholar
  27. PREPARATA, F. P., and SHAMOS, M. I. (1988): Computational Geometry: An Introduction. Springer-Verlag, New York.Google Scholar
  28. RIPLEY, B. D., and RASSON, J. P. (1977): Finding the edge of a Poisson forest. Journal of Applied Probability, 14, 483–491. CrossRefGoogle Scholar
  29. ROHLF, F. J., and FISHER, D. R. (1968): Tests for hierarchical structure in random data sets. Systematic Zoology, 17, 407–412.CrossRefGoogle Scholar
  30. RUBIN, P. A. (1984): Generating random points in a polytope. Communications in Statistics: Simulation and Computation, B 13, 375–396. CrossRefGoogle Scholar
  31. SMITH, S. P., and JAIN, A. K. (1984): Testing for uniformity in multidimensional data.IEEE Trans, on Pattern Analysis and Mach. Intell. PAMI-6, 73–81. Google Scholar
  32. STRAUSS, R. E. (1982): Statistical significance of species clusters in association analysis. Ecology, 63, 634–639.CrossRefGoogle Scholar
  33. TITTERINGTON, D. M. (1975): Optimal design: Some geometrical aspects of D-optimaiity. Biometrika, 62, 313–320. Google Scholar
  34. VASSILIOU, A., IGNATIADES, L., and KARYDIS, M. (1989): Clustering of transect phytoplankton collections with a quick randomization algorithm. Journal of Experimental Marine Biology and Ecology, 130, 135–145. CrossRefGoogle Scholar
  35. WARD, J. H., Jr. (1963): Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58, 236–244 CrossRefGoogle Scholar
  36. ZENG, G., and DUBES, R. C. (1985): A comparison of tests for randomness. Pattern Recognition, 18, 191–198 CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin · Heidelberg 1996

Authors and Affiliations

  • A. D. Gordon
    • 1
  1. 1.Mathematical InstituteUniversity of St AndrewsSt AndrewsScotland

Personalised recommendations