Summary
This paper surveys various ways in which probabilistic approaches can be useful in partitional (‘non-hierarchical’) cluster analysis. Four basic distribution models for ‘clustering structures’ are described in order to derive suitable clustering strategies. They are exemplified for various special distribution cases, including dissimilarity data and random similarity relations. A special section describes statistical tests for checking the relevance of a calculated classification (e.g., the max-F test, convex cluster tests) and comparing it to standard clustering situations (comparative assessment of classifications, CAC).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Anderson, J.J. (1985): Normal mixtures and the number of clusters problem. Computational Statistics Quarterly 2, 3–14.
P. Arabie, L. Hubert and G. De Soete (eds.) (1996): Clustering and Classification. World Science Publishers, River Edge NJ.
Baubkus, W. (1985): Minimizing the variance criterion in cluster analysis: Optimal configurations in the multidimensional normal case. Diploma thesis, Institute of Statistics, Technical University of Aachen, Germany.
Berdai, A., and B. Garel (1994): Performances d’un test d’homogénéité contre une hypothèse de mélange gaussien. Revue de Statistique Appliquée 42 (1), 63–79.
Bernardo, J.M. (1994): Optimizing prediction with hierarchical models: Bayesian clustering. In: P.R. Freeman, A.F.M. Smith (Eds.): Aspects of uncertainty. Wiley, New York, 1994, 67–76.
Binder, D.A. (1978): Bayesian cluster analysis. Biometrika 65, 31–38.
Bock, H.H. (1968): Statistische Modelle für die einfache und doppelte Klassifikation von normalverteilten Beobachtungen. Dissertation, Univ. Freiburg i. Brsg., Germany.
Bock, H.H. (1969): The equivalence of two extremal problems and its application to the iterative classification of multivariate data. Report of the Conference ‘Medizinische Statistik’, Forschungsinstitut Oberwolfach, February 1969, lOpp.
Bock, H.H. (1972): Statistische Modelle und Bayes’sche Verfahren zur Bestimmung einer unbekannten Klassifikation normalverteilter zufälliger Vektoren. Metrika 18 (1972) 120–132.
Bock, H.H. (1974): Automatische Klassifikation (Clusteranalyse). Vandenhoeck Ruprecht, Göttingen, 480 pp.
Bock, H.H. (1977): On tests concerning the existence of a classification. In: Proc. First Symposium on Data Analysis and Informatics, Versailles, 1977, Vol. II. Institut de Recherche d’Informatique et d’Automatique ( IRIA ), Le Chesnay, 1977, 449–464.
Bock, H.H. (1984): Statistical testing and evaluation methods in cluster analysis. In: J.K. Ghosh J. Roy (Eds.): Golden Jubilee Conference in Statistics: Applications and new directions. Calcutta, December 1981. Indian Statistical Institute, Calcutta, 1984, 116–146.
Bock, H.H. (1985): On some significance tests in cluster analysis. J. of Classification 2, 77–108.
Bock, H.H. (1986): Loglinear models and entropy clustering methods for qualitative data. In: W. Gaul, M. Schader (Eds.), Classification as a tool of research. North Holland, Amsterdam, 1986, 19–26.
Bock, H.H. (1987): On the interface between cluster analysis, principal component analysis, and multidimensional scaling. In: H. Bozdogan and A.K. Gupta (eds.): Multivariate statistical modeling and data analysis. Reidel, Dordrecht, 1987, 17–34.
Bock, H.H. (Ed.) (1988): Classification and related methods of data analysis. Proc. First IFCS Conference, Aachen, 1987. North Holland, Amsterdam.
Bock, H.H. (1989a): Probabilistic aspects in cluster analysis. In: O. Opitz (Ed.): Conceptual and numerical analysis of data. Springer-Verlag, Heidelberg, 1989, 12–44.
Bock, H.H. (1989b): A probabilistic clustering model for graphs and similarity relations. Paper presented at the Fall Meeting 1989 of the Working Group ‘Numerical Classification and Data Analysis’ of the Gesellschaft für Klassifikation, Essen, November 1989.
Bock, H.H. (1994): Information and entropy in cluster analysis. In: H. Bozdogan et al. (Eds.): Multivariate statistical modeling, Vol. II. Proc. 1st US Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach. Univ. of Tennessee, Knoxville, 1992. Kluwer, Dordrecht, 1994, 115–147.
Bock, H.H. (1996a): Probability models and hypotheses testing in partitioning cluster analysis. In: P. Arabie et al. (Eds.), 1996, 377–453.
Bock, H.H. (1996b): Probabilistic models in cluster analysis. Computational Statistics and Data Analysis 22 (in press).
Bock, H.H. (1996c): Probabilistic models in partitional cluster analysis. In: A. Ferligoj and A. Kramberger (Eds.): Developments in data analysis. Metodoloski zvezki, 12, Faculty of Social Sciences Press (Fakulteta za druzbene vede, FDV), Ljubljana, 1996, 3–25.
Bock, H.H. (1996d): Probabilistic models and statistical methods in partitional classification problems. Written version of a Tutorial Session organized by the Japanese Classification Society and the Japan Market Association, Tokyo, April 2–3, 1996, 50–68.
Bock, H.H. (1997): Probability models for convex clusters. In: R. Klar and O. Opitz (Eds.): Classification and knowledge organization. Springer-Verlag, Heidelberg, 1997 (to appear).
Bock, H.H., and W. Polasek (Eds.) (1996): Data analysis and information systems: Statistical and conceptual approaches. Springer-Verlag, Heidelberg, 1996.
Böhning, D., Dietz, E., Schaub, R., Schlattmann, P., Lindsay, B.G. (1994): The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Annals of the Institute of Mathematical Statistics 46, 373–388.
Bryant, P. (1988): On characterizing optimization-based clustering methods. J. of Classification 5, 81–84.
Bryant, P.G. (1991): Large-sample results for optimization-based clustering methods. J. of Classification 8, 31–44.
Bryant, P.G., and J.A. Williamson (1978): Asymptotic behaviour of classification maximum likelihood estimates. Biometrika 65. 273–281.
Céleux, G., Diebolt, J. (1985): The SEM algorithm: A probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Computational Statistics Quarterly 2, 73–82. Cox, D.R. (1957): A note on grouping. J. Amer. Statist. Assoc. 52, 543–547.
Cressie, N. (1991): Statistics for spatial data. Wiley, New York.
Diday, E. (1973): Introduction à l’analyse factorielle typologique. Rapport de Recherche no. 27, IRIA, Le Chesnay, France, 13 pp.
Diday, E., Y. Lechevallier, M. Schader, P. Bertrand, and B. Burtschy (Eds.) (1994): New approaches in classification and data analysis. Studies in Classification, Data Analysis, and Knowledge Organization, vol. 6. Springer-Verlag, Heidelberg, 186–193.
Dubes, R., and Jain, A.K. (1979): Validity studies in clustering methodologies. Pattern Recognition 11, 235–254.
Dubes, R.C., and Zeng, G. (1987): A test for spatial homogeneity in cluster analysis. J. of Classification 4, 33–56.
Everitt, B. S. (1981): A Monte Carlo investigation of the likelihood ratio test for the number of components in a mixture of normal distributions. Multivariate Behavioural Research 16, 171–180.
Fahrmeir, L., Hamerle, A. and G. Tutz (Eds.) (1996): Multivariate statistische Verfahren. Walter de Gruyter, Berlin - New York.
Fahrmeir, L., Kaufmann, H.L., and H. Pape (1980): Eine konstruktive Eigenschaft optimaler Par- titionen bei stochastischen Klassifikationsproblemen. Methods of Operations Research 37, 337–347.
Flury, B.D. (1993): Estimation of principal points. Applied Statistics 42, 139–151.
W. Gaul D. Pfeifer (Eds.) (1996): From data to knowledge. Theoretical and practical aspects of classification, data analysis and knowledge organization. Springer-Verlag, Heidelberg.
Ghosh, J. K., Sen, P. K. (1985): On the asymptotic performance of the log likelihood ratio statis- tic for the mixture model and related results. In: L.M. LeCam, R.A. Ohlsen (Eds.): Proc. Berkeley Conference in honor of Jerzy Neyman and Jack Kiefer. Vol. II, Wadsworth, Monterey, 1985, 789–806.
Godehardt, E. (1990): Graphs as structural models. The application of graphs and multigraphs in cluster analysis. Friedrich Vieweg Sohn, Braunschweig, 240 pp.
Godehardt, E., and Borsch, A. (1996): Graph-theoretic models for testing the homogeneity of data. In: W. Gaul D. Pfeifer (Eds.), 1996, 167–176.
Goffinet, B., Loisel, P., and B. Laurent (1992): Testing in normal mixture models when the proportions are known. Biometrika 79, 842–846.
Gordon, A.D. (1994): Identifying genuine clusters in a classification. Computational Statistics and Data Analysis 18, 561–581.
Gordon, A.D. (1996): Null models in cluster validation. In: W. Gaul and D. Pfeifer (Eds.), 1996, 32–44.
Gordon, A.D. (1997a): Cluster validation. This volume.
Gordon, A.D. (1997b): How many clusters? An investigation of five procedures for detecting nested cluster structure. This volume.
Hardy, A. (1997): A split and merge algorithm for cluster analysis. This volume.
Hartigan, J.A. (1978): Asymptotic distributions for clustering criteria. Ann. Statist. 6, 117–131. Hartigan, J.A. (1985): Statistical theory in clustering. J. of Classification 2, 63–76.
Hayashi, Ch. (19??):
Jain, A.K., and Dubes, R.C. (1988): Algorithms for clustering data. Prentice Hall, Englewood Cliffs, NJ.
Jank, W. (1996): A study on the varaince criterion in cluster analysis: Optimum and stationary partitions of RP and the distribution of related clustering criteria. (In German). Diploma thesis, Institute of Statistics, Technical University of Aachen, Aachen, 204 pp.
Jank, W., and Bock, H.H. (1996): Optimal partitions of R 2 and the distribution of the variance and max-F criterion. Paper presented at the 20th Annual Conference of the Gesellschaft für Klassifikation, Freiburg, Germany, March 1996.
Lapointe, F.-J. (1997): To validate and how to validate? That is the real question. This volume. Ling, R.F. (1973): A probability theory of cluster analysis. J. Amer. Statist. Assoc. 68, 159–164.
McLachlan, G.J., and K.E. Basford (1988): Mixture models. Inference and applications to clustering. Marcel Dekker, New York - Basel.
Mendell, N.P., Thode, H.C., Finch, S.J. (1991): The likelihood ratio test for the two-component normal mixture problem: power and sample-size analysis. Biometrics 47, 1143–1148. Correction: 48 (1992) 661.
Mendell, N.P., Finch, S.J., and Thode, H.C. (1993): Where is the likelihood ratio test powerful for detecting two-component normal mixtures? Biometrics 49, 907–915.
Milligan, G. W. (1981): A review of Monte Carlo tests of cluster analysis. Multivariate Behavioural Research 16, 379–401.
Milligan, G.W. (1996): Clustering validation: Results and implications for applied analyses. In: P. Arabie et al. (Eds.), 1996, 341–375.
Milligan, G. W., and M.C. Cooper (1985): An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179.
Pärna, K. (1986): Strong consistency of k-means clustering criterion in separable metric spaces. Tartu Riikliku Ulikooli, TOIMEISED 733, 86–96.
Kipper, S., and Pärna, K. (1992): Optimal k—centres for a two-dimensional normal distribution. Acta et Commentationes Universitatis Tartuensis, Tartu Ulikooli TOIMEISED 942, 21–27. Pollard, D. (1982): A central limit theorem for k-means clustering. Ann. Probab. 10, 919–926. Rasson, J.-P. ( 1997 ): Convexity methods in classification. This volume.
Rasson, J.-P., Hardy, A., and Weverbergh, D. (1988): Point process, classification and data analysis.In: 1LH. Bock (Ed.), 1988, 245–256.
Rasson, J.-P., and Kubushishi, T. (1994): The gap test: an optimal method for determining the number of natural classes in cluster analysis. In: E. Diday et al. (eds.), 1994, 186–193.
Ripley, B.D. (1981): Spatial statistics. Wiley, New York.
Sawitzki, G. (1996): The excess-mass approach and the analysis of multi-modality. In: W. Gaul and D. Pfeifer (Eds.), 1996, 203–211.
Silverman, B.W. (1981): Using kernel density estimates to investigate multimodality. J. Royal Statist. Soc. B 43, 97–99.
Snijders, T.A.B. and K. Nowicki (1996): Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J. of Classification 13 (in press).
Symons, M.J. (1981): Clustering criteria and multivariate normal mixtures. Biometrics 37, 35–43. Tharpey, Th., Li, L., Flury, B.D. (1995): Principal points and self-consistent points of elliptical distributions. Annals of Statistics 23, 103–112.
Thode, H.C., Finch, S.J., Mendell, N.R. (1988): Simulated percentage points for the null distribution of the likelihood ratio test for a mixture of two normals. Biometrics 44, 1195–1201.
Titterington, D.M. (1990): Some recent research in the analysis of mixture distributions. Statistics 21, 619–641.
Titterington, D.M., A.F.M. Smith and U.E. Makov (1985): Statistical analysis of finite mixture distributions. Wiley, New York.
Van Cutsem, B., and Ycart, B. (1996a): Probability distributions on indexed dendrograms and related problems of classifiability. In H.H. Bock and W. Polasek (Eds.), 1996, 73–87.
Van Cutsem, B., and Ycart, B. (19966): Combinatorial structures and structures for classification. Computational Statistics and Data Analysis (in press).
Van Cutsem, B., and Ycart, B. (1997): This volume.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1998 Springer Japan
About this paper
Cite this paper
Bock, H.H. (1998). Probabilistic Aspects in Classification. In: Hayashi, C., Yajima, K., Bock, HH., Ohsumi, N., Tanaka, Y., Baba, Y. (eds) Data Science, Classification, and Related Methods. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Tokyo. https://doi.org/10.1007/978-4-431-65950-1_1
Download citation
DOI: https://doi.org/10.1007/978-4-431-65950-1_1
Publisher Name: Springer, Tokyo
Print ISBN: 978-4-431-70208-5
Online ISBN: 978-4-431-65950-1
eBook Packages: Springer Book Archive