Probabilistic Aspects in Classification

Bock, Hans H.

doi:10.1007/978-4-431-65950-1_1

Hans H. Bock⁸

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

2043 Accesses
4 Citations

Summary

This paper surveys various ways in which probabilistic approaches can be useful in partitional (‘non-hierarchical’) cluster analysis. Four basic distribution models for ‘clustering structures’ are described in order to derive suitable clustering strategies. They are exemplified for various special distribution cases, including dissimilarity data and random similarity relations. A special section describes statistical tests for checking the relevance of a calculated classification (e.g., the max-F test, convex cluster tests) and comparing it to standard clustering situations (comparative assessment of classifications, CAC).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Anderson, J.J. (1985): Normal mixtures and the number of clusters problem. Computational Statistics Quarterly 2, 3–14.
Google Scholar
P. Arabie, L. Hubert and G. De Soete (eds.) (1996): Clustering and Classification. World Science Publishers, River Edge NJ.
Google Scholar
Baubkus, W. (1985): Minimizing the variance criterion in cluster analysis: Optimal configurations in the multidimensional normal case. Diploma thesis, Institute of Statistics, Technical University of Aachen, Germany.
Google Scholar
Berdai, A., and B. Garel (1994): Performances d’un test d’homogénéité contre une hypothèse de mélange gaussien. Revue de Statistique Appliquée 42 (1), 63–79.
MathSciNet MATH Google Scholar
Bernardo, J.M. (1994): Optimizing prediction with hierarchical models: Bayesian clustering. In: P.R. Freeman, A.F.M. Smith (Eds.): Aspects of uncertainty. Wiley, New York, 1994, 67–76.
Google Scholar
Binder, D.A. (1978): Bayesian cluster analysis. Biometrika 65, 31–38.
Article MathSciNet MATH Google Scholar
Bock, H.H. (1968): Statistische Modelle für die einfache und doppelte Klassifikation von normalverteilten Beobachtungen. Dissertation, Univ. Freiburg i. Brsg., Germany.
Google Scholar
Bock, H.H. (1969): The equivalence of two extremal problems and its application to the iterative classification of multivariate data. Report of the Conference ‘Medizinische Statistik’, Forschungsinstitut Oberwolfach, February 1969, lOpp.
Google Scholar
Bock, H.H. (1972): Statistische Modelle und Bayes’sche Verfahren zur Bestimmung einer unbekannten Klassifikation normalverteilter zufälliger Vektoren. Metrika 18 (1972) 120–132.
Article MathSciNet MATH Google Scholar
Bock, H.H. (1974): Automatische Klassifikation (Clusteranalyse). Vandenhoeck Ruprecht, Göttingen, 480 pp.
Google Scholar
Bock, H.H. (1977): On tests concerning the existence of a classification. In: Proc. First Symposium on Data Analysis and Informatics, Versailles, 1977, Vol. II. Institut de Recherche d’Informatique et d’Automatique ( IRIA ), Le Chesnay, 1977, 449–464.
Google Scholar
Bock, H.H. (1984): Statistical testing and evaluation methods in cluster analysis. In: J.K. Ghosh J. Roy (Eds.): Golden Jubilee Conference in Statistics: Applications and new directions. Calcutta, December 1981. Indian Statistical Institute, Calcutta, 1984, 116–146.
Google Scholar
Bock, H.H. (1985): On some significance tests in cluster analysis. J. of Classification 2, 77–108.
Article MathSciNet MATH Google Scholar
Bock, H.H. (1986): Loglinear models and entropy clustering methods for qualitative data. In: W. Gaul, M. Schader (Eds.), Classification as a tool of research. North Holland, Amsterdam, 1986, 19–26.
Google Scholar
Bock, H.H. (1987): On the interface between cluster analysis, principal component analysis, and multidimensional scaling. In: H. Bozdogan and A.K. Gupta (eds.): Multivariate statistical modeling and data analysis. Reidel, Dordrecht, 1987, 17–34.
Google Scholar
Bock, H.H. (Ed.) (1988): Classification and related methods of data analysis. Proc. First IFCS Conference, Aachen, 1987. North Holland, Amsterdam.
Google Scholar
Bock, H.H. (1989a): Probabilistic aspects in cluster analysis. In: O. Opitz (Ed.): Conceptual and numerical analysis of data. Springer-Verlag, Heidelberg, 1989, 12–44.
Google Scholar
Bock, H.H. (1989b): A probabilistic clustering model for graphs and similarity relations. Paper presented at the Fall Meeting 1989 of the Working Group ‘Numerical Classification and Data Analysis’ of the Gesellschaft für Klassifikation, Essen, November 1989.
Google Scholar
Bock, H.H. (1994): Information and entropy in cluster analysis. In: H. Bozdogan et al. (Eds.): Multivariate statistical modeling, Vol. II. Proc. 1st US Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach. Univ. of Tennessee, Knoxville, 1992. Kluwer, Dordrecht, 1994, 115–147.
Google Scholar
Bock, H.H. (1996a): Probability models and hypotheses testing in partitioning cluster analysis. In: P. Arabie et al. (Eds.), 1996, 377–453.
Google Scholar
Bock, H.H. (1996b): Probabilistic models in cluster analysis. Computational Statistics and Data Analysis 22 (in press).
Google Scholar
Bock, H.H. (1996c): Probabilistic models in partitional cluster analysis. In: A. Ferligoj and A. Kramberger (Eds.): Developments in data analysis. Metodoloski zvezki, 12, Faculty of Social Sciences Press (Fakulteta za druzbene vede, FDV), Ljubljana, 1996, 3–25.
Google Scholar
Bock, H.H. (1996d): Probabilistic models and statistical methods in partitional classification problems. Written version of a Tutorial Session organized by the Japanese Classification Society and the Japan Market Association, Tokyo, April 2–3, 1996, 50–68.
Google Scholar
Bock, H.H. (1997): Probability models for convex clusters. In: R. Klar and O. Opitz (Eds.): Classification and knowledge organization. Springer-Verlag, Heidelberg, 1997 (to appear).
Google Scholar
Bock, H.H., and W. Polasek (Eds.) (1996): Data analysis and information systems: Statistical and conceptual approaches. Springer-Verlag, Heidelberg, 1996.
Google Scholar
Böhning, D., Dietz, E., Schaub, R., Schlattmann, P., Lindsay, B.G. (1994): The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Annals of the Institute of Mathematical Statistics 46, 373–388.
Article MATH Google Scholar
Bryant, P. (1988): On characterizing optimization-based clustering methods. J. of Classification 5, 81–84.
Article Google Scholar
Bryant, P.G. (1991): Large-sample results for optimization-based clustering methods. J. of Classification 8, 31–44.
Article MATH Google Scholar
Bryant, P.G., and J.A. Williamson (1978): Asymptotic behaviour of classification maximum likelihood estimates. Biometrika 65. 273–281.
Article MATH Google Scholar
Céleux, G., Diebolt, J. (1985): The SEM algorithm: A probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Computational Statistics Quarterly 2, 73–82. Cox, D.R. (1957): A note on grouping. J. Amer. Statist. Assoc. 52, 543–547.
Google Scholar
Cressie, N. (1991): Statistics for spatial data. Wiley, New York.
Google Scholar
Diday, E. (1973): Introduction à l’analyse factorielle typologique. Rapport de Recherche no. 27, IRIA, Le Chesnay, France, 13 pp.
Google Scholar
Diday, E., Y. Lechevallier, M. Schader, P. Bertrand, and B. Burtschy (Eds.) (1994): New approaches in classification and data analysis. Studies in Classification, Data Analysis, and Knowledge Organization, vol. 6. Springer-Verlag, Heidelberg, 186–193.
Google Scholar
Dubes, R., and Jain, A.K. (1979): Validity studies in clustering methodologies. Pattern Recognition 11, 235–254.
Article MATH Google Scholar
Dubes, R.C., and Zeng, G. (1987): A test for spatial homogeneity in cluster analysis. J. of Classification 4, 33–56.
Article Google Scholar
Everitt, B. S. (1981): A Monte Carlo investigation of the likelihood ratio test for the number of components in a mixture of normal distributions. Multivariate Behavioural Research 16, 171–180.
Article Google Scholar
Fahrmeir, L., Hamerle, A. and G. Tutz (Eds.) (1996): Multivariate statistische Verfahren. Walter de Gruyter, Berlin - New York.
Google Scholar
Fahrmeir, L., Kaufmann, H.L., and H. Pape (1980): Eine konstruktive Eigenschaft optimaler Par- titionen bei stochastischen Klassifikationsproblemen. Methods of Operations Research 37, 337–347.
MATH Google Scholar
Flury, B.D. (1993): Estimation of principal points. Applied Statistics 42, 139–151.
Article MathSciNet MATH Google Scholar
W. Gaul D. Pfeifer (Eds.) (1996): From data to knowledge. Theoretical and practical aspects of classification, data analysis and knowledge organization. Springer-Verlag, Heidelberg.
Google Scholar
Ghosh, J. K., Sen, P. K. (1985): On the asymptotic performance of the log likelihood ratio statis- tic for the mixture model and related results. In: L.M. LeCam, R.A. Ohlsen (Eds.): Proc. Berkeley Conference in honor of Jerzy Neyman and Jack Kiefer. Vol. II, Wadsworth, Monterey, 1985, 789–806.
Google Scholar
Godehardt, E. (1990): Graphs as structural models. The application of graphs and multigraphs in cluster analysis. Friedrich Vieweg Sohn, Braunschweig, 240 pp.
Google Scholar
Godehardt, E., and Borsch, A. (1996): Graph-theoretic models for testing the homogeneity of data. In: W. Gaul D. Pfeifer (Eds.), 1996, 167–176.
Google Scholar
Goffinet, B., Loisel, P., and B. Laurent (1992): Testing in normal mixture models when the proportions are known. Biometrika 79, 842–846.
Article MathSciNet MATH Google Scholar
Gordon, A.D. (1994): Identifying genuine clusters in a classification. Computational Statistics and Data Analysis 18, 561–581.
Google Scholar
Gordon, A.D. (1996): Null models in cluster validation. In: W. Gaul and D. Pfeifer (Eds.), 1996, 32–44.
Google Scholar
Gordon, A.D. (1997a): Cluster validation. This volume.
Google Scholar
Gordon, A.D. (1997b): How many clusters? An investigation of five procedures for detecting nested cluster structure. This volume.
Google Scholar
Hardy, A. (1997): A split and merge algorithm for cluster analysis. This volume.
Google Scholar
Hartigan, J.A. (1978): Asymptotic distributions for clustering criteria. Ann. Statist. 6, 117–131. Hartigan, J.A. (1985): Statistical theory in clustering. J. of Classification 2, 63–76.
MathSciNet Google Scholar
Hayashi, Ch. (19??):
Google Scholar
Jain, A.K., and Dubes, R.C. (1988): Algorithms for clustering data. Prentice Hall, Englewood Cliffs, NJ.
Google Scholar
Jank, W. (1996): A study on the varaince criterion in cluster analysis: Optimum and stationary partitions of RP and the distribution of related clustering criteria. (In German). Diploma thesis, Institute of Statistics, Technical University of Aachen, Aachen, 204 pp.
Google Scholar
Jank, W., and Bock, H.H. (1996): Optimal partitions of R 2 and the distribution of the variance and max-F criterion. Paper presented at the 20th Annual Conference of the Gesellschaft für Klassifikation, Freiburg, Germany, March 1996.
Google Scholar
Lapointe, F.-J. (1997): To validate and how to validate? That is the real question. This volume. Ling, R.F. (1973): A probability theory of cluster analysis. J. Amer. Statist. Assoc. 68, 159–164.
Google Scholar
McLachlan, G.J., and K.E. Basford (1988): Mixture models. Inference and applications to clustering. Marcel Dekker, New York - Basel.
Google Scholar
Mendell, N.P., Thode, H.C., Finch, S.J. (1991): The likelihood ratio test for the two-component normal mixture problem: power and sample-size analysis. Biometrics 47, 1143–1148. Correction: 48 (1992) 661.
Google Scholar
Mendell, N.P., Finch, S.J., and Thode, H.C. (1993): Where is the likelihood ratio test powerful for detecting two-component normal mixtures? Biometrics 49, 907–915.
Article MathSciNet Google Scholar
Milligan, G. W. (1981): A review of Monte Carlo tests of cluster analysis. Multivariate Behavioural Research 16, 379–401.
Google Scholar
Milligan, G.W. (1996): Clustering validation: Results and implications for applied analyses. In: P. Arabie et al. (Eds.), 1996, 341–375.
Google Scholar
Milligan, G. W., and M.C. Cooper (1985): An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179.
Article Google Scholar
Pärna, K. (1986): Strong consistency of k-means clustering criterion in separable metric spaces. Tartu Riikliku Ulikooli, TOIMEISED 733, 86–96.
Google Scholar
Kipper, S., and Pärna, K. (1992): Optimal k—centres for a two-dimensional normal distribution. Acta et Commentationes Universitatis Tartuensis, Tartu Ulikooli TOIMEISED 942, 21–27. Pollard, D. (1982): A central limit theorem for k-means clustering. Ann. Probab. 10, 919–926. Rasson, J.-P. ( 1997 ): Convexity methods in classification. This volume.
Google Scholar
Rasson, J.-P., Hardy, A., and Weverbergh, D. (1988): Point process, classification and data analysis.In: 1LH. Bock (Ed.), 1988, 245–256.
Google Scholar
Rasson, J.-P., and Kubushishi, T. (1994): The gap test: an optimal method for determining the number of natural classes in cluster analysis. In: E. Diday et al. (eds.), 1994, 186–193.
Google Scholar
Ripley, B.D. (1981): Spatial statistics. Wiley, New York.
Google Scholar
Sawitzki, G. (1996): The excess-mass approach and the analysis of multi-modality. In: W. Gaul and D. Pfeifer (Eds.), 1996, 203–211.
Google Scholar
Silverman, B.W. (1981): Using kernel density estimates to investigate multimodality. J. Royal Statist. Soc. B 43, 97–99.
Google Scholar
Snijders, T.A.B. and K. Nowicki (1996): Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J. of Classification 13 (in press).
Google Scholar
Symons, M.J. (1981): Clustering criteria and multivariate normal mixtures. Biometrics 37, 35–43. Tharpey, Th., Li, L., Flury, B.D. (1995): Principal points and self-consistent points of elliptical distributions. Annals of Statistics 23, 103–112.
Google Scholar
Thode, H.C., Finch, S.J., Mendell, N.R. (1988): Simulated percentage points for the null distribution of the likelihood ratio test for a mixture of two normals. Biometrics 44, 1195–1201.
Article MathSciNet MATH Google Scholar
Titterington, D.M. (1990): Some recent research in the analysis of mixture distributions. Statistics 21, 619–641.
Article MathSciNet MATH Google Scholar
Titterington, D.M., A.F.M. Smith and U.E. Makov (1985): Statistical analysis of finite mixture distributions. Wiley, New York.
Google Scholar
Van Cutsem, B., and Ycart, B. (1996a): Probability distributions on indexed dendrograms and related problems of classifiability. In H.H. Bock and W. Polasek (Eds.), 1996, 73–87.
Google Scholar
Van Cutsem, B., and Ycart, B. (19966): Combinatorial structures and structures for classification. Computational Statistics and Data Analysis (in press).
Google Scholar
Van Cutsem, B., and Ycart, B. (1997): This volume.
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Statistics, Technical University of Aachen, Wüllnerstr. 3, D-52056, Aachen, Germany
Hans H. Bock

Authors

Hans H. Bock
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

The Institute of Statistical Mathematics, 4-6-7 Minami-Azabu, Minato-ku, Tokyo 106, Japan
Chikio Hayashi , Noboru Ohsumi & Yasumasa Baba , &
School of Management, Science University of Tokyo, 500 Shimokiyoku, Kuki, Saitama 346, Japan
Keiji Yajima
Institut für Statistik, Rheinisch-Westfälische Technische Hochschule (RWTH), D-52056, Aachen, Germany
Hans-Hermann Bock
Faculty of Environmental Science & Technology, Okayama University, 2-1-1 Tsushima-naka, Okayama 700, Japan
Yutaka Tanaka

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bock, H.H. (1998). Probabilistic Aspects in Classification. In: Hayashi, C., Yajima, K., Bock, HH., Ohsumi, N., Tanaka, Y., Baba, Y. (eds) Data Science, Classification, and Related Methods. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Tokyo. https://doi.org/10.1007/978-4-431-65950-1_1

Download citation

DOI: https://doi.org/10.1007/978-4-431-65950-1_1
Publisher Name: Springer, Tokyo
Print ISBN: 978-4-431-70208-5
Online ISBN: 978-4-431-65950-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics