Skip to main content

Probabilistic Aspects in Classification

  • Conference paper
Data Science, Classification, and Related Methods

Summary

This paper surveys various ways in which probabilistic approaches can be useful in partitional (‘non-hierarchical’) cluster analysis. Four basic distribution models for ‘clustering structures’ are described in order to derive suitable clustering strategies. They are exemplified for various special distribution cases, including dissimilarity data and random similarity relations. A special section describes statistical tests for checking the relevance of a calculated classification (e.g., the max-F test, convex cluster tests) and comparing it to standard clustering situations (comparative assessment of classifications, CAC).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Anderson, J.J. (1985): Normal mixtures and the number of clusters problem. Computational Statistics Quarterly 2, 3–14.

    Google Scholar 

  • P. Arabie, L. Hubert and G. De Soete (eds.) (1996): Clustering and Classification. World Science Publishers, River Edge NJ.

    Google Scholar 

  • Baubkus, W. (1985): Minimizing the variance criterion in cluster analysis: Optimal configurations in the multidimensional normal case. Diploma thesis, Institute of Statistics, Technical University of Aachen, Germany.

    Google Scholar 

  • Berdai, A., and B. Garel (1994): Performances d’un test d’homogénéité contre une hypothèse de mélange gaussien. Revue de Statistique Appliquée 42 (1), 63–79.

    MathSciNet  MATH  Google Scholar 

  • Bernardo, J.M. (1994): Optimizing prediction with hierarchical models: Bayesian clustering. In: P.R. Freeman, A.F.M. Smith (Eds.): Aspects of uncertainty. Wiley, New York, 1994, 67–76.

    Google Scholar 

  • Binder, D.A. (1978): Bayesian cluster analysis. Biometrika 65, 31–38.

    Article  MathSciNet  MATH  Google Scholar 

  • Bock, H.H. (1968): Statistische Modelle für die einfache und doppelte Klassifikation von normalverteilten Beobachtungen. Dissertation, Univ. Freiburg i. Brsg., Germany.

    Google Scholar 

  • Bock, H.H. (1969): The equivalence of two extremal problems and its application to the iterative classification of multivariate data. Report of the Conference ‘Medizinische Statistik’, Forschungsinstitut Oberwolfach, February 1969, lOpp.

    Google Scholar 

  • Bock, H.H. (1972): Statistische Modelle und Bayes’sche Verfahren zur Bestimmung einer unbekannten Klassifikation normalverteilter zufälliger Vektoren. Metrika 18 (1972) 120–132.

    Article  MathSciNet  MATH  Google Scholar 

  • Bock, H.H. (1974): Automatische Klassifikation (Clusteranalyse). Vandenhoeck Ruprecht, Göttingen, 480 pp.

    Google Scholar 

  • Bock, H.H. (1977): On tests concerning the existence of a classification. In: Proc. First Symposium on Data Analysis and Informatics, Versailles, 1977, Vol. II. Institut de Recherche d’Informatique et d’Automatique ( IRIA ), Le Chesnay, 1977, 449–464.

    Google Scholar 

  • Bock, H.H. (1984): Statistical testing and evaluation methods in cluster analysis. In: J.K. Ghosh J. Roy (Eds.): Golden Jubilee Conference in Statistics: Applications and new directions. Calcutta, December 1981. Indian Statistical Institute, Calcutta, 1984, 116–146.

    Google Scholar 

  • Bock, H.H. (1985): On some significance tests in cluster analysis. J. of Classification 2, 77–108.

    Article  MathSciNet  MATH  Google Scholar 

  • Bock, H.H. (1986): Loglinear models and entropy clustering methods for qualitative data. In: W. Gaul, M. Schader (Eds.), Classification as a tool of research. North Holland, Amsterdam, 1986, 19–26.

    Google Scholar 

  • Bock, H.H. (1987): On the interface between cluster analysis, principal component analysis, and multidimensional scaling. In: H. Bozdogan and A.K. Gupta (eds.): Multivariate statistical modeling and data analysis. Reidel, Dordrecht, 1987, 17–34.

    Google Scholar 

  • Bock, H.H. (Ed.) (1988): Classification and related methods of data analysis. Proc. First IFCS Conference, Aachen, 1987. North Holland, Amsterdam.

    Google Scholar 

  • Bock, H.H. (1989a): Probabilistic aspects in cluster analysis. In: O. Opitz (Ed.): Conceptual and numerical analysis of data. Springer-Verlag, Heidelberg, 1989, 12–44.

    Google Scholar 

  • Bock, H.H. (1989b): A probabilistic clustering model for graphs and similarity relations. Paper presented at the Fall Meeting 1989 of the Working Group ‘Numerical Classification and Data Analysis’ of the Gesellschaft für Klassifikation, Essen, November 1989.

    Google Scholar 

  • Bock, H.H. (1994): Information and entropy in cluster analysis. In: H. Bozdogan et al. (Eds.): Multivariate statistical modeling, Vol. II. Proc. 1st US Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach. Univ. of Tennessee, Knoxville, 1992. Kluwer, Dordrecht, 1994, 115–147.

    Google Scholar 

  • Bock, H.H. (1996a): Probability models and hypotheses testing in partitioning cluster analysis. In: P. Arabie et al. (Eds.), 1996, 377–453.

    Google Scholar 

  • Bock, H.H. (1996b): Probabilistic models in cluster analysis. Computational Statistics and Data Analysis 22 (in press).

    Google Scholar 

  • Bock, H.H. (1996c): Probabilistic models in partitional cluster analysis. In: A. Ferligoj and A. Kramberger (Eds.): Developments in data analysis. Metodoloski zvezki, 12, Faculty of Social Sciences Press (Fakulteta za druzbene vede, FDV), Ljubljana, 1996, 3–25.

    Google Scholar 

  • Bock, H.H. (1996d): Probabilistic models and statistical methods in partitional classification problems. Written version of a Tutorial Session organized by the Japanese Classification Society and the Japan Market Association, Tokyo, April 2–3, 1996, 50–68.

    Google Scholar 

  • Bock, H.H. (1997): Probability models for convex clusters. In: R. Klar and O. Opitz (Eds.): Classification and knowledge organization. Springer-Verlag, Heidelberg, 1997 (to appear).

    Google Scholar 

  • Bock, H.H., and W. Polasek (Eds.) (1996): Data analysis and information systems: Statistical and conceptual approaches. Springer-Verlag, Heidelberg, 1996.

    Google Scholar 

  • Böhning, D., Dietz, E., Schaub, R., Schlattmann, P., Lindsay, B.G. (1994): The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Annals of the Institute of Mathematical Statistics 46, 373–388.

    Article  MATH  Google Scholar 

  • Bryant, P. (1988): On characterizing optimization-based clustering methods. J. of Classification 5, 81–84.

    Article  Google Scholar 

  • Bryant, P.G. (1991): Large-sample results for optimization-based clustering methods. J. of Classification 8, 31–44.

    Article  MATH  Google Scholar 

  • Bryant, P.G., and J.A. Williamson (1978): Asymptotic behaviour of classification maximum likelihood estimates. Biometrika 65. 273–281.

    Article  MATH  Google Scholar 

  • Céleux, G., Diebolt, J. (1985): The SEM algorithm: A probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Computational Statistics Quarterly 2, 73–82. Cox, D.R. (1957): A note on grouping. J. Amer. Statist. Assoc. 52, 543–547.

    Google Scholar 

  • Cressie, N. (1991): Statistics for spatial data. Wiley, New York.

    Google Scholar 

  • Diday, E. (1973): Introduction à l’analyse factorielle typologique. Rapport de Recherche no. 27, IRIA, Le Chesnay, France, 13 pp.

    Google Scholar 

  • Diday, E., Y. Lechevallier, M. Schader, P. Bertrand, and B. Burtschy (Eds.) (1994): New approaches in classification and data analysis. Studies in Classification, Data Analysis, and Knowledge Organization, vol. 6. Springer-Verlag, Heidelberg, 186–193.

    Google Scholar 

  • Dubes, R., and Jain, A.K. (1979): Validity studies in clustering methodologies. Pattern Recognition 11, 235–254.

    Article  MATH  Google Scholar 

  • Dubes, R.C., and Zeng, G. (1987): A test for spatial homogeneity in cluster analysis. J. of Classification 4, 33–56.

    Article  Google Scholar 

  • Everitt, B. S. (1981): A Monte Carlo investigation of the likelihood ratio test for the number of components in a mixture of normal distributions. Multivariate Behavioural Research 16, 171–180.

    Article  Google Scholar 

  • Fahrmeir, L., Hamerle, A. and G. Tutz (Eds.) (1996): Multivariate statistische Verfahren. Walter de Gruyter, Berlin - New York.

    Google Scholar 

  • Fahrmeir, L., Kaufmann, H.L., and H. Pape (1980): Eine konstruktive Eigenschaft optimaler Par- titionen bei stochastischen Klassifikationsproblemen. Methods of Operations Research 37, 337–347.

    MATH  Google Scholar 

  • Flury, B.D. (1993): Estimation of principal points. Applied Statistics 42, 139–151.

    Article  MathSciNet  MATH  Google Scholar 

  • W. Gaul D. Pfeifer (Eds.) (1996): From data to knowledge. Theoretical and practical aspects of classification, data analysis and knowledge organization. Springer-Verlag, Heidelberg.

    Google Scholar 

  • Ghosh, J. K., Sen, P. K. (1985): On the asymptotic performance of the log likelihood ratio statis- tic for the mixture model and related results. In: L.M. LeCam, R.A. Ohlsen (Eds.): Proc. Berkeley Conference in honor of Jerzy Neyman and Jack Kiefer. Vol. II, Wadsworth, Monterey, 1985, 789–806.

    Google Scholar 

  • Godehardt, E. (1990): Graphs as structural models. The application of graphs and multigraphs in cluster analysis. Friedrich Vieweg Sohn, Braunschweig, 240 pp.

    Google Scholar 

  • Godehardt, E., and Borsch, A. (1996): Graph-theoretic models for testing the homogeneity of data. In: W. Gaul D. Pfeifer (Eds.), 1996, 167–176.

    Google Scholar 

  • Goffinet, B., Loisel, P., and B. Laurent (1992): Testing in normal mixture models when the proportions are known. Biometrika 79, 842–846.

    Article  MathSciNet  MATH  Google Scholar 

  • Gordon, A.D. (1994): Identifying genuine clusters in a classification. Computational Statistics and Data Analysis 18, 561–581.

    Google Scholar 

  • Gordon, A.D. (1996): Null models in cluster validation. In: W. Gaul and D. Pfeifer (Eds.), 1996, 32–44.

    Google Scholar 

  • Gordon, A.D. (1997a): Cluster validation. This volume.

    Google Scholar 

  • Gordon, A.D. (1997b): How many clusters? An investigation of five procedures for detecting nested cluster structure. This volume.

    Google Scholar 

  • Hardy, A. (1997): A split and merge algorithm for cluster analysis. This volume.

    Google Scholar 

  • Hartigan, J.A. (1978): Asymptotic distributions for clustering criteria. Ann. Statist. 6, 117–131. Hartigan, J.A. (1985): Statistical theory in clustering. J. of Classification 2, 63–76.

    MathSciNet  Google Scholar 

  • Hayashi, Ch. (19??):

    Google Scholar 

  • Jain, A.K., and Dubes, R.C. (1988): Algorithms for clustering data. Prentice Hall, Englewood Cliffs, NJ.

    Google Scholar 

  • Jank, W. (1996): A study on the varaince criterion in cluster analysis: Optimum and stationary partitions of RP and the distribution of related clustering criteria. (In German). Diploma thesis, Institute of Statistics, Technical University of Aachen, Aachen, 204 pp.

    Google Scholar 

  • Jank, W., and Bock, H.H. (1996): Optimal partitions of R 2 and the distribution of the variance and max-F criterion. Paper presented at the 20th Annual Conference of the Gesellschaft für Klassifikation, Freiburg, Germany, March 1996.

    Google Scholar 

  • Lapointe, F.-J. (1997): To validate and how to validate? That is the real question. This volume. Ling, R.F. (1973): A probability theory of cluster analysis. J. Amer. Statist. Assoc. 68, 159–164.

    Google Scholar 

  • McLachlan, G.J., and K.E. Basford (1988): Mixture models. Inference and applications to clustering. Marcel Dekker, New York - Basel.

    Google Scholar 

  • Mendell, N.P., Thode, H.C., Finch, S.J. (1991): The likelihood ratio test for the two-component normal mixture problem: power and sample-size analysis. Biometrics 47, 1143–1148. Correction: 48 (1992) 661.

    Google Scholar 

  • Mendell, N.P., Finch, S.J., and Thode, H.C. (1993): Where is the likelihood ratio test powerful for detecting two-component normal mixtures? Biometrics 49, 907–915.

    Article  MathSciNet  Google Scholar 

  • Milligan, G. W. (1981): A review of Monte Carlo tests of cluster analysis. Multivariate Behavioural Research 16, 379–401.

    Google Scholar 

  • Milligan, G.W. (1996): Clustering validation: Results and implications for applied analyses. In: P. Arabie et al. (Eds.), 1996, 341–375.

    Google Scholar 

  • Milligan, G. W., and M.C. Cooper (1985): An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179.

    Article  Google Scholar 

  • Pärna, K. (1986): Strong consistency of k-means clustering criterion in separable metric spaces. Tartu Riikliku Ulikooli, TOIMEISED 733, 86–96.

    Google Scholar 

  • Kipper, S., and Pärna, K. (1992): Optimal k—centres for a two-dimensional normal distribution. Acta et Commentationes Universitatis Tartuensis, Tartu Ulikooli TOIMEISED 942, 21–27. Pollard, D. (1982): A central limit theorem for k-means clustering. Ann. Probab. 10, 919–926. Rasson, J.-P. ( 1997 ): Convexity methods in classification. This volume.

    Google Scholar 

  • Rasson, J.-P., Hardy, A., and Weverbergh, D. (1988): Point process, classification and data analysis.In: 1LH. Bock (Ed.), 1988, 245–256.

    Google Scholar 

  • Rasson, J.-P., and Kubushishi, T. (1994): The gap test: an optimal method for determining the number of natural classes in cluster analysis. In: E. Diday et al. (eds.), 1994, 186–193.

    Google Scholar 

  • Ripley, B.D. (1981): Spatial statistics. Wiley, New York.

    Google Scholar 

  • Sawitzki, G. (1996): The excess-mass approach and the analysis of multi-modality. In: W. Gaul and D. Pfeifer (Eds.), 1996, 203–211.

    Google Scholar 

  • Silverman, B.W. (1981): Using kernel density estimates to investigate multimodality. J. Royal Statist. Soc. B 43, 97–99.

    Google Scholar 

  • Snijders, T.A.B. and K. Nowicki (1996): Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J. of Classification 13 (in press).

    Google Scholar 

  • Symons, M.J. (1981): Clustering criteria and multivariate normal mixtures. Biometrics 37, 35–43. Tharpey, Th., Li, L., Flury, B.D. (1995): Principal points and self-consistent points of elliptical distributions. Annals of Statistics 23, 103–112.

    Google Scholar 

  • Thode, H.C., Finch, S.J., Mendell, N.R. (1988): Simulated percentage points for the null distribution of the likelihood ratio test for a mixture of two normals. Biometrics 44, 1195–1201.

    Article  MathSciNet  MATH  Google Scholar 

  • Titterington, D.M. (1990): Some recent research in the analysis of mixture distributions. Statistics 21, 619–641.

    Article  MathSciNet  MATH  Google Scholar 

  • Titterington, D.M., A.F.M. Smith and U.E. Makov (1985): Statistical analysis of finite mixture distributions. Wiley, New York.

    Google Scholar 

  • Van Cutsem, B., and Ycart, B. (1996a): Probability distributions on indexed dendrograms and related problems of classifiability. In H.H. Bock and W. Polasek (Eds.), 1996, 73–87.

    Google Scholar 

  • Van Cutsem, B., and Ycart, B. (19966): Combinatorial structures and structures for classification. Computational Statistics and Data Analysis (in press).

    Google Scholar 

  • Van Cutsem, B., and Ycart, B. (1997): This volume.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1998 Springer Japan

About this paper

Cite this paper

Bock, H.H. (1998). Probabilistic Aspects in Classification. In: Hayashi, C., Yajima, K., Bock, HH., Ohsumi, N., Tanaka, Y., Baba, Y. (eds) Data Science, Classification, and Related Methods. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Tokyo. https://doi.org/10.1007/978-4-431-65950-1_1

Download citation

  • DOI: https://doi.org/10.1007/978-4-431-65950-1_1

  • Publisher Name: Springer, Tokyo

  • Print ISBN: 978-4-431-70208-5

  • Online ISBN: 978-4-431-65950-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics