Information and Entropy in Cluster Analysis

Bock, H. H.

doi:10.1007/978-94-011-0800-3_4

H. H. Bock⁶

231 Accesses
5 Citations

Abstract

Cluster analysis provides methods for subdividing a set of objects into a suitable number of ‘classes’, ‘groups’, or ‘types’ C ₁,…,C _m such that each class is as homogeneous as possible and different classes are sufficiently separated. This paper shows how entropy and information measures have been or can be used in this framework. We present several probabilistic clustering approaches which are related to, or lead to, information and entropy criteria g(C) for selecting an optimum partition C = (C ₁,…,C _m) of n data vectors, for qualitative and for quantitative data, assuming loglinear, logistic, and normal distribution models, together with appropriate iterative clustering algorithms. A new partitioning problem is considered in Section 5 where we look for a dissection (discretization) C of an arbitrary sample space Y (e.g. R ^p or 0,1^p) such that the ø—divergence I _c (P ₀, P ₁) between two discretized distributions P _o (C _i), P ₁(C _i) (i = 1,…, m) will be maximized (e.g., Kullback-Leibler’s discrimination information or the X ² noncentrality parameter). We conclude with some comments on methods for selecting a suitable number of classes, e.g., by using Akaike’s information criterion AIC and its modifications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agresti, A.: Ordinal categorical data. Wiley, New York, 1990.
Google Scholar
Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Petrov, B.N., and Csaki, F. (eds.): Second International Symposium on Information Theory. Akademiai Kiado, Budapest, 1973, 267–281.
Google Scholar
Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19 (1974) 716–723.
Article MathSciNet MATH Google Scholar
Akaike, H.: On entropy maximization principle. In: Krisnaiah, P.R. (ed.): Applications of statistics. North Holland, Amsterdam, 1977, 27–41.
Google Scholar
Akaike, H.: A Bayesian analysis of the minimum AIC procedure. Ann. Inst. Stat. Math. A 30 (1979) 9–14.
Article MathSciNet Google Scholar
Arnold, S.J.: A test for clusters. J. Marketing Research 16 (1979) 545–551.
Article Google Scholar
Benzécri, J.P.: Théorie de l’information et classification d’après un tableau de contingence. In: Benzécri, J.P.: L’Analyse des Données, Vol. 1. Dunod, Paris, 1973, 207–236.
Google Scholar
Binder, D.A.: Bayesian cluster analysis. Biometrika 65 (1978) 31–38.
Article MathSciNet MATH Google Scholar
Binder, D.A.: Approximations to Bayesian clustering rules. Biometrika 68 (1981) 275–286.
Article MathSciNet Google Scholar
Bock, H.H.: Statistische Modelle für die einfache und doppelte Klassifikation von normalverteilten Beobachtungen. Dissertation, University of Freiburg, 1968.
Google Scholar
Bock, H.H.: The equivalence of two extremal problems and its application to the iterative classification of multivariate data. Written version of a lecture given at the Conference on “Medizinische Statistik”, Forschungsinstitut Oberwolfach, February 23 — March 1, 1969, 10 pp.
Google Scholar
Bock, H.H.: Statistische Modelle und Bayes’sehe Verfahren zur Bestimmung einer unbekannten Klassifikation normalverteilter zufälliger Vektoren. Metrika 18 (1972) 120–132.
Article MathSciNet MATH Google Scholar
Bock, H.H.: Automatische Klassifikation. Theoretische und praktische Methoden zur Gruppierung und Strukturierung von Daten (Clusteranalyse). Vandenhoeck & Ruprecht, Göttingen, 1974.
Google Scholar
Bock, H.H.: On tests concerning the existence of a classification. In: Proc. 1st Symp. Data Analysis and Informatics. Versailles, 1977. Institut de Recherche d’Informatique et d’Automatique (IRIA), Le Cesnay, France, 1977, 449–464.
Google Scholar
Bock, H.H.: A clustering algorithm for choosing optimal classes for the chi-square test. Bull 44th Session of the International Statistical Institute, Madrid, Contributed papers, Vol. 2, (1983) 758–762.
Google Scholar
Bock, H.H.: Statistical testing and evaluation methods in cluster analysis. In: Ghosh, J.K. and Roy, J. (eds.): Golden Jubilee Conference in Statistics: Applications and new directions. Calcutta, december 1981. Indian Statistical Institute, Calcutta, 1984, 116–146.
Google Scholar
Bock, H.H.: On some significance tests in cluster analysis. J. of Classification 2 (1985) 77–108.
Article MathSciNet MATH Google Scholar
Bock, H.H.: Loglinear models and entropy clustering methods for qualitative data. In: Gaul, W. and Schader, M. (eds.): Classification as a tool of research. Proc. 9th Annual Conference of the Gesellschaft für Klassifikation, Karlsruhe, 1985. North Holland, Amsterdam, 1986, 19–26.
Google Scholar
Bock, H.H.: On the interface between cluster analysis, principal component analysis, and multidimensional scaling. In: Bozdogan, H. and Gupta, A.K. (eds.): Multivariate statistical modeling and data analysis. Reidel Publ., Dordrecht, 1987, 17–34.
Chapter Google Scholar
Bock, H.H.: Probabilistic aspects in cluster analysis. In: O. Opitz (ed.): Conceptual and numerical analysis of data. Springer-Verlag, Heidelberg-Berlin, 1989, 12–44.
Chapter Google Scholar
Bock, H.H.: A clustering technique for maximizing ø—divergence, noncentrality and discriminating power. In: Schader (ed.): Analyzing and modeling data and knowledge. Proc. 15th Annual Conference of the Gesellschaft für Klassifikation, Salzburg, 1991, Vol. 1. Springer-Verlag, Heidelberg — New York, 1991, 19–36.
Google Scholar
Boulton, D.M. and Wallace, C.S.: The information content of a multistate distribution. J. Theoretical Biology 23 (1969) 269–278.
Article MathSciNet Google Scholar
Bozdogan, H.: ICOMP: A new model selection criterion. In: Bock, H.H. (ed.): Classification and related methods of data analysis. Proc. First Conference of the International Federation of Classification Societies, Aachen, 1987. North Holland, Amsterdam, 1988, 599-608.
Google Scholar
Bozdogan, H.: On the information-based measure of covariance complexity and its application to the evaluation of multivariate linear models. Comm. Statist., Theory and Methods 19 (1990) 221–278.
Article MathSciNet MATH Google Scholar
Bozdogan, H.: Choosing the number of component clusters in the mixture model using a new informational complexity criterion of the inverse Fisher information matrix. In: O. Opitz, B. Lausen, R. Klar (eds.): Information and classification. Proc. 16th Annual Conference of the Gesellschaft für Klassifikation, Dortmund, April 1992. Springer-Verlag, Heidelberg, 1993 (to appear).
Google Scholar
Bozdogan, H. and Gupta, A.K. (eds.): Multivariate statistical modeling and data analysis. Reidel Publ., Dordrecht, 1987.
MATH Google Scholar
Bozdogan, H. and Sclove, S.L.: Multi-sample cluster analysis using Akaike’s information criterion. Ann. Inst. Statist. Math. 36 (1984), Part B, 163–180.
Article MATH Google Scholar
Bryant, P.: On characterizing optimization-based clustering methods. J. of Classification 5(1988) 81–84.
Article Google Scholar
Carman, C.S., Merickel, M.B.: Supervising ISODATA with an information theoretic stopping rule. Pattern Recognition 23 (1990) 185.
Article Google Scholar
Celeux, G.: Classification et modèles. Revue de Statistique Appliquée 36 (1988), no. 4, 43–58.
MathSciNet MATH Google Scholar
Celeux, G. and Govaert, G.: Clustering criteria for discrete data and latent class models. J. of Classification 8(1991) 157–176.
Article MATH Google Scholar
Ciampi, A., Thiffault, J. and Sagman, U.: Évaluation de classifications par le critère d’Akaike et la validation croisée. Revue de Statistique Appliquée 13 (1988) (3) 33–50.
Google Scholar
Csiszár, I.: Information-type measures of difference of probability distributions and indirect observations. Studia Scientiarum Mathematicarum Hungarica 2 (1967) 299–318.
MathSciNet MATH Google Scholar
Darroch, J.N., Lauritzen, S.L. and Speed, T.P.: Markov fields and log-linear interaction models for contingency tables. Ann. Statist. 8 (1980) 522–539.
Article MathSciNet MATH Google Scholar
Diday, E. and Schroeder, A.: A new approach in mixed distributions detection. Revue Française d’Automatique, Informatique et Recherche Opérationnelle 10 (1976), no. 6, 75–106.
MathSciNet MATH Google Scholar
Diday, E. and Simon, J.C.: Clustering analysis. In: K.S. Fu (ed.): Digital pattern recognition. Springer-Verlag, Berlin, 1976, 47–94.
Chapter Google Scholar
Diday, E. and Govaert, G.: Classification automatique avec distances adaptatives. Revue Française d’Automatique, Informatique et Recherche Opérationnelle (R.A.I.R.O.), Série Informatique 11 (1977) 329–349.
MathSciNet MATH Google Scholar
Diday, E. et al. (eds.): Optimisation en classification automatique I, II. Institut National de Recherche en Informatique et en Automatique, Le Chesnay, 1979.
Google Scholar
Eisenblätter, D. and Bozdogan, H.: Two-stage multi-sample cluster analysis as a general approach to discriminant analysis. In: Bozdogan, H., and A.K. Gupta (eds.): Multivariate statistical modeling and data analysis. Reidel Publ., Dordrecht, 1987, 95–119.
Chapter Google Scholar
Eisenblätter, D. and Bozdogan, H.: Two-stage multi-sample cluster analysis. In: Bock, H.H. (ed.): Classification and related methods of data analysis. Proc. First Conference of the International Federation of Classification Societies, Aachen, 1987. North Holland, Amsterdam, 1988, 91–96.
Google Scholar
Engelman, L. and Hartigan, J.A.: Percentage points of a test for clusters. J. Amer. Statist. Assoc. 64 (1969) 1647–1648.
Article Google Scholar
Everitt, B.S.: A Monte Carlo investigation of the likelihood ratio test for the number of components in a mixture of normal distributions. Multivariate Behavioral Reserach 16 (1981) 171–180.
Article Google Scholar
Forst, H. T.: On the hierarchical classification of observation units according to comparative characteristics (in German). International Classification 5 (1978) 81–85.
Google Scholar
Ghosh, J.K. and Sen, P.K.: On the asymptotic performance of the log-likelihood ratio statistic for the mixture model and related results. In: LeCam, L.M. and R.A. Olshen (eds.): Proc. Berkely Conference in Honor of Jerzy Neyman and Jack Kiefer. Vol II. Wadsworth, Monterey, California, 1985, 789–806.
Google Scholar
Godehardt, E.: Graphs as structural models. The application of graphs and multigraphs in cluster analysis. Vieweg Verlag, Braunschweig, 1990².
Google Scholar
Govaert, G.: Classification avec distances adaptatives. Thèse de 3e cycle. Université Paris VI, 1975.
Google Scholar
Govaert, G.: Classification binaire et modèles. Revue Statistique Appliquée 38 (1990), no.1, 67–81.
Google Scholar
Haberman, S.J.: Log-linear models for frequency data: sufficient statistics and likelihood equations. Ann. Statist. 1 (1973) 617–632.
Article MATH Google Scholar
Haberman, S.J.: The analysis of frequency data. University of Chicago Press, Chicago, 1974.
MATH Google Scholar
Haberman, S.J.: Log-linear models and frequency tables with small expected cell counts. Ann. Statist. 5 (1977) 1148–1169.
Article MathSciNet MATH Google Scholar
Hall, P.: Akaike’s information criterion and Kullback-Leibler loss for histogram density estimation. Theory of Probability and Related Fields 85 (1990) 449–467.
Article MATH Google Scholar
Hartigan, J.A.: Asymptotic distributions for clustering criteria. Ann. Statist. 6 (1978) 117–131.
Article MathSciNet MATH Google Scholar
Haughton, D.M.A.: On the choice of a model to fit data from an exponential family. Ann. Statist. 16 (1988) 342–355.
Article MathSciNet MATH Google Scholar
Haughton, D., Haughton, J. and Izenman, A.J.: Information criteria and harmonic models in time series analysis. J. Statist. Comput. Simul. 35 (1990) 187–207.
Article MathSciNet Google Scholar
Hyvärinen, L.: Classification of qualitative data. Nord. Tidskrift Info. Behandling (BIT) 2 (1962), no. 2, 83–89.
MATH Google Scholar
Jacobsen, M.: Existence and unicity of MLE’s in discrete exponential famila distributions. Scand. J. Statist. 16 (1989) 335–350.
MathSciNet Google Scholar
Jain, A.K. and Dubes, R.C.: Algorithms for cluster analysis. Prentice Hall, Englewood Cliffs NJ, 1988.
Google Scholar
Jones, L.K. et al.: General entropy criteria for inverse problems, with applications to data compression, pattern classification and cluster analysis. IEEE Trans. Inform. Theory IT-36(1990) 23–30.
Article Google Scholar
Kashyap, R.L.: Optimal choice of AR and MA parts in autoregressive moving average models. IEEE Trans. Pattern Analysis and Machine Intelligence PAMI 4 (1982) 99–104.
Article MATH Google Scholar
Khouas, S. and Parodi, A.: Towards natural clustering through entropy minimization. In: Diday, E., Lechevallier, Y. (eds.): Symbolic-numeric data analysis and learning. Nova Science Publishers, New York, 1991, 429–442.
Google Scholar
Koziol, J.A.: Cluster analysis of antigenic profiles of tumors: Selection of number of clusters using Akaike’s information criterion. Methods of Information in Medicine 29(1990) 200–204.
Google Scholar
Lambert, J.M. and Williams, W.T.: Multivariate methods in plant ecology. VI. Comparison of information analysis and association analysis. J. Ecology 54 (1966) 635–664.
Article Google Scholar
Lance, G.N. and Williams, W.T.: Mixed data classificatory programs. I. Agglomerative systems. Australian Computer J. 1 (1967) 15–20.
Google Scholar
Lance, G.N. and Williams, W.T.: Note on a new information statistic classificatory program. Computer J. 11 (1968) 195.
Google Scholar
Lauritzen, S.L. and Wermuth, N.: Graphical models for associations between variables, some of which are qualitative and some quantitative. Ann. Statist. 17 (1989) 31–57.
Article MathSciNet MATH Google Scholar
Lee, K.L.: Multivariate tests for clusters. J. Amer. Statist. Assoc. 74 (1979) 708–714.
Article MathSciNet MATH Google Scholar
Macnaughton-Smith, P.: Some statistical and other numerical techniques for classifying individuals. Home Office Research Unit Report No. 6, H.M.S.O. London, 1965.
Google Scholar
McLachlan, G.J.: On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Applied Statistics 36 (1987) 318–324.
Article Google Scholar
McLachlan, G.J. and Basford, K.E.: Mixture models. Inference and applications to clustering. Marcel Dekker, New York — Basel, 1988.
MATH Google Scholar
Nishii, R.: Asymptotic properties of criteria for selection of variables in multiple regression. Ann. Statist. 12 (1984) 758–765.
Article MathSciNet MATH Google Scholar
Orloci, L.: Information analysis in phytosociology: Partition, classification and prediction. J. Theoret. Biology 20 (1968) 271–284.
Article Google Scholar
Orloci, L.: Information theory models for hierarchic and non-hierarchic classification. In: A.J. Cole (ed.): Numerical taxonomy. Academic Press, New York, 1969, 148–165.
Google Scholar
Rissanen, J.: Modeling by shortest data description. Automatika 14 (1978) 465–471.
Article MATH Google Scholar
Rissanen, J.: Stochastic complexity and modeling. Ann. Statist. 14 (1986) 1080–1100.
Article MathSciNet MATH Google Scholar
Rousseau, P.: Analyse de données binaires. Ph. D. thesis, Université de Montréal, 1978.
Google Scholar
Rousseau, P. and Sankoff, D.: A solution to the problem of grouping speakers. In: Sankoff, D.: Linguistic variation: models and methods. Academic Press, New York, 1978, 97–117.
Google Scholar
Schroeder, A.: Analyse d’un mélange de distributions de probabilité de même type. Revue de Statistique Appliquée 24 (1976), no. 1, 39–62.
MathSciNet Google Scholar
Schwarz, G.: Estimating the dimension of a model. Ann. Statist. 6 (1978) 461–464.
Article MathSciNet MATH Google Scholar
Silverman, B.W.: Using kernel density estimates to investigate multimodality. J. Roy. Statist. Soc. B 43 (1981) 97–99.
Google Scholar
Späth, H.: Cluster dissection and analysis. Theory, FORTRAN programs, examples. Ellis Horwood Ltd./Wiley, Chichester, 1985.
MATH Google Scholar
Spruill, M.C.: Cell selection in the Chernoff-Lehmann chi-square statistics. Ann. Statist. 4(1976), 375–383.
Article MathSciNet MATH Google Scholar
Thode, H.C., Finch, S.J. and Mendell, N.R.: Simulated percentage points for the null distribution of the likelihood ratio test for a mixture of two normals. Biometrics 44 (1988) 1195–1201.
Article MathSciNet MATH Google Scholar
Titterington, D.M., Smith, A.F.M. and Mokov, U.E.: Statistical analysis of finite mixture distributions. Wiley, Chichester, 1985.
MATH Google Scholar
Vogel, F.: Ein Streuungsmaß für komparative Merkmale. Jahrbücher für Nationalökonomie und Statistik 197 (1982) (2) 145–157.
Google Scholar
Wallace, C.S. and Boulton, B.M.: An information measure for classification. Computer J. 11 (1968) 185–194.
MATH Google Scholar
Williams, W.T. and Dale, M.B.: Fundamental problems in numerical taxonomy. Advances Botanical Research 2 (1965) 35–68.
Article Google Scholar
Williams, W.T., Lambert, J.M. and Lance, G.N.: Multivariate methods in plant ecology V. Similarity analysis and information analysis. J. Ecology 54 (1966) 427–445.
Article Google Scholar
Whittaker, J.: Graphical models in applied multivariate statistics. Wiley, New York, 1989.
Google Scholar
Windham, M.P.: Parameter modification for clustering. J. of Classification 4(1987) 191–214.
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Statistics, Technical University of Aachen, D-5100, Aachen, Germany
H. H. Bock

Authors

H. H. Bock
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Statistics, The University of Tennessee, 331 Stokely Management Center, Knoxville, TN, 37996-0532, USA
Hamparsum Bozdogan
Department of Information & Decision Sciences M/C 294, CBA, University of Illinois at Chicago, Box 802451, 60607-7124, Chicago, IL, USA
Stanley L. Sclove
Department of Mathematics & Statistics, Bowling Green State University, Bowling Green, OH, 43403, USA
Arjun K. Gupta
Department of Mathematical Sciences, Bentley College, Waltham, MA, USA
D. Haughton
The Institute of Statistical Mathematics, 4-6-7 Minami-Azabu, Minato-Ku, Tokyo, Japan
G. Kitagawa , T. Ozaki & K. Tanabe , &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bock, H.H. (1994). Information and Entropy in Cluster Analysis. In: Bozdogan, H., et al. Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach. Springer, Dordrecht. https://doi.org/10.1007/978-94-011-0800-3_4

Download citation

DOI: https://doi.org/10.1007/978-94-011-0800-3_4
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-010-4344-1
Online ISBN: 978-94-011-0800-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics