A model for correlation within clusters and its use in pollen analysis
Many methods of cluster analysis do not explicitly account for correlation between attributes. In this paper we explicitly model any correlation using a single factor within each cluster: i.e., the correlation of atributes within each cluster is adequately described by a single component axis. However, the use of a factor is not required in every cluster. Using a Minimum Message Length criterion, we can determine the number of clusters and also whether the model of any cluster is improved by introducing a factor. The technique allows us to seek clusters which reflect directional changes rather than imposing a zonation constrained by spatial (and implicitly temporal) position. Minimal message length is a means of utilising Okham’s Razor in inductive analysis. The ‘best’ model is that which allows most compression of the data, which results in a minimal message length for the description. Fit to the data is not a sufficient criterion for choosing models because more complicated models will almost always fit better. Minimum message length combines fit to the data with an encoding of the model and provides a Bayesian probability criterion as a means of choosing between models (and classes of model). Applying the analysis to a pollen diagram from Southern Chile, we find that the introduction of factors does not improve the overall quality of the mixture model. The solution without axes in any cluster provides the most parsimonious solution. Examining the cluster with the best case for a factor to be incorporated in its description shows that the attributes highly loaded on the axis represent a contrast of herbaceous vegetation and dominant forests types. This contrast is also found when fitting the entire population, and in this case the factor solution is the preferred model. Overall, the cluster solution without factors is much preferred. Thus, in this case classification is preferred to ordination although more data are desirable to confirm such a conclusion.
KeywordsClustering Correlation within clusters Minimum message length Pollen analysis
Minimum description Length
Minimal Message Length
- Agusta, Y. and Dowe, D. L. 2003. Unsupervised learning of correlated multivariate Gaussian mixture models. Lecture Notes in Artificial Intelligence 2903, Springer-Verlag, Berlin. pp. 477–489.Google Scholar
- Aitchison, J. and Kay, J. W. 2003. Possible solutions of some essential zero problems. In: Compositional Data Analysis. Compositional Data Analysis Workshop, Universitat de Girona. pp. 1–6.Google Scholar
- Amari, S. and Nagaoka, H. 2000. Methods of Information Geometry Translations of Mathematical Monographs, American Mathematical Society and Oxford University Press, Oxford.Google Scholar
- Bennett, K. D. and Porter, C. 2001. Late Quarternary dynamics of Western Tierra del Fuego. Uppsala Universitet: https://doi.org/www.geo.uu.se/Institutionen för geovetenskaper: Paleobiologi: forskning.
- Birks, H. J. B. and Gordon, A. D. 1985. Numerical methods in Quaternary Pollen Analysis. Academic Press, London.Google Scholar
- Browne, M. W and Zhang, G. 2005. DyFA: Dynamic Factor Analysis of Lagged Correlation Matrices Version 2.03 [Computer Software and Manual]. https://doi.org/quantrm2.psy.ohio-state.edu/browne.
- Dale, M. B., Allison, L. and Dale, P. E. R. submitted. Attribute properties and clustering procedures: an example using pollen analysis.Google Scholar
- Dale, M. B. and Walker, D. 1970. Information analysis of pollen diagrams. Pollen et Spores 2: 21–37.Google Scholar
- Edwards, R. T. and D. L. Dowe 1998. Single factor analysis in MML mixture modelling. Lecture Notes in Artificial Intelligence (LNAI) 1394, Springer-Verlag, Berlin. pp. 96–109.Google Scholar
- Georgieff, M. P. and Wallace, C. S. 1984. A general selection criterion for inductive inference. Proceedings 6th European Conference Artificial Intelligence, (ECAI-84) Pisa. pp. 473–482.Google Scholar
- Green, D. G. 1983a. Interactive pollen time series analysis. Pollen et Spores 25: 531–540.Google Scholar
- Kodratoff, Y. 1986. Leçons d’apprentissage symbolique, Editions Cépadues, Toulouse.Google Scholar
- Lafferty, J., McCallum, A. and Pereira, F. 2001. Conditional random fields: probabilistic models for segmenting and labelling sequence data. In: Proceedings 18th International Conference on Machine Learning (ICML 2001), Morgan Kaufmann, San Francisco. pp. 282–289.Google Scholar
- Li, C. Biswas, G., Dale, M. B. and Dale, P. E. R. 2001. Building models of ecological dynamics using HMM-based temporal data clustering. In: Advances in Intelligent Data Analysis, the 4th International Conference on Intelligent Data Analysis, Lec-ture Notes in Computer Science Series 2189, Springer, Berlin. pp. 53–62.Google Scholar
- Liu, B., Hsu, W., Mun, L-F. and Lee, H-Y. 1999. Finding interesting patterns using user expectation. I.E.E.E. Trans. Knowledge and Data Engineering 11: 817–832.Google Scholar
- Rahwan, T. and Jennings, N. R. 2008. An improved dynamic programming algorithm for coalition structure generation. In: L. Padgham, D. C. Parkes, J. Mueller and S. Parsons (eds.), Proceedings 7th International Conference on Autonomous Agents and Multiagent systems (AAMAS), Estoril, Portugal. pp. 1417–1420.Google Scholar
- Schader, M. 1979. Branch and bound clustering with a generalised scatter criterion. Oper. Res. Verfahren 30: 154–162.Google Scholar
- Schmidhuber, J. 1997. What’s interesting? Tech. Rep. IDSIA-35–97, IDSIA, Lugano, Switzerland.Google Scholar
- Sombattheera, C. and Ghose, A. 2008 A best-first anytime algorithm for computing optimal coalition structures. In: L. Padgham, D. C. Parkes, J. Mueller and S. Parsons (ed.), Proceedings 7thIn-ternational Conference on Autonomous Agents and Multiagent systems (AAMAS), Estoril, Portugal. pp. 1425–1427.Google Scholar
- Visser, G. and Dowe, D. L. 2007. Minimum message length clustering of spatially-correlated data with varying inter-class penalties. 6th IEEE International Conference on Computer and Information Science (ICIS 2007), 11–13 July 2007, Melbourne, Australia, pp. 17–22.Google Scholar
- Wallace, C. S. 1995. Multiple factor analysis by MML estimation. Technical Report 95/218, Dept Computer Science, Monash University, Clayton, Victoria 3168, Australia. 21pp.Google Scholar
- Wallace, C. S. 2005. Statistical and Inductive Inference by Minimum Message Length. Springer, Berlin.Google Scholar
- Wallace, C. S. and Freeman, P. R. 1992. Single-factor analysis by minimal message length estimation. J. Roy. Stat. Soc. B 54: 195–209.Google Scholar
- Wallace, C. S. and Georgieff, M. P. 1983. A general objective for inductive inference. Technical Report 32, Department Computer Science, Monash University, Clayton, Victoria 3168, Australia.Google Scholar
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.