Abstract
The flexible Dirichlet (FD) distribution (Ongaro and Migliorati in J. Multvar. Anal. 114: 412–426, 2013) makes it possible to preserve many theoretical properties of the Dirichlet one, without inheriting its lack of flexibility in modeling the various independence concepts appropriate for compositional data, i.e. data representing vectors of proportions. In this paper we tackle the potential of the FD from an inferential and applicative viewpoint. In this regard, the key feature appears to be the special structure defining its Dirichlet mixture representation. This structure determines a simple and clearly interpretable differentiation among mixture components which can capture the main features of a large variety of data sets. Furthermore, it allows a substantially greater flexibility than the Dirichlet, including both unimodality and a varying number of modes. Very importantly, this increased flexibility is obtained without sharing many of the inferential difficulties typical of general mixtures. Indeed, the FD displays the identifiability and likelihood behavior proper to common (non-mixture) models. Moreover, thanks to a novel non random initialization based on the special FD mixture structure, an efficient and sound estimation procedure can be devised which suitably combines EM-types algorithms. Reliable complete-data likelihood-based estimators for standard errors can be provided as well.
Similar content being viewed by others
References
Aitchison, J.: The Statistical Analysis of Compositional Data. Chapman & Hall, London (2003)
Azzalini A, Menardi G, Rosolin T (2012) R package pdfCluster: cluster analysis via nonparametric density estimation (version 1.0-0). Università di Padova, Italia. http://cran.r-project.org/web/packages/pdfCluster/index.html
Banfield, J.D., Raftery, A.E.: Model-based gaussian and non-gaussian clustering. Biometrics 49, 803–821 (1993)
Barndorff-Nielsen, O., Jørgensen, B.: Some parametric models on the simplex. J. Multivar. Anal. 39(1), 106–116 (1991)
Biernacki, C., Celeux, G., Govaert, G.: Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput. Stat. Data Anal 41, 561–575 (2003)
Celeux, G., Govaert, G.: A classification EM algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal. 4, 315–332 (1992)
Celeux, G., Chauveau, D., Diebolt, J.: Stochastic versions of the EM algorithm: an experimental study in the mixture case. J. Stat. Comput. Simul. 55, 287–314 (1996)
Connor, R.J., Mosimann, J.E.: Concepts of independence for proportions with a generalization of the dirichlet distribution. J. Am. Stat. Assoc. 64(325), 194–206 (1969)
Coxeter, H.: Regular Polytopes. Dover Publications, New York (1973)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser B 39(1), 1–38 (1977)
Diebolt, J., Ip, E.: Stochastic EM: method and application. In: WR Gilks, S.R., Spiegelhalter, D. (eds.) Markov Chain Monte Carlo in Practice, pp. 259–273. Chapman & Hall, London (1996)
Efron, B.: Missing data, imputation, and the bootstrap. J. Am. Stat. Assoc. 89(426), 463–475 (1994)
Favaro, S., Hadjicharalambous, G., Prunster, I.: On a class of distributions on the simplex. J. Stat. Plan. Inference 141(426), 2987–3004 (2011)
Feng, Z., McCulloch, C.: Using bootstrap likelihood ratio in finite mixture models. J. R. Stat. Soc. B 58, 609–617 (1996)
Forina M, Armanino C, Lanteri S, Tiscornia E (1983) Classification of olive oils from their fatty acid composition. In: Martens, Russwurm (eds) Food Research and Data Anlysis, Dip. Chimica e Tecnologie Farmaceutiche ed Alimentari, University of Genova
Frühwirth-Schnatter, S.: Finite Mixture and Markov Switching Models. Springer, New York (2006)
Gupta, R.D., Richards, D.S.P.: Multivariate liouville distributions. J. Multivar. Anal. 23, 233–256 (1987)
Gupta, R.D., Richards, D.S.P.: Multivariate liouville distributions, II. Probab. Math. Stat. 12, 291–309 (1991)
Gupta, R.D., Richards, D.S.P.: Multivariate liouville distributions, III. J. Multivar. Anal. 43, 29–57 (1992)
Gupta, R.D., Richards, D.S.P.: Multivariate liouville distributions, IV. J. Multivar. Anal. 54, 1–17 (1995)
Gupta, R.D., Richards, D.S.P.: Multivariate liouville distributions, V. In: NL Johnson, N.B. (ed.) Advances in the Theory and Practice of Statistics: A Volume in Honour of Samuel Kotz, pp. 377–396. Wiley, New York (1997)
Gupta, R.D., Richards, D.S.P.: The covariance structure of the multivariate liouville distributions. Contemp. Math. 287, 125–138 (2001a)
Gupta, R.D., Richards, D.S.P.: The history of the Dirichlet and Liouville distributions. Int. Stat. Rev. 69(3), 433–446 (2001b)
Hathaway, R.J.: A constrained formulation of maximum-likelihood estimation for normal mixture distributions. Ann. Stat. 13(2), 795–800 (1985)
Kiefer, J., Wolfowitz, J.: Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Ann. Math. Stat. 27(4), 887–906 (1956)
Lehmann, E., Casella, G.: Theory of Point Estimation. Springer, New York (1998)
Louis, T.A.: Finding the observed information matrix when using the EM algorithm. J. R. Stat. Soc. Ser. B 44(2), 226–233 (1982)
McLachlan, G., Peel, D.: Finite Mixture Models. Wiley, New York (2000)
Meilijson, I.: A fast improvement to the EM algorithm on its own terms. J. R. Stat. Soc. Ser. B 51(1), 127–138 (1989)
Meng, X.L., Rubin, D.B.: Using EM to obtain asymptotic variance-covariance matrices: the SEM algorithm. J. Am. Stat. Assoc. 86(416), 899–909 (1991)
O’Hagan, A., Murphy, T.B., Gormley, I.C.: Computational aspects of fitting mixture models via the expectation-maximization algorithm. Comput. Stat. Data Anal. 56(12), 3843–3864 (2012)
Ongaro, A., Migliorati, S.: A generalization of the Dirichlet distribution. J. Multivar. Anal. 114, 412–426 (2013)
Palarea-Albaladejo, J., Martín-Fernández, J., Soto, J.: Dealing with distances and transformations for fuzzy c-means clustering of compositional data. J. Classif. 29, 144–169 (2012)
Pawlowsky-Glahn, V., Egozcue, J., Tolosana-Delgado, R.: Modeling and Analysis of Compositional Data. Wiley, New York (2015)
Peters, B.C., Walker, H.F.: An iterative procedure for obtaining maximum-likelihood estimates of the parameters for a mixture of normal distributions. SIAM J. Appl. Math. 35(2), 362–378 (1978)
R Development Core Team (2015) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/
Rayens, W.S., Srinivasan, C.: Dependence properties of generalized Liouville distributions on the simplex. J. Am. Stat. Assoc. 89(428), 1465–1470 (1994)
Redner, R.: Note on the consistency of the maximum likelihood estimate for non-identifiable distributions. Ann. Stat. 9, 225–228 (1981)
Rothenberg, T.: Identification in parametric models. Econometrica 39(3), 577–591 (1971)
Smith, B., Rayens, W.: Conditional generalized Liouville distributions on the simplex. Statistics 36(2), 185–194 (2002)
Wald, A.: Note on the consistency of the maximum likelihood estimate. Ann. Math. Stat. 20, 595–601 (1949)
Acknowledgments
We are grateful to the referees and to the editor for their constructive comments, which helped to improve the paper. Research partially financially supported by the Italian Ministry of University and Research, Grants F.A. 2014 from the University of Milano-Bicocca.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 Appendix 1: Vertexes of \(\textit{RSP}^{D}\) (see Sect. 3.1)
Having chosen as one dimensional simplex \(\textit{RSP}^{1}\) the line segment with vertexes \(\mathbf v_0=0\) and \(\mathbf v_1=1\), let us recursively determine the vertexes of \(\textit{RSP}^{n}\) (\(n\le D-1\)). Suppose we know the vertexes of \(\textit{RSP}^{n-1}\), that is \(\mathbf v_{0}^{n-1},\mathbf v_{1}^{n-1},\ldots ,\mathbf v_{n-1}^{n-1}\in {\mathbb {R}} ^{n-1}\), \(n\ge 2\). Then, \(\textit{RSP}^{n}\) can be obtained by adding to the n vertexes of \(\textit{RSP}^{n-1}\) a new vertex \(\mathbf v_n\) with the same distance 1 from all old vertexes. Therefore, the first n vertexes \(\mathbf v_{0}^{n},\mathbf v_{1}^{n},\ldots ,\mathbf v_{n-1}^{n}\) \(\in {\mathbb {R}} ^{n}\) of \(\textit{RSP}^{n}\) are obtained by adding a further coordinate equal to 0 to the \(\textit{RSP}^{n-1}\) vertexes (geometrically the first n vertexes of \(\textit{RSP}^{n}\) coincide with the vertexes of \(\textit{RSP}^{n-1}\)).
The last vertex \(\mathbf v_{n}^{n}\), having the same distance from all previous vertexes, has the first \(n-1\) coordinates equal to the barycenter \(\mathbf B^{n-1}\) of \(\textit{RSP}^{n-1}\). The last coordinate is then obtained by imposing that the distance between \(\mathbf v_{n}^{n}\) and one of the previous vertexes is one. In particular, we can choose \(\left\| \mathbf v_{n}^{n}- \mathbf v_{0}^{n}\right\| =\left\| \mathbf v_{n}^{n}\right\| =1\).
It is left to determine the coordinates of the generic barycenter \(\mathbf B^{n}\). Again we will proceed recursively, starting from \(\mathbf B^{1}=1/2\). As \(\mathbf B^{n}\) has the same distance from all vertexes of \(\textit{RSP}^{n}\), its first \(n-1\) coordinates must be equal to \(\mathbf B^{n-1}\). The last coordinate can be obtained by imposing that \(\left\| \mathbf B^{n}-\mathbf v_{n}^{n}\right\| \) is equal to the distance between \(\mathbf B^{n}\) and one of the first n vertexes of \(\textit{RSP}^{n}\). For example, we can set \(\left\| \mathbf B^{n}-\mathbf v_{n}^{n}\right\| =\left\| \mathbf B^{n}-\mathbf v_{0}^{n}\right\| =\left\| \mathbf B^{n}\right\| \). As the first \(n-1\) coordinates of \(\mathbf v_{n}^{n}\) are equal to \(\mathbf B^{n-1}\) and \(\left\| \mathbf v_{n}^{n}\right\| =1\), we can write
where \(B_n^n\) is the last coordinate of \(\mathbf B^{n}\). One can then find \(B_n^n\) as a function of \(\left\| \mathbf B^{n-1}\right\| \) by using the second equality in (16). By plugging this expression in the last equality of (16), after some manipulation one arrives at the following recursive relation:
This recursive equation, together with the initial value \(\left\| \mathbf B^{1}\right\| ^2=1/4\), admits the explicit solution given by \(\left\| \mathbf B^{n}\right\| ^2=n/[2(1+n)]\). By (16) one then has
which coincides with the quantity \(a_i\) with \(i=n\) defined in Sect. 3.1. As the first \(n-1\) coordinates of \(\mathbf B^{n}\) are equal to \(\mathbf B^{n-1}\), by induction one explicitly determines \(\mathbf B^{n}\): its i-th coordinate is \(B_i^n=\left[ 2 i (1+i)\right] ^{-1/2}=a_i\), \(i=1,\ldots ,n\), \(n\ge 1\). By applying recursively the above described procedure for the derivation of the vertexes \(\mathbf v_{0}^{n},\mathbf v_{1}^{n},\ldots ,\mathbf v_{n}^{n}\) of \(\textit{RSP}^{n}\) from the vertexes \(\mathbf v_{0}^{n-1},\mathbf v_{1}^{n-1},\) \(\ldots ,\mathbf v_{n-1}^{n-1}\) of \(\textit{RSP}^{n-1}\) one obtains the result.
1.2 Appendix 2: Lemma 1—Rate of divergence of Dirichlet log-likelihood
Let \(\mathbf x_1,\ldots ,\mathbf x_n\) be an i.i.d. sample from a Dirichlet \(\mathcal{D}(\mathbf x;\varvec{\alpha })\). Then the log-likelihood diverges to \(-\infty \) when at least one of the \(\alpha _i\)’s goes to zero.
Moreover, suppose \(n\ge 2\). Then a.s. the log-likelihood:
-
1.
is bounded from above;
-
2.
diverges to \(-\infty \) when at least one of the \(\alpha _i\)’s diverges to \(+\infty \), with a rate of divergence not smaller than \(n\, \alpha ^+ \log \sum _{i=1}^D x_i^{(0)}\).
Here \(x_i^{(0)}=\prod _{j=1}^n x_{ji}^{1/n}\) denotes the geometric mean of the i-th element of the observations.
Suppose now \(n=1\). Then the log-likelihood:
-
1.
is unbounded (from above and from below);
-
2.
a necessary condition for its divergence to \(+\infty \) is that all \(\alpha _i\)’s diverge to \(+\infty \) and the rate of divergence is not larger than \(\frac{1}{2}(D-1) \log \alpha ^+\);
-
3.
diverges to \(-\infty \) if at least one of the \(\alpha _i\)’s diverges to \(+\infty \) and at least one does not.
Proof
The Dirichlet log-likelihood \(l(\varvec{\alpha })\) can be written as
As \(\log \varGamma (y)\approx -\log y\) as \(y\rightarrow 0\), then \(l(\varvec{\alpha })\rightarrow -\infty \) when one or more of the \(\alpha _i\)’s go to 0 and the others are fixed. Because \(l(\varvec{\alpha })\) is a regular function, this implies that it is bounded on the set \(\{\alpha ^+\in (0,k]\}\) for any \(k>0\) and, therefore, it may diverge only if \(\alpha ^+\rightarrow +\infty \). Thus, suppose M (\(1\le M\le D\)) \(\alpha _i\)’s go to \(+\infty \), say \(\alpha _1,\ldots ,\alpha _M\) without loss of generality, and denote \(\alpha _1^+\) their sum and \(\alpha _2^+=\alpha ^+-\alpha _1^+\). Then, the following approximation holds
where \(\eta _i=\alpha _i/\alpha _1^+\). Formula (17) can be obtained by means of a careful expansion of the terms of \(l(\varvec{\alpha })\) based on the two following approximations valid as \(y\rightarrow \infty \):
the latter holding for fixed positive a.
The relation between geometric and arithmetic means implies that
and therefore:
Now, suppose \(n\ge 2\). Then all elements of the observations \(\mathbf x_1,\ldots ,\mathbf x_n\) are distinct a.s.. Thus \(\sum _{i=1}^M x_i^{(0)}<\sum _{i=1}^M \bar{x}_i\le 1\). It follows that \(l(\varvec{\alpha })\) goes to \(-\infty \) when at least one \(\alpha _i\rightarrow +\infty \) and the rate of divergence is not smaller than \(n\, \alpha ^+\log \sum _{i=1}^D x_i^{(0)}\). It also follows that \(l(\varvec{\alpha })\) is bounded a.s..
Suppose now that \(n=1\), so that \(x_i^{(0)}=\bar{x}_i=x_{1i}\) (\(i=1,\ldots ,D\)). If \(1\le M<D\) then \(\sum _{i=1}^M x_i^{(0)}=\sum _{i=1}^M x_{1i}<1\), and therefore the log-likelihood decreases to \(-\infty \). If \(M=D\), then \(\sum _{i=1}^D\eta _i\log \frac{x_{1i}}{\eta _i}\) achieves its maximum equal to zero if we set \(\eta _i=x_{1i}\) (\(i=1,\ldots ,D\)). Thus, if \(M=D\), the behavior of \(l(\varvec{\alpha })\) is determined by the term \((\sum _{i=1}^D\log \alpha _i-\log \alpha ^+)/2\). The latter can be shown to be smaller or equal to \((D-1)\log \alpha ^+-D\log D\) again by using the relation between geometric and arithmetic means. Hence, the rate of divergence of \(l(\varvec{\alpha })\) is not larger than \((D-1)\log \alpha ^+\).This rate can be exactly achieved if \(\eta _i=x_{1i}\) (\(i=1,\ldots ,D\)), which also shows that \(l(\varvec{\alpha })\) is indeed unbounded.
Note that the above arguments also imply that the log-likelihood diverges to \(-\infty \) when at least one of the \(\alpha _i\)’s goes to zero for any n even if \(M<D\) of the other \(\alpha _i\)’s diverge to \(+\infty \). \(\square \)
1.3 Appendix 3: Proof of Proposition 3
To prove a.s. boundedness of the log-likelihood \(l(\varvec{\theta })\) and the existence of a maximum we shall use the following upper bound:
where \(I_j\in \{1,\ldots ,D\}\) can be interpreted as the cluster to which observation \(\mathbf {x}_j\) has been assigned.
For ease of exposition, let us show boundedness first. By formula (18), we have:
Therefore, it is enough to show that, for any given allocation of the observations to the D Dirichlet clusters (i.e. for any \(I_j, j=1,\ldots ,n\)), the corresponding log-likelihood is bounded from above. Indeed, such log-likelihood coincides with the classified log-likelihood given by (14) and can be viewed as the sum of D Dirichlet log-likelihoods:
where \(A_i=\{j:z_j=i\}\) identify the observations assigned to cluster i. By Lemma 1, a necessary condition for any of the D Dirichlet log-likelihoods to be unbounded is that there exists at least one cluster with only one observation and all \(\alpha _i\)’s diverge to \(+\infty \) except at most one (in which case \(\tau \) goes to \(+\infty \) as well). In this case the sum of the log-likelihoods of the clusters (at most \(D-1\)) with only one observation diverges with a rate not larger than \(c_1\log (\alpha ^++\tau )\) where \(c_1\) is a positive constant. On the other hand, there must exist at least one cluster with two or more observations. Therefore, when some \(\alpha _i\)’s go to \(+\infty \), by Lemma 1 the corresponding log-likelihood tends a.s. to \(-\infty \) with a rate not smaller than \(-c_2(\alpha ^++\tau )\) where \(c_2\) is a positive constant. Thus, in this case, the classified likelihood diverges to \(-\infty \) which implies that \(l(\varvec{\theta })\) is a.s. bounded.
Let us now prove existence of a maximum. As the log-likelihood \(l(\varvec{\theta })\) is a regular and differentiable function, the existence of a global maximum can be proved by showing that the supremum is not reached at the boundary of the parameter space. More precisely, consider the frontier of \(\varvec{\varTheta }\) defined as the set of boundary points which are not actually in \(\varvec{\varTheta }\). We shall show that, when \(\varvec{\theta }\) tends to the frontier (i.e. \(\alpha _i\rightarrow 0\) or \(\alpha _i\rightarrow \infty \), \(p_i\rightarrow 1\) \(i=1,\ldots ,D\), \(\tau \rightarrow 0\) or \(\tau \rightarrow \infty \)), then the log-likelihood tends either to \(-\infty \) or to values not larger than the log-likelihood (based on the whole sample) of the Dirichlet distribution. As this distribution corresponds to interior points of the parameter space of the FD, such limiting values are dominated by \(l(\varvec{\theta })\) computed at those interior points. They can therefore be discarded.
To obtain the above limits, we shall study the upper bound \(l_U(\varvec{\alpha } , \tau )\) of \(l(\varvec{\theta })\) given in (18).
Suppose first that at least one of the \(\alpha _i\)’s goes to \(+\infty \), irrespectively of the behavior of the other parameters. For any given allocation of the observations to clusters (i.e. for any given \(I_1,\ldots ,I_n\)), there must exist at least one cluster with two or more observations. By Lemma 1, the corresponding Dirichlet log-likelihood tends to \(-\infty \) with rate dominating the log-likelihood of possible one-observation clusters. Thus, for any given allocation, the classified log-likelihood tends to \(-\infty \) and so does the upper bound \(l_U(\varvec{\alpha } , \tau )\). An analogous argument shows that \(l(\varvec{\theta })\) tends to \(-\infty \) even when \(\tau \) tends to \(+\infty \).
Suppose now that two or more \(\alpha _i\)’s go to zero. Then, whatever the allocation of the observations, in all Dirichlet cluster log-likelihoods there exists one parameter going to zero. Therefore, each Dirichlet log-likelihood goes to \(-\infty \), implying that \( l_U(\varvec{\alpha } , \tau )\) diverges as well.
Consider, instead, the case of a single \(\alpha \), say \(\alpha _1\), going to zero. Then, for all allocations with at least one observation not assigned to the first cluster, the corresponding classified log-likelihood tends to \(-\infty \). This is because the term corresponding to the first cluster tends to a finite value while all the others tend to \(-\infty \). On the other hand, if all observations are assigned to the first cluster, then the classified log-likelihood tends to a Dirichlet log-likelihood, computed on the whole sample, with parameter \((\tau , \alpha _2, \ldots , \alpha _D)\). It follows that \( l_U(\varvec{\alpha } , \tau )\) tends to the same limit as well. The latter limit is dominated by \(l(\varvec{\theta })\) computed at an interior point of \(\varvec{\theta }\).
Finally, if \(p_i\rightarrow 1\), \(i=1,\ldots ,D\), or \(\tau \rightarrow 0\), then it is straightforward to see that \(l(\varvec{\theta })\) converges to a Dirichlet log-likelihood and, again, it is dominated by the value of \(l(\varvec{\theta })\) at an interior point.
1.4 Appendix 4: Score statistic and information matrix of the complete-data likelihood
The elements \(s_c(\theta _r)=\partial \log L_c \left( \varvec{\theta }\right) /\partial \theta _r\) (\(r=1,\ldots ,\) 2D) of the score statistic \(\mathbf {S}_c\left( \varvec{\theta };\mathbf {X}_c\right) \) computed from (12) have the form:
where \(z_{.i}=\sum _{j=1}^n z_{ji}\), \((i=1,\ldots ,D)\).
The elements \(i_c(\theta _r,\theta _p)=\partial ^2 \log L_c \left( \varvec{\theta }\right) /\partial \theta _r\partial \theta _p\) (r, \(p=1,\ldots ,2D\)) of the \(2D\times 2D\) matrix \(\mathbf {I}_c\left( \varvec{\theta };\mathbf {X}_c\right) \) assume the following form:
where \(\psi ^{\prime }(\cdot )\) denotes the trigamma function.
Rights and permissions
About this article
Cite this article
Migliorati, S., Ongaro, A. & Monti, G.S. A structured Dirichlet mixture model for compositional data: inferential and applicative issues. Stat Comput 27, 963–983 (2017). https://doi.org/10.1007/s11222-016-9665-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-016-9665-y