An accelerated EM algorithm for mixture models with uncertainty for rating data

Abstract

The paper is framed within the literature around Louis’ identity for the observed information matrix in incomplete data problems, with a focus on the implied acceleration of maximum likelihood estimation for mixture models. The goal is twofold: to obtain direct expressions for standard errors of parameters from the EM algorithm and to reduce the computational burden of the estimation procedure for a class of mixture models with uncertainty for rating variables. This achievement fosters the feasibility of best-subset variable selection, which is an advisable strategy to identify response patterns from regression models for all Mixtures of Experts systems. The discussion is supported by simulation experiments and a real case study.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

References

  1. Agresti A (2010) Analysis of ordinal categorical data, 2nd edn. Wiley, Hoboken

    Book  Google Scholar 

  2. Allik J (2014) A mixed-binomial model for Likert-type personality measure. Front Psychol 5:1–13

    Article  Google Scholar 

  3. Baker SG (1992) A simple method for computing the observed information matrix when using the EM algorithm with categorical data. J Comput Graph Statist 1(1):63–76

    MathSciNet  Google Scholar 

  4. Buckland ST, Burnham KP, Augustin NH (1997) Model selection: an integral part of inference. Biometrics 53:603–618

    Article  Google Scholar 

  5. Burnham KP, Anderson DR (2003) Model selection and multimodel inference: a practical information-theoretic approach, 2nd edn. Springer, New York

    MATH  Google Scholar 

  6. Capecchi S, Piccolo D (2017) Dealing with heterogeneity in ordinal responses. Qual Quant 51:2375–2393

    Article  Google Scholar 

  7. Cappelli C, Simone R, Di Iorio F (2019) CUBREMOT: a tool for building model-based trees for ordinal responses. Expert Syst Appl 124:39–49

    Article  Google Scholar 

  8. Colombi R, Giordano S (2016) A class of mixture models for multidimensional ordinal data. Statist Model 16(4):322–340

    MathSciNet  Article  Google Scholar 

  9. Corduas M (2011) Assessing similarity of rating distributions by Kullback-Liebler divergence. In: Fichet A et al (eds) Classification and multivariate analysis for complex data structures, studies in classification, data analysis, and knowledge organization. Springer, Berlin, Heidelberg, pp 221–228

  10. D’Elia A, Piccolo D (2005) A mixture model for preference data analysis. Comput. Stat. Data Ann. 49:917–934

    MathSciNet  Article  Google Scholar 

  11. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Statist Soc Ser B 39(1):1–38

    MathSciNet  MATH  Google Scholar 

  12. GESIS Leibniz Institute for the Social Sciences (2016) German General Social Survey (ALLBUS)—Cumulation 1980-2014, GESIS Data Archive, Cologne. ZA4584 Data file version 1.0.0. https://doi.org/10.4232/1.12574

  13. Gormley IC, Frühwirth-Schnatter S (2019) Mixture of Experts Models, Chapter 12 In: Frühwirth-Schnatter, S, Gilles, C, Robert CP (eds) Handbook of mixture analysis, 1st edn, Chapman & Hall, CRC, Handbooks of Modern Statistical Methods, https://doi.org/10.1201/9780429055911

  14. Gottard A, Iannario M, Piccolo D (2016) Varying uncertainty in cub  models. Adv Data Anal Classif 10(2):225–244

    MathSciNet  Article  Google Scholar 

  15. Iannario M (2008) Selecting feeling covariates in rating surveys. Statist Appl 20(2):121–134

    Google Scholar 

  16. Iannario M (2010) On the identifiability of a mixture model for ordinal data. Metron LXVIII(1):87–94

    MathSciNet  Article  Google Scholar 

  17. Iannario M (2012) Preliminary estimators for a mixture model of ordinal data. Adv Data Anal Classif 6(3):163–184

    MathSciNet  Article  Google Scholar 

  18. Iannario M, Monti AC, Piccolo D, Ronchetti E (2017) Robust inference for ordinal response models. Electron J Statist 11:3407–3445

    MathSciNet  Article  Google Scholar 

  19. Iannario M, Piccolo D, Simone R (2018) CUB: a class of mixture models for ordinal data. (R package version 1.1.3), http://CRAN.R-project.org/package=CUB

  20. Ibrahim JC (1990) Incomplete data in generalized linear models. J Am Statist Assoc 85:765–769

    Article  Google Scholar 

  21. Louis TA (1976) Maximum likelihood estimation using pseudo-data interactions. Boston University Research Report, No, pp 2–76

  22. Louis TA (1982) Finding the observed information matrix when using the EM algorithm. J R Statist Soc Ser B 44:226–233

    MathSciNet  MATH  Google Scholar 

  23. Khalili A, Chen J (2007) Variable selection in finite mixture of regression models. J Am Statist Assoc 102(479):1025–1038

    MathSciNet  Article  Google Scholar 

  24. Mahalanobis PC (1936) On the generalised distance in statistics. Proc National Inst Sci India 2(1):49–55

    MathSciNet  MATH  Google Scholar 

  25. Manisera M, Zuccolotto P (2014) Modeling rating data with Non Linear CUB models. Comput Stat Data Ann 78:100–118

    Article  Google Scholar 

  26. McCullagh P (1980) Regression models for ordinal data. J R Statist Soc Ser B 42(2):109–142

    MathSciNet  MATH  Google Scholar 

  27. McLachlan GJ, Krishnan T (1997) The EM Algorithm and Extensions, 2nd edn, Wiley Series in Probability and Statistics

  28. Meilijson I (1989) A fast improvement of the EM algorithm on its own terms. J R Statist Soc Ser B 51:127–138

    MathSciNet  MATH  Google Scholar 

  29. Meng X, Rubin DB (1991) Using EM to obtain asymptotic variance-covariance matrices: the SEM algorithm. J Am Statist Assoc 86(416):899–909

    Article  Google Scholar 

  30. Miller K (1981) On the inverse of the sum of matrices. Math Mag 54(2):67–72

    MathSciNet  Article  Google Scholar 

  31. Oakes D (1999) Direct calculation of the information matrix via the EM. J R Statist Soc Ser B 61(2):479–482

    MathSciNet  Article  Google Scholar 

  32. Orchard T, Woodbury MA (1972) A missing information principle: theory and applications, Proc. Sixth Berkeley Symp. on Math. Stat. and Prob. 1, Univ. of Calif. Press, 697–715

  33. Piccolo D (2003) On the moments of a mixture of uniform and shifted binomial random variables. Quaderni di Statistica 5:85–104

    Google Scholar 

  34. Piccolo D (2006) Observed information matrix for MUB models. Quaderni di Statistica 8:33–78

    Google Scholar 

  35. Piccolo D, Simone R (2019a) The class of cub models: statistical foundations, inferential issues and empirical evidence. Statist Method Appl 28:389–435 (with discussions)

    MathSciNet  Article  Google Scholar 

  36. Piccolo D, Simone R (2019b) Rejoinder to the discussion of The class of cub  models: statistical foundations, inferential issues and empirical evidence. Statist Method Appl 28:477–493

    Article  Google Scholar 

  37. Piccolo D, Simone R, Iannario M (2019) Cumulative and cub  models for rating data: a comparative analysis. Int Statist Rev 87(2):207–236

    MathSciNet  Article  Google Scholar 

  38. Pinto da Costa JF, Alonso H, Cardoso JS (2008) The unimodal model for the classification of ordinal data. Neural Networks, 21, 78–91. Corrigendum in: (2014). Neural Networks, 59, 73–75

  39. Simone R (2020) FastCUB: Fast EM and Best-Subset Selection for CUB Models for Rating Data. R package version 0.0.2. https://CRAN.R-project.org/package=FastCUB

  40. Simone R, Cappelli C, Di Iorio F (2019) Modelling marginal ranking distributions: the uncertainty tree. Pattern Recognit Lett 125(1):278–288

    Article  Google Scholar 

  41. Simone R, Tutz G (2018) Modelling uncertainty and response styles in ordinal data. Statist Neerlandica 72(3):224–245

    MathSciNet  Article  Google Scholar 

  42. Simone R, Tutz G, Iannario M (2020) Subjective heterogeneity in response attitude for multivariate ordinal outcomes. Econ Statist 14:145–158

    MathSciNet  Google Scholar 

  43. Sundberg R (1976) An iterative method for solution of the likelihood equations for incomplete data from exponential families. Commun Statist Simul Comput B5(1):55–64

    MathSciNet  Article  Google Scholar 

  44. Tibshirani R (1996) Regression Shrinkage and Selection via the LASSO. J R Statist Soc Ser B 58:267–288

    MathSciNet  MATH  Google Scholar 

  45. Tutz G (2012) Regression for categorical data. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  46. Zhou H, Lange K (2009) Rating movies and rating the raters who rate them. Am Stat 63:297–307

    MathSciNet  Article  Google Scholar 

Download references

Acknowledgements

The research has been partially funded by the ‘cub  Regression Model Trees project’ (project number: 000025_ALTRI_DR_1043_2017-C-CAPPELLI) of the University of Naples Federico II, Italy.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Rosaria Simone.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix: the EM algorithm for CUB models

Given the notation set in Sects. 2 and 4, for a sample \(\varvec{R} = (R_1,\ldots ,R_n)^{\prime }\) of ordinal scores to be collected on a scale with m categories, consider the full cub specification with covariates given in (1)–(2). Then, \(\varvec{R}\) denotes the so-called incomplete data; let \(\varvec{X} = (\varvec{R}^{\prime }, \varvec{Z}^{\prime })^{\prime }\) be the complete data, with missing data \(\varvec{Z} = (Z_{1},\ldots , Z_n)^{\prime }\) given by:

$$\begin{aligned} Z_{i}= {\left\{ \begin{array}{ll} 1 &{}\quad \text {if the i-th observation is drawn from the feeling component} \\ 0 &{}\quad \text {otherwise}. \end{array}\right. } \end{aligned}$$

To be more specific, one should set \(\varvec{Z}_{1} = \varvec{Z}\) and \(\varvec{Z}_{2} = 1 - \varvec{Z}_1\) for the uncertainty component. Then, with obvious notation, consider the complete log-likelihood:

$$\begin{aligned} l_c(\varvec{\theta }; \varvec{R}, \varvec{Z}) = \sum _{i=1}^n Z_{1i}\, \log \big (\pi _i\;b_{R_i}(\xi _i) \big ) + \sum _{i=1}^n (1-Z_{1i})\, \log \big ((1-\pi _i)\dfrac{1}{m} \big ). \end{aligned}$$
(23)

At the k-th iteration and for realization \(\varvec{r} = (r_1,\ldots ,r_n)^{\prime }\) of \(\varvec{R}\), the procedure first computes the posterior probabilities that each observation is drawn from each component: In particular:

$$\begin{aligned} \tau _{1i}^{(k)}= & {} \dfrac{\pi _i^{(k)} b_{r_i}(\xi _i^{(k)})}{Pr(R_i=r_i \vert \varvec{\theta }^{(k)}, \varvec{y}_i, \varvec{w}_i)}, \qquad \end{aligned}$$
(24)
$$\begin{aligned} \tau _{2\,i}^{(k)}= & {} 1- \tau _{1\,i}^{(k)} = \dfrac{1}{m}\dfrac{1-\pi _i^{(k)}}{Pr(R_i=r_i \vert \varvec{\theta }^{(k)}, \varvec{y}_i, \varvec{w}_i)} \end{aligned}$$
(25)

where one sets:

$$\begin{aligned} {\text {logit}}(\pi _i^{(k)}) = \bar{\varvec{y}_i}\, \varvec{\beta }^{(k)} \qquad {\text {logit}}(\xi _i^{(k)}) = \bar{\varvec{w}_i}\, \varvec{\gamma }^{(k)} \;\qquad i=1,\ldots ,n. \end{aligned}$$

Thus, at the k-th step, the conditional expectation of the complete log-likelihood (23) to be maximized over \(\varvec{\theta }\) is given by:

$$\begin{aligned} Q(\varvec{\theta }; \varvec{\theta }^{(k)})= & {} {\mathbb {E}}_{\varvec{\theta }^{(k)}}[l_c(\varvec{\theta }; \varvec{R}, \varvec{Z})| \varvec{R} = \varvec{r}] \\= & {} \sum _{i=1}^{n} \tau _{1i}^{(k)} \log (\pi _i(\varvec{\beta })) + \sum _{i=1}^{n}(1- \tau _{1i}^{(k)}) \log (1 - \pi _i(\varvec{\beta })) \\&\qquad + \sum _{i=1}^{n} \tau _{1i}^{(k)} \log (b_{r_i}(\xi _i(\varvec{\gamma }))) \; + \;\sum _{i=1}^{n} (1- \tau _{1i}^{(k)}) \log \big (\dfrac{1}{m}\big ) \end{aligned}$$

yielding to the updated estimate \(\varvec{\theta }^{(k+1)}\). These steps are iterated until convergence is attained within a certain numerical tolerance. The Nelder–Mead algorithm is considered for the optimization steps in \(\varvec{\theta } = (\varvec{\beta },\varvec{\gamma })^{\prime }\).

Appendix: Louis’ identity for CUB models

In full generality, consider cub models specification with \(p\ge 0\), \(q\ge 0\) covariates for uncertainty and feeling parameters, respectively. Notice that, since \({\text {logit}}(\pi _i) = \bar{\varvec{y}}_i \cdot \varvec{\beta }\) and \({\text {logit}}(\xi _i) = \bar{\varvec{w}}_i \cdot \varvec{\gamma }\), then

$$\begin{aligned} \dfrac{\partial \pi _i}{\beta _j} = {\bar{y}}_{ij}\,\pi _i\,(1-\pi _i), \qquad \dfrac{\partial \xi _i}{\gamma _j} = {\bar{w}}_{ij}\,\xi _i\,(1-\xi _i). \end{aligned}$$

For the complete information matrix (17), the first derivatives with respect to \(\beta _j\) and \(\gamma _l\) of the complete log-likelihood (23) are given by:

$$\begin{aligned} \dfrac{\partial \ell _c(\varvec{\theta })}{\partial \beta _j} = \sum _{i=1}^n {\bar{y}}_{ij} \big (Z_{i1} - \pi _i\big ), \qquad \dfrac{\partial \ell _c(\varvec{\theta })}{\partial \gamma _l} = \sum _{i=1}^n {\bar{w}}_{il} \,Z_{i1}\big ( m - R_i - \xi _i\,(m-1) \big )\quad \end{aligned}$$
(26)

from which it follows:

$$\begin{aligned}&\dfrac{\partial ^2 \ell _c(\varvec{\theta })}{\partial \beta _j\, \partial \beta _k} = - \sum _{i=1}^n {\bar{y}}_{ij}\,{\bar{y}}_{ik} \,\pi _i\,(1-\pi _i), \\&\quad \dfrac{\partial ^2 \ell _c(\varvec{\theta })}{\partial \gamma _h\partial \gamma _l} = - (m-1) \sum _{i=1}^n {\bar{w}}_{il} {\bar{w}}_{ih}\,Z_{i1}\,\xi _i (1-\xi _i) \end{aligned}$$

Then, taking the conditional expectation given \(\varvec{R} = \varvec{r}\) of the negative of these second-order derivatives, the block-wise definition in (17) follows. Then, the matrix specification (19) in Louis’ identity follows straightforwardly since \({\mathbb {E}}[Z_{i1}|\varvec{R}=\varvec{r}] = \tau _i\).

Starting from the complete score vector (26), matrix \({\mathcal {V}}_c\) in (18) can be obtained as follows:

$$\begin{aligned} {\mathcal {V}}_c[j,l]&= {\mathbb {E}}\big [ \big ( \sum _{i=1}^n {\bar{y}}_{ij}(Z_{i1} - \pi _i)\big )\big (\sum _{t=1}^n {\bar{y}}_{tl}(Z_{t1} - \pi _t)\big ) \big | \varvec{R} = \varvec{r} \big ] \\&= \sum _{i=1}^n \sum _{t=1}^n {\bar{y}}_{ij}{\bar{y}}_{tl}Cov(Z_{i1}-\pi _i,Z_{t1} - \pi _t\big | \varvec{R} = \varvec{r} \big ) + \\&\quad + {\mathbb {E}}\big [\sum _{i=1}^n {\bar{y}}_{ij}(Z_{i1} - \pi _i)\big ]\,{\mathbb {E}}\big [\sum _{t=1}^n {\bar{y}}_{tl}(Z_{t1} - \pi _t)\big ] \\&= \sum _{i=1}^n {\bar{y}}_{ij}{\bar{y}}_{il} \tau _{i1}(1-\tau _{i1}) + \big (\sum _{i=1}^n {\bar{y}}_{ij}(\tau _{i1}-\pi _i)\big ) \big (\sum _{i=1}^n {\bar{y}}_{il}(\tau _{i1}-\pi _i)\big ) \\&= {\varvec{Y}}_{\varvec{\tau }}[,j]\cdot \varvec{Y}[,l] + {\mathcal {V}}[j,l] \end{aligned}$$

according to the notation introduced in Sect. 4 (notice that in the second to last row of the above identity, one uses the fact that the \(Z_{i1}\)’s are independent Bernoulli distributed, given \(\varvec{R}=\varvec{r}\), with probability parameter given by \(\tau _{1i}\). Similar steps can be easily pursued for the other blocks of the matrix (18).

Finally, the score vector for the incomplete data problem is obtained by taking the first partial derivatives in (3) with respect to \(\beta _j\), \(j=0,\ldots ,p\) and \(\gamma _l\), \(l=0,\ldots ,q\):

$$\begin{aligned} \dfrac{\partial \ell (\varvec{\theta })}{\partial \beta _j} = \sum _{i=1}^n {\bar{y}}_{ij} \big ( \tau _i - \pi _i \big ), \qquad \dfrac{\partial \ell (\varvec{\theta })}{\partial \gamma _l} = -\sum _{i=1}^n {\bar{w}}_{il} \,\tau _i\, a_i \end{aligned}$$

with \(a_i = 1- r_i + (m-1)(1-\xi _i)\). Accordingly, matrix (19) is obtained from the column-by-row product of the incomplete score vector with its transpose.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Simone, R. An accelerated EM algorithm for mixture models with uncertainty for rating data. Comput Stat 36, 691–714 (2021). https://doi.org/10.1007/s00180-020-01004-z

Download citation

Keywords

  • Louis’ Identity
  • Accelerated EM algorithm
  • cub Mixture models
  • Rating data
  • Standard errors