Cluster Validation for Mixtures of Regressions via the Total Sum of Squares Decomposition

Abstract

One of the challenges in cluster analysis is the evaluation of the obtained clustering results without using auxiliary information. To this end, a common approach is to use internal validity criteria. For mixtures of linear regressions whose parameters are estimated by maximum likelihood, we propose a three-term decomposition of the total sum of squares as a starting point to define some internal validity criteria. In particular, three types of mixtures of regressions are considered: with fixed covariates, with concomitant variables, and with random covariates. A ternary diagram is also suggested for easier joint interpretation of the three terms of the proposed decomposition. Furthermore, local and overall coefficients of determination are respectively defined to judge how well the model fits the data group-by-group but also taken as a whole. Artificial data are considered to find out more about the proposed decomposition, including violations of the model assumptions. Finally, an application to real data illustrates the use and the usefulness of these proposals.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

References

  1. Aitchison, J. (2003). The Statistical Analysis of Compositional Data. Caldwell: Blackburn Press.

    Google Scholar 

  2. Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J. M., & Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1), 243–256.

    Article  Google Scholar 

  3. Bagnato, L., & Punzo, A. (2013). Finite mixtures of unimodal beta and gamma densities and the k-bumps algorithm. Computational Statistics, 28(4), 1571–1597.

    MathSciNet  MATH  Article  Google Scholar 

  4. Berta, P., Ingrassia, S., Punzo, A., & Vittadini, G. (2016). Multilevel cluster-weighted models for the evaluation of hospitals. METRON, 74(3), 275–292.

    MathSciNet  MATH  Article  Google Scholar 

  5. Biernacki, C., Celeux, G., & Govaert, G. (2003). Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Computational Statistics & Data Analysis, 41(3-4), 561–575.

    MathSciNet  MATH  Article  Google Scholar 

  6. Buse, A. (1973). Goodness of fit in generalized least squares estimation. The American Statistician, 27(3), 106–108.

    Google Scholar 

  7. Cameron, A.C., & Windmeijer, F.A.G. (1996). R-squared measures for count data regression models with applications to health-care utilization. Journal of Business & Economic Statistics, 14(2), 209–220.

    Google Scholar 

  8. Cameron, A.C., & Windmeijer, F.A.G. (1997). An R-squared measure of goodness of fit for some common nonlinear regression models. Journal of Econometrics, 77(2), 329–342.

    MathSciNet  MATH  Article  Google Scholar 

  9. Cellini, R., & Cuccia, T. (2013). Museum and monument attendance and tourism flow: a time series approach. Applied Economics, 45, 3473–3482.

    Article  Google Scholar 

  10. Cerdeira, J.O., Martins, M.J., & Silva, P.C. (2012). A combinatorial approach to assess the separability of clusters. Journal of Classification, 29(1), 7–22.

    MathSciNet  MATH  Article  Google Scholar 

  11. Chatterjee, S., & Hadi, A.S. (2006). Regression Analysis by Example, volume 607 of Wiley Series in Probability and Statistics. Hoboken: Wiley.

    Google Scholar 

  12. Dang, U.J., Punzo, A., McNicholas, P.D., Ingrassia, S., & Browne, R.P. (2017). Multivariate response and parsimony for Gaussian cluster-weighted models. Journal of Classification, 34(1), 4–34.

    MathSciNet  MATH  Article  Google Scholar 

  13. Davidson, R., & MacKinnon, J.G. (2004). Econometric Theory and Methods. Oxford: Oxford University Press.

    Google Scholar 

  14. Dayton, C.M., & Macready, G.B. (1988). Concomitant-variable latent-class models. Journal of the American Statistical Association, 83(401), 173–178.

    MathSciNet  Article  Google Scholar 

  15. de Amorim, R.C. (2016). A survey on feature weighting based k-means algorithms. Journal of Classification, 33(2), 210–242.

    MathSciNet  MATH  Article  Google Scholar 

  16. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B, 39(1), 1–38.

    MathSciNet  MATH  Google Scholar 

  17. DeSarbo, W.S., & Cron, W.L. (1988). A maximum likelihood methodology for clusterwise linear regression. Journal of Classification, 5(2), 249–282.

    MathSciNet  MATH  Article  Google Scholar 

  18. Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. New York: Springer.

    Google Scholar 

  19. Gershenfeld, N. (1997). Nonlinear inference and cluster-weighted modeling. Annals of the New York Academy of Sciences, 808(1), 18–24.

    Article  Google Scholar 

  20. Grün, B., & Leisch, F. (2008). FlexMix version 2: finite mixtures with concomitant variables and varying and constant parameters. Journal of Statistical Software, 28(4), 1–35.

    Article  Google Scholar 

  21. Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2–3), 107–145.

    MATH  Article  Google Scholar 

  22. Hennig, C. (2000). Identifiablity of models for clusterwise linear regression. Journal of Classification, 17(2), 273–296.

    MathSciNet  MATH  Article  Google Scholar 

  23. Hosmer, D.W. (1974). Maximum likelihood estimates of the parameters of a mixture of two regression lines. Communications in Statistics-Theory and Methods, 3(10), 995–1006.

    MATH  Google Scholar 

  24. Huitema, B.E. (2011). The Analysis of Covariance and Alternatives: Statistical Methods for Experiments, Quasi-Experiments, and Single-Case Studies, volume 608 of Wiley Series in Probability and Statistics. New Jersey: Wiley.

    Google Scholar 

  25. Ingrassia, S., & Punzo, A. (2016). Decision boundaries for mixtures of regressions. Journal of the Korean Statistical Society, 45(2), 295–306.

    MathSciNet  MATH  Article  Google Scholar 

  26. Ingrassia, S., Minotti, S., & Vittadini, G. (2012). Local statistical modeling via the cluster-weighted approach with elliptical distributions. Journal of Classification, 29(3), 363–401.

    MathSciNet  MATH  Article  Google Scholar 

  27. Ingrassia, S., Minotti, S.C., & Punzo, A. (2014). Model-based clustering via linear cluster-weighted models. Computational Statistics and Data Analysis, 71, 159–182.

    MathSciNet  MATH  Article  Google Scholar 

  28. Ingrassia, S., Punzo, A., Vittadini, G., & Minotti, S.C. (2015). The generalized linear mixed cluster-weighted model. Journal of Classification, 32(1), 85–113.

    MathSciNet  MATH  Article  Google Scholar 

  29. Karlis, D., & Xekalaki, E. (2003). Choosing initial values for the EM algorithm for finite mixtures. Computational Statistics & Data Analysis, 41(3–4), 577–590.

    MathSciNet  MATH  Article  Google Scholar 

  30. Lange, K.L., Little, R.J.A., & Taylor, J.M.G. (1989). Robust statistical modeling using the t distribution. Journal of the American Statistical Association, 84(408), 881–896.

    MathSciNet  Google Scholar 

  31. Leisch, F. (2004). FlexMix: A general framework for finite mixture models and latent class regression in R. Journal of Statistical Software, 11(8), 1–18.

    Article  Google Scholar 

  32. Maddala, G.S. (1986). Limited-Dependent and Qualitative Variables in Econometrics. Econometric Society Monographs. Cambridge: Cambridge University Press.

    Google Scholar 

  33. Mazza, A., & Punzo, A. (2018). Mixtures of multivariate contaminated normal regression models. Statistical Papers. https://doi.org/10.1007/s00362-017-0964-y.

  34. Mazza, A., Punzo, A., & Ingrassia, S. (2018). flexCWM: Flexible cluster-weighted modeling. Journal of Statistical Software, 86(2), 1–30.

    Article  Google Scholar 

  35. Mazza, A., Battisti, M., Ingrassia, S., & Punzo, A. (2019). Modeling return to education in heterogeneous populations. An application to Italy. In Greselin, I., Deldossi, L., Vichi, M., & Bagnato, L. (Eds.) Advances in Statistical Models for Data Analysis, Studies in Classification, Data Analysis and Knowledge Organization. Switzerland: Springer International Publishing.

  36. McNicholas, P.D. (2016). Model-based clustering. Journal of Classification, 33 (3), 331–373.

    MathSciNet  MATH  Article  Google Scholar 

  37. Milligan, G.W., & Cheng, R. (1996). Measuring the influence of individual data points in a cluster analysis. Journal of Classification, 13(2), 315–335.

    MATH  Article  Google Scholar 

  38. Panagiotakis, C. (2015). Point clustering via voting maximization. Journal of Classification, 32(2), 212–240.

    MathSciNet  MATH  Article  Google Scholar 

  39. Punzo, A. (2014). Flexible mixture modeling with the polynomial Gaussian cluster-weighted model. Statistical Modelling, 14(3), 257–291.

    MathSciNet  Article  Google Scholar 

  40. Punzo, A., & Ingrassia, S. (2015). Parsimonious generalized linear Gaussian cluster-weighted models. In Morlini, I.s, Minerva, T., & Vichi, M. (Eds.) Advances in Statistical Models for Data Analysis, Studies in Classification, Data Analysis and Knowledge Organization (pp. 201–209). Switzerland: Springer International Publishing.

  41. Punzo, A., & Ingrassia, S. (2016). Clustering bivariate mixed-type data via the cluster-weighted model. Computational Statistics, 31(3), 989–1013.

    MathSciNet  MATH  Article  Google Scholar 

  42. Punzo, A., & McNicholas, P.D. (2017). Robust clustering in regression analysis via the contaminated Gaussian cluster-weighted model. Journal of Classification, 34 (2), 249–293.

    MathSciNet  MATH  Article  Google Scholar 

  43. Punzo, A., Ingrassia, S., & Maruotti, A. (2018). Multivariate generalized hidden Markov regression models with random covariates: physical exercise in an elderly population. Statistics in Medicine, 37(19), 2797–2808.

    MathSciNet  Article  Google Scholar 

  44. Quandt, R.E. (1972). A new approach to estimating switching regressions. Journal of the American Statistical Association, 67(338), 306–310.

    MATH  Article  Google Scholar 

  45. Quandt, R.E., & Ramsey, J.B. (1978). Estimating mixtures of normal distributions and switching regressions. Journal of the American Statistical Association, 73(364), 730–738.

    MathSciNet  MATH  Article  Google Scholar 

  46. R Core Team. (2016). R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing.

    Google Scholar 

  47. Rezaee, M.R., Lelieveldt, B.P.F., & Reiber, J.H.C. (1998). A new cluster validity index for the fuzzy c-mean. Pattern Recognition Letters, 19(3-4), 237–246.

    MATH  Article  Google Scholar 

  48. Rousseeuw, P.J., & Van Zomeren, B.C. (1990). Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association, 85(411), 633–639.

    Article  Google Scholar 

  49. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.

    MathSciNet  MATH  Article  Google Scholar 

  50. Steinley, D., Hendrickson, G., & Brusco, M.J. (2015). A note on maximizing the agreement between partitions: a stepwise optimal algorithm and some properties. Journal of Classification, 32(1), 114–126.

    MathSciNet  MATH  Article  Google Scholar 

  51. Subedi, S., Punzo, A., Ingrassia, S., & McNicholas, P.D. (2013). Clustering and classification via cluster-weighted factor analyzers. Advances in Data Analysis and Classification, 7(1), 5–40.

    MathSciNet  MATH  Article  Google Scholar 

  52. Subedi, S., Punzo, A., Ingrassia, S., & McNicholas, P.D. (2015). Cluster-weighted t-factor analyzers for robust model-based clustering and dimension reduction. Statistical Methods & Applications, 24(4), 623–649.

    MathSciNet  MATH  Article  Google Scholar 

  53. Theodoridis, S., & Koutroumbas, K. (2008). Pattern Recognition. London: Academic Press.

    Google Scholar 

  54. Veall, M.R., & Zimmermann, K.F. (1996). Pseudo-R2 measures for some common limited dependent variable models. Journal of Economic Surveys, 10(3), 241–259.

    Article  Google Scholar 

  55. Wedel, M. (1990). Clusterwise Regression and Market Segmentation: Developments and Applications. Landbouwuniversiteit te Wageningen.

  56. Wedel, M. (2002). Concomitant variables in finite mixture models. Statistica Neerlandica, 56(3), 362–375.

    MathSciNet  MATH  Article  Google Scholar 

  57. Wedel, M., & De Sarbo, W. (1995). A mixture likelihood approach for generalized linear models. Journal of Classification, 12(3), 21–55.

    MATH  Article  Google Scholar 

  58. Wedel, M., & Kamakura, W.A. (2000). Market Segmentation: Conceptual and Methodological Foundations, 2nd edn. Boston: Kluwer Academic Publishers.

    Google Scholar 

  59. Willett, J.B., & Singer, J.D. (1988). Another cautionary note about r2: Its use in weighted least-squares regression analysis. The American Statistician, 42(3), 236–238.

    Google Scholar 

  60. Windmeijer, F.A.G. (1995). Goodness-of-fit measures in binary choice models. Econometric Reviews, 14(1), 101–116.

    MathSciNet  MATH  Article  Google Scholar 

  61. Zarei, S., Mohammadpour, A., Ingrassia, S., & Punzo, A. (2018). On the use of the sub-Gaussian α-stable distribution in the cluster-weighted model. Iranian Journal of Science and Technology, Transactions A: Science. https://doi.org/10.1007/s40995-018-0526-8.

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Antonio Punzo.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ingrassia, S., Punzo, A. Cluster Validation for Mixtures of Regressions via the Total Sum of Squares Decomposition. J Classif (2019). https://doi.org/10.1007/s00357-019-09326-4

Download citation

Keywords

  • Cluster validation
  • EM algorithm
  • Maximum likelihood
  • Mixtures of regressions
  • Model-based clustering
  • Ternary diagram