A robust approach to model-based classification based on trimming and constraints

Semi-supervised learning in presence of outliers and label noise

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

In a standard classification framework a set of trustworthy learning data are employed to build a decision rule, with the final aim of classifying unlabelled units belonging to the test set. Therefore, unreliable labelled observations, namely outliers and data with incorrect labels, can strongly undermine the classifier performance, especially if the training size is small. The present work introduces a robust modification to the Model-Based Classification framework, employing impartial trimming and constraints on the ratio between the maximum and the minimum eigenvalue of the group scatter matrices. The proposed method effectively handles noise presence in both response and exploratory variables, providing reliable classification even when dealing with contaminated datasets. A robust information criterion is proposed for model selection. Experiments on real and simulated data, artificially adulterated, are provided to underline the benefits of the proposed method.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

References

  1. Aitken AC (1926) A series formula for the roots of algebraic and transcendental equations. Proc R Soc Edinb 45(01):14–22

    MATH  Google Scholar 

  2. Alimentarius C (2001) Revised codex standard for honey. Codex stan 12:1982

    Google Scholar 

  3. Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49(3):803

    MathSciNet  MATH  Google Scholar 

  4. Bensmail H, Celeux G (1996) Regularized Gaussian discriminant analysis through eigenvalue decomposition. J Am Stat Assoc 91(436):1743–1748

    MathSciNet  MATH  Google Scholar 

  5. Bohning D, Dietz E, Schaub R, Schlattmann P, Lindsay BG (1994) The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Ann Inst Stat Math 46(2):373–388

    MATH  Google Scholar 

  6. Bouveyron C, Girard S (2009) Robust supervised classification with mixture models: learning from data with uncertain labels. Pattern Recognit 42(11):2649–2658

    MATH  Google Scholar 

  7. Browne RP, McNicholas PD (2014) Estimating common principal components in high dimensions. Adv Data Anal Classif 8:217–226

    MathSciNet  MATH  Google Scholar 

  8. Cattell RB (1966) The scree test for the number of factors. Multivar Behav Res 1(2):245–276

    Google Scholar 

  9. Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recognit 28(5):781–793

    Google Scholar 

  10. Cerioli A, García-Escudero LA, Mayo-Iscar A, Riani M (2018) Finding the number of normal groups in model-based clustering via constrained likelihoods. J Comput Gr Stat 27(2):404–416

    MathSciNet  Google Scholar 

  11. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    MATH  Google Scholar 

  12. Cuesta-Albertos JA, Gordaliza A, Matrán C (1997) Trimmed k-means: an attempt to robustify quantizers. Ann Stat 25(2):553–576

    MathSciNet  MATH  Google Scholar 

  13. Dean N, Murphy TB, Downey G (2006) Using unlabelled data to update classification rules with applications in food authenticity studies. J R Stat Soc Ser C Appl Stat 55(1):1–14

    MathSciNet  MATH  Google Scholar 

  14. Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 39(1):1–38

    MathSciNet  MATH  Google Scholar 

  15. Dotto F, Farcomeni A (2019) Robust inference for parsimonious model-based clustering. J Stat Comput Simul 89(3):414–442

    MathSciNet  MATH  Google Scholar 

  16. Dotto F, Farcomeni A, García-Escudero LA, Mayo-Iscar A (2018) A reweighting approach to robust clustering. Stat Comput 28(2):477–493

    MathSciNet  MATH  Google Scholar 

  17. Downey G (1996) Authentication of food and food ingredients by near infrared spectroscopy. J Near Infrared Spectrosc 4(1):47

    Google Scholar 

  18. Fop M, Murphy TB, Raftery AE (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J XX(August):1–29

    Google Scholar 

  19. Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631

    MathSciNet  MATH  Google Scholar 

  20. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139

    MathSciNet  MATH  Google Scholar 

  21. Fritz H, García-Escudero LA, Mayo-Iscar A (2012) tclust : an R package for a trimming approach to cluster analysis. J Stat Softw 47(12):1–26

    Google Scholar 

  22. Fritz H, García-Escudero LA, Mayo-Iscar A (2013) A fast algorithm for robust constrained clustering. Comput Stat Data Anal 61:124–136

    MathSciNet  MATH  Google Scholar 

  23. Gallegos MT (2002) Maximum likelihood clustering with outliers. In: Classification, clustering, and data analysis, Springer, pp 247–255

  24. García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2008) A general trimming approach to robust cluster Analysis. Ann Stat 36(3):1324–1345

    MathSciNet  MATH  Google Scholar 

  25. García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2010) A review of robust clustering methods. Adv Data Anal Classif 4(2–3):89–109

    MathSciNet  MATH  Google Scholar 

  26. García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2011) Exploring the number of groups in robust model-based clustering. Stat Comput 21(4):585–599

    MathSciNet  MATH  Google Scholar 

  27. García-Escudero LA, Gordaliza A, Mayo-Iscar A (2014) A constrained robust proposal for mixture modeling avoiding spurious solutions. Adv Data Anal Classif 8(1):27–43

    MathSciNet  MATH  Google Scholar 

  28. García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2015) Avoiding spurious local maximizers in mixture modeling. Stat Comput 25(3):619–633

    MathSciNet  MATH  Google Scholar 

  29. García-Escudero LA, Gordaliza A, Greselin F, Ingrassia S, Mayo-Iscar A (2016) The joint role of trimming and constraints in robust estimation for mixtures of Gaussian factor analyzers. Comput Stat Data Anal 99:131–147

    MathSciNet  MATH  Google Scholar 

  30. García-Escudero LA, Gordaliza A, Greselin F, Ingrassia S, Mayo-Iscar A (2017) Eigenvalues and constraints in mixture modeling: geometric and computational issues. Adv Data Anal Classif 12:1–31

    MathSciNet  MATH  Google Scholar 

  31. Gordaliza A (1991a) Best approximations to random variables based on trimming procedures. J Approx Theory 64(2):162–180

    MathSciNet  MATH  Google Scholar 

  32. Gordaliza A (1991b) On the breakdown point of multivariate location estimators based on trimming procedures. Stat Probab Lett 11(5):387–394

    MathSciNet  MATH  Google Scholar 

  33. Hastie T, Tibshirani R (1996) Discriminant analysis by Gaussian mixtures. J R Stat Soc Ser B (Methodol) 58(1):155–176

    MathSciNet  MATH  Google Scholar 

  34. Hawkins DM, McLachlan GJ (1997) High-breakdown linear discriminant analysis. J Am Stat Assoc 92(437):136

    MathSciNet  MATH  Google Scholar 

  35. Hickey RJ (1996) Noise modelling and evaluating learning from examples. Artif Intell 82(1–2):157–179

    MathSciNet  Google Scholar 

  36. Hubert M, Debruyne M, Rousseeuw PJ (2018) Minimum covariance determinant and extensions. Wiley Interdiscip Rev Comput Stat 10(3):1–11

    MathSciNet  Google Scholar 

  37. Ingrassia S (2004) A likelihood-based constrained algorithm for multivariate normal mixture models. Stat Methods Appl 13(2):151–166

    MathSciNet  Google Scholar 

  38. Kelly JD, Petisco C, Downey G (2006) Application of Fourier transform midinfrared spectroscopy to the discrimination between Irish artisanal honey and such honey adulterated with various sugar syrups. J Agric Food Chem 54(17):6166–6171

    Google Scholar 

  39. Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic Press, New York

    Google Scholar 

  40. Maronna R, Jacovkis PM (1974) Multivariate clustering procedures with variable metrics. Biometrics 30(3):499

    MATH  Google Scholar 

  41. McLachlan GJ (1992) Discriminant analysis and statistical pattern recognition, vol 544. Wiley series in probability and statistics. Wiley, Hoboken

    Google Scholar 

  42. McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions, vol 54. Wiley series in probability and statistics. Wiley, Hoboken

    Google Scholar 

  43. McLachlan GJ, Peel D (1998) Robust cluster analysis via mixtures of multivariate t-distributions. In: Joint IAPR international workshops on statistical techniques in pattern recognition and structural and syntactic pattern recognition. Springer, Berlin, pp 658–666

  44. McNicholas PD (2016) Mixture model-based classification. CRC Press, Boca Raton

    Google Scholar 

  45. Menardi G (2011) Density-based Silhouette diagnostics for clustering methods. Stat Comput 21(3):295–308

    MathSciNet  MATH  Google Scholar 

  46. Neykov N, Filzmoser P, Dimova R, Neytchev P (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Comput Stat Data Anal 52(1):299–308

    MathSciNet  MATH  Google Scholar 

  47. Peel D, McLachlan GJ (2000) Robust mixture modelling using the t distribution. Stat Comput 10(4):339–348

    Google Scholar 

  48. Prati RC, Luengo J, Herrera F (2019) Emerging topics and challenges of learning from noisy data in nonstandard classification: a survey beyond binary class noise. Knowl Inf Syst 60(1):63–97

    Google Scholar 

  49. R Core Team (2018) R: a language and environment for statistical computing

  50. Rousseeuw PJ, Driessen KV (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3):212–223

    Google Scholar 

  51. Russell N, Cribbin L, Murphy TB (2014) upclass: an R package for updating model-based classification rules. Cran R-Project Org

  52. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

    MathSciNet  MATH  Google Scholar 

  53. Thomson G (1939) The factorial analysis of human ability. Br J Educ Psychol 9(2):188–195

    Google Scholar 

  54. Vanden Branden K, Hubert M (2005) Robust classification in high dimensions based on the SIMCA Method. Chemom Intell Lab Syst 79(1–2):10–21

    Google Scholar 

  55. Wu X (1995) Knowledge acquisition from databases. Intellect books, Westport

    Google Scholar 

  56. Zhu X, Wu X (2004) Class noise vs. attribute noise: a quantitative study. Artif Intell Rev 22(3):177–210

    MATH  Google Scholar 

Download references

Acknowledgements

The authors are very grateful to Agustin Mayo-Iscar and Luis Angel García Escudero for both stimulating discussion and advices on how to enforce the eigenvalue-ratio constraints under the different patterned models. Andrea Cappozzo deeply thanks Michael Fop for his endless patience and guidance in helping him with methodological and computational issues encountered during the draft of the present manuscript. Brendan Murphy’s work is supported by the Science Foundation Ireland Insight Research Centre (SFI/12/RC/2289_P2).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Andrea Cappozzo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

Proof of Proposition 1

Considering the random variable \({\mathcal {Z}}_{mg}\) corresponding to \(z_{mg}\), the E-step on the \((k+1)\)th iteration requires the calculation of the conditional expectation of \({\mathcal {Z}}_{mg}\) given \({\mathbf {y}}_m\):

$$\begin{aligned} \begin{aligned} E _{\hat{\varvec{\theta }}^{(k)}}({\mathcal {Z}}_{mg}|{\mathbf {y}}_m)&={\mathbb {P}}\left( {\mathcal {Z}}_{mg}=1|{\mathbf {y}}_m;{\hat{\theta }}^{(k)}\right) \\&=\frac{{\mathbb {P}}\left( {\mathbf {y}}_m|{\mathcal {Z}}_{mg}=1;{\hat{\theta }}^{(k)}\right) {\mathbb {P}}\left( {\mathcal {Z}}_{mg}=1;{\hat{\theta }}^{(k)}\right) }{\sum _{j=1}^G {\mathbb {P}}\left( {\mathbf {y}}_m|{\mathcal {Z}}_{mj}=1;{\hat{\theta }}^{(k)}\right) {\mathbb {P}}\left( {\mathcal {Z}}_{mj}=1;{\hat{\theta }}^{(k)}\right) }\\&=\frac{{\hat{\tau }}^{(k)}_g \phi \left( {\mathbf {y}}_m; \hat{\varvec{\mu }}^{(k)}_g, \hat{\varvec{\varSigma }}^{(k)}_g \right) }{\sum _{j=1}^G{\hat{\tau }}_j^{(k)} \phi \left( {\mathbf {y}}_m; \hat{\varvec{\mu }}^{(k)}_j, \hat{\varvec{\varSigma }}^{(k)}_j\right) }\\&={\hat{z}}_{mg}^{(k+1)} \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, g=1,\ldots , G; \,\,\,\, m=1,\ldots , M. \end{aligned} \end{aligned}$$
(23)

Therefore, the Q function, to be maximized with respect to \(\varvec{\theta }\) in the M-step, is given by

$$\begin{aligned} \begin{aligned} Q(\varvec{\theta };\hat{\varvec{\theta }}^{(k)})&= \sum _{n=1}^N \zeta ({\mathbf {x}}_n) \sum _{g=1}^G l_{ng} \log {\left[ \tau _g \phi ({\mathbf {x}}_n; \varvec{\mu }_g, \varvec{\varSigma }_g)\right] } \\&\quad +\, \sum _{m=1}^M \varphi ({\mathbf {y}}_m) \sum _{g=1}^G \hat{z}_{mg}^{(k+1)} \log {\left[ \tau _g \phi ({\mathbf {y}}_m; \varvec{\mu }_g, \varvec{\varSigma }_g)\right] .} \end{aligned} \end{aligned}$$
(24)

The maximization of (24) according to the mixture proportion \(\tau _g\), \(\sum _{j=1}^G\tau _j=1\) is solved considering the Lagrangian \({\mathcal {L}}(\varvec{\theta }, \kappa )\):

$$\begin{aligned} {\mathcal {L}}(\varvec{\theta }, \kappa )=Q\left( \varvec{\theta };\hat{\varvec{\theta }}^{(k)}\right) -\kappa \left( \sum _{j=1}^G\tau _j-1\right) \end{aligned}$$
(25)

with \(\kappa \) the Lagrangian coefficient. The partial derivative of (25) with respect to \(\tau _g\) has the form:

$$\begin{aligned} \frac{\partial }{\partial \tau _g}{\mathcal {L}}(\varvec{\theta }, \kappa )=\frac{\sum _{n=1}^N \zeta ({\mathbf {x}}_n)l_{ng}}{\tau _g}+ \frac{\sum _{m=1}^M \varphi ({\mathbf {y}}_m)\hat{z}_{mg}^{(k+1)}}{\tau _g}-\kappa \end{aligned}$$
(26)

and setting (26) equal to 0 for all \(g=1,\ldots , G\) we obtain:

$$\begin{aligned} \sum _{n=1}^N \zeta ({\mathbf {x}}_n)l_{ng}+ \sum _{m=1}^M \varphi ({\mathbf {y}}_m)\hat{z}_{mg}^{(k+1)}-\kappa \tau _g=0. \end{aligned}$$
(27)

Summing (27) over g, \(g=1,\ldots , G\), provides the value of \(\kappa =\lceil N(1-\alpha _{l})\rceil +M(1-\alpha _{u})\rceil \) and substituting it in the previous expression yields the ML estimate for \(\tau _g\):

$$\begin{aligned} {\hat{\tau }}_g^{(k+1)}=\frac{\sum _{n=1}^N \zeta ({\mathbf {x}}_n)l_{ng}+ \sum _{m=1}^M \varphi ({\mathbf {y}}_m){\hat{z}}_{mg}^{(k+1)}}{\lceil N(1-\alpha _{l})\rceil +\lceil M(1-\alpha _{u})\rceil }\,\,\,\,\, g=1,\ldots , G. \end{aligned}$$
(28)

The partial derivative of (24) with respect to the mean vector \(\varvec{\mu }_g\) reads:

$$\begin{aligned} \begin{aligned} \frac{\partial }{\partial \varvec{\mu }_g}Q\left( \varvec{\theta };\varvec{\theta }^{(k)}\right)&= -\varvec{\varSigma }_g^{-1}\left[ \sum _{n=1}^N \zeta ({\mathbf {x}}_n)l_{ng}\left( {\mathbf {x}}_n-\varvec{\mu }_g\right) +\sum _{m=1}^M \varphi ({\mathbf {y}}_m){\hat{z}}_{mg}^{(k+1)}\left( {\mathbf {y}}_m-\varvec{\mu }_g\right) \right] \\&=-\varvec{\varSigma }_g^{-1}\left[ \sum _{n=1}^N \zeta ({\mathbf {x}}_n)l_{ng}{\mathbf {x}}_n + \sum _{m=1}^M \varphi ({\mathbf {y}}_m){\hat{z}}_{mg}^{(k+1)}{\mathbf {y}}_m +\right. \\&\left. \quad -\varvec{\mu }_g\left( \sum _{n=1}^N \zeta ({\mathbf {x}}_n)l_{ng} + \sum _{m=1}^M \varphi ({\mathbf {y}}_m){\hat{z}}_{mg}^{(k+1)} \right) \right] . \end{aligned} \end{aligned}$$
(29)

Equating (29) to 0 and rearranging terms provides the ML estimate of \(\varvec{\mu }_g\):

$$\begin{aligned} \hat{\varvec{\mu }}_g^{(k+1)}=\frac{\sum _{n=1}^N \zeta ({\mathbf {x}}_n)l_{ng}{\mathbf {x}}_n+\sum _{m=1}^M\varphi ({\mathbf {y}}_m){\hat{z}}_{mg}^{(k+1)}{\mathbf {y}}_m}{\sum _{n=1}^N\zeta ({\mathbf {x}}_n)l_{ng}+\sum _{m=1}^M\varphi ({\mathbf {y}}_m){\hat{z}}_{mg}^{(k+1)}}\,\,\,\,\, g=1,\ldots , G. \end{aligned}$$
(30)

Discarding quantities that do not depend on \(\varvec{\varSigma }_g\), we can rewrite (24) as follows:

$$\begin{aligned}&\sum _{n=1}^{N} \sum _{g=1}^{G} \zeta ({\mathbf {x}}_n)l_{ng}\left( {\mathbf {x}}_{n}\right) \left[ -\log \left| \varvec{\varSigma }_{g}\right| ^{1 / 2}-\frac{1}{2}\left( {\mathbf {x}}_{n}-\varvec{\mu }_{g}\right) ^{\prime } \varvec{\varSigma }_{g}^{-1}\left( {\mathbf {x}}_{n}-\varvec{\mu }_{g}\right) \right] \nonumber \\&\qquad +\sum _{m=1}^{M} \sum _{g=1}^{G} \varphi ({\mathbf {y}}_m)\hat{z}_{mg}^{(k+1)}\left( {\mathbf {y}}_{m}\right) \left[ -\log \left| \varvec{\varSigma }_{g}\right| ^{1 / 2}-\frac{1}{2}\left( {\mathbf {y}}_{m}-\varvec{\mu }_{g}\right) ^{\prime } \varvec{\varSigma }_{g}^{-1}\left( {\mathbf {y}}_{m}-\varvec{\mu }_{g}\right) \right] \nonumber \\&\quad =-\frac{1}{2}\left[ \sum _{n=1}^{N} \sum _{g=1}^{G} \zeta ({\mathbf {x}}_n)l_{ng}\left( {\mathbf {x}}_{n}\right) \log \left| \varvec{\varSigma }_{g}\right| \right. \nonumber \\&\qquad +\sum _{n=1}^{N} \sum _{g=1}^{G} \zeta ({\mathbf {x}}_n)l_{ng} \left[ \underbrace{\left( {\mathbf {x}}_{n}-\varvec{\mu }_{g}\right) ^{\prime } \varvec{\varSigma }_{g}^{-1}\left( {\mathbf {x}}_{n}-\varvec{\mu }_{g}\right) }_{\text{ a } \text{ scalar } }\right] \nonumber \\&\qquad +\sum _{m=1}^{M} \sum _{g=1}^{G} \varphi ({\mathbf {y}}_m)\hat{z}_{mg}^{(k+1)}\left( {\mathbf {y}}_{m}\right) \log \left| \varvec{\varSigma }_{g}\right| \nonumber \\&\qquad \left. +\sum _{m=1}^{M} \sum _{g=1}^{G} \varphi ({\mathbf {y}}_m)\hat{z}_{mg}^{(k+1)}\left[ \underbrace{\left( {\mathbf {y}}_{m}-\varvec{\mu }_{g}\right) ^{\prime } \varvec{\varSigma }_{g}^{-1}\left( {\mathbf {y}}_{m}-\varvec{\mu }_{g}\right) }_{ \text{ a } \text{ scalar } }\right] \right] \nonumber \\&\quad =-\frac{1}{2}\left[ \sum _{g=1}^{G} \log \left| \varvec{\varSigma }_{g}\right| \left( \sum _{n=1}^{N} \zeta ({\mathbf {x}}_n)l_{ng}\left( {\mathbf {x}}_{n}\right) + \sum _{m=1}^{M} \varphi ({\mathbf {y}}_m)\hat{z}_{mg}^{(k+1)}\left( {\mathbf {y}}_{m}\right) \right) \right. \nonumber \\&\qquad +\sum _{n=1}^{N} \sum _{g=1}^{G} \zeta ({\mathbf {x}}_n)l_{ng} tr \left[ \varvec{\varSigma }_{g}^{-1} \left( {\mathbf {x}}_{n}-\varvec{\mu }_{g}\right) \left( {\mathbf {x}}_{n}-\varvec{\mu }_{g}\right) ^{\prime }\right] \nonumber \\&\qquad \left. + \sum _{m=1}^{M} \sum _{g=1}^{G} \varphi ({\mathbf {y}}_m)\hat{z}_{mg}^{(k+1)}tr \left[ \varvec{\varSigma }_{g}^{-1} \left( {\mathbf {y}}_{m}-\varvec{\mu }_{g}\right) \left( {\mathbf {y}}_{m}-\varvec{\mu }_{g}\right) ^{\prime }\right] \right] \nonumber \\&\quad =-\frac{1}{2}\left[ \sum _{g=1}^{G} \log \left| \varvec{\varSigma }_{g}\right| \left( \sum _{n=1}^{N} \zeta ({\mathbf {x}}_n)l_{ng}\left( {\mathbf {x}}_{n}\right) + \sum _{m=1}^{M} \varphi ({\mathbf {y}}_m)\hat{z}_{mg}^{(k+1)}\left( {\mathbf {y}}_{m}\right) \right) \right. \nonumber \\&\qquad +\left. \sum _{g=1}^{G} tr \left[ \varvec{\varSigma }^{-1}_{g}{\varvec{W}}_g^{X} \right] + \sum _{g=1}^{G} tr \left[ \varvec{\varSigma }_{g}^{-1}{\varvec{W}}_g^{Y} \right] \right] \nonumber \\&\qquad -\frac{1}{2}\left[ \sum _{g=1}^{G} \log \left| \varvec{\varSigma }_{g}\right| \left( \sum _{n=1}^{N} \zeta ({\mathbf {x}}_n)l_{ng}\left( {\mathbf {x}}_{n}\right) + \sum _{m=1}^{M} \varphi ({\mathbf {y}}_m)\hat{z}_{mg}^{(k+1)}\left( {\mathbf {y}}_{m}\right) \right) \right. \nonumber \\&\qquad \left. + \sum _{g=1}^{G} tr \left[ \varvec{\varSigma }^{-1}_{g}\left( {\varvec{W}}_g^{X} + {\varvec{W}}_g^{Y}\right) \right] \right] \end{aligned}$$
(31)

where \({\varvec{W}}_g^{X}=\sum _{n=1}^{N} \zeta ({\mathbf {x}}_n)l_{ng}\left[ \left( {\mathbf {x}}_{n}-\varvec{\mu }_{g}\right) \left( {\mathbf {x}}_{n}-\varvec{\mu }_{g}\right) ^{\prime }\right] \) and \({\varvec{W}}_g^{Y}=\sum _{m=1}^{M} \varphi ({\mathbf {y}}_m)\hat{z}_{mg}^{(k+1)}\left[ \left( {\mathbf {y}}_{m}-\varvec{\mu }_{g}\right) \left( {\mathbf {y}}_{m}-\varvec{\mu }_{g}\right) ^{\prime }\right] \). Finally, considering the eigenvalue decomposition \(\varvec{\varSigma }_g=\lambda _g{\varvec{D}}_g{\varvec{A}}_g{\varvec{D}}^{'}_g\), (31) simplifies to:

$$\begin{aligned} \begin{aligned}&-\frac{1}{2}\left[ \sum _{g=1}^{G} p \log \lambda _g \left( \sum _{n=1}^{N} \zeta ({\mathbf {x}}_n)l_{ng}\left( {\mathbf {x}}_{n}\right) + \sum _{m=1}^{M} \varphi ({\mathbf {y}}_m)\hat{z}_{mg}^{(k+1)}\left( {\mathbf {y}}_{m}\right) \right) \right. \\&\quad + \left. \sum _{g=1}^{G} \frac{1}{\lambda _g} tr \left[ {\varvec{D}}_{g} {\varvec{A}}^{-1} {\varvec{D}}_{g}^{\prime } \left( {\varvec{W}}_g^{X} + {\varvec{W}}_g^{Y}\right) \right] \right] \end{aligned} \end{aligned}$$
(32)

The partial derivative of (32) with respect to \(\left( \lambda _g,{\varvec{A}}_g, {\varvec{D}}_g\right) \) depends on the considered patterned structure: for a thorough derivation the reader is referred to Bensmail and Celeux (1996). If (8) is not satisfied, the constraints are enforced as detailed in “Appendix C”. Lastly, notice that in performing the concentration step the optimal observations of both training and test sets are retained, i.e. the ones with the highest contribution to the objective function.

The afore-described procedure falls within the structure of a general EM algorithm, for which the likelihood function does not decrease after an EM iteration, as shown in Dempster et al. (1977) and reported in page 78 of McLachlan and Krishnan (2008). \(\square \)

Appendix B

This appendix details the structure of the Simulation Study in Sect. 4.2.1. We consider a data generating process given by a mixture of \(G=4\) components of multivariate t-distributions (McLachlan and Peel 1998; Peel and McLachlan 2000), according to the following parameters:

$$\begin{aligned}&\varvec{\tau }=(0.2, 0.4, 0.1, 0.3)', \quad \nu =6, \\&\varvec{\mu }_1=(0, 0, 0, 0, 0, 0, 0,0,0,0,0)', \\&\varvec{\mu }_2=(4, -\,4, 4, -\,4,4, -\,4,4, -\,4,4, -\,4)', \\&\varvec{\mu }_3=(0,0,7,7,7,3,6,8,-\,4,-\,4)', \\&\varvec{\mu }_4=(8, 0, 8, 0, 8, 0, 8,0,8,0,8)', \\&\varvec{\varSigma }_1 = diag(1,1,1,1,1,1,1,1,1,1), \\&\varvec{\varSigma }_2 = diag(2,2,2,2,2,2,2,2,2,2), \\&\varvec{\varSigma }_3 = \varvec{\varSigma }_4\\= & {} \begin{bmatrix} 5.05&1.26&-\,0.35&-\,0.00&-\,1.04&-\,1.35&0.29&0.07&0.69&1.17 \\ 1.26&2.57&0.17&0.00&0.27&0.11&0.61&0.11&0.59&0.89 \\ -\,0.35&0.17&6.74&-\,0.00&-\,0.26&-\,0.31&-\,0.01&0.00&0.08&0.14 \\ -\,0.00&0.00&-\,0.00&5.47&-\,0.00&-\,0.00&0.00&0.00&0.00&0.00 \\ -\,1.04&0.27&-\,0.26&-\,0.00&6.80&-\,0.76&-\,0.12&-\,0.01&0.09&0.21 \\ -\,1.35&0.11&-\,0.31&-\,0.00&-\,0.76&7.75&-\,0.26&-\,0.04&-\,0.03&0.03 \\ 0.29&0.61&-0.01&0.00&-0.12&-\,0.26&4.76&0.06&0.38&0.60 \\ 0.07&0.11&0.00&0.00&-\,0.01&-\,0.04&0.06&4.18&0.07&0.11 \\ 0.69&0.59&0.08&0.00&0.09&-0.03&0.38&0.07&3.23&0.60 \\ 1.17&0.89&0.14&0.00&0.21&0.03&0.60&0.11&0.60&3.24 \\ \end{bmatrix}. \end{aligned}$$
Fig. 7
figure7

Generalized pairs plot of the simulated data under the Simulation Setup described in 4.2.1. Both label noise and outliers are present in the data units

A generalized pairs plot of contaminated labelled units under the afore-described Simulation Setup is reported in Fig. 7.

Appendix C

This final Section presents feasible and computationally efficient algorithms for enforcing the eigenvalue-ratio constraint according to the different patterned models in Table 1. At the \(k-\)th iteration of the M-step, the goal is to update the estimates for the covariance matrices \(\hat{\varvec{\varSigma }}_g^{(k+1)}={\hat{\lambda }}_g^{(k+1)}\hat{{\varvec{D}}}_g^{(k+1)}\hat{{\varvec{A}}}_g^{(k+1)}\hat{{\varvec{D}}}^{'(k+1)}_g\), \(g=1,\ldots ,G\) such that,

$$\begin{aligned} \frac{\max _{g=1\ldots G}\max _{l=1\ldots p}{\hat{\lambda }}_g^{(k+1)}{\hat{a}}_{lg}^{(k+1)}}{\min _{g=1\ldots G}\min _{l=1\ldots p}{\hat{\lambda }}_g^{(k+1)}{\hat{a}}_{lg}^{(k+1)}} \le c \end{aligned}$$
(33)

where \({\hat{a}}_{lg}^{(k+1)}\) indicates the diagonal entries of matrix \(\hat{{\varvec{A}}}_g^{(k+1)}\).

Denote with \({\hat{\varSigma }}_g^{U}={\hat{\lambda }}_g^{U} \hat{{\varvec{D}}}_g^{U}\hat{{\varvec{A}}}_g^{U}\hat{{\varvec{D}}}_g^{'U}\) the estimates for the variance covariance matrices obtained following Bensmail and Celeux (1996) without enforcing the eigenvalues-ratio restriction in (33). Lastly, denote with \(\hat{\varvec{\varDelta }}^U_g={\hat{\lambda }}_g^{U}\hat{{\varvec{A}}}_g^{U}\) the matrix of eigenvalues for \(\hat{\varvec{\varSigma }}_g^{U}\), with diagonal entries \({\hat{d}}_{lg}^U={\hat{\lambda }}_g^{U}{\hat{a}}_{lg}^{U}\), \(l=1,\ldots ,p\).

Constrained maximization for VII, VVI and VVV models

  1. 1.

    Compute \(\varvec{\varDelta }_g\) applying the optimal truncation operator defined in Fritz et al. (2013) to \(\left\{ \hat{\varvec{\varDelta }}^U_1,\ldots ,\hat{\varvec{\varDelta }}^U_G\right\} \), under condition (33).

  2. 2.

    Set \({\hat{\lambda }}_g^{(k+1)}=|\varvec{\varDelta }_g|^{1/p}\), \(\hat{{\varvec{A}}}_g^{(k+1)}=\frac{1}{{\hat{\lambda }}_g^{(k+1)}}\varvec{\varDelta }_g\), \(\hat{{\varvec{D}}}_g^{(k+1)}=\hat{{\varvec{D}}}_g^{U}.\)

Constrained maximization for VVE model

  1. 1.

    Compute \(\varvec{\varDelta }_g\) applying the optimal truncation operator defined in Fritz et al. (2013) to \(\left\{ \hat{\varvec{\varDelta }}^U_1,\ldots ,\hat{\varvec{\varDelta }}^U_G\right\} \), under condition (33).

  2. 2.

    Given \(\varvec{\varDelta }_g\), compute the common principal components \({\varvec{D}}\) via, for example, a majorization-minimization (MM) algorithm (Browne and McNicholas 2014).

  3. 3.

    Set \({\hat{\lambda }}_g^{(k+1)}=|\varvec{\varDelta }_g|^{1/p}\), \(\hat{{\varvec{A}}}_g^{(k+1)}=\frac{1}{{\hat{\lambda }}_g^{(k+1)}}\varvec{\varDelta }_g\), \(\hat{{\varvec{D}}}_g^{(k+1)}={\varvec{D}}.\)

Constrained maximization for EVI, EVV models

  1. 1.

    Compute \(\varvec{\varDelta }_g\) applying the optimal truncation operator defined in Fritz et al. (2013) to \(\left\{ \hat{\varvec{\varDelta }}^U_1,\ldots ,\hat{\varvec{\varDelta }}^U_G\right\} \), under condition (33).

  2. 2.

    Compute \(\varvec{\varDelta }^{\star }_g\) constraining \(\varvec{\varDelta }_g\) such that \(\varvec{\varDelta }^{\star }_g=\lambda ^{\star }{\varvec{A}}_g^{\star }\). That is, constraining \(|\varvec{\varDelta }^{\star }_g|\) to be equal across groups (Maronna and Jacovkis 1974; Gallegos 2002). Details are given in section 3.2 of Fritz et al. (2012).

  3. 3.

    Iterate \(1-2\) until (33) is satisfied.

  4. 4.

    Set \({\hat{\lambda }}_g^{(k+1)}=\lambda ^{\star }\), \(\hat{{\varvec{A}}}_g^{(k+1)}={\varvec{A}}_g^{\star }\), \(\hat{{\varvec{D}}}_g^{(k+1)}=\hat{{\varvec{D}}}_g^{U}.\)

Constrained maximization for EVE model

  1. 1.

    Compute \(\varvec{\varDelta }_g\) applying the optimal truncation operator defined in Fritz et al. (2013) to \(\left\{ \hat{\varvec{\varDelta }}^U_1,\ldots ,\hat{\varvec{\varDelta }}^U_G\right\} \), under condition (33).

  2. 2.

    Compute \(\varvec{\varDelta }^{\star }_g\) constraining \(\varvec{\varDelta }_g\) such that \(\varvec{\varDelta }^{\star }_g=\lambda ^{\star }{\varvec{A}}_g^{\star }\). Details are given in section 3.2 of Fritz et al. (2012).

  3. 3.

    Iterate 1–2 until (33) is satisfied.

  4. 4.

    Given \({\varvec{A}}^{\star }_g\), compute the common principal components \({\varvec{D}}\) via, for example, a majorization-minimization (MM) algorithm (Browne and McNicholas 2014).

  5. 5.

    Set \({\hat{\lambda }}_g^{(k+1)}=\lambda ^{\star }\), \(\hat{{\varvec{A}}}_g^{(k+1)}={\varvec{A}}_g^{\star }\), \(\hat{{\varvec{D}}}_g^{(k+1)}={\varvec{D}}\).

Constrained maximization for VEI, VEV models

  1. 1.

    Set \(\varvec{\varDelta }_g=\hat{\varvec{\varDelta }}^U_g\).

  2. 2.

    Set \(\lambda _g^{\star }={\hat{\lambda }}_g^{U}\), \(g=1,\ldots ,G\).

  3. 3.

    Compute \(\varvec{\varDelta }^{\star }_g\) applying the optimal truncation operator defined in Fritz et al. (2013) to \(\left\{ \varvec{\varDelta }_1,\ldots ,\varvec{\varDelta }_G\right\} \), under condition (33).

  4. 4.

    Compute \({\varvec{A}}^{\star }=\left. \sum _{g=1}^G\frac{1}{\lambda _g^{\star }}\varvec{\varDelta }^{\star }_g \Bigg / \left| \sum _{g=1}^G\frac{1}{\lambda _g^{\star }}\varvec{\varDelta }^{\star }_g \right| ^{1/p} \right. \).

  5. 5.

    Compute \(\lambda _g^{\star }=\frac{1}{p}tr\left( \varvec{\varDelta }^{\star }_g {{\varvec{A}}^{\star }}^{-1}\right) .\)

  6. 6.

    Set \(\varvec{\varDelta }_g=\lambda _g^{\star }{\varvec{A}}^{\star }\).

  7. 7.

    Iterate 3–6 until (33) is satisfied.

  8. 8.

    Set \({\hat{\lambda }}_g^{(k+1)}=\lambda ^{\star }_g\), \(\hat{{\varvec{A}}}_g^{(k+1)}={\varvec{A}}^{\star }\), \(\hat{{\varvec{D}}}_g^{(k+1)}=\hat{{\varvec{D}}}_g^{U}\).

Constrained maximization for VEE model

  1. 1.

    Set \({\varvec{K}}_g=\hat{\varvec{\varSigma }}^U_g\).

  2. 2.

    Set \(\lambda _g^{\star }={\hat{\lambda }}_g^{U}\), \(g=1,\ldots ,G\).

  3. 3.

    Compute \({\varvec{K}}_g^{\star }\) applying the optimal truncation operator defined in Fritz et al. (2013) to \(\left\{ {\varvec{K}}_1,\ldots ,{\varvec{K}}_G \right\} \), under condition (33).

  4. 4.

    Compute \({\varvec{C}}^{\star }=\left. \sum _{g=1}^G\frac{1}{\lambda _g^{\star }}{\varvec{K}}^{\star }_g \Bigg / \left| \sum _{g=1}^G\frac{1}{\lambda _g^{\star }}{\varvec{K}}^{\star }_g \right| ^{1/p} \right. \).

  5. 5.

    Compute \(\lambda _g^{\star }=\frac{1}{p}tr\left( {\varvec{K}}^{\star }_g {{\varvec{C}}^{\star }}^{-1}\right) \).

  6. 6.

    Set \({\varvec{K}}_g=\lambda _g^{\star }{\varvec{C}}^{\star }\).

  7. 7.

    Iterate \(3-6\) until (33) is satisfied.

  8. 8.

    Considering the spectral decomposition for \({\varvec{C}}^{\star }={\varvec{D}}^{\star }{\varvec{A}}^{\star }{{\varvec{D}}^{\star }}^{'}\), set \({\hat{\lambda }}_g^{(k+1)}=\lambda ^{\star }_g\), \(\hat{{\varvec{A}}}_g^{(k+1)}={\varvec{A}}^{\star }\), \(\hat{{\varvec{D}}}_g^{(k+1)}={\varvec{D}}^{\star }\).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Cappozzo, A., Greselin, F. & Murphy, T.B. A robust approach to model-based classification based on trimming and constraints. Adv Data Anal Classif 14, 327–354 (2020). https://doi.org/10.1007/s11634-019-00371-w

Download citation

Keywords

  • Model-based classification
  • Label noise
  • Outliers detection
  • Impartial trimming
  • Eigenvalues restrictions
  • Robust estimation

Mathematics Subject Classification

  • 62H30
  • 62F35