A robust approach to model-based classification based on trimming and constraints

Cappozzo, Andrea; Greselin, Francesca; Murphy, Thomas Brendan

doi:10.1007/s11634-019-00371-w

A robust approach to model-based classification based on trimming and constraints

Semi-supervised learning in presence of outliers and label noise

Regular Article
Published: 14 August 2019

Volume 14, pages 327–354, (2020)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

544 Accesses
8 Citations
7 Altmetric
Explore all metrics

Abstract

In a standard classification framework a set of trustworthy learning data are employed to build a decision rule, with the final aim of classifying unlabelled units belonging to the test set. Therefore, unreliable labelled observations, namely outliers and data with incorrect labels, can strongly undermine the classifier performance, especially if the training size is small. The present work introduces a robust modification to the Model-Based Classification framework, employing impartial trimming and constraints on the ratio between the maximum and the minimum eigenvalue of the group scatter matrices. The proposed method effectively handles noise presence in both response and exploratory variables, providing reliable classification even when dealing with contaminated datasets. A robust information criterion is proposed for model selection. Experiments on real and simulated data, artificially adulterated, are provided to underline the benefits of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning

Article Open access 01 July 2018

A Metalearning Study for Robust Nonlinear Regression

Simultaneous Supervised and Unsupervised Classification Modeling for Assessing Cluster Analysis and Improving Results Interpretability

References

Aitken AC (1926) A series formula for the roots of algebraic and transcendental equations. Proc R Soc Edinb 45(01):14–22
MATH Google Scholar
Alimentarius C (2001) Revised codex standard for honey. Codex stan 12:1982
Google Scholar
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49(3):803
MathSciNet MATH Google Scholar
Bensmail H, Celeux G (1996) Regularized Gaussian discriminant analysis through eigenvalue decomposition. J Am Stat Assoc 91(436):1743–1748
MathSciNet MATH Google Scholar
Bohning D, Dietz E, Schaub R, Schlattmann P, Lindsay BG (1994) The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Ann Inst Stat Math 46(2):373–388
MATH Google Scholar
Bouveyron C, Girard S (2009) Robust supervised classification with mixture models: learning from data with uncertain labels. Pattern Recognit 42(11):2649–2658
MATH Google Scholar
Browne RP, McNicholas PD (2014) Estimating common principal components in high dimensions. Adv Data Anal Classif 8:217–226
MathSciNet MATH Google Scholar
Cattell RB (1966) The scree test for the number of factors. Multivar Behav Res 1(2):245–276
Google Scholar
Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recognit 28(5):781–793
Google Scholar
Cerioli A, García-Escudero LA, Mayo-Iscar A, Riani M (2018) Finding the number of normal groups in model-based clustering via constrained likelihoods. J Comput Gr Stat 27(2):404–416
MathSciNet Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
MATH Google Scholar
Cuesta-Albertos JA, Gordaliza A, Matrán C (1997) Trimmed k-means: an attempt to robustify quantizers. Ann Stat 25(2):553–576
MathSciNet MATH Google Scholar
Dean N, Murphy TB, Downey G (2006) Using unlabelled data to update classification rules with applications in food authenticity studies. J R Stat Soc Ser C Appl Stat 55(1):1–14
MathSciNet MATH Google Scholar
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 39(1):1–38
MathSciNet MATH Google Scholar
Dotto F, Farcomeni A (2019) Robust inference for parsimonious model-based clustering. J Stat Comput Simul 89(3):414–442
MathSciNet MATH Google Scholar
Dotto F, Farcomeni A, García-Escudero LA, Mayo-Iscar A (2018) A reweighting approach to robust clustering. Stat Comput 28(2):477–493
MathSciNet MATH Google Scholar
Downey G (1996) Authentication of food and food ingredients by near infrared spectroscopy. J Near Infrared Spectrosc 4(1):47
Google Scholar
Fop M, Murphy TB, Raftery AE (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J XX(August):1–29
Google Scholar
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631
MathSciNet MATH Google Scholar
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
MathSciNet MATH Google Scholar
Fritz H, García-Escudero LA, Mayo-Iscar A (2012) tclust : an R package for a trimming approach to cluster analysis. J Stat Softw 47(12):1–26
Google Scholar
Fritz H, García-Escudero LA, Mayo-Iscar A (2013) A fast algorithm for robust constrained clustering. Comput Stat Data Anal 61:124–136
MathSciNet MATH Google Scholar
Gallegos MT (2002) Maximum likelihood clustering with outliers. In: Classification, clustering, and data analysis, Springer, pp 247–255
García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2008) A general trimming approach to robust cluster Analysis. Ann Stat 36(3):1324–1345
MathSciNet MATH Google Scholar
García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2010) A review of robust clustering methods. Adv Data Anal Classif 4(2–3):89–109
MathSciNet MATH Google Scholar
García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2011) Exploring the number of groups in robust model-based clustering. Stat Comput 21(4):585–599
MathSciNet MATH Google Scholar
García-Escudero LA, Gordaliza A, Mayo-Iscar A (2014) A constrained robust proposal for mixture modeling avoiding spurious solutions. Adv Data Anal Classif 8(1):27–43
MathSciNet MATH Google Scholar
García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2015) Avoiding spurious local maximizers in mixture modeling. Stat Comput 25(3):619–633
MathSciNet MATH Google Scholar
García-Escudero LA, Gordaliza A, Greselin F, Ingrassia S, Mayo-Iscar A (2016) The joint role of trimming and constraints in robust estimation for mixtures of Gaussian factor analyzers. Comput Stat Data Anal 99:131–147
MathSciNet MATH Google Scholar
García-Escudero LA, Gordaliza A, Greselin F, Ingrassia S, Mayo-Iscar A (2017) Eigenvalues and constraints in mixture modeling: geometric and computational issues. Adv Data Anal Classif 12:1–31
MathSciNet MATH Google Scholar
Gordaliza A (1991a) Best approximations to random variables based on trimming procedures. J Approx Theory 64(2):162–180
MathSciNet MATH Google Scholar
Gordaliza A (1991b) On the breakdown point of multivariate location estimators based on trimming procedures. Stat Probab Lett 11(5):387–394
MathSciNet MATH Google Scholar
Hastie T, Tibshirani R (1996) Discriminant analysis by Gaussian mixtures. J R Stat Soc Ser B (Methodol) 58(1):155–176
MathSciNet MATH Google Scholar
Hawkins DM, McLachlan GJ (1997) High-breakdown linear discriminant analysis. J Am Stat Assoc 92(437):136
MathSciNet MATH Google Scholar
Hickey RJ (1996) Noise modelling and evaluating learning from examples. Artif Intell 82(1–2):157–179
MathSciNet Google Scholar
Hubert M, Debruyne M, Rousseeuw PJ (2018) Minimum covariance determinant and extensions. Wiley Interdiscip Rev Comput Stat 10(3):1–11
MathSciNet Google Scholar
Ingrassia S (2004) A likelihood-based constrained algorithm for multivariate normal mixture models. Stat Methods Appl 13(2):151–166
MathSciNet Google Scholar
Kelly JD, Petisco C, Downey G (2006) Application of Fourier transform midinfrared spectroscopy to the discrimination between Irish artisanal honey and such honey adulterated with various sugar syrups. J Agric Food Chem 54(17):6166–6171
Google Scholar
Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic Press, New York
MATH Google Scholar
Maronna R, Jacovkis PM (1974) Multivariate clustering procedures with variable metrics. Biometrics 30(3):499
MATH Google Scholar
McLachlan GJ (1992) Discriminant analysis and statistical pattern recognition, vol 544. Wiley series in probability and statistics. Wiley, Hoboken
MATH Google Scholar
McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions, vol 54. Wiley series in probability and statistics. Wiley, Hoboken
MATH Google Scholar
McLachlan GJ, Peel D (1998) Robust cluster analysis via mixtures of multivariate t-distributions. In: Joint IAPR international workshops on statistical techniques in pattern recognition and structural and syntactic pattern recognition. Springer, Berlin, pp 658–666
McNicholas PD (2016) Mixture model-based classification. CRC Press, Boca Raton
MATH Google Scholar
Menardi G (2011) Density-based Silhouette diagnostics for clustering methods. Stat Comput 21(3):295–308
MathSciNet MATH Google Scholar
Neykov N, Filzmoser P, Dimova R, Neytchev P (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Comput Stat Data Anal 52(1):299–308
MathSciNet MATH Google Scholar
Peel D, McLachlan GJ (2000) Robust mixture modelling using the t distribution. Stat Comput 10(4):339–348
Google Scholar
Prati RC, Luengo J, Herrera F (2019) Emerging topics and challenges of learning from noisy data in nonstandard classification: a survey beyond binary class noise. Knowl Inf Syst 60(1):63–97
Google Scholar
R Core Team (2018) R: a language and environment for statistical computing
Rousseeuw PJ, Driessen KV (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3):212–223
Google Scholar
Russell N, Cribbin L, Murphy TB (2014) upclass: an R package for updating model-based classification rules. Cran R-Project Org
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
MathSciNet MATH Google Scholar
Thomson G (1939) The factorial analysis of human ability. Br J Educ Psychol 9(2):188–195
Google Scholar
Vanden Branden K, Hubert M (2005) Robust classification in high dimensions based on the SIMCA Method. Chemom Intell Lab Syst 79(1–2):10–21
Google Scholar
Wu X (1995) Knowledge acquisition from databases. Intellect books, Westport
Google Scholar
Zhu X, Wu X (2004) Class noise vs. attribute noise: a quantitative study. Artif Intell Rev 22(3):177–210
MATH Google Scholar

Download references

Acknowledgements

The authors are very grateful to Agustin Mayo-Iscar and Luis Angel García Escudero for both stimulating discussion and advices on how to enforce the eigenvalue-ratio constraints under the different patterned models. Andrea Cappozzo deeply thanks Michael Fop for his endless patience and guidance in helping him with methodological and computational issues encountered during the draft of the present manuscript. Brendan Murphy’s work is supported by the Science Foundation Ireland Insight Research Centre (SFI/12/RC/2289_P2).

Author information

Authors and Affiliations

Department of Statistics and Quantitative Methods, University of Milano-Bicocca, Milan, Italy
Andrea Cappozzo & Francesca Greselin
School of Mathematics and Statistics and Insight Research Centre, University College Dublin, Dublin, Ireland
Thomas Brendan Murphy

Authors

Andrea Cappozzo
View author publications
You can also search for this author in PubMed Google Scholar
Francesca Greselin
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Brendan Murphy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrea Cappozzo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

Proof of Proposition 1

Considering the random variable ${\mathcal {Z}}_{mg}$ corresponding to $z_{mg}$, the E-step on the $(k+1)$th iteration requires the calculation of the conditional expectation of ${\mathcal {Z}}_{mg}$ given ${\mathbf {y}}_m$:

$$\begin{aligned} \begin{aligned} E _{\hat{\varvec{\theta }}^{(k)}}({\mathcal {Z}}_{mg}|{\mathbf {y}}_m)&={\mathbb {P}}\left( {\mathcal {Z}}_{mg}=1|{\mathbf {y}}_m;{\hat{\theta }}^{(k)}\right) \\&=\frac{{\mathbb {P}}\left( {\mathbf {y}}_m|{\mathcal {Z}}_{mg}=1;{\hat{\theta }}^{(k)}\right) {\mathbb {P}}\left( {\mathcal {Z}}_{mg}=1;{\hat{\theta }}^{(k)}\right) }{\sum _{j=1}^G {\mathbb {P}}\left( {\mathbf {y}}_m|{\mathcal {Z}}_{mj}=1;{\hat{\theta }}^{(k)}\right) {\mathbb {P}}\left( {\mathcal {Z}}_{mj}=1;{\hat{\theta }}^{(k)}\right) }\\&=\frac{{\hat{\tau }}^{(k)}_g \phi \left( {\mathbf {y}}_m; \hat{\varvec{\mu }}^{(k)}_g, \hat{\varvec{\varSigma }}^{(k)}_g \right) }{\sum _{j=1}^G{\hat{\tau }}_j^{(k)} \phi \left( {\mathbf {y}}_m; \hat{\varvec{\mu }}^{(k)}_j, \hat{\varvec{\varSigma }}^{(k)}_j\right) }\\&={\hat{z}}_{mg}^{(k+1)} \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, g=1,\ldots , G; \,\,\,\, m=1,\ldots , M. \end{aligned} \end{aligned}$$

(23)

Therefore, the Q function, to be maximized with respect to $\varvec{\theta }$ in the M-step, is given by

$$\begin{aligned} \begin{aligned} Q(\varvec{\theta };\hat{\varvec{\theta }}^{(k)})&= \sum _{n=1}^N \zeta ({\mathbf {x}}_n) \sum _{g=1}^G l_{ng} \log {\left[ \tau _g \phi ({\mathbf {x}}_n; \varvec{\mu }_g, \varvec{\varSigma }_g)\right] } \\&\quad +\, \sum _{m=1}^M \varphi ({\mathbf {y}}_m) \sum _{g=1}^G \hat{z}_{mg}^{(k+1)} \log {\left[ \tau _g \phi ({\mathbf {y}}_m; \varvec{\mu }_g, \varvec{\varSigma }_g)\right] .} \end{aligned} \end{aligned}$$

(24)

The maximization of (24) according to the mixture proportion $\tau _g$, $\sum _{j=1}^G\tau _j=1$ is solved considering the Lagrangian ${\mathcal {L}}(\varvec{\theta }, \kappa )$:

$$\begin{aligned} {\mathcal {L}}(\varvec{\theta }, \kappa )=Q\left( \varvec{\theta };\hat{\varvec{\theta }}^{(k)}\right) -\kappa \left( \sum _{j=1}^G\tau _j-1\right) \end{aligned}$$

(25)

with $\kappa $ the Lagrangian coefficient. The partial derivative of (25) with respect to $\tau _g$ has the form:

$$\begin{aligned} \frac{\partial }{\partial \tau _g}{\mathcal {L}}(\varvec{\theta }, \kappa )=\frac{\sum _{n=1}^N \zeta ({\mathbf {x}}_n)l_{ng}}{\tau _g}+ \frac{\sum _{m=1}^M \varphi ({\mathbf {y}}_m)\hat{z}_{mg}^{(k+1)}}{\tau _g}-\kappa \end{aligned}$$

(26)

and setting (26) equal to 0 for all $g=1,\ldots , G$ we obtain:

$$\begin{aligned} \sum _{n=1}^N \zeta ({\mathbf {x}}_n)l_{ng}+ \sum _{m=1}^M \varphi ({\mathbf {y}}_m)\hat{z}_{mg}^{(k+1)}-\kappa \tau _g=0. \end{aligned}$$

(27)

Summing (27) over g, $g=1,\ldots , G$, provides the value of $\kappa =\lceil N(1-\alpha _{l})\rceil +M(1-\alpha _{u})\rceil $ and substituting it in the previous expression yields the ML estimate for $\tau _g$:

$$\begin{aligned} {\hat{\tau }}_g^{(k+1)}=\frac{\sum _{n=1}^N \zeta ({\mathbf {x}}_n)l_{ng}+ \sum _{m=1}^M \varphi ({\mathbf {y}}_m){\hat{z}}_{mg}^{(k+1)}}{\lceil N(1-\alpha _{l})\rceil +\lceil M(1-\alpha _{u})\rceil }\,\,\,\,\, g=1,\ldots , G. \end{aligned}$$

(28)

The partial derivative of (24) with respect to the mean vector $\varvec{\mu }_g$ reads:

$$\begin{aligned} \begin{aligned} \frac{\partial }{\partial \varvec{\mu }_g}Q\left( \varvec{\theta };\varvec{\theta }^{(k)}\right)&= -\varvec{\varSigma }_g^{-1}\left[ \sum _{n=1}^N \zeta ({\mathbf {x}}_n)l_{ng}\left( {\mathbf {x}}_n-\varvec{\mu }_g\right) +\sum _{m=1}^M \varphi ({\mathbf {y}}_m){\hat{z}}_{mg}^{(k+1)}\left( {\mathbf {y}}_m-\varvec{\mu }_g\right) \right] \\&=-\varvec{\varSigma }_g^{-1}\left[ \sum _{n=1}^N \zeta ({\mathbf {x}}_n)l_{ng}{\mathbf {x}}_n + \sum _{m=1}^M \varphi ({\mathbf {y}}_m){\hat{z}}_{mg}^{(k+1)}{\mathbf {y}}_m +\right. \\&\left. \quad -\varvec{\mu }_g\left( \sum _{n=1}^N \zeta ({\mathbf {x}}_n)l_{ng} + \sum _{m=1}^M \varphi ({\mathbf {y}}_m){\hat{z}}_{mg}^{(k+1)} \right) \right] . \end{aligned} \end{aligned}$$

(29)

Equating (29) to 0 and rearranging terms provides the ML estimate of $\varvec{\mu }_g$:

$$\begin{aligned} \hat{\varvec{\mu }}_g^{(k+1)}=\frac{\sum _{n=1}^N \zeta ({\mathbf {x}}_n)l_{ng}{\mathbf {x}}_n+\sum _{m=1}^M\varphi ({\mathbf {y}}_m){\hat{z}}_{mg}^{(k+1)}{\mathbf {y}}_m}{\sum _{n=1}^N\zeta ({\mathbf {x}}_n)l_{ng}+\sum _{m=1}^M\varphi ({\mathbf {y}}_m){\hat{z}}_{mg}^{(k+1)}}\,\,\,\,\, g=1,\ldots , G. \end{aligned}$$

(30)

Discarding quantities that do not depend on $\varvec{\varSigma }_g$, we can rewrite (24) as follows:

$$\begin{aligned}&\sum _{n=1}^{N} \sum _{g=1}^{G} \zeta ({\mathbf {x}}_n)l_{ng}\left( {\mathbf {x}}_{n}\right) \left[ -\log \left| \varvec{\varSigma }_{g}\right| ^{1 / 2}-\frac{1}{2}\left( {\mathbf {x}}_{n}-\varvec{\mu }_{g}\right) ^{\prime } \varvec{\varSigma }_{g}^{-1}\left( {\mathbf {x}}_{n}-\varvec{\mu }_{g}\right) \right] \nonumber \\&\qquad +\sum _{m=1}^{M} \sum _{g=1}^{G} \varphi ({\mathbf {y}}_m)\hat{z}_{mg}^{(k+1)}\left( {\mathbf {y}}_{m}\right) \left[ -\log \left| \varvec{\varSigma }_{g}\right| ^{1 / 2}-\frac{1}{2}\left( {\mathbf {y}}_{m}-\varvec{\mu }_{g}\right) ^{\prime } \varvec{\varSigma }_{g}^{-1}\left( {\mathbf {y}}_{m}-\varvec{\mu }_{g}\right) \right] \nonumber \\&\quad =-\frac{1}{2}\left[ \sum _{n=1}^{N} \sum _{g=1}^{G} \zeta ({\mathbf {x}}_n)l_{ng}\left( {\mathbf {x}}_{n}\right) \log \left| \varvec{\varSigma }_{g}\right| \right. \nonumber \\&\qquad +\sum _{n=1}^{N} \sum _{g=1}^{G} \zeta ({\mathbf {x}}_n)l_{ng} \left[ \underbrace{\left( {\mathbf {x}}_{n}-\varvec{\mu }_{g}\right) ^{\prime } \varvec{\varSigma }_{g}^{-1}\left( {\mathbf {x}}_{n}-\varvec{\mu }_{g}\right) }_{\text{ a } \text{ scalar } }\right] \nonumber \\&\qquad +\sum _{m=1}^{M} \sum _{g=1}^{G} \varphi ({\mathbf {y}}_m)\hat{z}_{mg}^{(k+1)}\left( {\mathbf {y}}_{m}\right) \log \left| \varvec{\varSigma }_{g}\right| \nonumber \\&\qquad \left. +\sum _{m=1}^{M} \sum _{g=1}^{G} \varphi ({\mathbf {y}}_m)\hat{z}_{mg}^{(k+1)}\left[ \underbrace{\left( {\mathbf {y}}_{m}-\varvec{\mu }_{g}\right) ^{\prime } \varvec{\varSigma }_{g}^{-1}\left( {\mathbf {y}}_{m}-\varvec{\mu }_{g}\right) }_{ \text{ a } \text{ scalar } }\right] \right] \nonumber \\&\quad =-\frac{1}{2}\left[ \sum _{g=1}^{G} \log \left| \varvec{\varSigma }_{g}\right| \left( \sum _{n=1}^{N} \zeta ({\mathbf {x}}_n)l_{ng}\left( {\mathbf {x}}_{n}\right) + \sum _{m=1}^{M} \varphi ({\mathbf {y}}_m)\hat{z}_{mg}^{(k+1)}\left( {\mathbf {y}}_{m}\right) \right) \right. \nonumber \\&\qquad +\sum _{n=1}^{N} \sum _{g=1}^{G} \zeta ({\mathbf {x}}_n)l_{ng} tr \left[ \varvec{\varSigma }_{g}^{-1} \left( {\mathbf {x}}_{n}-\varvec{\mu }_{g}\right) \left( {\mathbf {x}}_{n}-\varvec{\mu }_{g}\right) ^{\prime }\right] \nonumber \\&\qquad \left. + \sum _{m=1}^{M} \sum _{g=1}^{G} \varphi ({\mathbf {y}}_m)\hat{z}_{mg}^{(k+1)}tr \left[ \varvec{\varSigma }_{g}^{-1} \left( {\mathbf {y}}_{m}-\varvec{\mu }_{g}\right) \left( {\mathbf {y}}_{m}-\varvec{\mu }_{g}\right) ^{\prime }\right] \right] \nonumber \\&\quad =-\frac{1}{2}\left[ \sum _{g=1}^{G} \log \left| \varvec{\varSigma }_{g}\right| \left( \sum _{n=1}^{N} \zeta ({\mathbf {x}}_n)l_{ng}\left( {\mathbf {x}}_{n}\right) + \sum _{m=1}^{M} \varphi ({\mathbf {y}}_m)\hat{z}_{mg}^{(k+1)}\left( {\mathbf {y}}_{m}\right) \right) \right. \nonumber \\&\qquad +\left. \sum _{g=1}^{G} tr \left[ \varvec{\varSigma }^{-1}_{g}{\varvec{W}}_g^{X} \right] + \sum _{g=1}^{G} tr \left[ \varvec{\varSigma }_{g}^{-1}{\varvec{W}}_g^{Y} \right] \right] \nonumber \\&\qquad -\frac{1}{2}\left[ \sum _{g=1}^{G} \log \left| \varvec{\varSigma }_{g}\right| \left( \sum _{n=1}^{N} \zeta ({\mathbf {x}}_n)l_{ng}\left( {\mathbf {x}}_{n}\right) + \sum _{m=1}^{M} \varphi ({\mathbf {y}}_m)\hat{z}_{mg}^{(k+1)}\left( {\mathbf {y}}_{m}\right) \right) \right. \nonumber \\&\qquad \left. + \sum _{g=1}^{G} tr \left[ \varvec{\varSigma }^{-1}_{g}\left( {\varvec{W}}_g^{X} + {\varvec{W}}_g^{Y}\right) \right] \right] \end{aligned}$$

(31)

where ${\varvec{W}}_g^{X}=\sum _{n=1}^{N} \zeta ({\mathbf {x}}_n)l_{ng}\left[ \left( {\mathbf {x}}_{n}-\varvec{\mu }_{g}\right) \left( {\mathbf {x}}_{n}-\varvec{\mu }_{g}\right) ^{\prime }\right] $ and ${\varvec{W}}_g^{Y}=\sum _{m=1}^{M} \varphi ({\mathbf {y}}_m)\hat{z}_{mg}^{(k+1)}\left[ \left( {\mathbf {y}}_{m}-\varvec{\mu }_{g}\right) \left( {\mathbf {y}}_{m}-\varvec{\mu }_{g}\right) ^{\prime }\right] $. Finally, considering the eigenvalue decomposition $\varvec{\varSigma }_g=\lambda _g{\varvec{D}}_g{\varvec{A}}_g{\varvec{D}}^{'}_g$, (31) simplifies to:

$$\begin{aligned} \begin{aligned}&-\frac{1}{2}\left[ \sum _{g=1}^{G} p \log \lambda _g \left( \sum _{n=1}^{N} \zeta ({\mathbf {x}}_n)l_{ng}\left( {\mathbf {x}}_{n}\right) + \sum _{m=1}^{M} \varphi ({\mathbf {y}}_m)\hat{z}_{mg}^{(k+1)}\left( {\mathbf {y}}_{m}\right) \right) \right. \\&\quad + \left. \sum _{g=1}^{G} \frac{1}{\lambda _g} tr \left[ {\varvec{D}}_{g} {\varvec{A}}^{-1} {\varvec{D}}_{g}^{\prime } \left( {\varvec{W}}_g^{X} + {\varvec{W}}_g^{Y}\right) \right] \right] \end{aligned} \end{aligned}$$

(32)

The partial derivative of (32) with respect to $\left( \lambda _g,{\varvec{A}}_g, {\varvec{D}}_g\right) $ depends on the considered patterned structure: for a thorough derivation the reader is referred to Bensmail and Celeux (1996). If (8) is not satisfied, the constraints are enforced as detailed in “Appendix C”. Lastly, notice that in performing the concentration step the optimal observations of both training and test sets are retained, i.e. the ones with the highest contribution to the objective function.

The afore-described procedure falls within the structure of a general EM algorithm, for which the likelihood function does not decrease after an EM iteration, as shown in Dempster et al. (1977) and reported in page 78 of McLachlan and Krishnan (2008). $\square $

Appendix B

This appendix details the structure of the Simulation Study in Sect. 4.2.1. We consider a data generating process given by a mixture of $G=4$ components of multivariate t-distributions (McLachlan and Peel 1998; Peel and McLachlan 2000), according to the following parameters:

$$\begin{aligned}&\varvec{\tau }=(0.2, 0.4, 0.1, 0.3)', \quad \nu =6, \\&\varvec{\mu }_1=(0, 0, 0, 0, 0, 0, 0,0,0,0,0)', \\&\varvec{\mu }_2=(4, -\,4, 4, -\,4,4, -\,4,4, -\,4,4, -\,4)', \\&\varvec{\mu }_3=(0,0,7,7,7,3,6,8,-\,4,-\,4)', \\&\varvec{\mu }_4=(8, 0, 8, 0, 8, 0, 8,0,8,0,8)', \\&\varvec{\varSigma }_1 = diag(1,1,1,1,1,1,1,1,1,1), \\&\varvec{\varSigma }_2 = diag(2,2,2,2,2,2,2,2,2,2), \\&\varvec{\varSigma }_3 = \varvec{\varSigma }_4\\= & {} \begin{bmatrix} 5.05&1.26&-\,0.35&-\,0.00&-\,1.04&-\,1.35&0.29&0.07&0.69&1.17 \\ 1.26&2.57&0.17&0.00&0.27&0.11&0.61&0.11&0.59&0.89 \\ -\,0.35&0.17&6.74&-\,0.00&-\,0.26&-\,0.31&-\,0.01&0.00&0.08&0.14 \\ -\,0.00&0.00&-\,0.00&5.47&-\,0.00&-\,0.00&0.00&0.00&0.00&0.00 \\ -\,1.04&0.27&-\,0.26&-\,0.00&6.80&-\,0.76&-\,0.12&-\,0.01&0.09&0.21 \\ -\,1.35&0.11&-\,0.31&-\,0.00&-\,0.76&7.75&-\,0.26&-\,0.04&-\,0.03&0.03 \\ 0.29&0.61&-0.01&0.00&-0.12&-\,0.26&4.76&0.06&0.38&0.60 \\ 0.07&0.11&0.00&0.00&-\,0.01&-\,0.04&0.06&4.18&0.07&0.11 \\ 0.69&0.59&0.08&0.00&0.09&-0.03&0.38&0.07&3.23&0.60 \\ 1.17&0.89&0.14&0.00&0.21&0.03&0.60&0.11&0.60&3.24 \\ \end{bmatrix}. \end{aligned}$$

A generalized pairs plot of contaminated labelled units under the afore-described Simulation Setup is reported in Fig. 7.

Appendix C

This final Section presents feasible and computationally efficient algorithms for enforcing the eigenvalue-ratio constraint according to the different patterned models in Table 1. At the $k-$th iteration of the M-step, the goal is to update the estimates for the covariance matrices $\hat{\varvec{\varSigma }}_g^{(k+1)}={\hat{\lambda }}_g^{(k+1)}\hat{{\varvec{D}}}_g^{(k+1)}\hat{{\varvec{A}}}_g^{(k+1)}\hat{{\varvec{D}}}^{'(k+1)}_g$, $g=1,\ldots ,G$ such that,

$$\begin{aligned} \frac{\max _{g=1\ldots G}\max _{l=1\ldots p}{\hat{\lambda }}_g^{(k+1)}{\hat{a}}_{lg}^{(k+1)}}{\min _{g=1\ldots G}\min _{l=1\ldots p}{\hat{\lambda }}_g^{(k+1)}{\hat{a}}_{lg}^{(k+1)}} \le c \end{aligned}$$

(33)

where ${\hat{a}}_{lg}^{(k+1)}$ indicates the diagonal entries of matrix $\hat{{\varvec{A}}}_g^{(k+1)}$.

Denote with ${\hat{\varSigma }}_g^{U}={\hat{\lambda }}_g^{U} \hat{{\varvec{D}}}_g^{U}\hat{{\varvec{A}}}_g^{U}\hat{{\varvec{D}}}_g^{'U}$ the estimates for the variance covariance matrices obtained following Bensmail and Celeux (1996) without enforcing the eigenvalues-ratio restriction in (33). Lastly, denote with $\hat{\varvec{\varDelta }}^U_g={\hat{\lambda }}_g^{U}\hat{{\varvec{A}}}_g^{U}$ the matrix of eigenvalues for $\hat{\varvec{\varSigma }}_g^{U}$, with diagonal entries ${\hat{d}}_{lg}^U={\hat{\lambda }}_g^{U}{\hat{a}}_{lg}^{U}$, $l=1,\ldots ,p$.

1.1 Constrained maximization for VII, VVI and VVV models

1.
Compute $\varvec{\varDelta }_g$ applying the optimal truncation operator defined in Fritz et al. (2013) to $\left\{ \hat{\varvec{\varDelta }}^U_1,\ldots ,\hat{\varvec{\varDelta }}^U_G\right\} $, under condition (33).
2.
Set ${\hat{\lambda }}_g^{(k+1)}=|\varvec{\varDelta }_g|^{1/p}$, $\hat{{\varvec{A}}}_g^{(k+1)}=\frac{1}{{\hat{\lambda }}_g^{(k+1)}}\varvec{\varDelta }_g$, $\hat{{\varvec{D}}}_g^{(k+1)}=\hat{{\varvec{D}}}_g^{U}.$

1.2 Constrained maximization for VVE model

1.
Compute $\varvec{\varDelta }_g$ applying the optimal truncation operator defined in Fritz et al. (2013) to $\left\{ \hat{\varvec{\varDelta }}^U_1,\ldots ,\hat{\varvec{\varDelta }}^U_G\right\} $, under condition (33).
2.
Given $\varvec{\varDelta }_g$, compute the common principal components ${\varvec{D}}$ via, for example, a majorization-minimization (MM) algorithm (Browne and McNicholas 2014).
3.
Set ${\hat{\lambda }}_g^{(k+1)}=|\varvec{\varDelta }_g|^{1/p}$, $\hat{{\varvec{A}}}_g^{(k+1)}=\frac{1}{{\hat{\lambda }}_g^{(k+1)}}\varvec{\varDelta }_g$, $\hat{{\varvec{D}}}_g^{(k+1)}={\varvec{D}}.$

1.3 Constrained maximization for EVI, EVV models

1.
Compute $\varvec{\varDelta }_g$ applying the optimal truncation operator defined in Fritz et al. (2013) to $\left\{ \hat{\varvec{\varDelta }}^U_1,\ldots ,\hat{\varvec{\varDelta }}^U_G\right\} $, under condition (33).
2.
Compute $\varvec{\varDelta }^{\star }_g$ constraining $\varvec{\varDelta }_g$ such that $\varvec{\varDelta }^{\star }_g=\lambda ^{\star }{\varvec{A}}_g^{\star }$. That is, constraining $|\varvec{\varDelta }^{\star }_g|$ to be equal across groups (Maronna and Jacovkis 1974; Gallegos 2002). Details are given in section 3.2 of Fritz et al. (2012).
3.
Iterate $1-2$ until (33) is satisfied.
4.
Set ${\hat{\lambda }}_g^{(k+1)}=\lambda ^{\star }$, $\hat{{\varvec{A}}}_g^{(k+1)}={\varvec{A}}_g^{\star }$, $\hat{{\varvec{D}}}_g^{(k+1)}=\hat{{\varvec{D}}}_g^{U}.$

1.4 Constrained maximization for EVE model

1.
Compute $\varvec{\varDelta }_g$ applying the optimal truncation operator defined in Fritz et al. (2013) to $\left\{ \hat{\varvec{\varDelta }}^U_1,\ldots ,\hat{\varvec{\varDelta }}^U_G\right\} $, under condition (33).
2.
Compute $\varvec{\varDelta }^{\star }_g$ constraining $\varvec{\varDelta }_g$ such that $\varvec{\varDelta }^{\star }_g=\lambda ^{\star }{\varvec{A}}_g^{\star }$. Details are given in section 3.2 of Fritz et al. (2012).
3.
Iterate 1–2 until (33) is satisfied.
4.
Given ${\varvec{A}}^{\star }_g$, compute the common principal components ${\varvec{D}}$ via, for example, a majorization-minimization (MM) algorithm (Browne and McNicholas 2014).
5.
Set ${\hat{\lambda }}_g^{(k+1)}=\lambda ^{\star }$, $\hat{{\varvec{A}}}_g^{(k+1)}={\varvec{A}}_g^{\star }$, $\hat{{\varvec{D}}}_g^{(k+1)}={\varvec{D}}$.

1.5 Constrained maximization for VEI, VEV models

1.
Set $\varvec{\varDelta }_g=\hat{\varvec{\varDelta }}^U_g$.
2.
Set $\lambda _g^{\star }={\hat{\lambda }}_g^{U}$, $g=1,\ldots ,G$.
3.
Compute $\varvec{\varDelta }^{\star }_g$ applying the optimal truncation operator defined in Fritz et al. (2013) to $\left\{ \varvec{\varDelta }_1,\ldots ,\varvec{\varDelta }_G\right\} $, under condition (33).
4.
Compute ${\varvec{A}}^{\star }=\left. \sum _{g=1}^G\frac{1}{\lambda _g^{\star }}\varvec{\varDelta }^{\star }_g \Bigg / \left| \sum _{g=1}^G\frac{1}{\lambda _g^{\star }}\varvec{\varDelta }^{\star }_g \right| ^{1/p} \right. $.
5.
Compute $\lambda _g^{\star }=\frac{1}{p}tr\left( \varvec{\varDelta }^{\star }_g {{\varvec{A}}^{\star }}^{-1}\right) .$
6.
Set $\varvec{\varDelta }_g=\lambda _g^{\star }{\varvec{A}}^{\star }$.
7.
Iterate 3–6 until (33) is satisfied.
8.
Set ${\hat{\lambda }}_g^{(k+1)}=\lambda ^{\star }_g$, $\hat{{\varvec{A}}}_g^{(k+1)}={\varvec{A}}^{\star }$, $\hat{{\varvec{D}}}_g^{(k+1)}=\hat{{\varvec{D}}}_g^{U}$.

1.6 Constrained maximization for VEE model

1.
Set ${\varvec{K}}_g=\hat{\varvec{\varSigma }}^U_g$.
2.
Set $\lambda _g^{\star }={\hat{\lambda }}_g^{U}$, $g=1,\ldots ,G$.
3.
Compute ${\varvec{K}}_g^{\star }$ applying the optimal truncation operator defined in Fritz et al. (2013) to $\left\{ {\varvec{K}}_1,\ldots ,{\varvec{K}}_G \right\} $, under condition (33).
4.
Compute ${\varvec{C}}^{\star }=\left. \sum _{g=1}^G\frac{1}{\lambda _g^{\star }}{\varvec{K}}^{\star }_g \Bigg / \left| \sum _{g=1}^G\frac{1}{\lambda _g^{\star }}{\varvec{K}}^{\star }_g \right| ^{1/p} \right. $.
5.
Compute $\lambda _g^{\star }=\frac{1}{p}tr\left( {\varvec{K}}^{\star }_g {{\varvec{C}}^{\star }}^{-1}\right) $.
6.
Set ${\varvec{K}}_g=\lambda _g^{\star }{\varvec{C}}^{\star }$.
7.
Iterate $3-6$ until (33) is satisfied.
8.
Considering the spectral decomposition for ${\varvec{C}}^{\star }={\varvec{D}}^{\star }{\varvec{A}}^{\star }{{\varvec{D}}^{\star }}^{'}$, set ${\hat{\lambda }}_g^{(k+1)}=\lambda ^{\star }_g$, $\hat{{\varvec{A}}}_g^{(k+1)}={\varvec{A}}^{\star }$, $\hat{{\varvec{D}}}_g^{(k+1)}={\varvec{D}}^{\star }$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cappozzo, A., Greselin, F. & Murphy, T.B. A robust approach to model-based classification based on trimming and constraints. Adv Data Anal Classif 14, 327–354 (2020). https://doi.org/10.1007/s11634-019-00371-w

Download citation

Received: 06 December 2018
Revised: 19 July 2019
Accepted: 05 August 2019
Published: 14 August 2019
Issue Date: June 2020
DOI: https://doi.org/10.1007/s11634-019-00371-w

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A robust approach to model-based classification based on trimming and constraints

Abstract

Access this article

Similar content being viewed by others

On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning

A Metalearning Study for Robust Nonlinear Regression

Simultaneous Supervised and Unsupervised Classification Modeling for Assessing Cluster Analysis and Improving Results Interpretability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A

Proof of Proposition 1

Appendix B

Appendix C

1.1 Constrained maximization for VII, VVI and VVV models

1.2 Constrained maximization for VVE model

1.3 Constrained maximization for EVI, EVV models

1.4 Constrained maximization for EVE model

1.5 Constrained maximization for VEI, VEV models

1.6 Constrained maximization for VEE model

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

A robust approach to model-based classification based on trimming and constraints

Abstract

Access this article

Similar content being viewed by others

On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning

A Metalearning Study for Robust Nonlinear Regression

Simultaneous Supervised and Unsupervised Classification Modeling for Assessing Cluster Analysis and Improving Results Interpretability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A

Proof of Proposition 1

Appendix B

Appendix C

1.1 Constrained maximization for VII, VVI and VVV models

1.2 Constrained maximization for VVE model

1.3 Constrained maximization for EVI, EVV models

1.4 Constrained maximization for EVE model

1.5 Constrained maximization for VEI, VEV models

1.6 Constrained maximization for VEE model

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation