Skip to main content
Log in

Robust finite mixture regression for heterogeneous targets

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Finite Mixture Regression (FMR) refers to the mixture modeling scheme which learns multiple regression models from the training data set. Each of them is in charge of a subset. FMR is an effective scheme for handling sample heterogeneity, where a single regression model is not enough for capturing the complexities of the conditional distribution of the observed samples given the features. In this paper, we propose an FMR model that (1) finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously, (2) achieves shared feature selection among tasks and cluster components, and (3) detects anomaly tasks or clustered structure among tasks, and accommodates outlier samples. We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework. The proposed model is evaluated on both synthetic and real-world data sets. The results show that our model can achieve state-of-the-art performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. https://www.cdc.gov/nchs/lsoa/lsoa2.htm.

  2. http://www.share-project.org/data-access-documentation.html.

References

  • Aho K, Derryberry D, Peterson T (2014) Model selection for ecologists: the worldviews of AIC and BIC. Ecology 95(3):631–636

    Article  Google Scholar 

  • Alfò M, Salvati N, Ranallli MG (2016) Finite mixtures of quantile and M-quantile regression models. Stat Comput 27:1–24

    MathSciNet  MATH  Google Scholar 

  • Argyriou A, Evgeniou T, Pontil M (2007a) Multi-task feature learning. In: Advances in neural information processing systems, pp 41–48

  • Argyriou A, Pontil M, Ying Y, Micchelli CA (2007b) A spectral regularization framework for multi-task structure learning. In: Advances in neural information processing systems, pp 25–32

  • Bai X, Chen K, Yao W (2016) Mixture of linear mixed models using multivariate t distribution. J Stat Comput Simul 86(4):771–787

    Article  MathSciNet  Google Scholar 

  • Bartolucci F, Scaccia L (2005) The use of mixtures for dealing with non-normal regression errors. Comput Stat Data Anal 48(4):821–834

    Article  MathSciNet  MATH  Google Scholar 

  • Barzilai J, Borwein JM (1988) Two-point step size gradient methods. IMA J Numer Anal 8(1):141–148

    Article  MathSciNet  MATH  Google Scholar 

  • Becker SR, Candès EJ, Grant MC (2011) Templates for convex cone problems with applications to sparse signal recovery. Math Program Comput 3(3):165–218

    Article  MathSciNet  MATH  Google Scholar 

  • Bhat HS, Kumar N (2010) On the derivation of the Bayesian information criterion. School of Natural Sciences, University of California, Oakland

    Google Scholar 

  • Bickel PJ, Ritov Y, Tsybakov AB (2009) Simultaneous analysis of Lasso and Dantzig selector. Ann Stat 37:705–1732

    Article  MathSciNet  MATH  Google Scholar 

  • Bishop CM (2006) Pattern recognition. Mach Learn 128:1–58

    Google Scholar 

  • Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends® Mach Learn 3(1):1–122

    MATH  Google Scholar 

  • Candès EJ, Recht B (2009) Exact matrix completion via convex optimization. Found Comput Math 9(6):717–772

    Article  MathSciNet  MATH  Google Scholar 

  • Chen X, Kim S, Lin Q, Carbonell JG, Xing EP (2010) Graph-structured multi-task regression and an efficient optimization method for general fused lasso. ArXiv preprint arXiv:1005.3579

  • Chen J, Zhou J, Ye J (2011) Integrating low-rank and group-sparse structures for robust multi-task learning. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 42–50

  • Chen J, Liu J, Ye J (2012a) Learning incoherent sparse and low-rank patterns from multiple tasks. ACM Trans Knowl Discov Data (TKDD) 5(4):22

    Google Scholar 

  • Chen K, Chan KS, Stenseth NC (2012b) Reduced rank stochastic regression with a sparse singular value decomposition. J R Stat Soc Ser B (Stat Methodol) 74(2):203–221

    Article  MathSciNet  MATH  Google Scholar 

  • Cover TM, Thomas JA (2012) Elements of information theory. Wiley, Hoboken

    MATH  Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39:1–38

    MathSciNet  MATH  Google Scholar 

  • Doğru FZ, Arslan O (2016) Robust mixture regression using mixture of different distributions. In: Agostinelli C, Basu A, Filzmoser P, Mukherjee D (eds) Recent advances in robust statistics: theory and applications. Springer, New Delhi, pp 57–79

    Chapter  MATH  Google Scholar 

  • Doğru FZ, Arslan O (2017) Parameter estimation for mixtures of skew Laplace normal distributions and application in mixture regression modeling. Commun Stat Theory Methods 46(21):10,879–10,896

    Article  MathSciNet  MATH  Google Scholar 

  • Fahrmeir L, Kneib T, Lang S, Marx B (2013) Regression: models, methods and applications. Springer, Berlin

    Book  MATH  Google Scholar 

  • Fan J, Lv J (2010) A selective overview of variable selection in high dimensional feature space. Stat Sin 20(1):101–148

    MathSciNet  MATH  Google Scholar 

  • Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 186–193

  • Gong P, Ye J, Zhang C (2012a) Robust multi-task feature learning. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 895–903

  • Gong P, Ye J, Zhang C (2012b) Multi-stage multi-task feature learning. In: Advances in neural information processing systems, pp 1988–1996

  • Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36

    Article  Google Scholar 

  • He J, Lawrence R (2011) A graph-based framework for multi-task multi-view learning. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 25–32

  • Huang J, Breheny P, Ma S (2012) A selective review of group selection in high-dimensional models. Stat Sci 27(4):481–499

    Article  MathSciNet  MATH  Google Scholar 

  • Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE (1991) Adaptive mixtures of local experts. Neural Comput 3(1):79–87

    Article  Google Scholar 

  • Jacob L, Vert J, Bach FR (2009) Clustered multi-task learning: a convex formulation. In: Advances in neural information processing systems, pp 745–752

  • Jalali A, Sanghavi S, Ruan C, Ravikumar PK (2010) A dirty model for multi-task learning. In: Advances in neural information processing systems, pp 964–972

  • Ji S, Ye J (2009) An accelerated gradient method for trace norm minimization. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 457–464

  • Jin R, Goswami A, Agrawal G (2006) Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10(1):17–40

    Article  Google Scholar 

  • Jin X, Zhuang F, Pan SJ, Du C, Luo P, He Q (2015) Heterogeneous multi-task semantic feature learning for classification. In: Proceedings of the 24th ACM international on conference on information and knowledge management. ACM, pp 1847–1850

  • Jorgensen B (1987) Exponential dispersion models. J R Stat Soc Ser B (Methodol) 49:127–162

    MathSciNet  MATH  Google Scholar 

  • Khalili A (2011) An overview of the new feature selection methods in finite mixture of regression models. J Iran Stat Soc 10(2):201–235

    MathSciNet  MATH  Google Scholar 

  • Khalili A, Chen J (2007) Variable selection in finite mixture of regression models. J Am Stat Assoc 102(479):1025–1038

    Article  MathSciNet  MATH  Google Scholar 

  • Koller D (1996) Toward optimal feature selection. In: Proceedings of the 13th international conference on machine learning, pp 284–292

  • Kubat M (2015) An introduction to machine learning. Springer, Berlin

    Book  MATH  Google Scholar 

  • Kumar A, Daumé III H (2012) Learning task grouping and overlap in multi-task learning. In: Proceedings of the 29th international conference on machine learning. Omnipress, pp 1723–1730

  • Lim H, Narisetty NN, Cheon S (2016) Robust multivariate mixture regression models with incomplete data. J Stat Comput Simul 87:1–20

    MathSciNet  Google Scholar 

  • Law MH, Jain AK, Figueiredo M (2002) Feature selection in mixture-based clustering. In: Advances in neural information processing systems, pp 625–632

  • Li S, Liu ZQ, Chan AB (2014) Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 482–489

  • Liu J, Ji S, Ye J (2009) Multi-task feature learning via efficient \(\ell _{2,1}\)-norm minimization. In: Proceedings of the 25th conference on uncertainty in artificial intelligence. AUAI Press, pp 339–348

  • McLachlan G, Peel D (2004) Finite mixture models. Wiley, Hoboken

    MATH  Google Scholar 

  • Neal RM, Hinton GE (1998) A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Jordan MI (ed) Learning in graphical models. Springer, Dordrecht, pp 355–368

    Chapter  Google Scholar 

  • Nelder JA, Baker RJ (1972) Generalized linear models. Encyclopedia of statistical sciences. Wiley, Hoboken

    Google Scholar 

  • Nesterov Y et al (2007) Gradient methods for minimizing composite objective function. Technical report, UCL

  • Passos A, Rai P, Wainer J, Daumé III H (2012) Flexible modeling of latent task structures in multitask learning. In: Proceedings of the 29th international conference on machine learning. Omnipress, pp 1283–1290

  • Schölkopf B, Smola A, Müller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10(5):1299–1319

    Article  Google Scholar 

  • She Y, Chen K (2017) Robust reduced-rank regression. Biometrika 104(3):633–647

    Article  MathSciNet  MATH  Google Scholar 

  • She Y, Owen AB (2011) Outlier detection using nonconvex penalized regression. J Am Stat Assoc 106(494):626–639

    Article  MathSciNet  MATH  Google Scholar 

  • Städler N, Bühlmann P, Van De Geer S (2010) \(\ell _1\)-penalization for mixture regression models. Test 19(2):209–256

    Article  MathSciNet  MATH  Google Scholar 

  • Strehl A, Ghosh J (2002a) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3(Dec):583–617

    MathSciNet  MATH  Google Scholar 

  • Strehl A, Ghosh J (2002b) Cluster ensembles: a knowledge reuse framework for combining partitionings. In: 18th national conference on artificial intelligence. American Association for Artificial Intelligence, pp 93–98

  • Tan Z, Kaddoum R, Le Yi Wang HW (2010) Decision-oriented multi-outcome modeling for anesthesia patients. Open Biomed Eng J 4:113

    Article  Google Scholar 

  • Van de Geer SA (2000) Applications of empirical process theory, vol 91. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  • Van Der Maaten L, Postma E, Van den Herik J (2009) Dimensionality reduction: a comparative. J Mach Learn Res 10:66–71

    Google Scholar 

  • Van Der Vaart AW, Wellner JA (1996) Weak convergence. Springer, Berlin

    MATH  Google Scholar 

  • Wedel M, DeSarbo WS (1995) A mixture likelihood approach for generalized linear models. J Classif 12(1):21–55

    Article  MATH  Google Scholar 

  • Weruaga L, Vía J (2015) Sparse multivariate gaussian mixture regression. IEEE Trans Neural Netw Learn Syst 26(5):1098–1108

    Article  MathSciNet  Google Scholar 

  • Wang HX, bing Zhang Q, Luo B, Wei S (2004) Robust mixture modelling using multivariate t-distribution with missing information. Pattern Recognit Lett 25(6):701–710

    Article  Google Scholar 

  • Yang X, Kim S, Xing EP (2009) Heterogeneous multitask learning with joint sparsity constraints. In: Advances in neural information processing systems, pp 2151–2159

  • Yuksel SE, Wilson JN, Gader PD (2012) Twenty years of mixture of experts. IEEE Trans Neural Netw Learn Syst 23(8):1177–1193

    Article  Google Scholar 

  • Zhang D, Shen D, Initiative ADN et al (2012) Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease. NeuroImage 59(2):895–907

    Article  Google Scholar 

  • Zhang Y, Yeung DY (2011) Multi-task learning in heterogeneous feature spaces. In: 25th AAAI conference on artificial intelligence and the 23rd innovative applications of artificial intelligence conference, AAAI-11/IAAI-11, San Francisco, CA, 7–11 August 2011, Code 87049, Proceedings of the National Conference on Artificial Intelligence, p 574

  • Zhou J, Chen J, Ye J (2011) Clustered multi-task learning via alternating structure optimization. In: Advances in neural information processing systems, pp 702–710

Download references

Acknowledgements

The authors would like to thank the editors and reviewers for their valuable suggestions on improving this paper. This work of Jian Liang and Changshui Zhang is (jointly or partly) funded by National Natural Science Foundation of China under Grant No. 61473167 and Beijing Natural Science Foundation under Grant No. L172037. Kun Chen’s work is partially supported by U.S. National Science Foundation under Grants DMS-1613295 and IIS-1718798. The work of Fei Wang is supported by National Science Foundation under Grants IIS-1650723 and IIS-1716432.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian Liang.

Additional information

Responsible editor: Pauli Miettinen.

Appendices

Appendix A: Definitions

Definition 1

\(Z = (Z_1,\ldots ,Z_{m'})^{\mathrm{T}}\in \mathbb {R}^{m'} \) has a sub-exponential distribution with parameters \((\sigma ,v,t)\) if for \(M>t\), it holds

$$\begin{aligned} \mathbb {P}(\Vert Z\Vert _{\infty }>M)\le \left\{ \begin{array}{ll} \exp \biggl (-\frac{M^2}{\sigma ^2}\biggr ), &{} \quad t\le M\le \frac{\sigma ^2}{v}\\ \exp \biggl (-\frac{M }{v }\biggr ), &{}\quad M>\frac{\sigma ^2}{v}. \end{array} \right. \end{aligned}$$

Appendix B: The empirical process

In order to prove the first part of Theorem 1 that the bound in (26) has the probability in (25), we firstly follow Städler et al. (2010) to define the empirical process for fixed data points \(\mathbf {x}_1,\ldots ,\mathbf {x}_n\). For \(\tilde{\mathbf {y}}_i = (y_{ij}, j\in {\varOmega }_i)^{\mathrm{T}}\in \mathbb {R}^{|{\varOmega }_i|}\) and \(X = (X_1,\ldots ,X_d)\), let

$$\begin{aligned} V_n(\theta ) = \frac{1}{n}\sum _{i=1}^n\left( \ell _{\theta }(\mathbf {x}_i,\tilde{\mathbf {y}}_i)-\mathbb {E}[\ell _{\theta }(\mathbf {x}_i,\tilde{\mathbf {y}}_i)\mid X=\mathbf {x}_i]\right) . \end{aligned}$$

By fixing some \(T\ge 1\) and \(\lambda _0\ge 0\), we define an event \(\mathcal {T}\) below, upon which the bound in (26) can be proved. So the probability of the event \(\mathcal {T}\) is the probability in (25).

$$\begin{aligned} \mathcal {T} = \left\{ \sup _{\theta \in \tilde{{\varTheta }}} \frac{|V_n(\theta )-V_n(\theta _0)|}{(\Vert \varvec{\beta }-\varvec{\beta }_0\Vert _1 + \Vert \eta -\eta _0\Vert _2 )\vee \lambda _0}\le T\lambda _0 \right\} . \end{aligned}$$
(21)

It can be seen that, (21) defines a set of the parameter \(\theta \), and the bound in (26) will be proved with \(\hat{\theta }\) in the set.

For group-lasso type estimator, define an event similar to that in (21) in the following.

$$\begin{aligned} \mathcal {T}_{group} = \left\{ \sup _{\theta \in \tilde{{\varTheta }}} \frac{|V_n(\theta )-V_n(\theta _0)|}{\left( \sum _p\Vert \varvec{\beta }_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _2 + \Vert \eta -\eta _0\Vert _2 \right) \vee \lambda _0}\le T\lambda _0 \right\} . \end{aligned}$$
(22)

Appendix C: Lemmas

In order to show that the probability of event \(\mathcal {T}\) is large, we firstly invoke the following lemma.

Lemma 2

Under Condition 2, for model (1) with \(\theta _0 \in \tilde{{\varTheta }}\), \(M_n\) and \(\lambda _0\) defined in (24), some constants \(c_6,c_7\) depending on K, and for \(n\ge c_7\), we have

$$\begin{aligned} \mathbb {P}_{\mathbf {X}}\left( \frac{1}{n}\sum _{i=1}^nF(\tilde{\mathbf {y}}_i)>c_6\lambda _0^2/(mk)\right) \le \frac{1}{n}, \end{aligned}$$

where \(\mathbb {P}_{\mathbf {X}}\) denote the conditional probability given \((X_1^{\mathrm{T}},\ldots ,X_n^{\mathrm{T}})^{\mathrm{T}}=(\mathbf {x}_1^{\mathrm{T}},\ldots ,\mathbf {x}_n^{\mathrm{T}})^{\mathrm{T}}= \mathbf {X}\), and \(F(\tilde{\mathbf {y}}_i) = G_1(\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)>M_n\} + \mathbb {E}[G_1(\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)>M_n\}\mid X=\mathbf {x}_i],\forall i\).

A proof is given in “Appendix F” section.

Then we can follow the Corollary 1 in Städler et al. (2010) to show that the probability of event \(\mathcal {T}\) is large below.

Lemma 3

Use Lemma 2. For model (1) with \(\theta _0 \in \tilde{{\varTheta }}\), some constants \(c_7,c_8,c_9,c_{10}\) depending on K, for \(\mathcal {T}\) is defined in (21), and for all \(T\ge c_{10}\) we have

$$\begin{aligned} \mathbb {P}_{\mathbf {X}}(\mathcal {T})\ge 1 - c_9\exp \left( -\frac{T^2(\log n)^2\log (d\vee n)}{c_8}\right) - \frac{1}{n}, \forall n\ge c_7. \end{aligned}$$

A proof is given in “Appendix G” section.

Appendix D: Corollaries for models considering outlier samples

When considering outlier samples and modifying the natural parameter model as in (11), we can show in this section the similar results.

First, as \(\varvec{\beta }\) and \(\varvec{\zeta }\) are treated in the similar way, we denote them together by \(\varvec{\xi }\in \mathbb {R}^{((d+n)\times m)\times k}\), and \(\xi = vec(\varvec{\xi }) \in \mathbb {R}^{(d+n)mk}\) such that for all \(r = 1,\ldots ,k\),

$$\begin{aligned} \varvec{\varphi }_r= & {} \mathbf {X}\varvec{\beta }_r + \varvec{\zeta }_r \ \Rightarrow \varvec{\varphi }_r = \mathbf {A}\varvec{\xi }_r,\\ \mathbf {A}= & {} [\mathbf {X}, \mathbf {I}_{n}]\in \mathbb {R}^{n\times (d+n)}, \ \varvec{\xi }_r = [\varvec{\beta }_r^{\mathrm{T}},\varvec{\zeta }_r^{\mathrm{T}}]^{\mathrm{T}}\in \mathbb {R}^{ (d+n)\times m}, \end{aligned}$$

where \(\mathbf {I}_{n}\in \mathbb {R}^{n\times n}\) is a identity matrix.

Thus it can be seen that the modification only results in new design matrix and regression coefficient matrix, therefore, we can apply Theorems 13 to have similar results for the modified models.

For lasso-type penalties, denote the set of indices of non-zero entries of \(\beta _0\) by \(S_{\beta }\), and the set of indices of non-zero entries of \(\zeta _0\) by \(S_{\zeta }\), where \(\zeta = \text{ vec }(\varvec{\zeta }_1,\ldots ,\varvec{\zeta }_k)\). Denote by \(s = |S_{\beta }| + |S_{\zeta }|\). Then for entry-wise \(\ell _1\) penalties in (5) (for \(\varvec{\beta }\)) with \(\gamma = 0\) and \(\mathcal {R}(\varvec{\zeta }) = \lambda \Vert \zeta \Vert _1\) (for \(\varvec{\zeta }\)), we need the following modified restricted eigenvalue condition.

Condition 6

For all \( \beta \in \mathbb {R}^{dmk}\) and all \( \zeta \in \mathbb {R}^{nmk}\) satisfying \(\Vert \beta _{S_{\beta }^c}\Vert _1 + \Vert \zeta _{S_{\zeta }^c}\Vert _1 \le 6(\Vert \beta _{S_{\beta }}\Vert _1+\Vert \zeta _{S_{\zeta }}\Vert _1)\), it holds for some constant \(\kappa \ge 1\) that,

$$\begin{aligned} \Vert \beta _{S_{\beta }}\Vert _2^2 + \Vert \zeta _{S_{\zeta }}\Vert _2^2 \le \kappa ^2 \Vert \varphi \Vert _{Q_n}^2 = \frac{\kappa ^2}{n}\sum _{i=1}^n\sum _{j\in {\varOmega }_i}\sum _{r=1}^k (\mathbf {x}_i\varvec{\beta }_{jr}+\zeta _{ijr})^2. \end{aligned}$$

Corollary 1

Consider the Hermit  model in (1) with \(\theta _0\in \tilde{{\varTheta }}\), and consider the penalized estimator (12) with the \(\ell _1\) penalties in (5) and \(\mathcal {R}(\varvec{\zeta }) = \lambda \Vert \zeta \Vert _1\).

  1. (a)

    Assume Conditions 13 and 6 hold. Suppose \(\sqrt{mk} \lesssim n/M_n\), and take \(\lambda > 2T\lambda _0\) for some constant \(T>1\). For some constant \(c>0\) and large enough n, with probability \(1 - c\exp \left( -\frac{(\log n)^2\log (d\vee n)}{c}\right) - \frac{1}{n}\), we have

    $$\begin{aligned} \bar{\varepsilon }(\hat{\theta }\mid \theta _0) + 2(\lambda -T\lambda _0) \left( \Vert \hat{\beta }_{S_{\beta }^c}\Vert _1 + \Vert \hat{\zeta }_{S_{\zeta }^c}\Vert _1\right) \le 4(\lambda +T\lambda _0)^2\kappa ^2 c_0^2s, \end{aligned}$$
  2. (b)

    Assume Conditions 13 hold (without Condition 6), assume

    $$\begin{aligned} \Vert \beta _0\Vert _1 + \Vert \zeta _0\Vert _1&= o\left( \sqrt{n/((\log n)^{2+2c_1} \log (d\vee n)mk)}\right) ,\\ \sqrt{mk}&= o\left( \sqrt{n/((\log n)^{2+2c_1}\log (d\vee n))}\right) \end{aligned}$$

    as \(n\rightarrow \infty \). If \(\lambda = C\sqrt{(\log n)^{2+2c_1}\log (d\vee n)mk/n}\) for some \(C>0\) sufficiently large, and for some constant \(c>0\) and large enough n, with the following probability \(1 - c\exp \left( -\frac{ (\log n)^2\log (d\vee n)}{c}\right) - \frac{1}{n}\), we have \(\bar{\varepsilon }(\hat{\theta }\mid \theta _0) = o_P(1)\).

For group-lasso type penalties, denote

$$\begin{aligned}&\mathcal {I}_{\beta } = \{p: \varvec{\beta }_{0,\mathcal {G}_{\beta ,p}} = \mathbf {0}\}, \ \mathcal {I}_{\beta }^c = \{p: \varvec{\beta }_{0,\mathcal {G}_{\beta ,p}} \ne \mathbf {0}\},\\&\mathcal {I}_{\zeta } = \{q: \varvec{\zeta }_{0,\mathcal {G}_{\zeta ,q}} = \mathbf {0}\}, \ \mathcal {I}_{\zeta }^c = \{q: \varvec{\zeta }_{0,\mathcal {G}_{\zeta ,q}} \ne \mathbf {0}\}, \end{aligned}$$

where \(\varvec{\beta }_{0,\mathcal {G}_{\beta ,p}}\) and \(\varvec{\zeta }_{0,\mathcal {G}_{\zeta ,q}}\) denote the pth group of \(\varvec{\beta }_0\) and the qth group of \(\varvec{\zeta }_0\), respectively. Now denote \(s = |\mathcal {I}_{\beta }| + |\mathcal {I}_{\zeta }|\) with some abuse of notation.

Then for group \(\ell _1\) penalties in (27) (for \(\varvec{\beta }\)) and \(\mathcal {R}(\varvec{\zeta }) = \sum _q^Q\Vert \varvec{\zeta }_{\mathcal {G}_{\zeta ,q}}\Vert _F\) (for \(\varvec{\zeta }\)), we need the following modified restricted eigenvalue condition.

Condition 7

For all \( \varvec{\beta }\in \mathbb {R}^{d\times mk}\) and all \( \varvec{\zeta }\in \mathbb {R}^{n\times mk}\) satisfying

$$\begin{aligned} \sum _{p\in \mathcal {I}_{\beta }^c}\Vert \varvec{\beta }_{\mathcal {G}_{\beta ,p}}\Vert _F + \sum _{q\in \mathcal {I}_{\zeta }^c}\Vert \varvec{\zeta }_{\mathcal {G}_{\zeta ,q}}\Vert _F \le 6\left( \sum _{p\in \mathcal {I}_{\beta }}\Vert \varvec{\beta }_{\mathcal {G}_{\beta ,p}}\Vert _F + \sum _{q\in \mathcal {I}_{\zeta }}\Vert \varvec{\zeta }_{\mathcal {G}_{\zeta ,q}}\Vert _F \right) , \end{aligned}$$

it holds that for some constant \(\kappa \ge 1\),

$$\begin{aligned} \sum _{p\in \mathcal {I}_{\beta }}\Vert \varvec{\beta }_{\mathcal {G}_{\beta ,p}}\Vert _F^2 + \sum _{q\in \mathcal {I}_{\zeta }}\Vert \varvec{\zeta }_{\mathcal {G}_{\zeta ,q}}\Vert _F^2 \le \kappa ^2 \Vert \varphi \Vert _{Q_n}^2 =\frac{\kappa ^2}{n}\sum _{i=1}^n\sum _{j\in {\varOmega }_i}\sum _{r=1}^k (\mathbf {x}_i\varvec{\beta }_{jr}+\zeta _{ijr})^2. \end{aligned}$$

Corollary 2

Consider the Hermit  model in (1) with \(\theta _0\in \tilde{{\varTheta }}\), and consider estimator (12) with the group \(\ell _1\) penalties in (27) and \(\mathcal {R}(\varvec{\zeta }) = \sum _q^Q\Vert \varvec{\zeta }_{\mathcal {G}_{\zeta ,q}}\Vert _F\).

  1. (a)

    Assume Conditions 13 and 7 hold. Suppose \(\sqrt{mk} \lesssim n/M_n\), and take \(\lambda > 2T\lambda _0\) for some constant \(T>1\). For some constant \(c>0\) and large enough n, with probability \(1 - c\exp \left( -\frac{(\log n)^2\log (d\vee n)}{c}\right) - \frac{1}{n}\), we have

    $$\begin{aligned} \bar{\varepsilon }(\hat{\theta }\mid \theta _0) + 2(\lambda -T\lambda _0)\biggl (\sum _{p\in \mathcal {I}_{\beta }^c}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_{\beta ,p}}\Vert _F+\sum _{q\in \mathcal {I}_{\zeta }^c}\Vert \hat{\varvec{\zeta }}_{\mathcal {G}_{\zeta ,q}}\Vert _F\biggr ) \le 4(\lambda +T\lambda _0)^2\kappa ^2 c_0^2s, \end{aligned}$$
  2. (b)

    Assume Conditions 13 hold (without Condition 7), assume

    $$\begin{aligned} \sum _{p=1}^P\Vert \varvec{\beta }_{0,\mathcal {G}_{\beta ,p}}\Vert _F + \sum _{q=1}^Q\Vert \varvec{\zeta }_{0,\mathcal {G}_{\zeta ,q}}\Vert _F&= o\left( \sqrt{n/((\log n)^{2+2c_1}\log (d\vee n)mk)}\right) ,\\ \sqrt{mk}&= o\left( \sqrt{n/((\log n)^{2+2c_1}\log (d\vee n))}\right) \end{aligned}$$

    as \(n\rightarrow \infty \). If \(\lambda = C\sqrt{(\log n)^{2+2c_1}\log (d\vee n)mk/n}\) for some \(C>0\) sufficiently large, and for some constant \(c>0\) and large enough n, with the following probability \(1 - c\exp \left( -\frac{ (\log n)^2\log (d\vee n)}{c}\right) - \frac{1}{n}\), we have \(\bar{\varepsilon }(\hat{\theta }\mid \theta _0) = o_P(1)\).

Appendix E: Proof of Lemma 1

Proof

For non-negative continuous variable X, we have

$$\begin{aligned} \mathbb {E}[X1\{X>M\}]&= \int _M^{\infty }tf_X(t)dt = \int _M^{\infty }\int _0^tf_X(t)dxdt \nonumber \\&= \int _0^M\int _M^{\infty }f_X(t)dtdx + \int _M^{\infty }\int _x^{\infty }f_X(t)dtdx \nonumber \\&= M\mathbb {P}(X>M) + \int _M^{\infty }\mathbb {P}(X>x)dx. \end{aligned}$$

Similarly, we have \(\mathbb {E}[X^21\{X>M\}] = M^2\mathbb {P}(X>M) + \int _M^{\infty }2x\mathbb {P}(X>x)dx\).

For X sub-exponential with parameters \((\sigma ,v ,t) \) such that for \(M>t \)

$$\begin{aligned} \mathbb {P}(X>M)\le \left\{ \begin{array}{ll} \exp \biggl (-\frac{M^2}{\sigma ^2}\biggr ), &{} t \le M\le \frac{\sigma ^2}{v}\\ \exp \biggl (-\frac{M }{v }\biggr ), &{} M\ge \frac{\sigma ^2}{v}, \end{array} \right. \end{aligned}$$

we have the following.

If \(M\le \frac{\sigma ^2}{v} \), we have

$$\begin{aligned} \mathbb {E}[X1\{X>M\}]&= M\mathbb {P}(X>M) + \int _M^{\infty }\mathbb {P}(X>x)dx\\&\le M\exp \biggl (-\frac{M^2}{\sigma ^2}\biggr ) + \int _M^{\frac{\sigma ^2}{v}}\exp \biggl (-\frac{x^2}{\sigma ^2}\biggr )dx + \int _{\frac{\sigma ^2}{v}}^{\infty }\exp \biggl (-\frac{x }{v }\biggr )dx\\&\le M\exp \biggl (-\frac{M^2}{\sigma ^2}\biggr ) + \bigg (\frac{\sigma ^2}{v} - M\bigg ) \exp \biggl (-\frac{M^2}{\sigma ^2}\biggr ) + v\exp \biggl (-\frac{M}{v}\biggr )\\&= M \exp \biggl (-\frac{M^2}{\sigma ^2}\biggr ) + v\exp \biggl (-\frac{M}{v}\biggr )\le (M+v) \exp \biggl (-\frac{M^2}{\sigma ^2}\biggr ), \end{aligned}$$

and similarly, \(\mathbb {E}[X^21\{X>M\}] \le \biggl (M^2+ 2v^2+2\sigma ^2\biggr )\exp \biggl (-\frac{M^2}{\sigma ^2}\biggr ).\)

If \(M> \frac{\sigma ^2}{v} \), we have \(\mathbb {E}[X1\{X>M\}] \le (M+v)\exp \biggl (-\frac{M }{v }\biggr )\) and \(\mathbb {E}[X^21\{X>M\}] \le (M^2+2v^2+2vM)\exp \biggl (-\frac{M }{v }\biggr )\).

Then for some constants \(c_1,c_2,c_3,c_4,c_5>0\), for non-negative continuous variable X which is sub-exponential with parameters \((\sigma ,v,t)\), for \(M>c_4>t\) and \(c' = 2+\frac{3}{c_1}\), we have

$$\begin{aligned}&\mathbb {E}[X1\{X>M\}] \le \biggl [ c_3\biggl (\frac{M}{c_2}\biggr )^{c'}+ c_5 \biggr ]\exp \biggl \{-\biggl (\frac{M}{c_2}\biggr )^{1/c_1}\biggr \},\\&\mathbb {E}[X^21\{X>M\}] \le \biggl [ c_3\biggl (\frac{M}{c_2}\biggr )^{c'}+ c_5 \biggr ]^2\exp \biggl \{-2\biggl (\frac{M}{c_2}\biggr )^{1/c_1}\biggr \}. \end{aligned}$$

If \(t \le M\le \frac{\sigma ^2}{v}\), \(c_1 =1/2, c_2 = \sqrt{2}\sigma , c_3 = 16\sigma ^8\). And if \(M\ge \frac{\sigma ^2}{v}\), \(c_1 = 1,c_2 = 2v,c_3 = 32v^5\). And \(c_5 = \sqrt{2}(v + \sigma )\).

For non-negative discrete variables, the result is the same.

The result of Lemma 1 follows from the result above, \(\tilde{\mathbf {y}}_i\) has a finite mixture distribution for \(i=1,\ldots ,n\) and the following.

When dispersion parameter \(\phi \) is known, for a constant \(c_K\) depending on K, we have

$$\begin{aligned} G_1(\tilde{\mathbf {y}}_i) = e^K \max _{j\in {\varOmega }_i}|y_{ij}| + c_K, \ i=1,\ldots ,n. \end{aligned}$$

\(\square \)

Appendix F: Proof of Lemma 2

Proof

Under Condition 2, \(M_n = c_2(\log n)^{c_1}\), and \(\lambda _0\) defined in (24), for a constant \(c_6\) depending on K, for \(i=1,\ldots ,n\), we have

$$\begin{aligned}&\mathbb {E}[|G_1(\tilde{\mathbf {y}}_i)|1\{|G_1(\tilde{\mathbf {y}}_i)|>M_n\}] \le c_6\lambda _0^2/(mk), \\&\mathbb {E}[|G_1(\tilde{\mathbf {y}}_i)|^21\{|G_1(\tilde{\mathbf {y}}_i)|>M_n\}] \le c_6^2\lambda _0^4/(mk)^2. \end{aligned}$$

The we can get

$$\begin{aligned}&\mathbb {P}_{\mathbf {X}}\biggl (\frac{1}{n}\sum _{i=1}^nG_1(\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)>M_n\} + \mathbb {E}[G_1(\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)>M_n\}]>3c_6\lambda _0^2/(mk) \biggr )\\&\quad \le \mathbb {P}_{\mathbf {X}}\biggl (\frac{1}{n}\sum _{i=1}^nG_1(\tilde{\mathbf {y}}_i)1 \{G_1(\tilde{\mathbf {y}}_i)>M_n\} - \mathbb {E}[G_1(\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)>M_n\}]>c_6\lambda _0^2/(mk) \biggr )\\&\quad \le \frac{\mathbb {E}[|G_1(\tilde{\mathbf {y}}_i)|^21\{|G_1(\tilde{\mathbf {y}}_i)|>M_n\}]}{n}\frac{m^2k^2}{c_6^2\lambda _0^4} \le \frac{1}{n}. \end{aligned}$$

\(\square \)

Appendix G: Proof of Lemma 3

Proof

We follow Städler et al. (2010) to give a Entropy Lemma and then prove Lemma 3.

We use the following norm \(\Vert \cdot \Vert _{P_n}\) introduced in the Proof of Lemma 2 in Städler et al. (2010) and use \(H(\cdot ,\mathcal {H},\Vert \cdot \Vert _{P_n})\) as the entropy of covering number [see Van de Geer (2000)] which is equipped the metric induced by the norm for a collection \(\mathcal {H}\) of functions on \(\mathcal {X}\times \mathcal {Y}\),

$$\begin{aligned} \Vert h(\cdot ,\cdot )\Vert _{P_n} = \sqrt{\frac{1}{n}\sum _{i=1}^nh^2(\mathbf {x}_i,\tilde{\mathbf {y}}_i)}. \end{aligned}$$

Define \(\tilde{{\varTheta }}(\epsilon ) = \{\theta \in \tilde{{\varTheta }}: \Vert \varvec{\beta }-\varvec{\beta }_0\Vert _1 + \Vert \eta - \eta _0\Vert _2 \le \epsilon \}\).

Lemma 4

(Entropy Lemma) For a constant \(c_{12}>0\), for all \(u>0\) and \(M_n>0\), we have

$$\begin{aligned}&H\biggl (u,\biggl \{(\ell _{\theta } - \ell _{\theta ^{\star }})1\{G_1\le M_n\}: \theta \in \tilde{{\varTheta }}(\epsilon )\biggr \}, \Vert \cdot \Vert _{P_n}\biggr ) \\&\quad \le c_{12}\frac{mk\epsilon ^2M_n^2}{u^2}\log \biggl (\frac{\sqrt{mk}\epsilon M_n}{u}\biggr ). \end{aligned}$$

Proof

(For Entropy Lemma) The difference between this proof and that of Entropy Lemma in the proof of Lemma 2 of Städler et al. (2010) is in the notations and the effect of multivariate responses.

For multivariate responses we have for \(i=1,\ldots ,n\),

$$\begin{aligned} |\ell _{\theta }(\mathbf {x}_i,\tilde{\mathbf {y}}_i) - \ell _{\theta '} (\mathbf {x}_i,\tilde{\mathbf {y}}_i)|^2&\le G_1^2(\tilde{\mathbf {y}}_i)\Vert \psi _i - \psi '_i \Vert _1^2 \le d_{\psi }G_1^2(\tilde{\mathbf {y}}_i)\Vert \psi _i- \psi '_i\Vert _2^2\\&= d_{\psi }G_1^2(\tilde{\mathbf {y}}_i) \biggl [\sum _{r=1}^k\sum _{j\in {\varOmega }_i}|\mathbf {x}_i(\varvec{\beta }_{rj}- \varvec{\beta }'_{rj}) |^2 +\Vert \eta - \eta '\Vert _2^2\biggr ], \end{aligned}$$

where \(d_{\psi } = (2m+1)k\) is the maximum of dimension of \(\psi _i\) for \(i=1,\ldots ,n\).

Under the definition of the norm \(\Vert \cdot \Vert _{P_n}\) we have

$$\begin{aligned}&\Vert (\ell _{\theta } - \ell _{\theta '})1\{G_1\le M_n\}\Vert _{P_n}^2 \\&\quad \le d_{\psi }M_n^2\left[ \frac{1}{n}\sum _{i=1}^n\sum _{r=1}^k\sum _{j\in {\varOmega }_i}|\mathbf {x}_i(\varvec{\beta }_{rj}- \varvec{\beta }'_{rj}) |^2 + \Vert \eta - \eta '\Vert _2^2 \right] . \end{aligned}$$

Then by the result of Städler et al. (2010) we have

$$\begin{aligned} H (u,\{\eta \in \mathbb {R}^{d_{\eta }}: \Vert \eta -\eta _0\Vert _2\le \epsilon \},\Vert \cdot \Vert _2 )\le d_{\eta }\log \biggl (\frac{5\epsilon }{u}\biggr ), \end{aligned}$$

where \(d_{\eta } = (m+1)k\) is the dimension of \(\eta \).

And we follow Städler et al. (2010) to apply Lemma 2.6.11 of Van Der Vaart and Wellner (1996) to give a bound as

$$\begin{aligned}&H \biggl (2u,\biggl \{ \sum _{r=1}^k\sum _{j\in {\varOmega }_i}\mathbf {x}_i(\varvec{\beta }_{rj}- {\varvec{\beta }}_{0,rj}) : \Vert \varvec{\beta }-\varvec{\beta }_0\Vert _1\le \epsilon \biggr \}, \Vert \cdot \Vert _{P_n}\biggr ) \\&\quad \le \biggl (\frac{\epsilon ^2}{u^2}+1\biggr )\log (1+kmd). \end{aligned}$$

Thus we can get

$$\begin{aligned}&H \biggl (3\sqrt{d_{\psi }}M_nu,\biggl \{ (\ell _{\theta } - \ell _{\theta _0})1\{G_1\le M_n\} : \theta \in \tilde{{\varTheta }}(\epsilon ) \biggr \}, \Vert \cdot \Vert _{P_n}\biggr ) \\&\quad \le \biggl (\frac{\epsilon ^2}{u^2}+1+d_{\eta }\biggr )\biggl (\log (1+kmd)+\log \biggl (\frac{5\epsilon }{u}\biggr )\biggr ). \end{aligned}$$

\(\square \)

Now we turn to prove Lemma 3.

We follow Städler et al. (2010) to use the truncated version of the empirical process below.

$$\begin{aligned}&V_n^{trunc}(\theta ) \\&\quad = \frac{1}{n}\sum _{i=1}^n\biggl ( \ell _{\theta }(\mathbf {x}_i,\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)\le M_n\} - \mathbb {E}[\ell _{\theta }(\mathbf {x}_i,\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)\le M_n\}\mid X=\mathbf {x}_i]. \biggr ) \end{aligned}$$

We follow Städler et al. (2010) to apply the Lemma 3.2 in Van de Geer (2000) and a conditional version of Lemma 3.3 in Van de Geer (2000) to the class

$$\begin{aligned} \biggl \{ (\ell _{\theta } - \ell _{\theta _0})1\{G_1\le M_n\} : \theta \in \tilde{{\varTheta }}(\epsilon ) \biggr \}, \forall \epsilon >0. \end{aligned}$$

For some constants \(\{c_{t}\}_{t>12}\) depending on K and \({\varLambda }_{\max }\) in Condition 2 of Städler et al. (2010), using the notation of Lemma 3.2 in Van de Geer (2000), we follow Städler et al. (2010) to choose \(\delta = c_{13} T\epsilon \lambda _0\) and \(R = c_{14}(\sqrt{mk}\epsilon \wedge 1)M_n\).

Thus we by choosing \(M_n = c_2(\log n)^{c_1}\) we can satisfy the condition of Lemma 3.2 of Van de Geer (2000) to have

$$\begin{aligned}&\int _{\epsilon /c_{15}}^R H^{1/2} \biggl (u,\biggl \{(\ell _{\theta } - \ell _{\theta ^{\star }})1\{G_1\le M_n\}: \theta \in \tilde{{\varTheta }}(\epsilon )\biggr \}, \Vert \cdot \Vert _{P_n}\biggr ) du \vee R \\&\quad =\int _{\epsilon /c_{15}}^{c_{14}\sqrt{mk}(\epsilon \wedge 1)M_n} c_{12}\biggl (\frac{\sqrt{mk}\epsilon M_n}{u}\biggr )\log ^{1/2}\biggl (\frac{\sqrt{mk}\epsilon M_n}{u}\biggr )du \vee (c_{14}(\epsilon \wedge 1)M_n)\\&\quad \le \frac{2}{3}c_{12}\sqrt{mk}\epsilon M_n \left[ \log ^{3/2} (c_{15}\sqrt{mk}M_n) - \log ^{3/2} \left( \frac{\sqrt{mk} \epsilon M_n}{c_{14}\sqrt{mk}(\epsilon \wedge 1)M_n}\right) \right] \\&\qquad \vee (c_{14}\sqrt{mk}(\epsilon \wedge 1)M_n) \\&\quad \le \frac{2}{3}c_{12}\sqrt{mk}\epsilon M_n\log ^{3/2} (c_{15}\sqrt{mk}M_n)\\&\quad \le c_{16} \sqrt{mk}\epsilon M_n\log ^{3/2} (n) \quad \left( \text{ by } \text{ choosing } \ M_n = c_2(\log n)^{c_1}, \text{ and } \ \sqrt{mk} \le c_{17}\frac{n}{M_n}\right) \\&\quad \le c_{18} \sqrt{n} T\epsilon \lambda _0\le \sqrt{n}(\delta - \epsilon ). \end{aligned}$$

Now for the rest we can apply Lemma 3.2 of Van de Geer (2000) to give the same result with Lemma 2 of Städler et al. (2010).

So we have

$$\begin{aligned} \sup _{\theta \in \tilde{{\varTheta }}} \frac{|V_n^{trunc}(\theta ) - V_n^{trunc}(\theta _0)|}{(\Vert \varvec{\beta }-\varvec{\beta }_0\Vert _1 + \Vert \eta -\eta _0\Vert _2 )\vee \lambda _0} \le 2c_{23}T \lambda _0 \end{aligned}$$

with probability at least \(1 - c_{9}\exp \biggl [- \frac{T^2(\log n)^2\log (d\vee n) }{c_{8}^2}\biggr ]\).

At last, for the case when \(G_1(\tilde{\mathbf {y}}_i)>M_n\), for \(i=1,\ldots ,n\), we have

$$\begin{aligned} | (\ell _{\theta }(\mathbf {x}_i,\tilde{\mathbf {y}}_i) - \ell _{\theta _0}(\mathbf {x}_i,\tilde{\mathbf {y}}_i))1\{G_1(\tilde{\mathbf {y}}_i)> M_n\} |\le d_{\psi }KG_1(\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)> M_n\}, \end{aligned}$$

and

$$\begin{aligned}&\frac{|(V_n^{trunc}(\theta ) - V_n^{trunc}(\theta _0)) -(V_n(\theta )-V_n(\theta _0)) |}{(\Vert \varvec{\beta }-\varvec{\beta }_0\Vert _1 + \Vert \eta -\eta _0\Vert _2 )\vee \lambda _0}\\&\quad \le \frac{d_{\psi }K}{n\lambda _0}\sum _{i=1}^n \biggl ( G_1(\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)>M_n\} + \mathbb {E}[G_1(\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)>M_n\}\mid X=\mathbf {x}_i] \biggr ). \end{aligned}$$

Then the probability of the following inequality under our model is given in Lemma 2.

$$\begin{aligned}&\frac{d_{\psi }K}{n\lambda _0}\sum _{i=1}^n \biggl ( G_1(\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)>M_n\} + \mathbb {E}[G_1(\tilde{\mathbf {y}}_i)1\{G_1(\tilde{\mathbf {y}}_i)>M_n\}\mid X=\mathbf {x}_i] \biggr ) \\&\quad \le c_{23}T \lambda _0, \end{aligned}$$

where \(d_{\psi } = 2(m+1)k\). \(\square \)

Appendix H: Proof of Theorem 1

Proof

This proof mostly follows that of Theorem 3 of Städler et al. (2010). The only difference is in the notations. As such, we omit the details. \(\square \)

Appendix I: Proof of Theorem 2

Proof

This proof also mostly follows that of Theorem 5 of Städler et al. (2010). The difference is in the notations and the choice of \(M_n\).

If the event \(\mathcal {T}\) happens, with \(M_n = c_2(\log n)^{c_1}\) for some constants \(0\le c_1,c_2<\infty \), where \(c_2\) depends on K,

$$\begin{aligned} \lambda _0 = \sqrt{mk} M_n\log n\sqrt{\log (d\vee n)/n} = c_2\sqrt{mk\log ^{2+2c_1}\log (d\vee n)/n}, \end{aligned}$$

we have

$$\begin{aligned} \bar{\varepsilon }(\hat{\psi }\mid \psi _0) + \lambda \Vert \hat{\beta }\Vert _1&\le T\lambda _0[(\Vert \hat{\beta }-\beta _0\Vert _1 + \Vert \eta -\eta _0\Vert _2 )\vee \lambda _0] \\&\quad + \lambda \Vert \beta _0\Vert _1 + \bar{\varepsilon }(\psi _0\mid \psi _0). \end{aligned}$$

Under the definition of \(\theta \in \tilde{{\varTheta }}\) in (23) we have \(\Vert \eta -\eta _0\Vert _2\le 2K\). And as \( \bar{\varepsilon }(\psi _0\mid \psi _0) =0\) we have for n sufficiently large.

$$\begin{aligned}&\bar{\varepsilon }(\hat{\psi }\mid \psi _0) + \lambda \Vert \hat{\beta }\Vert _1 \le T\lambda _0(\Vert \hat{\beta }\Vert _1 +\Vert \beta _0\Vert _1 + 2K ) + \lambda \Vert \beta _0\Vert _1\\&\rightarrow \bar{\varepsilon }(\hat{\psi }\mid \psi _0) + (\lambda -T\lambda _0)\Vert \hat{\beta }\Vert _1 \le T\lambda _0 2K + (\lambda +T\lambda _0)\Vert \beta _0\Vert _1 \end{aligned}$$

As \(C>0\) sufficiently large we have \(\lambda \ge 2T\lambda _0\).

And using the condition on \(\Vert \beta _0\Vert _1\) and \(\sqrt{mk}\), we have both \(T\lambda _02K = o(1)\) and \((\lambda +T\lambda _0)\Vert \beta _0\Vert _1 = o(1)\), so we have \(\bar{\varepsilon }(\hat{\psi }\mid \psi _0)\rightarrow 0 \ (n\rightarrow \infty )\).

At last, as the event \(\mathcal {T}\) has large probability, we have \(\bar{\varepsilon }(\hat{\theta }_{\lambda }\mid \theta _0) = o_P(1) \ (n\rightarrow \infty )\). \(\square \)

Appendix J: Proof of Theorem 3

Proof

First we discuss the bound for the probability of \(\mathcal {T}_{group}\) in (22).

The difference between \(\mathcal {T}_{group}\) and \(\mathcal {T}\) in (21) is only related to the following entropy of the Entropy Lemma in the proof of Lemma 3.

$$\begin{aligned}&H \biggl (2u,\biggl \{ \sum _{r=1}^k\sum _{j\in {\varOmega }_i}\mathbf {x}_i(\varvec{\beta }_{rj}- {\varvec{\beta }}_{0,rj}) : \sum _p\Vert \varvec{\beta }_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F\le \epsilon \biggr \}, \Vert \cdot \Vert _{P_n}\biggr ) , \\&\quad \text{ for } \ i = 1\ldots ,n, \end{aligned}$$

where \(\sum _p\Vert \varvec{\beta }_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F\le \epsilon \) still maintains a convex hull for \(\varvec{\beta }\) in the metric space equipped with the metric induced by the norm \(\Vert \cdot \Vert _{P_n}\) defined in the proof of Lemma 3. Thus it still satisfies the Condition of Lemma 2.6.11 of Van Der Vaart and Wellner (1996) which can still be applied to give

$$\begin{aligned}&H \biggl (2u,\biggl \{ \sum _{r=1}^k\sum _{j\in {\varOmega }_i}\mathbf {x}_i(\varvec{\beta }_{rj}- {\varvec{\beta }}_{0,rj}) : \sum _p\Vert \varvec{\beta }_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F\le \epsilon \biggr \}, \Vert \cdot \Vert _{P_n}\biggr ) \\&\quad \le \biggl (\frac{\epsilon ^2}{u^2}+1\biggr )\log (1+kmd), \ \text{ for } \ i = 1\ldots ,n. \end{aligned}$$

So the probability of event \(\mathcal {T}_{group}\) remains the same form with that in Lemma 3.

Then we discuss the bound for the average excess risk and feature selection.

If the event \(\mathcal {T}_{group}\) happens, we have

$$\begin{aligned} \bar{\varepsilon }(\hat{\psi }\mid \psi _0) + \lambda \sum _p\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}\Vert _F&\le T\lambda _0\biggl [\biggl (\sum _{\mathcal {G}_p}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F + \Vert \eta -\eta _0\Vert _2 \biggr )\vee \lambda _0\biggr ]\\&\quad + \lambda \sum _p\Vert \varvec{\beta }_{0,\mathcal {G}_p}\Vert _F + \bar{\varepsilon }(\psi _0\mid \psi _0). \end{aligned}$$

Using Condition 3 we have \( \bar{\varepsilon }(\psi _0\mid \psi _0) =0\) and \(\bar{\varepsilon }(\hat{\psi }\mid \psi _0) \ge {\Vert \hat{\psi }-\psi _0\Vert _{Q_n}^2}/{c_0^2}\).

Case 1 When the following is true:

$$\begin{aligned} \sum _p\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F + \Vert \hat{\eta }-\eta _0\Vert _2 \le \lambda _0, \end{aligned}$$

we have

$$\begin{aligned} \bar{\varepsilon }(\hat{\psi }\mid \psi _0)&\le T\lambda _0^2 + \lambda \sum _p\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F + \bar{\varepsilon }(\psi _0\mid \psi _0) \le (\lambda +T\lambda _0)\lambda _0. \end{aligned}$$

Case 2 When the following is true:

$$\begin{aligned}&\sum _p\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F + \Vert \hat{\eta }-\eta _0\Vert _2 \ge \lambda _0,\\&T\lambda _0\Vert \hat{\eta }-\eta _0\Vert _2 \ge (\lambda +T\lambda _0)\sum _{p\in \mathcal {I}}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F. \end{aligned}$$

As \(\sum _{p\in \mathcal {I}^c}\Vert \varvec{\beta }_{0,\mathcal {G}_p}\Vert _F=0\), we have

$$\begin{aligned}&\bar{\varepsilon }(\hat{\psi }\mid \psi _0) + (\lambda -T\lambda _0)\sum _{p\in \mathcal {I}^c}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}\Vert _F \le 2T\lambda _0\Vert \hat{\eta }-\eta _0\Vert _2\\&\quad \le 2T^2\lambda _0^2c_0^2 + \Vert \hat{\eta }-\eta _0\Vert _2^2/(2c_0^2) \le 2T^2\lambda _0^2c_0^2 + \bar{\varepsilon }(\hat{\psi }\mid \psi _0)/2. \end{aligned}$$

Then we get

$$\begin{aligned} \bar{\varepsilon }(\hat{\psi }\mid \psi _0) + 2(\lambda -T\lambda _0)\sum _{p\in \mathcal {I}^c}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}\Vert _F\le 4T^2\lambda _0^2c_0^2. \end{aligned}$$

Case 3 When the following is true:

$$\begin{aligned}&\sum _p\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F + \Vert \hat{\eta }-\eta _0\Vert _2 \ge \lambda _0,\\&T\lambda _0\Vert \hat{\eta }-\eta _0\Vert _2 \le (\lambda +T\lambda _0)\sum _{p\in \mathcal {I}}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F, \end{aligned}$$

we have

$$\begin{aligned}&\bar{\varepsilon }(\hat{\psi }\mid \psi _0) + (\lambda -T\lambda _0)\sum _{p\in \mathcal {I}^c}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}\Vert _F \le 2(\lambda +T\lambda _0)\sum _{p\in \mathcal {I}}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F. \end{aligned}$$

Thus we have

$$\begin{aligned} \sum _{p\in \mathcal {I}^c}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}\Vert _F \le 6 \sum _{p\in \mathcal {I}}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}-\varvec{\beta }_{0,\mathcal {G}_p}\Vert _F, \end{aligned}$$

so we can use the Condition 5 for \(\hat{\varvec{\beta }} -\varvec{\beta }_0\) to have

$$\begin{aligned}&\bar{\varepsilon }(\hat{\psi }\mid \psi _0) + (\lambda -T\lambda _0)\sum _{p\in \mathcal {I}^c}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}\Vert _F \le 2(\lambda +T\lambda _0)\sqrt{s}\sum _{p\in \mathcal {I}}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}-\hat{\varvec{\beta }}_{0,\mathcal {G}_p}\Vert _F\\&\quad \le 2(\lambda +T\lambda _0)\sqrt{s}\kappa \Vert \hat{\varphi }-(\varphi _0)\Vert _{Q_n} \le 2(\lambda +T\lambda _0)^2 s \kappa ^2 c_0^2 + \Vert \hat{\varphi }-(\varphi _0)\Vert _{Q_n}^2/(2c_0^2)\\&\quad \le 2(\lambda +T\lambda _0)^2 s \kappa ^2 c_0^2 +\bar{\varepsilon }(\hat{\psi }\mid \psi _0)/2. \end{aligned}$$

So we have

$$\begin{aligned} \bar{\varepsilon }(\hat{\psi }\mid \psi _0) + 2(\lambda -T\lambda _0)\sum _{p\in \mathcal {I}^c}\Vert \hat{\varvec{\beta }}_{\mathcal {G}_p}\Vert _F\le 4(\lambda +T\lambda _0)^2s \kappa ^2 c_0^2. \end{aligned}$$

And without restricted eigenvalue Condition 5, we can prove similarly as in “Appendix I” section, assuming event \(\mathcal {T}_{group}\) happens and using the condition on \(\sum _p\Vert \varvec{\beta }_{0,\mathcal {G}_p}\Vert _F\) and \(\sqrt{mk}\). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liang, J., Chen, K., Lin, M. et al. Robust finite mixture regression for heterogeneous targets. Data Min Knowl Disc 32, 1509–1560 (2018). https://doi.org/10.1007/s10618-018-0564-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-018-0564-z

Keywords

Navigation