Skip to main content
Log in

Robust variable selection for finite mixture regression models

  • Published:
Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Abstract

Finite mixture regression (FMR) models are frequently used in statistical modeling, often with many covariates with low significance. Variable selection techniques can be employed to identify the covariates with little influence on the response. The problem of variable selection in FMR models is studied here. Penalized likelihood-based approaches are sensitive to data contamination, and their efficiency may be significantly reduced when the model is slightly misspecified. We propose a new robust variable selection procedure for FMR models. The proposed method is based on minimum-distance techniques, which seem to have some automatic robustness to model misspecification. We show that the proposed estimator has the variable selection consistency and oracle property. The finite-sample breakdown point of the estimator is established to demonstrate its robustness. We examine small-sample and robustness properties of the estimator using a Monte Carlo study. We also analyze a real data set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Basu, A., Harris, I. R., Hjort, N. L., Jones, M. C. (1998). Robust and efficient estimation by minimizing a density power divergence. Biometrika, 85, 549–559.

  • Beran, R. (1977). Minimum Hellinger distance estimators for parametric models. Annals of Statistics, 5, 445–463.

    Article  MathSciNet  MATH  Google Scholar 

  • Beran, R. (1978). An efficient and robust adaptive estimator of location. Annals of Statistics, 6, 292–313.

    Article  MathSciNet  MATH  Google Scholar 

  • Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press.

    MATH  Google Scholar 

  • Broniatowski, M., Toma, A., Vajda, I. (2012). Decomposable pseudodistances and applications in statistical estimation. Journal of Statistical Planning and Inference, 142, 2574–2585.

  • Chen, S. X. (1999). Beta kernel estimators for density functions. Computational Statistics and Data Analysis, 31, 131–145.

    Article  MathSciNet  MATH  Google Scholar 

  • Cutler, A., Cordero-Braña, O. I. (1996). Minimum Hellinger distance estimation for finite mixture models. Journal of the American Statistical Association, 91, 1716–1723.

  • Devlin, S. J., Gnandesikan, R., Kettenring, J. R. (1981). Robust Estimation of Dispersion Matrices and Principal Components. Journal of the American Statistical Association, 76, 354–362.

  • Devroye, L. P., Wagner, T. J. (1979). The \(L_{1}\) convergence of kernel density estimates. Annals of Statistics, 7, 1136–1139.

  • Donoho, D. (1982). Breakdown properties of multivariate location estimators. Unpublished qualifying paper. Cambridge, Massachusetts, USA: Harvard University, Department of Statistics.

  • Donoho, D., Huber, P. (1983). The notion of breakdown point. In P. J. Bickel, K. A. Doksum, J. L. Hodges Jr. (Eds.), A Festschrift for E. L. Lehmann (pp. 157–184). Belmont, CA: Wadsworth.

  • Donoho, D. L., Liu, R. C. (1988). The “automatic” robustness of minimum distance functionals. Annals of Statistics, 16, 552–586.

  • Fan, J., Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.

  • Fan, J., Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. In International Congress of Mathematicians, 3, 595–622.

  • Fan, J., Lv, J. (2011). Non-concave penalized likelihood with np-dimensionality. IEEE Transaction Information Theory, 57, 5467–5484.

  • Fan, J., Xue, L., Zou, H. (2014). Strong oracle optimality of folded concave penalized estimation. Annals of Statistics, 42, 819–849.

  • Frank, I., Friedman, J. (1993). A statistical view of some chemometrics regression tools (with discussion). Technometrics, 35, 109–135.

  • Friedman, J., Hastie, T., Höflinng, H., Tibshirani, R. (2007). Pathwise coordinate optimization. Annals of Applied Statistics, 1, 302–332.

  • Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., Stahel, W. A. (1986). Robust statistics: The approach based on influence functions. New York: Wiley.

  • Hennig, C. (2000). Identifiability of models for clusterwise linear regression. Journal of Classification, 17, 273–296.

    Article  MathSciNet  MATH  Google Scholar 

  • Jiang, W., Tanner, M. A. (1999). On the approximation rate of hierarchical mixtures-of-experts for generalized linear models. Machine Learning, 11, 1183–1198.

  • Karlis, D., Xekalaki, E. (2001). Robust inference for finite mixtures. Journal of Statistical Planning and Inference, 93, 93–115.

  • Karunamuni, R. J., Wu, J. (2011). One-step minimum Hellinger distance estimation. Computational Statistics and Data Analysis, 55, 3148–3164.

  • Khalili, A. (2010). New estimation and feature selection methods in mixture-of-experts models. The Canadian Journal of Statistics, 38, 519–539.

    Article  MathSciNet  MATH  Google Scholar 

  • Khalili, A., Chen, J. (2007). Variable selection in finite mixture of regression models. Journal of the American Statistical Association, 102, 1025–1038.

  • Khalili, A., Lin, S. (2013). Regularization in finite mixture of regression models with diverging number of parameters. Biometrics, 69, 436–446.

  • Khalili, A., Chen, J., Lin, S. (2011). Feature selection in finite mixture of sparse normal linear models in high-dimensional feature space. Biostatistics, 12, 156–172.

  • Leisch, F. (2004). FlexMix: A general framework for finite mixture models and latent class regression in R. Journal of Statistical Software, 11(8), 1–18.

  • Lindsay, B. G. (1994). Efficiency versus robustness: The case for minimum Hellinger distance and related methods. Annals of Statistics, 22, 1081–1114.

    Article  MathSciNet  MATH  Google Scholar 

  • Lu, Z., Hui, Y. V., Lee, A. H. (2003). Minimum Hellinger distance estimation for finite mixtures of Poisson regression models and its applications. Biometrics, 59, 1016–1026.

  • Lv, J., Fan, J. (2009). A unified approach to model selection and sparse recovery using regularized least squares. Annals of Statistics, 37, 3498–3528.

  • Markatou, M. (2000). Mixture models, robustness and the weighted likelihood methodology. Biometrics, 56, 483–486.

    Article  MATH  Google Scholar 

  • Maronna, R. A. (1976). Robust M-estimators of multivariate location and scatter. Annals of Statistics, 4, 51–67.

    Article  MathSciNet  MATH  Google Scholar 

  • McLachlan, G. J., Peel, D. (2000). Finite Mixture Models. New York: Wiley.

  • Pollard, D. (1981). Stong consistency of k-means clustering. Annals of Statistics, 9, 135–140.

    Article  MathSciNet  MATH  Google Scholar 

  • Shen, L. Z. (1995). On optimal B-robust influence functions in semiparametric models. Annals of Statistics, 23, 968–989.

    Article  MathSciNet  MATH  Google Scholar 

  • Städler, N., Bühlmann, P., van de Geer, S. (2010). \(l_{1}\)-penalization for mixture regression models. Test, 19, 209–256.

  • Tamura, R., Boos, D. D. (1986). Minimum Hellinger distance estimation for multivariate location and covariance. Journal of the American Statistical Association, 81, 223–229.

  • Tang, Q., Karunamuni, R. J. (2013). Minimum distance estimation in a finite mixture regression model. Journal of Multivariate Analysis, 120, 185–204.

  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58, 267–288.

  • Titterington, D. M., Smith, A. F. M., Markov, U. E. (1985). Statistical analysis of finite mixture distributions. New York: Wiley.

  • Toma, A. (2008). Minimum Hellinger distance estimators for multivariate distributions from Johnson system. Journal of Statistical Planning and Inference, 138, 803–816.

    Article  MathSciNet  MATH  Google Scholar 

  • van der Vaart, A. (1996). Efficient maximum likelihood estimation in semiparametric models. Annals of Statistics, 24, 862–878.

    Article  MathSciNet  MATH  Google Scholar 

  • Wang, H., Li, G., Jiang, G. (2007). Robust regression shrinkage and consistent variable selection through the LAD-Lasso. Journal of Business and Economic Statistics, 25, 347–355.

  • Wang, X., Jiang, Y., Huang, M., Zhang, H. (2013). Robust variable selection with exponential squared loss. Journal of the American Statistical Association, 108, 632–643.

  • Wu, J., Karunamuni, R. J. (2012). Efficient Hellinger distance estimates for semiparametric models. Journal of Multivariate Analysis, 107, 1–23.

  • Wu, J., Karunamuni, R. J. (2015). Profile Hellinger distance estimation. Statistics, 49(4), 711–740.

  • Wu, J., Karunamuni, R. J., Zhang, B. (2010). Minimum Hellinger distance estimation in a two-sample semiparametric model. Journal of Multivariate Analysis, 101, 1102–1122.

  • Zhang, C.-H. (2010). Nearly unbiased variable selection under mini-max concave penalty. Annals of Statistics, 38, 894–942.

    Article  MathSciNet  MATH  Google Scholar 

  • Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.

    Article  MathSciNet  MATH  Google Scholar 

  • Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67, 301–320.

  • Zou, H., Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36, 1509–1533.

  • Zou, H., Zhang, H. H. (2009). On the adaptive elastic-net with a diverging number of parameters. Annals of Statistics, 37, 1733–1751.

Download references

Acknowledgements

We wish to thank the Chief Editor, Professor Kenji Fukumizu, an Associate Editor, and two reviewers for their helpful comments and suggestions that led to substantial improvements in this paper. Q. Tang’s research was supported in part by the National Social Science Foundation of China (16BTJ019) and Jiangsu Natural Science Foundation of China (BK20151481). R.J. Karunamuni’s research was supported by a grant from the Natural Sciences and Engineering Research Council of Canada.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to R. J. Karunamuni.

Appendix

Appendix

In this “Appendix”, we list the conditions used in the theorems and outline the proofs of main results. For convenience of notation, we write \(f_{\varvec{\theta }}(y)\ \)for \(f_{\varvec{\theta },\eta }(y)\) defined by (4). The following technical conditions are imposed:

(C1) \(E\Vert \mathbf {X}\Vert ^{2}<+\infty \) and \(\max _{1\le i\le n}\Vert X_{i}\Vert =O_{p}(1)\). \(\int \sup _{t\in \Theta }|\frac{\partial f_{t} (y)}{\partial t}|{\text{ d }}y<+\infty \), where \(|b|=\max _{1\le i\le k}|b_{i}|\) for a vector \(b=(b_{1},\ldots ,b_{k})^{T}\). \(\ddot{S}_{\varvec{\theta }}^{(l,m)}(y)\in L_{2}\) for \(1\le l,m\le J(p+2)-1\) and \(H(\varvec{\theta }_{0})=-\int \ddot{S}_{\varvec{\theta }_{0}}(y)f_{\varvec{\theta }_{0}}^{1/2}(y){\text{ d }}y\) is a positive definite matrix, where \(\ddot{S}_{\varvec{\theta }}^{(l,m)}(y)\) denotes the \((l,m)^{th}\) component of \(\ddot{S}_{\varvec{\theta }}(y)\), \(\ddot{S} _{_{\varvec{\theta }}}(y)=\frac{\partial ^{2}S_{\varvec{\theta }}(y)}{\partial \varvec{\theta }\partial \varvec{\theta }^{T}}\), and \(S_{\varvec{\theta }} (y)=f_{\varvec{\theta }}^{1/2}(y)\).

(C2) The second and third continuous partial derivatives of g(yzu) exist w.r.t. y and \(z,\,u,\) respectively. For a given \(\tilde{L}\,>0\) and some \(\epsilon \)-neighborhood of \(\theta ,B(\theta ,\epsilon ),\) define \(\tilde{g}(y)=\inf _{\Vert x\Vert \le \tilde{L},t\in B(\theta ,\epsilon )}\min _{1\le j\le k}g(y,\mathbf {x}^{T}t_{j},u_{j}).\) Suppose that \(1/\tilde{g}(y)\) is bounded on any compact subset of R and that, as \(L\rightarrow \infty \),

$$\begin{aligned} \begin{array}{ll} \int _{|y|>L}\int _{\Vert x\Vert \le \tilde{L}}|x_{r}|\breve{g}_{z} (y,\mathbf {x}){\text{ d }}\eta (\mathbf {x}){\text{ d }}y\rightarrow 0, &{} \int _{|y|>L}\int _{\Vert x\Vert \le \tilde{L}}\breve{g}_{u}(y,\mathbf {x}){\text{ d }}\eta (\mathbf {x}){\text{ d }}y\rightarrow 0,\\ \int _{|y|>L}\int _{\Vert x\Vert \le \tilde{L}}g^{*}(y,\mathbf {x} ){\text{ d }}\eta (\mathbf {x}){\text{ d }}y\rightarrow 0, &{} \int _{|y|>L}\frac{\int _{\Vert x\Vert \le \tilde{L}}x_{r}^{2}(\dot{g}_{z}^{*}(y,\mathbf {x}))^{2}{\text{ d }}\eta (\mathbf {x})}{\tilde{g}(y)}{\text{ d }}y\rightarrow 0,\\ \int _{|y|>L}\frac{\int _{\Vert x\Vert \le \tilde{L}}(\dot{g}_{u}^{*}(y,\mathbf {x}))^{2}{\text{ d }}\eta (\mathbf {x})}{\tilde{g}(y)}{\text{ d }}y\rightarrow 0, &{} \int _{|y|>L}\frac{\int _{\Vert x\Vert \le \tilde{L}}x_{r}^{4}(\dot{g}_{z}^{*}(y,\mathbf {x}))^{4}{\text{ d }}\eta (\mathbf {x})}{\tilde{g}^{3}(y)}{\text{ d }}y\rightarrow 0,\\ \int _{|y|>L}\frac{\int _{\Vert x\Vert \le \tilde{L}}(\dot{g}_{u}^{*}(y,\mathbf {x}))^{4}{\text{ d }}\eta (\mathbf {x})}{\tilde{g}^{3}(y)}{\text{ d }}y\rightarrow 0, &{} \int _{|y|>L}\frac{\int _{\Vert x\Vert \le \tilde{L}}x_{r}^{2}x_{q}^{2}(\ddot{g}_{zz}^{*}(y,\mathbf {x}))^{2}{\text{ d }}\eta (\mathbf {x})}{\tilde{g}(y)} {\text{ d }}y\rightarrow 0,\\ \int _{|y|>L}\frac{\int _{\Vert x\Vert \le \tilde{L}}(\ddot{g}_{uu}^{*}(y,\mathbf {x}))^{2}{\text{ d }}\eta (\mathbf {x})}{\tilde{g}(y)}{\text{ d }}y\rightarrow 0, &{} \int _{|y|>L}\frac{\int _{\Vert x\Vert \le \tilde{L}}x_{r}^{2}(\ddot{g} _{zu}^{*}(y,\mathbf {x}))^{2}{\text{ d }}\eta (\mathbf {x})}{\tilde{g}(y)}{\text{ d }}y\rightarrow 0 \end{array} \end{aligned}$$

for \(r,q=0,\ldots ,p\), where \(x_{0}=1\), \(\breve{g}_{z}(y,\mathbf {x})=\sup _{t\in \Theta }\max _{1\le j\le k}|\dot{g}_{z}(y,\mathbf {x}^{T}t_{j},u_{j})|\), \(\breve{g}_{u}(y,\mathbf {x})=\sup _{t\in \Theta }\max _{1\le j\le k}|\dot{g} _{u}(y,\mathbf {x}^{T}t_{j},u_{j})|\),

$$\begin{aligned} g^{*}(y,\mathbf {x})=\sup _{t\in B(\theta ,\epsilon )}\max _{1\le j\le k}g(y,x^{T}t_{j},u_{j}),\ \ \dot{g}_{z}^{*}(y,x)=\sup _{t\in B(\theta ,\epsilon )}\max _{1\le j\le k}|\dot{g}_{z}(y,x^{T}t_{j},u_{j})|, \end{aligned}$$

\(\ddot{g}_{zz}^{*}(y,\mathbf {x})=\sup _{t\in B(\theta ,\epsilon )}\max _{1\le j\le k}|\ddot{g}_{zz}(y,\mathbf {x}^{T}t_{j},u_{j})|\), \(\dot{g}_{u}^{*}(y,\mathbf {x})\), \(\ddot{g}_{zu}^{*}(y,\mathbf {x})\), and \(\ddot{g} _{uu}^{*}(y,\mathbf {x})\) are defined in a similar fashion, \(\dot{g} _{z}(y,z,u)=\frac{\partial g(y,z,u)}{\partial z},\dot{g}_{u}(y,z,u)=\frac{\partial g(y,z,u)}{\partial u}\), \(\ddot{g}_{zz}(y,z,u)=\frac{\partial ^{2}g(y,z,u)}{\partial z^{2}},\ \)and \(\ddot{g}_{zu}(y,z,u)=\frac{\partial ^{2}g(y,z,u)}{\partial z\partial u}\).

(C3) The kernel function \(K(\cdot )\) is a bounded symmetric density with compact support \([-M,M]\).

(C4) \(\sup _{y\in \mathbb {R}}\sup _{|v|\le M}\frac{f_{\varvec{\theta }_{0}}(y+v)}{f_{\varvec{\theta }_{0} }^{1/2}(y)}=O(1)\), \(\sup _{y\in \mathbb {R}}\sup _{|v|\le M}\frac{[f_{\varvec{\theta }_{0}}^{\prime \prime }(y+v)]^{2}}{f_{\varvec{\theta }_{0}}^{7/4}(y)}=O(1)\), and as \(L\rightarrow \infty \,\int _{|y|>L}\frac{\dot{S}_{\varvec{\theta }_{0} q}^{2}(y)}{f_{\varvec{\theta }_{0} }^{1/2}(y)}{\text{ d }}y\rightarrow 0\) for \(q=1,\ldots ,J(p+2)-1\), where \(f_{\varvec{\theta }} ^{\prime \prime }(y)=\frac{\partial ^{2}f_{\varvec{\theta }}(y)}{\partial y^{2}}\) and \(\dot{S}_{\varvec{\theta } q}(y)\) is the \(q^{th}\) entry of the vector \(\dot{S}_{\varvec{\theta }}(y)\).

(C5) The bandwidth \(h_{n}=b_{0}n^{-\gamma }\) for some \(\gamma \in (1/8,1/2)\) and constant \(b_{0}>0\). \(E|Y|^{s}<+\infty \) for \(s>6/(1-2\gamma )\). There exists some \(l,\,1/s<l<(1-2\gamma )/6,\) satisfying \(\sup _{|y|\le n^{l}}\sup _{|v|\le M}\frac{f_{\varvec{\theta }_{0}}(y+v)}{f_{\varvec{\theta }_{0} }(y)}=O(1)\), and as \(n\rightarrow \infty \)

$$\begin{aligned} (n^{1/2}h_{n})^{-1}\int _{|y|\le n^{l}}\frac{|\dot{g}_{jq}(y)|}{g_{j} (y)}{\text{ d }}y\rightarrow 0,\ \ \ (n^{1/2}h_{n})^{-1}\int _{|y|\le n^{l}}\frac{|\dot{g}_{j\gamma }(y)|}{g_{j}(y)}{\text{ d }}y\rightarrow 0, \end{aligned}$$

where \(g_{j}(y)=g_{j}(y,\beta _{j},\gamma _{j})=\int g(y,x^{T}\beta _{j} ,\gamma _{j}){\text{ d }}\eta (x)\); \(\dot{g}_{jq}(y)=\frac{\partial g_{j}(y)}{\partial \beta _{jq}}\) for \(j=1,\ldots ,J\), \(q=1,\ldots ,p\); and \(\dot{g}_{j\gamma }(y)=\frac{\partial g_{j}(y)}{\partial \gamma _{j}}\).

(C6) \(\int \frac{\int g^{4}(y,\mathbf {x}^{T}\varvec{\beta }_{j},\gamma _{j}){\text{ d }}\eta (x)}{g_{j}^{3}(y)}{\text{ d }}y<+\infty \); \(\int \frac{\int x_{r}^{2}\dot{g} _{z}^{2}(y,\mathbf {x}^{T}\varvec{\beta }_{j},\gamma _{j}){\text{ d }}\eta (x)}{g_{j} (y)}{\text{ d }}y<+\infty \) for \(j=1,\ldots ,J\), \(r=1,\ldots ,p\); and \(\int \frac{\int \dot{g}_{u}^{2}(y,\mathbf {x}^{T}\varvec{\beta }_{j},\gamma _{j}){\text{ d }}\eta (x)}{g_{j} (y)}{\text{ d }}y<+\infty .\)

Conditions (C1) and (C2) guarantee that (2.5) and (2.6) of Beran (1977) hold about an expansion of the first and second partial derivatives in some neighborhood of \(\varvec{\theta }_{0}\). Condition (C3) is also a typical assumption on kernels, including the family of symmetric beta kernel functions (Chen 1999). Conditions (C4)–(C6) are used to derive the asymptotic normality of the MHD estimators. When X is bounded and \(g(y,x^{T}\beta _{j},\gamma _{j})=\exp \{-(y-x^{T}\beta _{j})^{2}/(2\gamma _{j}^{2})\}/(\sqrt{2\pi }\gamma _{j})\), or X is a normal random variable and \(g(y,x,\beta _{j1},\beta _{j2},\gamma _{j})=\exp \{-[y-(\beta _{j1}+\beta _{j2}x)]^{2} /(2\gamma _{j}^{2})\}/(\sqrt{2\pi }\gamma _{j})\) for \(j=1,\ldots ,J\), the above conditions are satisfied, see Remarks 3.4 and 3.4 of Tang and Karunamuni (2013) for details.

Lemma 1

Under the assumptions of Theorem 3, there exists a local minimizer \(\hat{\varvec{\theta }}\) of (9) such that \(\left\| \hat{\varvec{\theta }} -\varvec{\theta }_{0}\right\| =O_{p}(n^{-1/2})\).

Proof

Let

$$\begin{aligned} P_{n}(\varvec{\theta })=\sum _{j=1}^{J}\alpha _{j}^{1/2}\sum _{k=1}^{p} p_{\lambda _{nj}}^{\prime }(|\beta _{jk}^{(0)}|)|\beta _{jk}|,\ \ P_{n1} (\varvec{\theta })=\sum _{j=1}^{J}\alpha _{j}^{1/2}\sum _{k=1}^{d_{j}}p_{\lambda _{nj}}^{\prime }(|\beta _{jk}^{(0)}|)|\beta _{jk}| \end{aligned}$$

and \(D_{n}(\varvec{\theta })=\int S_{\varvec{\theta },n}(y)\hat{f}_{n}^{1/2} (y){\text{ d }}y-P_{n}(\varvec{\theta })\). It suffices to prove that for any given \(\varepsilon >0\), there exists a constant C such that

$$\begin{aligned} P\left\{ \sup _{\Vert v\Vert =C}D_{n}(\varvec{\theta }_{0}+n^{-1/2}v)<D_{n} (\varvec{\theta }_{0})\right\} \ge 1-\varepsilon . \end{aligned}$$
(20)

Note that

$$\begin{aligned} D_{n}(\varvec{\theta }_{0}+n^{-1/2}v)-D_{n}(\varvec{\theta }_{0})\le & {} \int [S_{\varvec{\theta }_{0}+n^{-1/2}v,n}(y)-S_{\varvec{\theta }_{0},n}(y)]\hat{f} _{n}^{1/2}(y){\text{ d }}y\nonumber \\&-[P_{n1}(\varvec{\theta }_{01}+n^{-1/2}v_{1})-P_{n1}(\varvec{\theta }_{01})]. \end{aligned}$$
(21)

By a Taylor expansion,

$$\begin{aligned}&\int [S_{\varvec{\theta }_{0}+n^{-1/2}v,n}(y)-S_{\varvec{\theta }_{0},n}(y)]\hat{f} _{n}^{1/2}(y){\text{ d }}y\\&\quad =n^{-1/2}v\int \dot{S}_{\varvec{\theta }_{0},n}(y)\hat{f}_{n}^{1/2}(y){\text{ d }}y+\frac{1}{2n}v^{T}\int \ddot{S}_{\varvec{\theta }^{*},n}(y)\hat{f}_{n}^{1/2}(y){\text{ d }}yv, \end{aligned}$$

where \(\varvec{\theta }^{*}\) is between \(\varvec{\theta }_{0}\) and \(\varvec{\theta }_{0}+n^{-1/2}v\). As in the proof of Lemma 3.1 of Tang and Karunamuni (2013), we have

$$\begin{aligned} \int [\ddot{S}_{\varvec{\theta }^{*},n}(y)-\ddot{S}_{\varvec{\theta }_{0},n} (y)]\hat{f}_{n}^{1/2}(y){\text{ d }}y\le \left( \int [\ddot{S}_{\varvec{\theta }^{*},n} (y)-\ddot{S}_{\varvec{\theta }_{0},n}(y)]^{2}{\text{ d }}y\right) ^{1/2}\left( \int \hat{f}_{n} (y){\text{ d }}y\right) ^{1/2}=o_{p}(1). \end{aligned}$$

Similar to the proof of Theorem 3.2 of Tang and Karunamuni (2013), we obtain

$$\begin{aligned} \int \ddot{S}_{\varvec{\theta }_{0},n}(y)\hat{f}_{n}^{1/2}(y){\text{ d }}y=-H(\varvec{\theta }_{0} )+o_{p}(1), \end{aligned}$$

where \(H(\varvec{\theta }_{0})=-\int \ddot{S}_{\varvec{\theta }_{0}} (y)f_{\varvec{\theta }_{0}}^{1/2}(y){\text{ d }}y\). Hence

$$\begin{aligned}&\int [S_{\varvec{\theta }_{0}+n^{-1/2}v,n}(y)-S_{\varvec{\theta }_{0},n}(y)]\hat{f} _{n}^{1/2}(y){\text{ d }}y\nonumber \\&\quad =n^{-1/2}v\int \dot{S}_{\varvec{\theta }_{0},n}(y)\hat{f}_{n}^{1/2}(y){\text{ d }}y-\frac{1}{2n}v^{T}H(\varvec{\theta }_{0})v[1+o_{p}(1)]. \end{aligned}$$
(22)

By (A.26) of Tang and Karunamuni (2013), it follows that \(\int \dot{S}_{\varvec{\theta }_{0},n}(y)\hat{f}_{n}^{1/2}(y){\text{ d }}y=O_{p}(n^{-1/2})\). Since \(\varvec{\theta }^{(0)}\rightarrow _{P}\varvec{\theta }_{0}\), we have \(P\{P_{n1} (\varvec{\theta }_{01}+n^{-1/2}v_{1})-P_{n1}(\varvec{\theta }_{01})=0\}\rightarrow 1\) as \(n\rightarrow \infty \). Hence, for sufficiently large C, (20) follows from (21) and (22) and the fact that \(H(\varvec{\theta }_{0})\) is positive definite. The proof of Lemma 1 is complete. \(\square \)

Lemma 2

Under the assumptions of Theorem 3, for any \(\varvec{\theta }=(\varvec{\theta }_{1}^{T},\varvec{\theta }_{2}^{T})^{T}\) such that \(\Vert \varvec{\theta }-\varvec{\theta }_{0}\Vert =O(n^{-1/2})\) and \(\varvec{\theta }_{2} \ne \varvec{0}\), with probability tending to 1, we have

$$\begin{aligned} D_{n}((\varvec{\theta }_{1},\varvec{\theta }_{2}))<D_{n}((\varvec{\theta }_{1},\varvec{0})). \end{aligned}$$

Proof

By a Taylor expansion, we obtain

$$\begin{aligned} S_{(\varvec{\theta }_{1},\varvec{\theta }_{2}),n}(y)=S_{(\varvec{\theta }_{1} ,\varvec{0}),n}(y)+\varvec{\theta }_{2}^{T}\frac{\partial S_{\varvec{\theta },n} (y)}{\partial \varvec{\theta }_{2}}\left| _{\varvec{\theta }=(\varvec{\theta }_{1} ,\varvec{0})}+\frac{1}{2}\varvec{\theta }_{2}^{T}\frac{\partial ^{2}S_{\varvec{\theta },n} (y)}{\partial \varvec{\theta }_{2}\partial \varvec{\theta }_{2}^{T}}\right| _{\varvec{\theta }=(\varvec{\theta }_{1},\varvec{\theta }_{2}^{*})}\varvec{\theta }_{2}, \end{aligned}$$

where \(\varvec{\theta }_{2}^{*}\) is between \(\varvec{0}\) and \(\varvec{\theta }_{2}\). As in the proof of (22), it follows that

$$\begin{aligned} \int \frac{\partial ^{2}S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }_{2} \partial \varvec{\theta }_{2}^{T}}\left| _{\varvec{\theta }=(\varvec{\theta }_{1} ,\varvec{\theta }_{2}^{*})}\hat{f}_{n}^{1/2}(y){\text{ d }}y=\int \frac{\partial ^{2}S_{\varvec{\theta }}(y)}{\partial \varvec{\theta }_{2}\partial \varvec{\theta }_{2}^{T} }\right| _{\varvec{\theta }=\varvec{\theta }_{0}}f_{\varvec{\theta }_{0}} ^{1/2}(y){\text{ d }}y[1+o_{p}(1)]=O_{p}(1). \end{aligned}$$

By (A.26) of Tang and Karunamuni (2013), we have

$$\begin{aligned} \int \frac{\partial S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }_{2}}\left| _{\varvec{\theta }=(\varvec{\theta }_{1},\varvec{0})}\hat{f}_{n}^{1/2}(y){\text{ d }}y=O_{p} (n^{-1/2})\right. . \end{aligned}$$

Using the fact that \(\Vert \varvec{\theta }_{2}\Vert =O(n^{-1/2})\) and \(n^{1/2}\lambda _{nj}\rightarrow +\infty \), we deduce that with probability tending to 1, it holds that

$$\begin{aligned}&D_{n}((\varvec{\theta }_{1},\varvec{\theta }_{2}))-D_{n}((\varvec{\theta }_{1},\varvec{0}))\\&\quad =\int [S_{(\varvec{\theta }_{1},\varvec{\theta }_{2}),n}(y)-S_{(\varvec{\theta }_{1} ,\varvec{0}),n}(y)]\hat{f}_{n}^{1/2}(y){\text{ d }}y-\sum _{j=1}^{J}\alpha _{j}^{1/2} \sum _{k=d_{j}}^{p}p_{\lambda _{nj}}^{\prime }(|\beta _{jk}^{(0)}|)|\beta _{jk}|\\&\quad =O_{p}(n^{-1/2})\sum _{j=1}^{J}\sum _{k=d_{j}}^{p}|\beta _{jk}|-\sum _{j=1} ^{J}\alpha _{j}^{1/2}\sum _{k=d_{j}}^{p}p_{\lambda _{nj}}^{\prime }(|\beta _{jk}^{(0)}|)|\beta _{jk}|\\&\quad =\sum _{j=1}^{J}\lambda _{nj}\sum _{k=d_{j}}^{p}\left[ O_{p}((n^{1/2}\lambda _{nj} )^{-1})-\alpha _{j}^{1/2}\lambda _{nj}^{-1}p_{\lambda _{nj}}^{\prime }(|\beta _{jk}^{(0)}|)\right] |\beta _{jk}|<0. \end{aligned}$$

This completes the proof of Lemma 2. \(\square \)

Proof of Theorem 3

By Lemmas 1 and 2, there exists a \(\sqrt{n} \)-consistent local maximizer \(\check{\varvec{\theta }}=(\check{\varvec{\theta }} _{1},\varvec{0}^{T})^{T}\) of (9). By a Taylor expansion, with probability tending 1, we have

$$\begin{aligned}&D_{n}((\hat{\varvec{\theta }}_{1},\hat{\varvec{\theta }}_{2}))\\&\quad =D_{n}((\check{\varvec{\theta }}_{1},\varvec{0}))+(\hat{\varvec{\theta }}-\check{\varvec{\theta }})^{T}\int \frac{\partial S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }}\left| _{\varvec{\theta }=(\check{\varvec{\theta }}_{1},0)}\hat{f}_{n}^{1/2}(y){\text{ d }}y\right. \\&\qquad +\,\frac{1}{2}(\hat{\varvec{\theta }}-\check{\varvec{\theta }})^{T}\int \frac{\partial ^{2}S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }\partial \varvec{\theta }^{T}}\left| _{\varvec{\theta }=\varvec{\theta }^{*}}\hat{f} _{n}^{1/2}(y){\text{ d }}y(\hat{\varvec{\theta }}-\check{\varvec{\theta }})-\sum _{j=1}^{J} \alpha _{j}^{1/2}\sum _{k=d_{k}+1}^{p}p_{\lambda _{nj}}^{\prime }\left( |\beta _{jk}^{(0)}|\right) |\hat{\beta }_{jk}|,\right. \end{aligned}$$

where \(\varvec{\theta }^{*}\) is between \(\hat{\varvec{\theta }}\) and \(\check{\varvec{\theta }}\). By Theorem 2, it follows that \(\hat{\varvec{\theta }} \rightarrow _{P}\varvec{\theta }_{0}\). Using an argument similar to the one used in the proof of (22), we obtain \(\int \frac{\partial ^{2}S_{\varvec{\theta },n} (y)}{\partial \varvec{\theta }\partial \varvec{\theta }^{T}} _{\varvec{\theta }=\varvec{\theta }^{*}}\hat{f}_{n}^{1/2}(y){\text{ d }}y=-H(\varvec{\theta }_{0} )[1+o_{p}(1)]\). Noting that with probability tending to 1, \(\int \frac{\partial S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }_{1}}_{\varvec{\theta }=(\check{\varvec{\theta }}_{1},0)}\hat{f}_{n}^{1/2}(y){\text{ d }}y=0\), we have

$$\begin{aligned}&D_{n}((\hat{\varvec{\theta }}_{1},\hat{\varvec{\theta }}_{2}))\nonumber \\&\quad =D_{n}((\check{\varvec{\theta }}_{1},\varvec{0}))+\hat{\varvec{\theta }}_{2}^{T}\int \frac{\partial S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }_{2}}\left| _{\varvec{\theta }=(\check{\varvec{\theta }}_{1},0)}\hat{f}_{n}^{1/2}(y){\text{ d }}y\right. \nonumber \\&\qquad -\,\frac{1}{2}(\hat{\varvec{\theta }}-\check{\varvec{\theta }})^{T}H(\varvec{\theta }_{0} )(\hat{\varvec{\theta }}-\check{\varvec{\theta }})[1+o_{p}(1)]-\sum _{j=1}^{J} \alpha _{j}^{1/2}\sum _{k=d_{k}+1}^{p}p_{\lambda _{nj}}^{\prime }(|\beta _{jk}^{(0)}|)|\hat{\beta }_{jk}|,\nonumber \\ \end{aligned}$$
(23)

Using a Taylor expansion, we have

$$\begin{aligned}&\int \frac{\partial S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }_{2}}\left| _{\varvec{\theta }=(\check{\varvec{\theta }}_{1},0)}\hat{f}_{n}^{1/2}(y){\text{ d }}y=\int \frac{\partial S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }_{2}}\right| _{\varvec{\theta }=\varvec{\theta }_{0}}\nonumber \\&\quad \hat{f}_{n}^{1/2}(y){\text{ d }}y+H_{21}(\varvec{\theta }_{0} )\check{\varvec{\theta }}_{1}[1+o_{p}(1)], \end{aligned}$$

where \(H_{21}(\varvec{\theta }_{0})=\int \frac{\partial ^{2}S_{\varvec{\theta }} (y)}{\partial \varvec{\theta }_{2}\partial \varvec{\theta }_{1}^{T}} _{\varvec{\theta }=\varvec{\theta }_{0}}f_{\varvec{\theta }_{0}}^{1/2}(y){\text{ d }}y\). By (A.26) of Tang and Karunamuni (2013), we have \(\int \frac{\partial S_{\varvec{\theta },n} (y)}{\partial \varvec{\theta }_{2}}\left| _{\varvec{\theta }=\varvec{\theta }_{0}} \hat{f}_{n}^{1/2}(y){\text{ d }}y=O_{p}(n^{-1/2})\right. \). Then \(\check{\varvec{\theta }} _{1}=O_{p}(n^{-1/2})\) implies that \(\int \frac{\partial S_{\varvec{\theta },n} (y)}{\partial \varvec{\theta }_{2}}_{\varvec{\theta }=(\check{\varvec{\theta }}_{1},0)} \hat{f}_{n}^{1/2}(y){\text{ d }}y=O_{p}(n^{-1/2})\). If \(\hat{\varvec{\theta }}\ne \check{\varvec{\theta }}\), then by (23), we have \(D_{n}((\hat{\varvec{\theta }} _{1},\hat{\varvec{\theta }}_{2}))<D_{n}((\check{\varvec{\theta }}_{1},\varvec{0}))\). This is a contradiction to the fact that \(\hat{\varvec{\theta }}\) is a maximizer of (10). So \(\hat{\varvec{\theta }}_{2}=0\) and \(\hat{\varvec{\theta }}_{1}=\check{\varvec{\theta }}_{1}\).

We now prove the asymptotic normality part. Consider \(D_{n}((\varvec{\theta }_{1} ,\varvec{0}))\) as a function of \(\varvec{\theta }_{1}\). Noting that with probability tending 1, \(\hat{\varvec{\theta }}_{1}\) is the \(\sqrt{n}\)-consistent maximizer of \(D_{n}((\varvec{\theta }_{1},\varvec{0}))\) and satisfies

$$\begin{aligned} \frac{\partial D_{n}((\varvec{\theta }_{1},\varvec{0}))}{\partial \varvec{\theta }_{1} }\left| _{\varvec{\theta }_{1}=\hat{\varvec{\theta }}_{1}}=\int \frac{\partial S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }_{1}}\right| _{\varvec{\theta }=(\hat{\varvec{\theta }}_{1},\varvec{0})}\hat{f}_{n}^{1/2}(y){\text{ d }}y=0. \end{aligned}$$

By an argument similar to the one used in the proof of (22), we obtain

$$\begin{aligned} \int \frac{\partial S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }_{1}}\left| _{\varvec{\theta }=(\hat{\varvec{\theta }}_{1},\varvec{0})}\hat{f}_{n}^{1/2} (y){\text{ d }}y=\int \frac{\partial S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }_{1} }\right| _{\varvec{\theta }={\varvec{\theta }}_{0}}\hat{f}_{n}^{1/2} (y){\text{ d }}y-H_{1}(\varvec{\theta }_{01})(\hat{\varvec{\theta }}_{1}-\varvec{\theta }_{01} )[1+o_{p}(1)]. \end{aligned}$$

Hence

$$\begin{aligned} H_{1}(\varvec{\theta }_{01})(\hat{\varvec{\theta }}_{1}-\varvec{\theta }_{01} )[1+o_{p}(1)]=\int \frac{\partial S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }_{1}}\left| _{\varvec{\theta }={\varvec{\theta }}_{0}}\hat{f} _{n}^{1/2}(y){\text{ d }}y\right. . \end{aligned}$$
(24)

Using an argument similar to the one used in the proof of Theorem 3.3 of Tang and Karunamuni (2013), we obtain

$$\begin{aligned} n^{1/2}\left( \int \frac{\partial S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }_{1}}\left| _{\varvec{\theta }={\varvec{\theta }}_{0}}\hat{f} _{n}^{1/2}(y){\text{ d }}y-A_{n1}(\varvec{\theta }_{01})\right. \right) \rightarrow _{d}N\left( \varvec{0},\Sigma _{1}(\varvec{\theta }_{01})\right) . \end{aligned}$$

Now (11) follows from the preceding expression and (24). This completes the proof of Theorem 3. \(\square \)

Proof of Theorem 5

Note that

$$\begin{aligned}&2\left\| f_{n+n_{1}}^{1/2}(\hat{\varvec{\theta }}(h_{n+n_{1}}))-\hat{f}_{n+n_{1} }^{1/2}(h_{n+n_{1}})\right\| ^{2}\nonumber \\&\quad \ge 2\left\| f_{n+n_{1}}^{1/2}(\hat{\varvec{\theta }}(h_{n+n_{1}}))-f_{n}^{1/2} (\hat{\varvec{\theta }}(h_{n}))\right\| ^{2}-2\left\| f_{n}^{1/2}(\hat{\varvec{\theta }} (h_{n}))-\hat{f}_{n+n_{1}}^{1/2}(h_{n+n_{1}})\right\| ^{2}.\nonumber \\ \end{aligned}$$
(25)

Since

$$\begin{aligned}&\left\| f_{n+n_{1}}^{1/2}(\hat{\varvec{\theta }}(h_{n+n_{1}}))-\hat{f}_{n+n_{1} }^{1/2}(h_{n+n_{1}})\right\| ^{2}\\&\qquad +\,2\sum _{j=1}^{J}\hat{\alpha }_{j,n+n_{1}}^{1/2}\sum _{k=1}^{p}p_{\lambda _{(n+n_{1})j}}^{\prime }(|\beta _{jk,n+n_{1}}^{(0)}|)|\hat{\beta }_{jk,n+n_{1}}|\\&\quad \le \left\| f_{n+n_{1}}^{1/2}(\hat{\varvec{\theta }}(h_{n}))-\hat{f}_{n+n_{1}} ^{1/2}(h_{n+n_{1}})\right\| ^{2}+2\sum _{j=1}^{J}\hat{\alpha }_{j,n}^{1/2}\sum _{k=1}^{p}p_{\lambda _{(n+n_{1})j}}^{\prime }(|\beta _{jk,n+n_{1}}^{(0)} |)|\hat{\beta }_{jk,n}|, \end{aligned}$$

we have

$$\begin{aligned}&\left\| f_{n+n_{1}}^{1/2}(\hat{\varvec{\theta }}(h_{n+n_{1}}))-\hat{f}_{n+n_{1} }^{1/2}(h_{n+n_{1}})\right\| ^{2}\nonumber \\&\quad \le \left\| f^{1/2}(\hat{\varvec{\theta }}(h_{n}))-\hat{f}_{n+n_{1}}^{1/2} (h_{n+n_{1}})\right\| ^{2}+\left\| f_{n+n_{1}}^{1/2}(\hat{\varvec{\theta }} (h_{n}))-f^{1/2}(\hat{\varvec{\theta }}(h_{n}))\right\| ^{2}+2P_{n,n_{1}}^{*}.\nonumber \\ \end{aligned}$$
(26)

Clearly

$$\begin{aligned}&\left\| f_{n}^{1/2}(\hat{\varvec{\theta }}(h_{n}))-\hat{f}_{n+n_{1}}^{1/2} (h_{n+n_{1}})\right\| ^{2}\nonumber \\&\quad \le \left\| f^{1/2}(\hat{\varvec{\theta }}(h_{n}))-\hat{f}_{n+n_{1}}^{1/2} (h_{n+n_{1}})\right\| ^{2}+\left\| f_{n}^{1/2}(\hat{\varvec{\theta }}(h_{n} ))-f^{1/2}(\hat{\varvec{\theta }}(h_{n}))\right\| ^{2} \end{aligned}$$
(27)

and

$$\begin{aligned}&\left\| f_{n+n_{1}}^{1/2}(\hat{\varvec{\theta }}(h_{n+n_{1}}))-f_{n}^{1/2} (\hat{\varvec{\theta }}(h_{n}))\right\| ^{2}\nonumber \\&\ge \left\| f^{1/2}(\hat{\varvec{\theta }}(h_{n+n_{1}}))-f^{1/2}(\hat{\varvec{\theta }} (h_{n}))\right\| ^{2}-\left\| f_{n+n_{1}}^{1/2}(\hat{\varvec{\theta }}(h_{n+n_{1}}))\right. \nonumber \\&\left. -f^{1/2}(\hat{\varvec{\theta }}(h_{n+n_{1}}))\right\| ^{2}-\left\| f_{n}^{1/2} (\hat{\varvec{\theta }}(h_{n}))-f^{1/2}(\hat{\varvec{\theta }}(h_{n}))\right\| ^{2}. \end{aligned}$$
(28)

Combining (25)–(28), we conclude that

$$\begin{aligned} 4\left\| f^{1/2}(\hat{\varvec{\theta }}(h_{n}))-\hat{f}_{n+n_{1}}^{1/2}(h_{n+n_{1} })\right\| ^{2}\ge \left\| f^{1/2}(\hat{\varvec{\theta }}(h_{n+n_{1}}))-f^{1/2} (\hat{\varvec{\theta }}(h_{n}))\right\| ^{2}-\varepsilon _{n,n_{1}} \end{aligned}$$

If \(\hat{\varvec{\theta }}(h_{n})\) breaks down, then \(\sup _{\#\varvec{V}_{n_{1} }=n_{1}}d(\hat{\varvec{\theta }}(h_{n}),\hat{\varvec{\theta }}(h_{n+n_{1}}))=\infty \). So by the definition of \(\delta ^{*}\), \(\left\| f^{1/2}(\hat{\varvec{\theta }} (h_{n+n_{1}}))-f^{1/2}(\hat{\varvec{\theta }}(h_{n}))\right\| ^{2}\ge \delta ^{*}\). Hence, \(\left\| f^{1/2}(\hat{\varvec{\theta }}(h_{n}))-\right. \left. \hat{f}_{n+n_{1}} ^{1/2}(h_{n+n_{1}})\right\| ^{2}\ge (\delta ^{*}-\varepsilon _{n,n_{1}})/4.\) The rest of the proof is similar to that of Tamura and Boos (1986) and is therefore omitted to save space. This completes the proof of Theorem 5. \(\square \)

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tang, Q., Karunamuni, R. . Robust variable selection for finite mixture regression models. Ann Inst Stat Math 70, 489–521 (2018). https://doi.org/10.1007/s10463-017-0602-4

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-017-0602-4

Keywords

Navigation