Robust variable selection for finite mixture regression models

Tang, Qingguo; Karunamuni, R.  J.

doi:10.1007/s10463-017-0602-4

Robust variable selection for finite mixture regression models

Published: 25 February 2017

Volume 70, pages 489–521, (2018)
Cite this article

Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Qingguo Tang¹ &
R. J. Karunamuni²

670 Accesses
7 Citations
Explore all metrics

Abstract

Finite mixture regression (FMR) models are frequently used in statistical modeling, often with many covariates with low significance. Variable selection techniques can be employed to identify the covariates with little influence on the response. The problem of variable selection in FMR models is studied here. Penalized likelihood-based approaches are sensitive to data contamination, and their efficiency may be significantly reduced when the model is slightly misspecified. We propose a new robust variable selection procedure for FMR models. The proposed method is based on minimum-distance techniques, which seem to have some automatic robustness to model misspecification. We show that the proposed estimator has the variable selection consistency and oracle property. The finite-sample breakdown point of the estimator is established to demonstrate its robustness. We examine small-sample and robustness properties of the estimator using a Monte Carlo study. We also analyze a real data set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Article Open access 05 May 2021

Violating the normality assumption may be the lesser of two evils

Article Open access 07 May 2021

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Article 30 August 2016

References

Basu, A., Harris, I. R., Hjort, N. L., Jones, M. C. (1998). Robust and efficient estimation by minimizing a density power divergence. Biometrika, 85, 549–559.
Beran, R. (1977). Minimum Hellinger distance estimators for parametric models. Annals of Statistics, 5, 445–463.
Article MathSciNet MATH Google Scholar
Beran, R. (1978). An efficient and robust adaptive estimator of location. Annals of Statistics, 6, 292–313.
Article MathSciNet MATH Google Scholar
Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press.
MATH Google Scholar
Broniatowski, M., Toma, A., Vajda, I. (2012). Decomposable pseudodistances and applications in statistical estimation. Journal of Statistical Planning and Inference, 142, 2574–2585.
Chen, S. X. (1999). Beta kernel estimators for density functions. Computational Statistics and Data Analysis, 31, 131–145.
Article MathSciNet MATH Google Scholar
Cutler, A., Cordero-Braña, O. I. (1996). Minimum Hellinger distance estimation for finite mixture models. Journal of the American Statistical Association, 91, 1716–1723.
Devlin, S. J., Gnandesikan, R., Kettenring, J. R. (1981). Robust Estimation of Dispersion Matrices and Principal Components. Journal of the American Statistical Association, 76, 354–362.
Devroye, L. P., Wagner, T. J. (1979). The $L_{1}$ convergence of kernel density estimates. Annals of Statistics, 7, 1136–1139.
Donoho, D. (1982). Breakdown properties of multivariate location estimators. Unpublished qualifying paper. Cambridge, Massachusetts, USA: Harvard University, Department of Statistics.
Donoho, D., Huber, P. (1983). The notion of breakdown point. In P. J. Bickel, K. A. Doksum, J. L. Hodges Jr. (Eds.), A Festschrift for E. L. Lehmann (pp. 157–184). Belmont, CA: Wadsworth.
Donoho, D. L., Liu, R. C. (1988). The “automatic” robustness of minimum distance functionals. Annals of Statistics, 16, 552–586.
Fan, J., Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.
Fan, J., Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. In International Congress of Mathematicians, 3, 595–622.
Fan, J., Lv, J. (2011). Non-concave penalized likelihood with np-dimensionality. IEEE Transaction Information Theory, 57, 5467–5484.
Fan, J., Xue, L., Zou, H. (2014). Strong oracle optimality of folded concave penalized estimation. Annals of Statistics, 42, 819–849.
Frank, I., Friedman, J. (1993). A statistical view of some chemometrics regression tools (with discussion). Technometrics, 35, 109–135.
Friedman, J., Hastie, T., Höflinng, H., Tibshirani, R. (2007). Pathwise coordinate optimization. Annals of Applied Statistics, 1, 302–332.
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., Stahel, W. A. (1986). Robust statistics: The approach based on influence functions. New York: Wiley.
Hennig, C. (2000). Identifiability of models for clusterwise linear regression. Journal of Classification, 17, 273–296.
Article MathSciNet MATH Google Scholar
Jiang, W., Tanner, M. A. (1999). On the approximation rate of hierarchical mixtures-of-experts for generalized linear models. Machine Learning, 11, 1183–1198.
Karlis, D., Xekalaki, E. (2001). Robust inference for finite mixtures. Journal of Statistical Planning and Inference, 93, 93–115.
Karunamuni, R. J., Wu, J. (2011). One-step minimum Hellinger distance estimation. Computational Statistics and Data Analysis, 55, 3148–3164.
Khalili, A. (2010). New estimation and feature selection methods in mixture-of-experts models. The Canadian Journal of Statistics, 38, 519–539.
Article MathSciNet MATH Google Scholar
Khalili, A., Chen, J. (2007). Variable selection in finite mixture of regression models. Journal of the American Statistical Association, 102, 1025–1038.
Khalili, A., Lin, S. (2013). Regularization in finite mixture of regression models with diverging number of parameters. Biometrics, 69, 436–446.
Khalili, A., Chen, J., Lin, S. (2011). Feature selection in finite mixture of sparse normal linear models in high-dimensional feature space. Biostatistics, 12, 156–172.
Leisch, F. (2004). FlexMix: A general framework for finite mixture models and latent class regression in R. Journal of Statistical Software, 11(8), 1–18.
Lindsay, B. G. (1994). Efficiency versus robustness: The case for minimum Hellinger distance and related methods. Annals of Statistics, 22, 1081–1114.
Article MathSciNet MATH Google Scholar
Lu, Z., Hui, Y. V., Lee, A. H. (2003). Minimum Hellinger distance estimation for finite mixtures of Poisson regression models and its applications. Biometrics, 59, 1016–1026.
Lv, J., Fan, J. (2009). A unified approach to model selection and sparse recovery using regularized least squares. Annals of Statistics, 37, 3498–3528.
Markatou, M. (2000). Mixture models, robustness and the weighted likelihood methodology. Biometrics, 56, 483–486.
Article MATH Google Scholar
Maronna, R. A. (1976). Robust M-estimators of multivariate location and scatter. Annals of Statistics, 4, 51–67.
Article MathSciNet MATH Google Scholar
McLachlan, G. J., Peel, D. (2000). Finite Mixture Models. New York: Wiley.
Pollard, D. (1981). Stong consistency of k-means clustering. Annals of Statistics, 9, 135–140.
Article MathSciNet MATH Google Scholar
Shen, L. Z. (1995). On optimal B-robust influence functions in semiparametric models. Annals of Statistics, 23, 968–989.
Article MathSciNet MATH Google Scholar
Städler, N., Bühlmann, P., van de Geer, S. (2010). $l_{1}$-penalization for mixture regression models. Test, 19, 209–256.
Tamura, R., Boos, D. D. (1986). Minimum Hellinger distance estimation for multivariate location and covariance. Journal of the American Statistical Association, 81, 223–229.
Tang, Q., Karunamuni, R. J. (2013). Minimum distance estimation in a finite mixture regression model. Journal of Multivariate Analysis, 120, 185–204.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58, 267–288.
Titterington, D. M., Smith, A. F. M., Markov, U. E. (1985). Statistical analysis of finite mixture distributions. New York: Wiley.
Toma, A. (2008). Minimum Hellinger distance estimators for multivariate distributions from Johnson system. Journal of Statistical Planning and Inference, 138, 803–816.
Article MathSciNet MATH Google Scholar
van der Vaart, A. (1996). Efficient maximum likelihood estimation in semiparametric models. Annals of Statistics, 24, 862–878.
Article MathSciNet MATH Google Scholar
Wang, H., Li, G., Jiang, G. (2007). Robust regression shrinkage and consistent variable selection through the LAD-Lasso. Journal of Business and Economic Statistics, 25, 347–355.
Wang, X., Jiang, Y., Huang, M., Zhang, H. (2013). Robust variable selection with exponential squared loss. Journal of the American Statistical Association, 108, 632–643.
Wu, J., Karunamuni, R. J. (2012). Efficient Hellinger distance estimates for semiparametric models. Journal of Multivariate Analysis, 107, 1–23.
Wu, J., Karunamuni, R. J. (2015). Profile Hellinger distance estimation. Statistics, 49(4), 711–740.
Wu, J., Karunamuni, R. J., Zhang, B. (2010). Minimum Hellinger distance estimation in a two-sample semiparametric model. Journal of Multivariate Analysis, 101, 1102–1122.
Zhang, C.-H. (2010). Nearly unbiased variable selection under mini-max concave penalty. Annals of Statistics, 38, 894–942.
Article MathSciNet MATH Google Scholar
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.
Article MathSciNet MATH Google Scholar
Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67, 301–320.
Zou, H., Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36, 1509–1533.
Zou, H., Zhang, H. H. (2009). On the adaptive elastic-net with a diverging number of parameters. Annals of Statistics, 37, 1733–1751.

Download references

Acknowledgements

We wish to thank the Chief Editor, Professor Kenji Fukumizu, an Associate Editor, and two reviewers for their helpful comments and suggestions that led to substantial improvements in this paper. Q. Tang’s research was supported in part by the National Social Science Foundation of China (16BTJ019) and Jiangsu Natural Science Foundation of China (BK20151481). R.J. Karunamuni’s research was supported by a grant from the Natural Sciences and Engineering Research Council of Canada.

Author information

Authors and Affiliations

School of Economics and Management, Nanjing University of Science and Technology, Nanjing, 210094, China
Qingguo Tang
Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, T6G 2G1, Canada
R. J. Karunamuni

Authors

Qingguo Tang
View author publications
You can also search for this author in PubMed Google Scholar
R. J. Karunamuni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to R. J. Karunamuni.

Appendix

In this “Appendix”, we list the conditions used in the theorems and outline the proofs of main results. For convenience of notation, we write $f_{\varvec{\theta }}(y)\ $for $f_{\varvec{\theta },\eta }(y)$ defined by (4). The following technical conditions are imposed:

(C1) $E\Vert \mathbf {X}\Vert ^{2}<+\infty $ and $\max _{1\le i\le n}\Vert X_{i}\Vert =O_{p}(1)$. $\int \sup _{t\in \Theta }|\frac{\partial f_{t} (y)}{\partial t}|{\text{ d }}y<+\infty $, where $|b|=\max _{1\le i\le k}|b_{i}|$ for a vector $b=(b_{1},\ldots ,b_{k})^{T}$. $\ddot{S}_{\varvec{\theta }}^{(l,m)}(y)\in L_{2}$ for $1\le l,m\le J(p+2)-1$ and $H(\varvec{\theta }_{0})=-\int \ddot{S}_{\varvec{\theta }_{0}}(y)f_{\varvec{\theta }_{0}}^{1/2}(y){\text{ d }}y$ is a positive definite matrix, where $\ddot{S}_{\varvec{\theta }}^{(l,m)}(y)$ denotes the $(l,m)^{th}$ component of $\ddot{S}_{\varvec{\theta }}(y)$, $\ddot{S} _{_{\varvec{\theta }}}(y)=\frac{\partial ^{2}S_{\varvec{\theta }}(y)}{\partial \varvec{\theta }\partial \varvec{\theta }^{T}}$, and $S_{\varvec{\theta }} (y)=f_{\varvec{\theta }}^{1/2}(y)$.

(C2) The second and third continuous partial derivatives of g(y, z, u) exist w.r.t. y and $z,\,u,$ respectively. For a given $\tilde{L}\,>0$ and some $\epsilon $-neighborhood of $\theta ,B(\theta ,\epsilon ),$ define $\tilde{g}(y)=\inf _{\Vert x\Vert \le \tilde{L},t\in B(\theta ,\epsilon )}\min _{1\le j\le k}g(y,\mathbf {x}^{T}t_{j},u_{j}).$ Suppose that $1/\tilde{g}(y)$ is bounded on any compact subset of R and that, as $L\rightarrow \infty $,

$$\begin{aligned} \begin{array}{ll} \int _{|y|>L}\int _{\Vert x\Vert \le \tilde{L}}|x_{r}|\breve{g}_{z} (y,\mathbf {x}){\text{ d }}\eta (\mathbf {x}){\text{ d }}y\rightarrow 0, &{} \int _{|y|>L}\int _{\Vert x\Vert \le \tilde{L}}\breve{g}_{u}(y,\mathbf {x}){\text{ d }}\eta (\mathbf {x}){\text{ d }}y\rightarrow 0,\\ \int _{|y|>L}\int _{\Vert x\Vert \le \tilde{L}}g^{*}(y,\mathbf {x} ){\text{ d }}\eta (\mathbf {x}){\text{ d }}y\rightarrow 0, &{} \int _{|y|>L}\frac{\int _{\Vert x\Vert \le \tilde{L}}x_{r}^{2}(\dot{g}_{z}^{*}(y,\mathbf {x}))^{2}{\text{ d }}\eta (\mathbf {x})}{\tilde{g}(y)}{\text{ d }}y\rightarrow 0,\\ \int _{|y|>L}\frac{\int _{\Vert x\Vert \le \tilde{L}}(\dot{g}_{u}^{*}(y,\mathbf {x}))^{2}{\text{ d }}\eta (\mathbf {x})}{\tilde{g}(y)}{\text{ d }}y\rightarrow 0, &{} \int _{|y|>L}\frac{\int _{\Vert x\Vert \le \tilde{L}}x_{r}^{4}(\dot{g}_{z}^{*}(y,\mathbf {x}))^{4}{\text{ d }}\eta (\mathbf {x})}{\tilde{g}^{3}(y)}{\text{ d }}y\rightarrow 0,\\ \int _{|y|>L}\frac{\int _{\Vert x\Vert \le \tilde{L}}(\dot{g}_{u}^{*}(y,\mathbf {x}))^{4}{\text{ d }}\eta (\mathbf {x})}{\tilde{g}^{3}(y)}{\text{ d }}y\rightarrow 0, &{} \int _{|y|>L}\frac{\int _{\Vert x\Vert \le \tilde{L}}x_{r}^{2}x_{q}^{2}(\ddot{g}_{zz}^{*}(y,\mathbf {x}))^{2}{\text{ d }}\eta (\mathbf {x})}{\tilde{g}(y)} {\text{ d }}y\rightarrow 0,\\ \int _{|y|>L}\frac{\int _{\Vert x\Vert \le \tilde{L}}(\ddot{g}_{uu}^{*}(y,\mathbf {x}))^{2}{\text{ d }}\eta (\mathbf {x})}{\tilde{g}(y)}{\text{ d }}y\rightarrow 0, &{} \int _{|y|>L}\frac{\int _{\Vert x\Vert \le \tilde{L}}x_{r}^{2}(\ddot{g} _{zu}^{*}(y,\mathbf {x}))^{2}{\text{ d }}\eta (\mathbf {x})}{\tilde{g}(y)}{\text{ d }}y\rightarrow 0 \end{array} \end{aligned}$$

for $r,q=0,\ldots ,p$, where $x_{0}=1$, $\breve{g}_{z}(y,\mathbf {x})=\sup _{t\in \Theta }\max _{1\le j\le k}|\dot{g}_{z}(y,\mathbf {x}^{T}t_{j},u_{j})|$, $\breve{g}_{u}(y,\mathbf {x})=\sup _{t\in \Theta }\max _{1\le j\le k}|\dot{g} _{u}(y,\mathbf {x}^{T}t_{j},u_{j})|$,

$$\begin{aligned} g^{*}(y,\mathbf {x})=\sup _{t\in B(\theta ,\epsilon )}\max _{1\le j\le k}g(y,x^{T}t_{j},u_{j}),\ \ \dot{g}_{z}^{*}(y,x)=\sup _{t\in B(\theta ,\epsilon )}\max _{1\le j\le k}|\dot{g}_{z}(y,x^{T}t_{j},u_{j})|, \end{aligned}$$

$\ddot{g}_{zz}^{*}(y,\mathbf {x})=\sup _{t\in B(\theta ,\epsilon )}\max _{1\le j\le k}|\ddot{g}_{zz}(y,\mathbf {x}^{T}t_{j},u_{j})|$, $\dot{g}_{u}^{*}(y,\mathbf {x})$, $\ddot{g}_{zu}^{*}(y,\mathbf {x})$, and $\ddot{g} _{uu}^{*}(y,\mathbf {x})$ are defined in a similar fashion, $\dot{g} _{z}(y,z,u)=\frac{\partial g(y,z,u)}{\partial z},\dot{g}_{u}(y,z,u)=\frac{\partial g(y,z,u)}{\partial u}$, $\ddot{g}_{zz}(y,z,u)=\frac{\partial ^{2}g(y,z,u)}{\partial z^{2}},\ $and $\ddot{g}_{zu}(y,z,u)=\frac{\partial ^{2}g(y,z,u)}{\partial z\partial u}$.

(C3) The kernel function $K(\cdot )$ is a bounded symmetric density with compact support $[-M,M]$.

(C4) $\sup _{y\in \mathbb {R}}\sup _{|v|\le M}\frac{f_{\varvec{\theta }_{0}}(y+v)}{f_{\varvec{\theta }_{0} }^{1/2}(y)}=O(1)$, $\sup _{y\in \mathbb {R}}\sup _{|v|\le M}\frac{[f_{\varvec{\theta }_{0}}^{\prime \prime }(y+v)]^{2}}{f_{\varvec{\theta }_{0}}^{7/4}(y)}=O(1)$, and as $L\rightarrow \infty \,\int _{|y|>L}\frac{\dot{S}_{\varvec{\theta }_{0} q}^{2}(y)}{f_{\varvec{\theta }_{0} }^{1/2}(y)}{\text{ d }}y\rightarrow 0$ for $q=1,\ldots ,J(p+2)-1$, where $f_{\varvec{\theta }} ^{\prime \prime }(y)=\frac{\partial ^{2}f_{\varvec{\theta }}(y)}{\partial y^{2}}$ and $\dot{S}_{\varvec{\theta } q}(y)$ is the $q^{th}$ entry of the vector $\dot{S}_{\varvec{\theta }}(y)$.

(C5) The bandwidth $h_{n}=b_{0}n^{-\gamma }$ for some $\gamma \in (1/8,1/2)$ and constant $b_{0}>0$. $E|Y|^{s}<+\infty $ for $s>6/(1-2\gamma )$. There exists some $l,\,1/s<l<(1-2\gamma )/6,$ satisfying $\sup _{|y|\le n^{l}}\sup _{|v|\le M}\frac{f_{\varvec{\theta }_{0}}(y+v)}{f_{\varvec{\theta }_{0} }(y)}=O(1)$, and as $n\rightarrow \infty $

$$\begin{aligned} (n^{1/2}h_{n})^{-1}\int _{|y|\le n^{l}}\frac{|\dot{g}_{jq}(y)|}{g_{j} (y)}{\text{ d }}y\rightarrow 0,\ \ \ (n^{1/2}h_{n})^{-1}\int _{|y|\le n^{l}}\frac{|\dot{g}_{j\gamma }(y)|}{g_{j}(y)}{\text{ d }}y\rightarrow 0, \end{aligned}$$

where $g_{j}(y)=g_{j}(y,\beta _{j},\gamma _{j})=\int g(y,x^{T}\beta _{j} ,\gamma _{j}){\text{ d }}\eta (x)$; $\dot{g}_{jq}(y)=\frac{\partial g_{j}(y)}{\partial \beta _{jq}}$ for $j=1,\ldots ,J$, $q=1,\ldots ,p$; and $\dot{g}_{j\gamma }(y)=\frac{\partial g_{j}(y)}{\partial \gamma _{j}}$.

(C6) $\int \frac{\int g^{4}(y,\mathbf {x}^{T}\varvec{\beta }_{j},\gamma _{j}){\text{ d }}\eta (x)}{g_{j}^{3}(y)}{\text{ d }}y<+\infty $; $\int \frac{\int x_{r}^{2}\dot{g} _{z}^{2}(y,\mathbf {x}^{T}\varvec{\beta }_{j},\gamma _{j}){\text{ d }}\eta (x)}{g_{j} (y)}{\text{ d }}y<+\infty $ for $j=1,\ldots ,J$, $r=1,\ldots ,p$; and $\int \frac{\int \dot{g}_{u}^{2}(y,\mathbf {x}^{T}\varvec{\beta }_{j},\gamma _{j}){\text{ d }}\eta (x)}{g_{j} (y)}{\text{ d }}y<+\infty .$

Conditions (C1) and (C2) guarantee that (2.5) and (2.6) of Beran (1977) hold about an expansion of the first and second partial derivatives in some neighborhood of $\varvec{\theta }_{0}$. Condition (C3) is also a typical assumption on kernels, including the family of symmetric beta kernel functions (Chen 1999). Conditions (C4)–(C6) are used to derive the asymptotic normality of the MHD estimators. When X is bounded and $g(y,x^{T}\beta _{j},\gamma _{j})=\exp \{-(y-x^{T}\beta _{j})^{2}/(2\gamma _{j}^{2})\}/(\sqrt{2\pi }\gamma _{j})$, or X is a normal random variable and $g(y,x,\beta _{j1},\beta _{j2},\gamma _{j})=\exp \{-[y-(\beta _{j1}+\beta _{j2}x)]^{2} /(2\gamma _{j}^{2})\}/(\sqrt{2\pi }\gamma _{j})$ for $j=1,\ldots ,J$, the above conditions are satisfied, see Remarks 3.4 and 3.4 of Tang and Karunamuni (2013) for details.

Lemma 1

Under the assumptions of Theorem 3, there exists a local minimizer $\hat{\varvec{\theta }}$ of (9) such that $\left\| \hat{\varvec{\theta }} -\varvec{\theta }_{0}\right\| =O_{p}(n^{-1/2})$.

Proof

Let

$$\begin{aligned} P_{n}(\varvec{\theta })=\sum _{j=1}^{J}\alpha _{j}^{1/2}\sum _{k=1}^{p} p_{\lambda _{nj}}^{\prime }(|\beta _{jk}^{(0)}|)|\beta _{jk}|,\ \ P_{n1} (\varvec{\theta })=\sum _{j=1}^{J}\alpha _{j}^{1/2}\sum _{k=1}^{d_{j}}p_{\lambda _{nj}}^{\prime }(|\beta _{jk}^{(0)}|)|\beta _{jk}| \end{aligned}$$

and $D_{n}(\varvec{\theta })=\int S_{\varvec{\theta },n}(y)\hat{f}_{n}^{1/2} (y){\text{ d }}y-P_{n}(\varvec{\theta })$. It suffices to prove that for any given $\varepsilon >0$, there exists a constant C such that

$$\begin{aligned} P\left\{ \sup _{\Vert v\Vert =C}D_{n}(\varvec{\theta }_{0}+n^{-1/2}v)<D_{n} (\varvec{\theta }_{0})\right\} \ge 1-\varepsilon . \end{aligned}$$

(20)

Note that

$$\begin{aligned} D_{n}(\varvec{\theta }_{0}+n^{-1/2}v)-D_{n}(\varvec{\theta }_{0})\le & {} \int [S_{\varvec{\theta }_{0}+n^{-1/2}v,n}(y)-S_{\varvec{\theta }_{0},n}(y)]\hat{f} _{n}^{1/2}(y){\text{ d }}y\nonumber \\&-[P_{n1}(\varvec{\theta }_{01}+n^{-1/2}v_{1})-P_{n1}(\varvec{\theta }_{01})]. \end{aligned}$$

(21)

By a Taylor expansion,

$$\begin{aligned}&\int [S_{\varvec{\theta }_{0}+n^{-1/2}v,n}(y)-S_{\varvec{\theta }_{0},n}(y)]\hat{f} _{n}^{1/2}(y){\text{ d }}y\\&\quad =n^{-1/2}v\int \dot{S}_{\varvec{\theta }_{0},n}(y)\hat{f}_{n}^{1/2}(y){\text{ d }}y+\frac{1}{2n}v^{T}\int \ddot{S}_{\varvec{\theta }^{*},n}(y)\hat{f}_{n}^{1/2}(y){\text{ d }}yv, \end{aligned}$$

where $\varvec{\theta }^{*}$ is between $\varvec{\theta }_{0}$ and $\varvec{\theta }_{0}+n^{-1/2}v$. As in the proof of Lemma 3.1 of Tang and Karunamuni (2013), we have

$$\begin{aligned} \int [\ddot{S}_{\varvec{\theta }^{*},n}(y)-\ddot{S}_{\varvec{\theta }_{0},n} (y)]\hat{f}_{n}^{1/2}(y){\text{ d }}y\le \left( \int [\ddot{S}_{\varvec{\theta }^{*},n} (y)-\ddot{S}_{\varvec{\theta }_{0},n}(y)]^{2}{\text{ d }}y\right) ^{1/2}\left( \int \hat{f}_{n} (y){\text{ d }}y\right) ^{1/2}=o_{p}(1). \end{aligned}$$

Similar to the proof of Theorem 3.2 of Tang and Karunamuni (2013), we obtain

$$\begin{aligned} \int \ddot{S}_{\varvec{\theta }_{0},n}(y)\hat{f}_{n}^{1/2}(y){\text{ d }}y=-H(\varvec{\theta }_{0} )+o_{p}(1), \end{aligned}$$

where $H(\varvec{\theta }_{0})=-\int \ddot{S}_{\varvec{\theta }_{0}} (y)f_{\varvec{\theta }_{0}}^{1/2}(y){\text{ d }}y$. Hence

$$\begin{aligned}&\int [S_{\varvec{\theta }_{0}+n^{-1/2}v,n}(y)-S_{\varvec{\theta }_{0},n}(y)]\hat{f} _{n}^{1/2}(y){\text{ d }}y\nonumber \\&\quad =n^{-1/2}v\int \dot{S}_{\varvec{\theta }_{0},n}(y)\hat{f}_{n}^{1/2}(y){\text{ d }}y-\frac{1}{2n}v^{T}H(\varvec{\theta }_{0})v[1+o_{p}(1)]. \end{aligned}$$

(22)

By (A.26) of Tang and Karunamuni (2013), it follows that $\int \dot{S}_{\varvec{\theta }_{0},n}(y)\hat{f}_{n}^{1/2}(y){\text{ d }}y=O_{p}(n^{-1/2})$. Since $\varvec{\theta }^{(0)}\rightarrow _{P}\varvec{\theta }_{0}$, we have $P\{P_{n1} (\varvec{\theta }_{01}+n^{-1/2}v_{1})-P_{n1}(\varvec{\theta }_{01})=0\}\rightarrow 1$ as $n\rightarrow \infty $. Hence, for sufficiently large C, (20) follows from (21) and (22) and the fact that $H(\varvec{\theta }_{0})$ is positive definite. The proof of Lemma 1 is complete. $\square $

Lemma 2

Under the assumptions of Theorem 3, for any $\varvec{\theta }=(\varvec{\theta }_{1}^{T},\varvec{\theta }_{2}^{T})^{T}$ such that $\Vert \varvec{\theta }-\varvec{\theta }_{0}\Vert =O(n^{-1/2})$ and $\varvec{\theta }_{2} \ne \varvec{0}$, with probability tending to 1, we have

$$\begin{aligned} D_{n}((\varvec{\theta }_{1},\varvec{\theta }_{2}))<D_{n}((\varvec{\theta }_{1},\varvec{0})). \end{aligned}$$

Proof

By a Taylor expansion, we obtain

$$\begin{aligned} S_{(\varvec{\theta }_{1},\varvec{\theta }_{2}),n}(y)=S_{(\varvec{\theta }_{1} ,\varvec{0}),n}(y)+\varvec{\theta }_{2}^{T}\frac{\partial S_{\varvec{\theta },n} (y)}{\partial \varvec{\theta }_{2}}\left| _{\varvec{\theta }=(\varvec{\theta }_{1} ,\varvec{0})}+\frac{1}{2}\varvec{\theta }_{2}^{T}\frac{\partial ^{2}S_{\varvec{\theta },n} (y)}{\partial \varvec{\theta }_{2}\partial \varvec{\theta }_{2}^{T}}\right| _{\varvec{\theta }=(\varvec{\theta }_{1},\varvec{\theta }_{2}^{*})}\varvec{\theta }_{2}, \end{aligned}$$

where $\varvec{\theta }_{2}^{*}$ is between $\varvec{0}$ and $\varvec{\theta }_{2}$. As in the proof of (22), it follows that

$$\begin{aligned} \int \frac{\partial ^{2}S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }_{2} \partial \varvec{\theta }_{2}^{T}}\left| _{\varvec{\theta }=(\varvec{\theta }_{1} ,\varvec{\theta }_{2}^{*})}\hat{f}_{n}^{1/2}(y){\text{ d }}y=\int \frac{\partial ^{2}S_{\varvec{\theta }}(y)}{\partial \varvec{\theta }_{2}\partial \varvec{\theta }_{2}^{T} }\right| _{\varvec{\theta }=\varvec{\theta }_{0}}f_{\varvec{\theta }_{0}} ^{1/2}(y){\text{ d }}y[1+o_{p}(1)]=O_{p}(1). \end{aligned}$$

By (A.26) of Tang and Karunamuni (2013), we have

$$\begin{aligned} \int \frac{\partial S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }_{2}}\left| _{\varvec{\theta }=(\varvec{\theta }_{1},\varvec{0})}\hat{f}_{n}^{1/2}(y){\text{ d }}y=O_{p} (n^{-1/2})\right. . \end{aligned}$$

Using the fact that $\Vert \varvec{\theta }_{2}\Vert =O(n^{-1/2})$ and $n^{1/2}\lambda _{nj}\rightarrow +\infty $, we deduce that with probability tending to 1, it holds that

$$\begin{aligned}&D_{n}((\varvec{\theta }_{1},\varvec{\theta }_{2}))-D_{n}((\varvec{\theta }_{1},\varvec{0}))\\&\quad =\int [S_{(\varvec{\theta }_{1},\varvec{\theta }_{2}),n}(y)-S_{(\varvec{\theta }_{1} ,\varvec{0}),n}(y)]\hat{f}_{n}^{1/2}(y){\text{ d }}y-\sum _{j=1}^{J}\alpha _{j}^{1/2} \sum _{k=d_{j}}^{p}p_{\lambda _{nj}}^{\prime }(|\beta _{jk}^{(0)}|)|\beta _{jk}|\\&\quad =O_{p}(n^{-1/2})\sum _{j=1}^{J}\sum _{k=d_{j}}^{p}|\beta _{jk}|-\sum _{j=1} ^{J}\alpha _{j}^{1/2}\sum _{k=d_{j}}^{p}p_{\lambda _{nj}}^{\prime }(|\beta _{jk}^{(0)}|)|\beta _{jk}|\\&\quad =\sum _{j=1}^{J}\lambda _{nj}\sum _{k=d_{j}}^{p}\left[ O_{p}((n^{1/2}\lambda _{nj} )^{-1})-\alpha _{j}^{1/2}\lambda _{nj}^{-1}p_{\lambda _{nj}}^{\prime }(|\beta _{jk}^{(0)}|)\right] |\beta _{jk}|<0. \end{aligned}$$

This completes the proof of Lemma 2. $\square $

Proof of Theorem 3

By Lemmas 1 and 2, there exists a $\sqrt{n} $-consistent local maximizer $\check{\varvec{\theta }}=(\check{\varvec{\theta }} _{1},\varvec{0}^{T})^{T}$ of (9). By a Taylor expansion, with probability tending 1, we have

$$\begin{aligned}&D_{n}((\hat{\varvec{\theta }}_{1},\hat{\varvec{\theta }}_{2}))\\&\quad =D_{n}((\check{\varvec{\theta }}_{1},\varvec{0}))+(\hat{\varvec{\theta }}-\check{\varvec{\theta }})^{T}\int \frac{\partial S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }}\left| _{\varvec{\theta }=(\check{\varvec{\theta }}_{1},0)}\hat{f}_{n}^{1/2}(y){\text{ d }}y\right. \\&\qquad +\,\frac{1}{2}(\hat{\varvec{\theta }}-\check{\varvec{\theta }})^{T}\int \frac{\partial ^{2}S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }\partial \varvec{\theta }^{T}}\left| _{\varvec{\theta }=\varvec{\theta }^{*}}\hat{f} _{n}^{1/2}(y){\text{ d }}y(\hat{\varvec{\theta }}-\check{\varvec{\theta }})-\sum _{j=1}^{J} \alpha _{j}^{1/2}\sum _{k=d_{k}+1}^{p}p_{\lambda _{nj}}^{\prime }\left( |\beta _{jk}^{(0)}|\right) |\hat{\beta }_{jk}|,\right. \end{aligned}$$

where $\varvec{\theta }^{*}$ is between $\hat{\varvec{\theta }}$ and $\check{\varvec{\theta }}$. By Theorem 2, it follows that $\hat{\varvec{\theta }} \rightarrow _{P}\varvec{\theta }_{0}$. Using an argument similar to the one used in the proof of (22), we obtain $\int \frac{\partial ^{2}S_{\varvec{\theta },n} (y)}{\partial \varvec{\theta }\partial \varvec{\theta }^{T}} _{\varvec{\theta }=\varvec{\theta }^{*}}\hat{f}_{n}^{1/2}(y){\text{ d }}y=-H(\varvec{\theta }_{0} )[1+o_{p}(1)]$. Noting that with probability tending to 1, $\int \frac{\partial S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }_{1}}_{\varvec{\theta }=(\check{\varvec{\theta }}_{1},0)}\hat{f}_{n}^{1/2}(y){\text{ d }}y=0$, we have

$$\begin{aligned}&D_{n}((\hat{\varvec{\theta }}_{1},\hat{\varvec{\theta }}_{2}))\nonumber \\&\quad =D_{n}((\check{\varvec{\theta }}_{1},\varvec{0}))+\hat{\varvec{\theta }}_{2}^{T}\int \frac{\partial S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }_{2}}\left| _{\varvec{\theta }=(\check{\varvec{\theta }}_{1},0)}\hat{f}_{n}^{1/2}(y){\text{ d }}y\right. \nonumber \\&\qquad -\,\frac{1}{2}(\hat{\varvec{\theta }}-\check{\varvec{\theta }})^{T}H(\varvec{\theta }_{0} )(\hat{\varvec{\theta }}-\check{\varvec{\theta }})[1+o_{p}(1)]-\sum _{j=1}^{J} \alpha _{j}^{1/2}\sum _{k=d_{k}+1}^{p}p_{\lambda _{nj}}^{\prime }(|\beta _{jk}^{(0)}|)|\hat{\beta }_{jk}|,\nonumber \\ \end{aligned}$$

(23)

Using a Taylor expansion, we have

$$\begin{aligned}&\int \frac{\partial S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }_{2}}\left| _{\varvec{\theta }=(\check{\varvec{\theta }}_{1},0)}\hat{f}_{n}^{1/2}(y){\text{ d }}y=\int \frac{\partial S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }_{2}}\right| _{\varvec{\theta }=\varvec{\theta }_{0}}\nonumber \\&\quad \hat{f}_{n}^{1/2}(y){\text{ d }}y+H_{21}(\varvec{\theta }_{0} )\check{\varvec{\theta }}_{1}[1+o_{p}(1)], \end{aligned}$$

where $H_{21}(\varvec{\theta }_{0})=\int \frac{\partial ^{2}S_{\varvec{\theta }} (y)}{\partial \varvec{\theta }_{2}\partial \varvec{\theta }_{1}^{T}} _{\varvec{\theta }=\varvec{\theta }_{0}}f_{\varvec{\theta }_{0}}^{1/2}(y){\text{ d }}y$. By (A.26) of Tang and Karunamuni (2013), we have $\int \frac{\partial S_{\varvec{\theta },n} (y)}{\partial \varvec{\theta }_{2}}\left| _{\varvec{\theta }=\varvec{\theta }_{0}} \hat{f}_{n}^{1/2}(y){\text{ d }}y=O_{p}(n^{-1/2})\right. $. Then $\check{\varvec{\theta }} _{1}=O_{p}(n^{-1/2})$ implies that $\int \frac{\partial S_{\varvec{\theta },n} (y)}{\partial \varvec{\theta }_{2}}_{\varvec{\theta }=(\check{\varvec{\theta }}_{1},0)} \hat{f}_{n}^{1/2}(y){\text{ d }}y=O_{p}(n^{-1/2})$. If $\hat{\varvec{\theta }}\ne \check{\varvec{\theta }}$, then by (23), we have $D_{n}((\hat{\varvec{\theta }} _{1},\hat{\varvec{\theta }}_{2}))<D_{n}((\check{\varvec{\theta }}_{1},\varvec{0}))$. This is a contradiction to the fact that $\hat{\varvec{\theta }}$ is a maximizer of (10). So $\hat{\varvec{\theta }}_{2}=0$ and $\hat{\varvec{\theta }}_{1}=\check{\varvec{\theta }}_{1}$.

We now prove the asymptotic normality part. Consider $D_{n}((\varvec{\theta }_{1} ,\varvec{0}))$ as a function of $\varvec{\theta }_{1}$. Noting that with probability tending 1, $\hat{\varvec{\theta }}_{1}$ is the $\sqrt{n}$-consistent maximizer of $D_{n}((\varvec{\theta }_{1},\varvec{0}))$ and satisfies

$$\begin{aligned} \frac{\partial D_{n}((\varvec{\theta }_{1},\varvec{0}))}{\partial \varvec{\theta }_{1} }\left| _{\varvec{\theta }_{1}=\hat{\varvec{\theta }}_{1}}=\int \frac{\partial S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }_{1}}\right| _{\varvec{\theta }=(\hat{\varvec{\theta }}_{1},\varvec{0})}\hat{f}_{n}^{1/2}(y){\text{ d }}y=0. \end{aligned}$$

By an argument similar to the one used in the proof of (22), we obtain

$$\begin{aligned} \int \frac{\partial S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }_{1}}\left| _{\varvec{\theta }=(\hat{\varvec{\theta }}_{1},\varvec{0})}\hat{f}_{n}^{1/2} (y){\text{ d }}y=\int \frac{\partial S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }_{1} }\right| _{\varvec{\theta }={\varvec{\theta }}_{0}}\hat{f}_{n}^{1/2} (y){\text{ d }}y-H_{1}(\varvec{\theta }_{01})(\hat{\varvec{\theta }}_{1}-\varvec{\theta }_{01} )[1+o_{p}(1)]. \end{aligned}$$

Hence

$$\begin{aligned} H_{1}(\varvec{\theta }_{01})(\hat{\varvec{\theta }}_{1}-\varvec{\theta }_{01} )[1+o_{p}(1)]=\int \frac{\partial S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }_{1}}\left| _{\varvec{\theta }={\varvec{\theta }}_{0}}\hat{f} _{n}^{1/2}(y){\text{ d }}y\right. . \end{aligned}$$

(24)

Using an argument similar to the one used in the proof of Theorem 3.3 of Tang and Karunamuni (2013), we obtain

$$\begin{aligned} n^{1/2}\left( \int \frac{\partial S_{\varvec{\theta },n}(y)}{\partial \varvec{\theta }_{1}}\left| _{\varvec{\theta }={\varvec{\theta }}_{0}}\hat{f} _{n}^{1/2}(y){\text{ d }}y-A_{n1}(\varvec{\theta }_{01})\right. \right) \rightarrow _{d}N\left( \varvec{0},\Sigma _{1}(\varvec{\theta }_{01})\right) . \end{aligned}$$

Now (11) follows from the preceding expression and (24). This completes the proof of Theorem 3. $\square $

Proof of Theorem 5

Note that

$$\begin{aligned}&2\left\| f_{n+n_{1}}^{1/2}(\hat{\varvec{\theta }}(h_{n+n_{1}}))-\hat{f}_{n+n_{1} }^{1/2}(h_{n+n_{1}})\right\| ^{2}\nonumber \\&\quad \ge 2\left\| f_{n+n_{1}}^{1/2}(\hat{\varvec{\theta }}(h_{n+n_{1}}))-f_{n}^{1/2} (\hat{\varvec{\theta }}(h_{n}))\right\| ^{2}-2\left\| f_{n}^{1/2}(\hat{\varvec{\theta }} (h_{n}))-\hat{f}_{n+n_{1}}^{1/2}(h_{n+n_{1}})\right\| ^{2}.\nonumber \\ \end{aligned}$$

(25)

Since

$$\begin{aligned}&\left\| f_{n+n_{1}}^{1/2}(\hat{\varvec{\theta }}(h_{n+n_{1}}))-\hat{f}_{n+n_{1} }^{1/2}(h_{n+n_{1}})\right\| ^{2}\\&\qquad +\,2\sum _{j=1}^{J}\hat{\alpha }_{j,n+n_{1}}^{1/2}\sum _{k=1}^{p}p_{\lambda _{(n+n_{1})j}}^{\prime }(|\beta _{jk,n+n_{1}}^{(0)}|)|\hat{\beta }_{jk,n+n_{1}}|\\&\quad \le \left\| f_{n+n_{1}}^{1/2}(\hat{\varvec{\theta }}(h_{n}))-\hat{f}_{n+n_{1}} ^{1/2}(h_{n+n_{1}})\right\| ^{2}+2\sum _{j=1}^{J}\hat{\alpha }_{j,n}^{1/2}\sum _{k=1}^{p}p_{\lambda _{(n+n_{1})j}}^{\prime }(|\beta _{jk,n+n_{1}}^{(0)} |)|\hat{\beta }_{jk,n}|, \end{aligned}$$

we have

$$\begin{aligned}&\left\| f_{n+n_{1}}^{1/2}(\hat{\varvec{\theta }}(h_{n+n_{1}}))-\hat{f}_{n+n_{1} }^{1/2}(h_{n+n_{1}})\right\| ^{2}\nonumber \\&\quad \le \left\| f^{1/2}(\hat{\varvec{\theta }}(h_{n}))-\hat{f}_{n+n_{1}}^{1/2} (h_{n+n_{1}})\right\| ^{2}+\left\| f_{n+n_{1}}^{1/2}(\hat{\varvec{\theta }} (h_{n}))-f^{1/2}(\hat{\varvec{\theta }}(h_{n}))\right\| ^{2}+2P_{n,n_{1}}^{*}.\nonumber \\ \end{aligned}$$

(26)

Clearly

$$\begin{aligned}&\left\| f_{n}^{1/2}(\hat{\varvec{\theta }}(h_{n}))-\hat{f}_{n+n_{1}}^{1/2} (h_{n+n_{1}})\right\| ^{2}\nonumber \\&\quad \le \left\| f^{1/2}(\hat{\varvec{\theta }}(h_{n}))-\hat{f}_{n+n_{1}}^{1/2} (h_{n+n_{1}})\right\| ^{2}+\left\| f_{n}^{1/2}(\hat{\varvec{\theta }}(h_{n} ))-f^{1/2}(\hat{\varvec{\theta }}(h_{n}))\right\| ^{2} \end{aligned}$$

(27)

and

$$\begin{aligned}&\left\| f_{n+n_{1}}^{1/2}(\hat{\varvec{\theta }}(h_{n+n_{1}}))-f_{n}^{1/2} (\hat{\varvec{\theta }}(h_{n}))\right\| ^{2}\nonumber \\&\ge \left\| f^{1/2}(\hat{\varvec{\theta }}(h_{n+n_{1}}))-f^{1/2}(\hat{\varvec{\theta }} (h_{n}))\right\| ^{2}-\left\| f_{n+n_{1}}^{1/2}(\hat{\varvec{\theta }}(h_{n+n_{1}}))\right. \nonumber \\&\left. -f^{1/2}(\hat{\varvec{\theta }}(h_{n+n_{1}}))\right\| ^{2}-\left\| f_{n}^{1/2} (\hat{\varvec{\theta }}(h_{n}))-f^{1/2}(\hat{\varvec{\theta }}(h_{n}))\right\| ^{2}. \end{aligned}$$

(28)

Combining (25)–(28), we conclude that

$$\begin{aligned} 4\left\| f^{1/2}(\hat{\varvec{\theta }}(h_{n}))-\hat{f}_{n+n_{1}}^{1/2}(h_{n+n_{1} })\right\| ^{2}\ge \left\| f^{1/2}(\hat{\varvec{\theta }}(h_{n+n_{1}}))-f^{1/2} (\hat{\varvec{\theta }}(h_{n}))\right\| ^{2}-\varepsilon _{n,n_{1}} \end{aligned}$$

If $\hat{\varvec{\theta }}(h_{n})$ breaks down, then $\sup _{\#\varvec{V}_{n_{1} }=n_{1}}d(\hat{\varvec{\theta }}(h_{n}),\hat{\varvec{\theta }}(h_{n+n_{1}}))=\infty $. So by the definition of $\delta ^{*}$, $\left\| f^{1/2}(\hat{\varvec{\theta }} (h_{n+n_{1}}))-f^{1/2}(\hat{\varvec{\theta }}(h_{n}))\right\| ^{2}\ge \delta ^{*}$. Hence, $\left\| f^{1/2}(\hat{\varvec{\theta }}(h_{n}))-\right. \left. \hat{f}_{n+n_{1}} ^{1/2}(h_{n+n_{1}})\right\| ^{2}\ge (\delta ^{*}-\varepsilon _{n,n_{1}})/4.$ The rest of the proof is similar to that of Tamura and Boos (1986) and is therefore omitted to save space. This completes the proof of Theorem 5. $\square $

About this article

Cite this article

Tang, Q., Karunamuni, R. . Robust variable selection for finite mixture regression models. Ann Inst Stat Math 70, 489–521 (2018). https://doi.org/10.1007/s10463-017-0602-4

Download citation

Received: 18 May 2016
Revised: 29 September 2016
Published: 25 February 2017
Issue Date: June 2018
DOI: https://doi.org/10.1007/s10463-017-0602-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust variable selection for finite mixture regression models

Abstract

Access this article

Similar content being viewed by others

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Violating the normality assumption may be the lesser of two evils

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Lemma 1

Proof

Lemma 2

Proof

Proof of Theorem 3

Proof of Theorem 5

About this article

Cite this article

Keywords

Navigation

Robust variable selection for finite mixture regression models

Abstract

Access this article

Similar content being viewed by others

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Violating the normality assumption may be the lesser of two evils

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Lemma 1

Proof

Lemma 2

Proof

Proof of Theorem 3

Proof of Theorem 5

About this article

Cite this article

Share this article

Keywords

Search

Navigation