Variable Selection in Threshold Regression Model with Applications to HIV Drug Adherence Data


The threshold regression model is an effective alternative to the Cox proportional hazards regression model when the proportional hazards assumption is not met. This paper considers variable selection for threshold regression. This model has separate regression functions for the initial health status and the speed of degradation in health. This flexibility is an important advantage when considering relevant risk factors for a complex time-to-event model where one needs to decide which variables should be included in the regression function for the initial health status, in the function for the speed of degradation in health, or in both functions. In this paper, we extend the broken adaptive ridge (BAR) method, originally designed for variable selection for one regression function, to simultaneous variable selection for both regression functions needed in the threshold regression model. We establish variable selection consistency of the proposed method and asymptotic normality of the estimator of non-zero regression coefficients. Simulation results show that our method outperformed threshold regression without variable selection and variable selection based on the Akaike information criterion. We apply the proposed method to data from an HIV drug adherence study in which electronic monitoring of drug intake is used to identify risk factors for non-adherence.

This is a preview of subscription content, log in to check access.


  1. 1.

    Akaike H (1974) Stochastic theory of minimal realization. IEEE Trans Autom Control 19:667–674.

    MathSciNet  Article  MATH  Google Scholar 

  2. 2.

    Cambiano V, Lampe FC, Rodger AJ, Smith CJ, Geretti AM, Lodwick RK, Puradiredja DI, Johnson M, Swaden L, Phillips AN (2010) Long-term trends in adherence to antiretroviral therapy from start of HAART. AIDS 24 (8):1153–1162

    Article  Google Scholar 

  3. 3.

    Candes E, Tao T (2007) The Dantzig selector: statistical estimation when \(p\) is much larger than \(n\). Ann Stat 35 (6):2313–2351.

    MathSciNet  Article  MATH  Google Scholar 

  4. 4.

    Chen L, Huang JZ (2012) Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. J Am Stat Assoc 107 (500):1533–1545.

    MathSciNet  Article  MATH  Google Scholar 

  5. 5.

    Cox DR (1972) Regression models and life-tables. J R Stat Soc Ser B 34:187–220

    MathSciNet  MATH  Google Scholar 

  6. 6.

    Cox DR, Miller HD (1965) The theory of stochastic processes. Wiley, New York

    Google Scholar 

  7. 7.

    Dai L, Chen K, Sun Z, Liu Z, Li G (2018) Broken adaptive ridge regression and its asymptotic properties. J Multivariate Anal 168:334–351.

    MathSciNet  Article  MATH  Google Scholar 

  8. 8.

    Denison JA, Packer C, Stalter RM, Banda H, Mercer S, Nyambe N, Katayamoyo P, Mwansa JK, McCarraher DR (2018) Factors related to incomplete adherence to antiretroviral therapy among adolescents attending three HIV clinics in the copperbelt, Zambia. AIDS Behav 22 (3):996–1005

    Article  Google Scholar 

  9. 9.

    Donoho DL, Johnstone IM (1994) Ideal spatial adaptation by wavelet shrinkage. Biometrika 81 (3):425–455.

    MathSciNet  Article  MATH  Google Scholar 

  10. 10.

    Du P, Ma S, Liang H (2010) Penalized variable selection procedure for Cox models with semiparametric relative risk. Ann Stat 38 (4):2092–2117.

    MathSciNet  Article  MATH  Google Scholar 

  11. 11.

    Fan J (2005) A selective overview of nonparametric methods in financial econometrics. Stat Sci 20 (4):317–357.

    MathSciNet  Article  MATH  Google Scholar 

  12. 12.

    Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96 (456):1348–1360.

    MathSciNet  Article  MATH  Google Scholar 

  13. 13.

    Fan J, Lv J (2010) A selective overview of variable selection in high dimensional feature space. Stat Sinica 20 (1):101–148

    MathSciNet  MATH  Google Scholar 

  14. 14.

    Frank IE, Friedman JH (1993) A statistical view of some chemometrics regression tools. Technometrics 35 (2):109–135

    Article  Google Scholar 

  15. 15.

    Frommlet F, Nuel G (2016) An adaptive ridge procedure for l0 regularization. PLoS ONE

    Article  Google Scholar 

  16. 16.

    Glass TR, Battegay M, Cavassini M, De Geest S, Furrer H, Vernazza PL, Hirschel B, Bernasconi E, Rickenbach M, Günthard HF, Bucher HC (2010) Longitudinal analysis of patterns and predictors of changes in self-reported adherence to antiretroviral therapy: Swiss HIV Cohort Study. J Acquir Immune Defic Syndr 54 (2):197–203

    Article  Google Scholar 

  17. 17.

    Gulick RM, Wilkin TJ, Chen YQ, Landovitz RJ, Amico KR, Young AM, Richardson P, Marzinke MA, Hendrix CW, Eshleman SH, McGowan I, Cottle LM, Andrade A, Marcus C, Klingman KL, Chege W, Rinehart AR, Rooney JF, Andrew P, Salata RA, Magnus M, Farley JE, Liu A, Frank I, Ho K, Santana J, Stekler JD, McCauley M, Mayer KH (2017) Phase 2 study of the safety and tolerability of maraviroc-containing regimens to prevent HIV infection in men who have sex with men (HPTN 069/ACTG A5305). J Infect Dis 215 (2):238–246

    Google Scholar 

  18. 18.

    Huang J, Ma S, Xie H, Zhang CH (2009) A group bridge approach for variable selection. Biometrika 96 (2):339–355.

    MathSciNet  Article  MATH  Google Scholar 

  19. 19.

    Huang J, Breheny P, Ma S (2012) A selective review of group selection in high-dimensional models. Stat Sci 27 (4):481–499.

    MathSciNet  Article  MATH  Google Scholar 

  20. 20.

    Huang J, Liu L, Liu Y, Zhao X (2014) Group selection in the Cox model with a diverging number of covariates. Stat Sinica 24 (4):1787–1810

    MathSciNet  MATH  Google Scholar 

  21. 21.

    Kawaguchi ES, Suchard MA, Liu Z, Li G (2017) Scalable Sparse Cox’s regression for large-scale survival data via broken adaptive ridge. arXiv:1712.00561, arXiv:1712.00561

  22. 22.

    Kim J, Sohn I, Jung SH, Kim S, Park C (2012) Analysis of survival data with group lasso. Commun Stat 41 (9):1593–1605.

    MathSciNet  Article  MATH  Google Scholar 

  23. 23.

    Lawson C (1961) Contribution to the theory of linear least maximum approximation. PhD thesis, University of California, Los Angeles

  24. 24.

    Lee MLT, Whitmore GA (2006) Threshold regression for survival analysis: modeling event times by a stochastic process reaching a boundary. Stat Sci 21 (4):501–513.

    MathSciNet  Article  MATH  Google Scholar 

  25. 25.

    Lee MLT, Whitmore GA (2010) Proportional hazards and threshold regression: their theoretical and practical connections. Lifetime Data Anal 16 (2):196–214.

    MathSciNet  Article  MATH  Google Scholar 

  26. 26.

    Mallows CL (1973) Some comments on cp. Technometrics 15 (4):661–675.

    Article  MATH  Google Scholar 

  27. 27.

    Mittal S, Madigan D, Cheng JQ, Burd RS (2013) Large-scale parametric survival analysis. Stat Med 32 (23):3955–3971.

    MathSciNet  Article  Google Scholar 

  28. 28.

    Peng J, Zhu J, Bergamaschi A, Han W, Noh DY, Pollack JR, Wang P (2010) Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Ann Appl Stat 4 (1):53–77.

    MathSciNet  Article  MATH  Google Scholar 

  29. 29.

    Rothman AJ, Levina E, Zhu J (2010) Sparse multivariate regression with covariance estimation. J Comput Graph Stat 19 (4):947–962. supplementary materials available online

    MathSciNet  Article  Google Scholar 

  30. 30.

    Saegusa T, Lee MLT, Chen YQ (2020) Short- and long-term adherence patterns to antiretroviral drugs and prediction of time to non-adherence based on electronic drug monitoring devices

  31. 31.

    Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6 (2):461–464

    MathSciNet  Article  Google Scholar 

  32. 32.

    Simon N, Friedman J, Hastie T, Tibshirani R (2011) Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw 39 (5):1–13 10.18637/jss.v039.i05

    Article  Google Scholar 

  33. 33.

    Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58 (1):267–288

    MathSciNet  MATH  Google Scholar 

  34. 34.

    Tibshirani R (1997) The lasso method for variable selection in the Cox model. Stat Med 16 (4):385–395

    Article  Google Scholar 

  35. 35.

    Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused lasso. J R Stat Soc Ser B 67 (1):91–108.

    MathSciNet  Article  MATH  Google Scholar 

  36. 36.

    van der Vaart A, Wellner JA (2000) Preservation theorems for Glivenko-Cantelli and uniform Glivenko-Cantelli classes. In: High dimensional probability, II (Seattle, WA, 1999), Progr. Probab., vol 47, Birkhäuser Boston, Boston, MA, pp 115–133

  37. 37.

    van der Vaart AW, Wellner JA (1996) Weak convergence and empirical processes. Springer series in statistics. Springer, New York

    Google Scholar 

  38. 38.

    Xiao T, Whitmore G, He X, Lee ML (2012) Threshold regression for time-to-event analysis: the stthreg package. Stat J 12 (2):257–283

    Article  Google Scholar 

  39. 39.

    Xiao T, Whitmore G, He X, Lee ML (2015) The R package threg to implement threshold regression models. J Stat Softw 66 (8):1–16 10.18637/jss.v066.i08

    Article  Google Scholar 

  40. 40.

    Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B 68 (1):49–67.

    MathSciNet  Article  MATH  Google Scholar 

  41. 41.

    Zhang CH (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38 (2):894–942.

    MathSciNet  Article  MATH  Google Scholar 

  42. 42.

    Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101 (476):1418–1429.

    MathSciNet  Article  MATH  Google Scholar 

  43. 43.

    Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67 (2):301–320.

    MathSciNet  Article  MATH  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Takumi Saegusa.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations



A. Proofs

Let \({{\varvec{\alpha }}}_{0,1} = (\alpha _{0,1,j})_{j=1}^{s_1}\in {\mathbb {R}}^{s_1}\), \({{\varvec{\beta }}}_{0,1}= (\beta _{0,1,j})_{j=1}^{s_2}\in {\mathbb {R}}^{s_2}\). and

\({{\varvec{\theta }}} = ({{\varvec{\alpha }}}^t,{{\varvec{\beta }}}^t)^t = ({{\varvec{\alpha }}}_1^t,{{\varvec{\alpha }}}_2^t,{{\varvec{\beta }}}_1^t,{{\varvec{\beta }}}_2^t)^t\) with \({{\varvec{\alpha }}}_1= (\alpha _{1,j})_{j=1}^{s_1},{{\varvec{\alpha }}}_2= (\alpha _{2,j})_{j=1}^{p-s_1},{{\varvec{\beta }}}_1= (\beta _{1,j})_{j=1}^{s_2},{{\varvec{\beta }}}_2= (\beta _{2,j})_{j=1}^{p-s_2}\). Let \(r_n^{ (0)} = 1/\sqrt{n}+\lambda _{n,1}^{ (0)}/n+\lambda _{n,2}^{ (0)}/n\) and \(r_n = 1/\sqrt{n}+\lambda _{n,1}/n+\lambda _{n,2}/n\).

Lemma 1

The initial estimator \(\hat{{{\varvec{\theta }}}}^{ (0)}\) satisfies

$$\begin{aligned} \Vert \hat{{{\varvec{\theta }}}}^{ (0)}-{{\varvec{\theta }}}_0\Vert = O_P\left ( r_n^{ (0)} \right) . \end{aligned}$$


Let the criterion function for the initial estimator be

$$\begin{aligned} \mathbb {M}_n ({\varvec{\theta }})\equiv \left\{ -2\ell _n ({\varvec{\theta }})+ \lambda _{n,1}^{ (0)}\sum _{j=1}^p\alpha ^2_j + \lambda _{n,2}^{ (0)}\sum _{j=1}^p\beta _j^2\right\} . \end{aligned}$$

The initial estimator \(\hat{{\varvec{\theta }}}^{ (0)}\) minimizes \(\mathbb {M}_n ({\varvec{\theta }})\).

We claim that for any \(\epsilon >0\) there exists a constant \(M>0\) such that

$$\begin{aligned} P\left ( \sup _{{{\varvec{u}}}:\Vert {{\varvec{u}}}\Vert =M}\mathbb {M}_n ({\varvec{\theta }}_0+r_n^{ (0)} {{\varvec{u}}})>\mathbb {M}_n ({\varvec{\theta }}_0)\right) = 1-\epsilon . \end{aligned}$$

If the claim holds, the local minimizer \(\hat{{\varvec{\theta }}}^{ (0)}\) is in the ball \(\{{\varvec{\theta }}_0+r_n^{ (0)}{{\varvec{u}}}: \Vert {{\varvec{u}}}\Vert \le M\}\) with probability at least \(1-\epsilon\). This implies \(P\left ( \Vert \hat{{\varvec{\theta }}}^{ (0)}-{\varvec{\theta }}_0\Vert >r_n^{ (0)}M\right) \ge 1-\epsilon\) and hence \(\Vert \hat{{\varvec{\theta }}}^{ (0)}-{\varvec{\theta }}_0\Vert = O_P (r_n^{ (0)})\).

We have

$$\begin{aligned}&\frac{1}{n}\mathbb {M}_n\left ( {\varvec{\theta }}_0+r_n^{ (0)} {{\varvec{u}}}\right) -\frac{1}{n}\mathbb {M}_n ({\varvec{\theta }}_0)\\&\quad =-\frac{2\ell _n\left ( {\varvec{\theta }}_0+r_n^{ (0)} {{\varvec{u}}}\right) }{n} + \frac{2\ell _n ({\varvec{\theta }}_0)}{n} +\frac{\lambda _{n,1}^{ (0)}}{n} \sum _{j=1}^p\left\{ \alpha _{0,j}+r_n^{ (0)}u_j\right\} ^2 -\frac{\lambda _{n,1}^{ (0)}}{n}\sum _{j=1}^p\alpha _{0,j}^2\\&\qquad + \frac{\lambda _{n,2}^{ (0)}}{n}\sum _{j=1}^p \left\{ \beta _{0,j}+r_n^{ (0)}u_{p+j}\right\} ^2- \frac{\lambda _{n,2}^{ (0)}}{n}\sum _{j=1}^p \beta _{0,j}^2\\&\quad \ge -2\ell _n\left ( {\varvec{\theta }}_0+r_n^{ (0)} {{\varvec{u}}}\right) + 2\ell _n ({\varvec{\theta }}_0) +\frac{\lambda _{n,1}^{ (0)}}{n} \sum _{j=1}^{s_1}\left\{ \left ( \alpha _{0,j}+r_n^{ (0)}u_j\right) ^2-\alpha _{0,j}^2\right\} \\&\qquad + \frac{\lambda _{n,2}^{ (0)}}{n}\sum _{j=1}^{s_2} \left\{ \left ( \beta _{0,j}+r_n^{ (0)}u_{p+j}\right) ^2-\beta _{0,j}^2\right\} . \end{aligned}$$

Note that the class of functions \(\{-E\ddot{\ell }_1 ({\varvec{\theta }}) :{\varvec{\theta }}\in \varTheta \}\) is a Glivenko-Cantelli class by Lemma 2.6.15 of van der Vaart and Wellner [37] and the Glivenko-Cantelli preservation theorem van der Vaart and Wellner [36] under Conditions (C1) and (C2). Since \(r_n^{ (0)}\rightarrow 0\), for any vector \({\varvec{\theta }}^*\) between \({\varvec{\theta }}_0\) and \({\varvec{\theta }}_0+r_n^{ (0)}{{\varvec{u}}}\) we have \(-\ddot{\ell } ({\varvec{\theta }}^*)/n=I ({\varvec{\theta }}_0) +o_P (1)\). Note also that the central limit theorem with Condition (C2) yields \(n^{-1/2}\dot{\ell }_n ({\varvec{\theta }}_0)\) converges in distribution to a zero-mean normal vector Z. Thus, it follows from the fact that \(r_n^{ (0)}>n^{-1/2}\) and the Cauchy–Schwarz inequality that Taylor expansion yields

$$\begin{aligned}&-\frac{2\ell _n\left ( {\varvec{\theta }}_0+r_n^{ (0)} {{\varvec{u}}}\right) }{n}+ \frac{2\ell _n ({\varvec{\theta }}_0) }{n} =-\frac{2r_n^{ (0)}{{\varvec{u}}}^t \dot{\ell }_n ({\varvec{\theta }}_0)}{n}- 2\left ( r_n^{ (0)}\right) ^2{{\varvec{u}}}^t\frac{\ddot{\ell }_n ({\varvec{\theta }}^*)}{n}{\varvec{u}}\\&\quad =-2r_n^{ (0)}n^{-1/2}\{{{\varvec{u}}}^tZ+o_P (1)\} +2\left ( r_n^{ (0)}\right) ^2{{\varvec{u}}}^t\{I ({\varvec{\theta }}_0)+o_P (1)\}{{\varvec{u}}}\\&\quad \ge -2\left ( r_n^{ (0)}\right) ^2\{|{{\varvec{u}}}^tZ|+|o_P (1)|\} +2\left ( r_n^{ (0)}\right) ^2{{\varvec{u}}}^t\{I ({\varvec{\theta }}_0)+|o_P (1)|\}{{\varvec{u}}}. \end{aligned}$$

Since \(I ({\varvec{\theta }}_0)\) is a strictly positive definite matrix, \({{\varvec{u}}}^tI ({\varvec{\theta }}_0){{\varvec{u}}}\ge c_1\Vert {{\varvec{u}}}\Vert ^2\) for some constant \(c_1>0\). Tightness of Z implies that there exists a large constant \(c_2=\Vert {{\varvec{u}}}\Vert >0\) such that for n sufficiently large,

$$\begin{aligned} P\left ( A\cap \{|{{\varvec{u}}}^tZ|> c_1\Vert {{\varvec{u}}}\Vert ^2/4\}\right) \le P\left ( A\cap \{\Vert Z\Vert >c_1c_2/4\}\right) \le \epsilon \end{aligned}$$

where A is the event that the sum of two \(o_P (1)\) terms above is larger than \(c_1c_2^2/4\). Thus, the difference in the log-likelihoods above is \(c_1c_2^2/2\) with probability at least \(1-\epsilon\).

Next, we bound the difference in the penalty terms regarding \({{\varvec{\alpha }}}\) to obtain

$$\begin{aligned}&\frac{\lambda _{n,1}^{ (0)}}{n}\sum _{j=1}^{s_1} \left\{ 2r_n^{ (0)}\alpha _{0,j}u_j+ \left ( r_n^{ (0)}\right) ^2u_j^2\right\} \\&\quad \ge -2\frac{\lambda _{n,1}^{ (0)}}{n}r_n^{ (0)} \sqrt{s_1}\max _j|\alpha _{0,j}| \Vert {{\varvec{u}}}\Vert \ge -2\left ( r_n^{ (0)}\right) ^2\sqrt{s_1}\max _j|\alpha _{0,j}|c_2. \end{aligned}$$

Here we use the inequality \(\sum _{j=1}^pa_j \ge -\sum _{j=1}^p|a_j|\ge -\sqrt{p}\Vert {{\varvec{a}}}\Vert\) and the fact \(r_n^{ (0)}>\lambda _{n,1}^{ (0)}/n\). Similarly, we obtain

$$\begin{aligned} \frac{\lambda _{n,2}^{ (0)}}{n}\sum _{j=1}^{s_2}\left\{ 2r_n^{ (0)}\beta _{0,j}u_{p+j}+ \left ( r_n^{ (0)}\right) ^2u_{p+j}^2\right\} \ge -2 (r_n^{ (0)})^2\sqrt{s_2}\max _j|\beta _{0,j}|c_2. \end{aligned}$$

Because \({\varvec{\theta }}_0\) is bounded, we can choose a large constant \(M>c_2\) such that \(c_1M^2/2 >2 (\sqrt{s_1}+\sqrt{s_2})\max _j|{\varvec{\theta }}_{0,j}|M\).

Now replace \(c_2\) by M and follow the same argument to conclude that for n sufficiently large, \(\mathbb {M}_n\left ( {\varvec{\theta }}_0+r_n^{ (0)}{{\varvec{u}}}\right) - \mathbb {M}_n ({\varvec{\theta }}_0)\) is strictly positive for all \({{\varvec{u}}}\) satisfying \(\Vert {{\varvec{u}}}\Vert =M\) with probability at least \(1-\epsilon\). Hence the claim is proved. □

To analyze the broken adaptive ridge estimator, we study the criterion function in each step of optimization. Let

$$\begin{aligned} {\mathbb {Q}}_n ({{\varvec{\eta }}}|{{\varvec{\theta }}}) = -2\ell _n ({{\varvec{\eta }}}) + {{\varvec{\eta }}}^tD ({{\varvec{\theta }}}){{\varvec{\eta }}} \end{aligned}$$

where \({{\varvec{\eta }}} \in {\mathbb {R}}^p\) and \(D ({{\varvec{\theta }}})\) is a block diagonal matrix with diagonal elements

$$\begin{aligned}&D_1 ({{\varvec{\theta }}}) =\lambda _{n,1}\mathrm {diag} (|\alpha _1|^{-2},\ldots ,|\alpha _{s_1}|^{-2} ),\\&D_2 ({{\varvec{\theta }}}) =\lambda _{n,1}\mathrm {diag} (|\alpha _{s_1+1}|^{-2},\ldots ,|\alpha _{p}|^{-2} ),\\&D_3 ({{\varvec{\theta }}}) =\lambda _{n,2}\mathrm {diag} (|\beta _1|^{-2},\ldots ,|\beta _{s_2}|^{-2} ),\\&D_4 ({{\varvec{\theta }}}) =\lambda _{n,2}\mathrm {diag} (|\beta _{s_2+1}|^{-2},\ldots ,|\beta _{p}|^{-2} ). \end{aligned}$$

Note that \({\mathbb {Q}}_n\left ( {{\varvec{\theta }}}\left| \hat{{{\varvec{\theta }}}}_n^{ (k-1)}\right. \right)\) is the criterion function in the kth step. Viewing \({\mathbb {Q}}_n ({{\varvec{\eta }}}|{{\varvec{\theta }}})\) as a function of \({{\varvec{\eta }}}\), the first and second derivatives are

$$\begin{aligned} \dot{{\mathbb {Q}}}_n ({{\varvec{\eta }}}|{{\varvec{\theta }}}) = -2{\dot{\ell }}_n ({{\varvec{\eta }}}) +2D ({{\varvec{\theta }}}) {{\varvec{\eta }}},\quad \ddot{{\mathbb {Q}}}_n ({{\varvec{\eta }}}|{{\varvec{\theta }}}) = -2\ddot{\ell }_n ({{\varvec{\eta }}}) +2D ({{\varvec{\theta }}}). \end{aligned}$$

Lemma 2


$$\begin{aligned} {\mathcal {M}}_n= & {} \left\{ {{\varvec{\theta }}}:m_0\le |\alpha _{1j}|\le M_0,j=1,\ldots ,s_1,m_0\le |\beta _{1j}|\le M_0,\right. \\&\ \left. j=1,\ldots ,s_2,\Vert {{\varvec{\alpha }}}_2\Vert \le \delta _nn^{-1/2},\Vert {{\varvec{\beta }}}_2\Vert \le \delta _nn^{-1/2}\right\} \end{aligned}$$

where\(m_0=\min _k \{|\alpha _{0,1,k}|,|\beta _{0,1,k}|\}/2\), \(M_0=2\max _k\{|\alpha _{0,1,k}|,|\beta _{0,1,k}|\}\)and\(\delta _n = O (n^{1/4-\epsilon })\). Let\(g ({{\varvec{\theta }}}) = (g_1 ({{\varvec{\theta }}})^t,g_2 ({{\varvec{\theta }}})^t,g_3 ({{\varvec{\theta }}})^t, g_4 ({{\varvec{\theta }}})^t)^t,g_1 ({{\varvec{\theta }}})\in {\mathbb {R}}^{s_1}, g_2 ({{\varvec{\theta }}})\in {\mathbb {R}}^{p-s_1},g_3 ({{\varvec{\theta }}}) \in {\mathbb {R}}^{s_2},g_4 ({{\varvec{\theta }}})\in {\mathbb {R}}^{p-s_2}\)be solutions to\(\dot{{\mathbb {Q}}}_n ({{\varvec{\eta }}}|{{\varvec{\theta }}})=0\) (When some elements in\({{\varvec{\theta }}}\)are zero, the corresponding elements in\(g ({{\varvec{\theta }}})\)are set to zero and the equation should be understood to be with respect to the rest of elements). Then the following holds with probability tending to 1 as\(n\rightarrow \infty\)for some\(c>0\):

$$\begin{aligned} \sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n} \frac{\Vert g_2 ({{\varvec{\theta }}})\Vert }{\Vert {{\varvec{\alpha }}}_2\Vert }\le c<1,\quad \sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n} \frac{\Vert g_4 ({{\varvec{\theta }}})\Vert }{\Vert {{\varvec{\beta }}}_2\Vert }\le c<1,\quad \end{aligned}$$


$$\begin{aligned} (g_1 ({{\varvec{\theta }}})^t,g_2 ({{\varvec{\theta }}})^t,g_3 ({{\varvec{\theta }}})^t, g_4 ({{\varvec{\theta }}})^t)^t\in {\mathcal {M}}_n. \end{aligned}$$


Without loss of generality, we assume all elements of \({{\varvec{\theta }}}\) are not zero. Let \(m_{g_2 ({{\varvec{\theta }}})/{{\varvec{\alpha }}}_2}\in {\mathbb {R}}^{p-s_1}\) be a vector obtained by the element-wise division of \(g_2 ({{\varvec{\theta }}})\) by \({{\varvec{\alpha }}}_2\). Because \({{\varvec{\theta }}}\in {\mathcal {M}}_n\) implies \(\max _j|\alpha _{s_1+j}|\le \delta _n/n^{1/2}\), we have

$$\begin{aligned} \Vert D_2 ({{\varvec{\theta }}})g_2 ({{\varvec{\theta }}})\Vert \ge \frac{\lambda _{n,1}n^{1/2}}{\delta _n} \Vert m_{g_2 ({{\varvec{\theta }}})/{{\varvec{\alpha }}}_2}\Vert . \end{aligned}$$

To analyze \(D_2 ({{\varvec{\theta }}})g_2 ({{\varvec{\theta }}})\) we apply the Taylor expansion of \(\dot{{\mathbb {Q}}}_n (g ({{\varvec{\theta }}})|{{\varvec{\theta }}})\) around \({{\varvec{\theta }}}_0\) to obtain

$$\begin{aligned} 0=\dot{{\mathbb {Q}}}_n (g ({{\varvec{\theta }}})|{{\varvec{\theta }}}) =\dot{{\mathbb {Q}}}_n ({{\varvec{\theta }}}_0|{{\varvec{\theta }}}) +\ddot{{\mathbb {Q}}}_n ({{\varvec{\theta }}}^*|{{\varvec{\theta }}}) \left\{ g ({{\varvec{\theta }}})-{{\varvec{\theta }}}_0\right\} \end{aligned}$$

where \({{\varvec{\theta }}}^*\) is a convex combination of \({{\varvec{\theta }}}_0\) and \(g ({{\varvec{\theta }}})\). We can rewrite this in terms of \({\dot{\ell }}_n\), \(\ddot{\ell }_n\), and D by

$$\begin{aligned} \left\{ -2\ddot{\ell }_n ({{\varvec{\theta }}}^*) +2D ({{\varvec{\theta }}})\right\} \left\{ g ({{\varvec{\theta }}})-{{\varvec{\theta }}}_0\right\} = 2{\dot{\ell }}_n ({{\varvec{\theta }}}_0) -2D ({{\varvec{\theta }}}) {{\varvec{\theta }}}_0. \end{aligned}$$

Rearrangement yields

$$\begin{aligned} \frac{1}{n}{\dot{\ell }}_n ({{\varvec{\theta }}}_0)= & {} -\frac{\ddot{\ell }_n ({{\varvec{\theta }}}^*)}{n}\left\{ g ({{\varvec{\theta }}}) - {{\varvec{\theta }}}_0\right\} + \frac{1}{n}D ({{\varvec{\theta }}})g ({{\varvec{\theta }}}). \end{aligned}$$

Since the central limit theorem yields \({\dot{\ell }} ({{\varvec{\theta }}}_0)/n=O_P (n^{-1/2})\), the triangle inequality yields

$$\begin{aligned} \sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n}\left\Vert \frac{1}{n} D_2 ({{\varvec{\theta }}})g_2 ({{\varvec{\theta }}})\right\Vert \le O_P (n^{-1/2})+ \sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n}\left\Vert \frac{\ddot{\ell }_n ({{\varvec{\theta }}}^*)}{n}g_2 ({{\varvec{\theta }}}) \right\Vert . \end{aligned}$$

Since each element of \(I ({{\varvec{\theta }}})\) is bounded and continuous in \({{\varvec{\theta }}}\in \varTheta\), the Glivenko–Cantelli theorem together with the dominated convergence theorem yields

$$\begin{aligned} \sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n}\left\Vert \frac{\ddot{\ell }_n ({{\varvec{\theta }}})}{n} \right\Vert =O_P (1) \end{aligned}$$

where \(\Vert \cdot \Vert\) is the operator norm for a matrix. Thus, if we show that

$$\begin{aligned} \sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n}\Vert g_2 ({{\varvec{\theta }}}) \Vert =O_P (n^{-1/2}), \end{aligned}$$

it follows from (7), (8), and the definition of \(O_P (n^{-1/2})\) that with probability tending to 1 we have

$$\begin{aligned} \frac{\lambda _{n,1}n^{1/2}}{n\delta _n} \Vert m_{g_2 ({{\varvec{\theta }}})/{{\varvec{\alpha }}}_2}\Vert \le \delta _nn^{-1/2}. \end{aligned}$$

Since \(\delta _n^2/\lambda _{n,1}\rightarrow 0\), we have that with probability tending to 1,

$$\begin{aligned} \sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n}\Vert m_{g_2 ({{\varvec{\theta }}})/{{\varvec{\alpha }}}_2}\Vert <c\le 1. \end{aligned}$$


$$\begin{aligned} \Vert g_2 ({{\varvec{\theta }}})\Vert \le \Vert m_{g_2 ({{\varvec{\theta }}})/{{\varvec{\alpha }}}_2}\Vert \max _{s_1+1\le j\le p}|\alpha _j| \le \Vert m_{g_2 ({{\varvec{\theta }}})/{{\varvec{\alpha }}}_2}\Vert \Vert {{\varvec{\alpha }}}_2\Vert , \end{aligned}$$

we obtain

$$\begin{aligned} P\left ( \sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n}\frac{\Vert g_2 ({{\varvec{\theta }}})\Vert }{\Vert {{\varvec{\alpha }}}_2\Vert }<c\right) \rightarrow 1. \end{aligned}$$

Now we show (9). Let \(g_{j,k} ({{\varvec{\theta }}}), j=1,2,3,4,\) is the kth element of \(g_j ({{\varvec{\theta }}})\). Because the likelihood is not concave, \(g ({{\varvec{\theta }}})\) may take multiple values. With probability tending to 1, we can choose one solution \({\tilde{g}} ({{\varvec{\theta }}}_0)\) to \(\dot{{\mathbb {Q}}}_n ({{\varvec{\eta }}}|{{\varvec{\theta }}}_0)=0\) such that \({\tilde{g}}_2 ({{\varvec{\theta }}}_0)-{{\varvec{\theta }}}_0=O_P (r_n)\). Because \({\mathbb {Q}}_n ({{\varvec{\eta }}}|{{\varvec{\theta }}})\) is continuous in \({{\varvec{\eta }}}\) and \({{\varvec{\theta }}}\), there is a continuous map \({\mathcal {M}}_n\mapsto \{g ({{\varvec{\theta }}}):\theta \in {\mathcal {M}}_n\}\) that passes through \({\tilde{g}} ({{\varvec{\theta }}}_0)\). To see this, we show \({{\varvec{\theta }}}\rightarrow {{\varvec{\theta }}}'\) implies \({\tilde{g}}_2 ({{\varvec{\theta }}})\rightarrow {\tilde{g}}_2 ({{\varvec{\theta }}}').\) Since \(\dot{{\mathbb {Q}}}_n ({\tilde{g}} ({{\varvec{\theta }}})|{{\varvec{\theta }}}) =\dot{{\mathbb {Q}}}_n ({\tilde{g}} ({{\varvec{\theta }}}')|{{\varvec{\theta }}}')=0\), the Taylor expansion yields

$$\begin{aligned}&\frac{\ddot{\ell }_n ({{\varvec{\theta }}}^*)}{n}\left\{ {\tilde{g}}_2 ({{\varvec{\theta }}}')-{\tilde{g}}_2 ({{\varvec{\theta }}})\right\} =\frac{1}{n}D_2 ({{\varvec{\theta }}}')g_2 ({{\varvec{\theta }}}') - \frac{1}{n}D_2 ({{\varvec{\theta }}}){\tilde{g}}_2 ({{\varvec{\theta }}}) \end{aligned}$$

where \({{\varvec{\theta }}}^*\) is some convex combination of \({\tilde{g}}_2 ({{\varvec{\theta }}})\) and \({\tilde{g}}_2 ({{\varvec{\theta }}}').\) Since \({\tilde{g}} ({{\varvec{\theta }}})\rightarrow _P{{\varvec{\theta }}}_0, {\tilde{g}} ({{\varvec{\theta }}}')\rightarrow _P{{\varvec{\theta }}}_0\) and \(I ({{\varvec{\theta }}}_0)\) is strictly positive definite, it follows from the Glivenko-Cantelli theorem that the following inequality holds almost surely:

$$\begin{aligned} \{c_1-o (1)\} \Vert {\tilde{g}}_2 ({{\varvec{\theta }}}')-{\tilde{g}}_2 ({{\varvec{\theta }}})\Vert \le \left\Vert -\frac{\ddot{\ell }_n ({{\varvec{\theta }}}^*)}{n}\left\{ {\tilde{g}}_2 ({{\varvec{\theta }}}')-{\tilde{g}}_2 ({{\varvec{\theta }}})\right\} \right\Vert \end{aligned}$$

for some constant \(c_1>0\). The right-hand side of (10) is bounded above by

$$\begin{aligned} \frac{1}{n}\Vert D_2 ({{\varvec{\theta }}}') -D_2 ({{\varvec{\theta }}})\Vert \Vert {\tilde{g}}_2 ({{\varvec{\theta }}}')\Vert + \frac{1}{n}\Vert D_2 ({{\varvec{\theta }}}) \Vert \Vert {\tilde{g}}_2 ({{\varvec{\theta }}}')-{\tilde{g}}_2 ({{\varvec{\theta }}})\Vert . \end{aligned}$$

Since \({{\varvec{\theta }}}\in {\mathcal {M}}_n\) and \(\lambda _{n,1}/n\rightarrow 0\), \(\Vert D_2 ({{\varvec{\theta }}})\Vert /n =o (1)\). Thus, we have almost surely

$$\begin{aligned} \{c_1-o (1)\} \Vert {\tilde{g}}_2 ({{\varvec{\theta }}}')-{\tilde{g}}_2 ({{\varvec{\theta }}})\Vert \le \frac{1}{n}\Vert D_2 ({{\varvec{\theta }}}') -D_2 ({{\varvec{\theta }}})\Vert \Vert {\tilde{g}}_2 ({{\varvec{\theta }}}')\Vert . \end{aligned}$$

If every element of \({{\varvec{\alpha }}}_2'\) is non-zero, then \(\Vert D_2 ({{\varvec{\theta }}}') -D_2 ({{\varvec{\theta }}})\Vert \rightarrow 0\) as \({{\varvec{\theta }}}\rightarrow {{\varvec{\theta }}}'\), and hence \(\Vert {\tilde{g}}_2 ({{\varvec{\theta }}}')-{\tilde{g}}_2 ({{\varvec{\theta }}})\Vert \rightarrow 0\). Suppose the jth element \(\alpha _{2,j}'\) of \({{\varvec{\alpha }}}'_2\) is zero. Let \({\tilde{g}}_{2,j} ({{\varvec{\theta }}})\) be the jth element of \({\tilde{g}}_{2} ({{\varvec{\theta }}})\). Recall that \(\dot{{\mathbb {Q}}}_n ({{\varvec{\eta }}}|{{\varvec{\theta }}}) =- {\dot{\ell }}_n ({{\varvec{\theta }}}) +D ({{\varvec{\theta }}}){{\varvec{\eta }}}\). Since \({\dot{\ell }}_n ({{\varvec{\theta }}})\) is uniformly bounded in \({{\varvec{\theta }}}\), \(D ({{\varvec{\theta }}})g ({\theta })\) must be uniformly bounded. That is, \(\lambda _{n,1}{\tilde{g}}_{2,j} ({{\varvec{\theta }}})/\alpha _{2,j}^2\) is bounded by a constant which does not depend on \({{\varvec{\theta }}}\). Thus, \(\alpha _{2,j}\rightarrow 0=\alpha _{2,j}'\) implies \({\tilde{g}}_{2,j} ({{\varvec{\theta }}})\rightarrow 0 = {\tilde{g}}_{2,j} ({{\varvec{\theta }}}')\) as expected. The case where jth element \(\alpha _{2,j}'\) is non-zero reduces to the previous case where every element of \(\alpha _{2,j}'\) is non-zero. Now, the continuity of \({\tilde{g}}_2\) and compactness of \({\mathcal {M}}_n\) implies the image of \({\tilde{g}}_2\) is compact. This image can be covered by finitely many balls with radius \(o (n^{-1/2})\) and center \({\tilde{g}}_2 ({{\varvec{\theta }}}^{ (j)}),j=1,\ldots ,J,\) which is \(O_P (r_n)=O_P (n^{-1/2})\). Then for any \({{\varvec{\theta }}}\in {\mathcal {M}}_n\) the triangle inequality gives

$$\begin{aligned} \Vert {\tilde{g}}_2 ({{\varvec{\theta }}})\Vert \le o (n^{-1/2}) + O_P (n^{-1/2}) = O_p (n^{-1/2}). \end{aligned}$$

If some solution \(g_2 ({{\varvec{\theta }}})\) is not included in \({\tilde{g}}_2 ({{\varvec{\theta }}})\) one can create another function and repeat the same argument above. This proves (9).

The case for \(\sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n}\Vert g_4 ({{\varvec{\theta }}})\Vert /\Vert {{\varvec{\beta }}}_2\Vert\) is similar. For \(g_1 ({{\varvec{\theta }}})\) and \(g_3 ({{\varvec{\theta }}})\) it can be shown in a similar way as above that \(g_1\) and \(g_3\) (with appropriate modification into functions) are continuous. Because the similar argument in the proof of Lemma 1 implies

$$\begin{aligned} \Vert g_1 ({{\varvec{\theta }}})-{{\varvec{\alpha }}}_1\Vert =O_P (r_n) = O_P (n^{-1/2}),\quad \Vert g_3 ({{\varvec{\theta }}})-{{\varvec{\beta }}}_1\Vert = O_P (n^{-1/2}), \end{aligned}$$

the similar arguent for bounding \(\sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n}\Vert {\tilde{g}}_2 ({{\varvec{\theta }}})\Vert\) yields

$$\begin{aligned} \sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n}\Vert g_1 ({{\varvec{\theta }}})-{{\varvec{\alpha }}}_1\Vert \le O_P (n^{-1/2})\rightarrow _P0, \quad \sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n}\Vert g_3 ({{\varvec{\theta }}}) -{{\varvec{\beta }}}_1\Vert \le O_P (n^{-1/2})\rightarrow _P0. \end{aligned}$$

This proves that with probability tending to 1, \(g ({{\varvec{\theta }}})\in {\mathcal {M}}_n\). □

We will show with the help of Lemma 2 that \(\hat{{{\varvec{\alpha }}}}_2=0\) and \(\hat{{{\varvec{\beta }}}}_2=0\) with probability tending to 1. Then the analysis of \(\hat{{{\varvec{\theta }}}}\) reduces to the analysis of the oracle estimator based on the model where \({{\varvec{\alpha }}}_2=0\) and \({{\varvec{\beta }}}_2=0\). Let \({{\varvec{\theta }}}^{o} = ({{\varvec{\alpha }}}_1^t,{{\varvec{\beta }}}^t_1)^t \in {\mathbb {R}}^{s_1+s_2}\) and \({{\varvec{\theta }}}_0^{o} = ({{\varvec{\alpha }}}_{0,1}^t,{{\varvec{\beta }}}^t_{0,1})^t\). With abuse of notation, let \(\ell _n ({{\varvec{\theta }}}^o)\) be the likelihood evaluated at \(({{\varvec{\alpha }}}_1^t,{\varvec{0}}^t,{{\varvec{\beta }}}^t_1,{\varvec{0}}^t)^t\) with the first and second derivatives \({\dot{\ell }}_n ({{\varvec{\theta }}}^o)\) and \(\ddot{\ell }_n ({{\varvec{\theta }}}^o)\) with respect to \({{\varvec{\theta }}}^o\) and the corresponding Fisher information matrix \(I ({{\varvec{\theta }}}^o)\). The criterion functions for the oracle estimator are

$$\begin{aligned}&{\mathbb {M}}_n ({{\varvec{\theta }}}^o)= -2\ell _n ({{\varvec{\theta }}}^o)+ \lambda _{n,1}^{ (0)} \sum _{j=1}^{s_1}\alpha ^2_{1,j} + \lambda _{n,2}^{ (0)}\sum _{j=1}^{s_2}\beta _{1,j}^2,\\&{\mathbb {Q}}_n ({{\varvec{\eta }}}^o|{{\varvec{\theta }}}^o) = -2\ell _n ({{\varvec{\eta }}}^o) + {{\varvec{\alpha }}}_1^tD_1 ({{\varvec{\theta }}}^o){{\varvec{\alpha }}}_1 +{{\varvec{\beta }}}_1^tD_3 ({{\varvec{\theta }}}^o){{\varvec{\beta }}}_1, \end{aligned}$$


$$\begin{aligned}&D_1 ({{\varvec{\theta }}}^o) =\lambda _{n,1} \mathrm {diag} (|\alpha _{1,1}|^{-2},\ldots ,|\alpha _{1,s_1}|^{-2}),\\&D_3 ({{\varvec{\theta }}}^o) =\lambda _{n,2} \mathrm {diag} (|\beta _{1,1}|^{-2},\ldots ,|\beta _{1,s_2}|^{-2}),\\&\dot{{\mathbb {Q}}}_n ({{\varvec{\eta }}}^o|{{\varvec{\theta }}}^o) = -2{\dot{\ell }}_n ({{\varvec{\eta }}}^o) +2\mathrm {diag}\left ( D_1 ({{\varvec{\theta }}}^o) ,D_3 ({{\varvec{\theta }}}^o)\right) ( {{\varvec{\alpha }}}_1^t,{{\varvec{\beta }}}^t_1)^t,\\&\ddot{{\mathbb {Q}}}_n ({{\varvec{\eta }}}^o|{{\varvec{\theta }}}^o) = -2\ddot{\ell }_n ({{\varvec{\eta }}}^o) +2\mathrm {diag}\left ( D_1 ({{\varvec{\theta }}}^o),D_3 ({{\varvec{\theta }}}^o)\right) . \end{aligned}$$

Lemma 3


$$\begin{aligned} {\mathcal {M}}_n^o= \left\{ {{\varvec{\theta }}}^o:m_0\le \min _j|\alpha _{1j}|\le \max _j|\alpha _{1j}|\le M_0,m_0\le \min _j|\beta _{1j}|\le \max _j|\beta _{1j}|\le M_0\right\} . \end{aligned}$$

Let\(h ({{\varvec{\theta }}}^o) = (h_1 ({{\varvec{\theta }}}^o)^t, h_3 ({{\varvec{\theta }}}^o)^t)^t,h_1 ({{\varvec{\theta }}}^o)\in {\mathbb {R}}^{s_1}, h_3 ({{\varvec{\theta }}}^o)\in {\mathbb {R}}^{s_2},\)be solutions to\(\dot{{\mathbb {Q}}}_n ({{\varvec{\eta }}}^o|{{\varvec{\theta }}}^o)=0\). Then, the maphon\({\mathcal {M}}_n^o\)is a contraction map with probability tending to 1 with the unique fixed point\(\tilde{{{\varvec{\theta }}}}^o\)satisfying

$$\begin{aligned} \sqrt{n} (\tilde{{{\varvec{\theta }}}}^o-{{\varvec{\theta }}}_0^o)\rightarrow _d Z\sim N\left ( 0,I^{-1} ({{\varvec{\theta }}}_0^o)\right) . \end{aligned}$$


A similar argument in Lemma 2 implies h maps \({\mathcal {M}}_n^o\) into itself. Since \(\dot{{\mathbb {Q}}}_n (h ({{\varvec{\theta }}}^o)|{{\varvec{\theta }}}^o)=\dot{{\mathbb {Q}}}_n (h ({{\varvec{\theta }}'}^o)|{{\varvec{\theta }}'}^o)=0\) for \({{\varvec{\theta }}}^o\ne {{\varvec{\theta }}'}^o\), a similar argument in Lemma 2 yields

$$\begin{aligned} \{c+o (1)\}\Vert h_1 ({{\varvec{\theta }}}^o)-h_1 ({{\varvec{\theta }}'}^o)\Vert \le \frac{1}{n}\left\Vert D_1 ({{\varvec{\theta }}'}^o) -D_1 ({{\varvec{\theta }}}^o)\right\Vert \left\Vert h_1 ({{\varvec{\theta }}}^o) \right\Vert \end{aligned}$$

for some constant \(c>0\). By the Taylor expansion to the function \(y=1/x^2\), the right-hand side of the inequality is bounded by

$$\begin{aligned} \frac{1}{n}\left\Vert D_1 ({{\varvec{\theta }}'}^o) -D_1 ({{\varvec{\theta }}}^o)\right\Vert \left\Vert h_1 ({{\varvec{\theta }}}^o) \right\Vert \le \frac{2\lambda _{n,1}\left\Vert h_1 ({{\varvec{\theta }}}^o) \right\Vert }{m_0^{3}n} \Vert {{\varvec{\theta }}}^o-{{\varvec{\theta }}'}^o\Vert . \end{aligned}$$

Since we have

$$\begin{aligned} h_1 ({{\varvec{\theta }}}^o) =h_1 ({{\varvec{\theta }}}^o)-{{\varvec{\alpha }}}_1+{{\varvec{\alpha }}}_1 = O_P (r_n)+ O (1), \end{aligned}$$

we have

$$\begin{aligned} \frac{2\lambda _{n,1}\left\Vert h_1 ({{\varvec{\theta }}}^o) \right\Vert }{m_0^{3}n} \le \frac{2\lambda _{n,1}}{m_0^{3}n} (O_P (r_n)+O (1)) \rightarrow _P0. \end{aligned}$$

Thus, the map \(h_1\) is a contraction map with probability tending to 1. The case for \(h_3\) is similar. Thus, it follows from the fixed point theorem that with probability tending to 1 there is a unique fixed point \(\tilde{{{\varvec{\theta }}}}^o\) such that \(h (\tilde{{{\varvec{\theta }}}}^o)=\tilde{{{\varvec{\theta }}}}^o\).

As in the proof of Lemma 2, the Taylor expansion of \({\mathbb {Q}}_n (h (\tilde{{{\varvec{\theta }}}}^o)|\tilde{{{\varvec{\theta }}}}^o)\) around \({{\varvec{\theta }}}_0^o\) yields

$$\begin{aligned} -\frac{1}{n}\ddot{\ell }_n ({{\varvec{\theta }}}^{*o})\sqrt{n} (\tilde{{{\varvec{\theta }}}}^o-{{\varvec{\theta }}}_0^o) = \frac{1}{n^{1/2}}{\dot{\ell }}_n ({{\varvec{\theta }}}_0^o) -\frac{1}{n^{1/2}}D_1 (\tilde{{{\varvec{\theta }}}}^o)h_1 (\tilde{{{\varvec{\theta }}}}^o) -\frac{1}{n^{1/2}}D_3 (\tilde{{{\varvec{\theta }}}}^o)h_3 (\tilde{{{\varvec{\theta }}}}^o) \end{aligned}$$

where \({{\varvec{\theta }}}^{*o}\) is some convex combination of \(\tilde{{{\varvec{\theta }}}}^o\) and \({{\varvec{\theta }}}_0^o\). Since \(h (\tilde{{{\varvec{\theta }}}}^o)-{{\varvec{\theta }}}_0^o=O_P (r_n)=o_P (1)\), \(\lambda _{n,1}/\sqrt{n}\rightarrow 0\) and \(\lambda _{n,2}/\sqrt{n}\rightarrow 0\), the second and third terms in the right-hand side of the last display is \(o_P (1)\). Since \(-\ddot{\ell }_n (\tilde{{{\varvec{\theta }}}}^o)/n\rightarrow _PI ({{\varvec{\theta }}}^o_0)\) as in the proof of Lemma 2, the central limit theorem applied to \({\dot{\ell }}_n ({{\varvec{\theta }}}^o_0)/\sqrt{n}\) and the Slutsky theorem yields

$$\begin{aligned} \sqrt{n} (\tilde{{{\varvec{\theta }}}}^o-{{\varvec{\theta }}}_0^o)\rightarrow _d Z\sim N\left ( 0,I^{-1} ({{\varvec{\theta }}}_0^o)\right) . \end{aligned}$$

Proof of Theorems 1and 2 Since the initial estimator \(\hat{{{\varvec{\theta }}}}^{ (0)}\) is in \({\mathcal {M}}_n\) with probability tending to 1 by Lemma 1, it follows from Lemma 2 that

$$\begin{aligned} \Vert \hat{{{\varvec{\alpha }}}}_2^{ (k)}\Vert = \frac{\Vert \hat{{{\varvec{\alpha }}}}_2^{ (k)}\Vert }{\Vert \hat{{{\varvec{\alpha }}}}_2^{ (k-1)}\Vert } \times \ldots \times \frac{\Vert \hat{{{\varvec{\alpha }}}}_2^{ (1)}\Vert }{\Vert \hat{{{\varvec{\alpha }}}}_2^{ (0)}\Vert } \rightarrow 0 \end{aligned}$$

as \(k\rightarrow \infty\) so that \(\hat{{{\varvec{\alpha }}}}_2 = {\varvec{0}}\) and, similarly, \(\hat{{{\varvec{\beta }}}}_2 = {\varvec{0}}\) with probability tending to 1. This together with the consistency of \(\hat{{{\varvec{\alpha }}}}_1\) and \(\hat{{{\varvec{\beta }}}}_1\) proves the model selection consistency of \(\hat{{{\varvec{\theta }}}}\).

Since a similar argument in the proof of Lemma 2 shows the map g from \({\mathcal {M}}_n^0\) to itself is continuous, \(\hat{{{\varvec{\alpha }}}}_2^{ (k)}\rightarrow {\varvec{0}}\) and \(\hat{{{\varvec{\beta }}}}_2^{ (k)}\rightarrow {\varvec{0}}\) implies

$$\begin{aligned} \Vert g (\hat{{{\varvec{\theta }}}}^{ (k)}) - h (\hat{{{\varvec{\theta }}}}^{ (k)o}) \Vert \rightarrow 0 \end{aligned}$$

as \(k\rightarrow \infty\) where \(\hat{{{\varvec{\theta }}}}^{ (k)o} = ( (\hat{{{\varvec{\alpha }}}}^{ (k)}_1)^t, (\hat{{{\varvec{\beta }}}}^{ (k)}_1)^t)^t\). Since h is a contraction map by Lemma 2,

$$\begin{aligned} \hat{{{\varvec{\theta }}}}^{ (k)o}\rightarrow \tilde{{{\varvec{\theta }}}}^o = (\tilde{{{\varvec{\alpha }}}}_1^t,\tilde{{{\varvec{\beta }}}}_1^t)^t \end{aligned}$$

as \(k\rightarrow \infty\) where \(\tilde{{{\varvec{\theta }}}}^o\) is a unique fixed point of h. Thus the triangle inequality yields

$$\begin{aligned} \Vert \hat{{{\varvec{\alpha }}}}_1-\tilde{{{\varvec{\alpha }}}}_1\Vert \le \lim _{k\rightarrow \infty }\Vert g_1 (\hat{{{\varvec{\theta }}}}^{ (k)})-h_1 (\hat{{{\varvec{\theta }}}}^o)\Vert + \lim _{k\rightarrow \infty }\Vert h_1 (\hat{{{\varvec{\theta }}}}^o) -\tilde{{{\varvec{\alpha }}}}_1\Vert =0. \end{aligned}$$

Similarly, \(\hat{{{\varvec{\beta }}}}_1\rightarrow \tilde{{{\varvec{\beta }}}}_1\) as \(k\rightarrow \infty\). Thus, \(\hat{{{\varvec{\alpha }}}}_1=\tilde{{{\varvec{\alpha }}}}_1\) and \(\hat{{{\varvec{\beta }}}}_1=\tilde{{{\varvec{\beta }}}}_1\) with probability tending to 1. This implies the asymptotic properties of \(\hat{{{\varvec{\alpha }}}}_1\) and \(\hat{{{\varvec{\beta }}}}_1\) reduces to the properties of \(\tilde{{{\varvec{\theta }}}}^o\) derived in the Lemma. This completes the proof. □

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Saegusa, T., Ma, T., Li, G. et al. Variable Selection in Threshold Regression Model with Applications to HIV Drug Adherence Data. Stat Biosci (2020).

Download citation


  • HIV
  • Survival analysis
  • Threshold regression
  • Variable selection