Variable Selection in Threshold Regression Model with Applications to HIV Drug Adherence Data

Saegusa, Takumi; Ma, Tianzhou; Li, Gang; Chen, Ying Qing; Lee, Mei-Ling Ting

doi:10.1007/s12561-020-09284-1

Variable Selection in Threshold Regression Model with Applications to HIV Drug Adherence Data

Published: 17 June 2020

Volume 12, pages 376–398, (2020)
Cite this article

Statistics in Biosciences Aims and scope Submit manuscript

Takumi Saegusa ORCID: orcid.org/0000-0001-6869-2451¹,
Tianzhou Ma²,
Gang Li³,
Ying Qing Chen⁴ &
…
Mei-Ling Ting Lee²

438 Accesses
3 Citations
Explore all metrics

Abstract

The threshold regression model is an effective alternative to the Cox proportional hazards regression model when the proportional hazards assumption is not met. This paper considers variable selection for threshold regression. This model has separate regression functions for the initial health status and the speed of degradation in health. This flexibility is an important advantage when considering relevant risk factors for a complex time-to-event model where one needs to decide which variables should be included in the regression function for the initial health status, in the function for the speed of degradation in health, or in both functions. In this paper, we extend the broken adaptive ridge (BAR) method, originally designed for variable selection for one regression function, to simultaneous variable selection for both regression functions needed in the threshold regression model. We establish variable selection consistency of the proposed method and asymptotic normality of the estimator of non-zero regression coefficients. Simulation results show that our method outperformed threshold regression without variable selection and variable selection based on the Akaike information criterion. We apply the proposed method to data from an HIV drug adherence study in which electronic monitoring of drug intake is used to identify risk factors for non-adherence.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A flexible approach for variable selection in large-scale healthcare database studies with missing covariate and outcome data

Article Open access 04 May 2022

Estimating Costs Associated with Disease Model States Using Generalized Linear Models: A Tutorial

Article Open access 10 November 2023

Clustering Functional Data with Application to Electronic Medication Adherence Monitoring in HIV Prevention Trials

Article 12 February 2019

References

Akaike H (1974) Stochastic theory of minimal realization. IEEE Trans Autom Control 19:667–674. https://doi.org/10.1109/tac.1974.1100707
Article MathSciNet MATH Google Scholar
Cambiano V, Lampe FC, Rodger AJ, Smith CJ, Geretti AM, Lodwick RK, Puradiredja DI, Johnson M, Swaden L, Phillips AN (2010) Long-term trends in adherence to antiretroviral therapy from start of HAART. AIDS 24 (8):1153–1162
Article Google Scholar
Candes E, Tao T (2007) The Dantzig selector: statistical estimation when $p$ is much larger than $n$. Ann Stat 35 (6):2313–2351. https://doi.org/10.1214/009053606000001523
Article MathSciNet MATH Google Scholar
Chen L, Huang JZ (2012) Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. J Am Stat Assoc 107 (500):1533–1545. https://doi.org/10.1080/01621459.2012.734178
Article MathSciNet MATH Google Scholar
Cox DR (1972) Regression models and life-tables. J R Stat Soc Ser B 34:187–220
MathSciNet MATH Google Scholar
Cox DR, Miller HD (1965) The theory of stochastic processes. Wiley, New York
MATH Google Scholar
Dai L, Chen K, Sun Z, Liu Z, Li G (2018) Broken adaptive ridge regression and its asymptotic properties. J Multivariate Anal 168:334–351. https://doi.org/10.1016/j.jmva.2018.08.007
Article MathSciNet MATH Google Scholar
Denison JA, Packer C, Stalter RM, Banda H, Mercer S, Nyambe N, Katayamoyo P, Mwansa JK, McCarraher DR (2018) Factors related to incomplete adherence to antiretroviral therapy among adolescents attending three HIV clinics in the copperbelt, Zambia. AIDS Behav 22 (3):996–1005
Article Google Scholar
Donoho DL, Johnstone IM (1994) Ideal spatial adaptation by wavelet shrinkage. Biometrika 81 (3):425–455. https://doi.org/10.1093/biomet/81.3.425
Article MathSciNet MATH Google Scholar
Du P, Ma S, Liang H (2010) Penalized variable selection procedure for Cox models with semiparametric relative risk. Ann Stat 38 (4):2092–2117. https://doi.org/10.1214/09-AOS780
Article MathSciNet MATH Google Scholar
Fan J (2005) A selective overview of nonparametric methods in financial econometrics. Stat Sci 20 (4):317–357. https://doi.org/10.1214/088342305000000412
Article MathSciNet MATH Google Scholar
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96 (456):1348–1360. https://doi.org/10.1198/016214501753382273
Article MathSciNet MATH Google Scholar
Fan J, Lv J (2010) A selective overview of variable selection in high dimensional feature space. Stat Sinica 20 (1):101–148
MathSciNet MATH Google Scholar
Frank IE, Friedman JH (1993) A statistical view of some chemometrics regression tools. Technometrics 35 (2):109–135
Article Google Scholar
Frommlet F, Nuel G (2016) An adaptive ridge procedure for l0 regularization. PLoS ONE https://doi.org/10.1371/journal.pone.0148620
Article Google Scholar
Glass TR, Battegay M, Cavassini M, De Geest S, Furrer H, Vernazza PL, Hirschel B, Bernasconi E, Rickenbach M, Günthard HF, Bucher HC (2010) Longitudinal analysis of patterns and predictors of changes in self-reported adherence to antiretroviral therapy: Swiss HIV Cohort Study. J Acquir Immune Defic Syndr 54 (2):197–203
Article Google Scholar
Gulick RM, Wilkin TJ, Chen YQ, Landovitz RJ, Amico KR, Young AM, Richardson P, Marzinke MA, Hendrix CW, Eshleman SH, McGowan I, Cottle LM, Andrade A, Marcus C, Klingman KL, Chege W, Rinehart AR, Rooney JF, Andrew P, Salata RA, Magnus M, Farley JE, Liu A, Frank I, Ho K, Santana J, Stekler JD, McCauley M, Mayer KH (2017) Phase 2 study of the safety and tolerability of maraviroc-containing regimens to prevent HIV infection in men who have sex with men (HPTN 069/ACTG A5305). J Infect Dis 215 (2):238–246
Google Scholar
Huang J, Ma S, Xie H, Zhang CH (2009) A group bridge approach for variable selection. Biometrika 96 (2):339–355. https://doi.org/10.1093/biomet/asp020
Article MathSciNet MATH Google Scholar
Huang J, Breheny P, Ma S (2012) A selective review of group selection in high-dimensional models. Stat Sci 27 (4):481–499. https://doi.org/10.1214/12-STS392
Article MathSciNet MATH Google Scholar
Huang J, Liu L, Liu Y, Zhao X (2014) Group selection in the Cox model with a diverging number of covariates. Stat Sinica 24 (4):1787–1810
MathSciNet MATH Google Scholar
Kawaguchi ES, Suchard MA, Liu Z, Li G (2017) Scalable Sparse Cox’s regression for large-scale survival data via broken adaptive ridge. arXiv:1712.00561, arXiv:1712.00561
Kim J, Sohn I, Jung SH, Kim S, Park C (2012) Analysis of survival data with group lasso. Commun Stat 41 (9):1593–1605. https://doi.org/10.1080/03610918.2011.611311
Article MathSciNet MATH Google Scholar
Lawson C (1961) Contribution to the theory of linear least maximum approximation. PhD thesis, University of California, Los Angeles
Lee MLT, Whitmore GA (2006) Threshold regression for survival analysis: modeling event times by a stochastic process reaching a boundary. Stat Sci 21 (4):501–513. https://doi.org/10.1214/088342306000000330
Article MathSciNet MATH Google Scholar
Lee MLT, Whitmore GA (2010) Proportional hazards and threshold regression: their theoretical and practical connections. Lifetime Data Anal 16 (2):196–214. https://doi.org/10.1007/s10985-009-9138-0
Article MathSciNet MATH Google Scholar
Mallows CL (1973) Some comments on cp. Technometrics 15 (4):661–675. https://doi.org/10.1080/00401706.1973.10489103
Article MATH Google Scholar
Mittal S, Madigan D, Cheng JQ, Burd RS (2013) Large-scale parametric survival analysis. Stat Med 32 (23):3955–3971. https://doi.org/10.1002/sim.5817
Article MathSciNet Google Scholar
Peng J, Zhu J, Bergamaschi A, Han W, Noh DY, Pollack JR, Wang P (2010) Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Ann Appl Stat 4 (1):53–77. https://doi.org/10.1214/09-AOAS271
Article MathSciNet MATH Google Scholar
Rothman AJ, Levina E, Zhu J (2010) Sparse multivariate regression with covariance estimation. J Comput Graph Stat 19 (4):947–962. https://doi.org/10.1198/jcgs.2010.09188 supplementary materials available online
Article MathSciNet Google Scholar
Saegusa T, Lee MLT, Chen YQ (2020) Short- and long-term adherence patterns to antiretroviral drugs and prediction of time to non-adherence based on electronic drug monitoring devices
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6 (2):461–464
Article MathSciNet Google Scholar
Simon N, Friedman J, Hastie T, Tibshirani R (2011) Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw 39 (5):1–13 10.18637/jss.v039.i05
Article Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58 (1):267–288
MathSciNet MATH Google Scholar
Tibshirani R (1997) The lasso method for variable selection in the Cox model. Stat Med 16 (4):385–395
Article Google Scholar
Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused lasso. J R Stat Soc Ser B 67 (1):91–108. https://doi.org/10.1111/j.1467-9868.2005.00490.x
Article MathSciNet MATH Google Scholar
van der Vaart A, Wellner JA (2000) Preservation theorems for Glivenko-Cantelli and uniform Glivenko-Cantelli classes. In: High dimensional probability, II (Seattle, WA, 1999), Progr. Probab., vol 47, Birkhäuser Boston, Boston, MA, pp 115–133
van der Vaart AW, Wellner JA (1996) Weak convergence and empirical processes. Springer series in statistics. Springer, New York
Book Google Scholar
Xiao T, Whitmore G, He X, Lee ML (2012) Threshold regression for time-to-event analysis: the stthreg package. Stat J 12 (2):257–283
Article Google Scholar
Xiao T, Whitmore G, He X, Lee ML (2015) The R package threg to implement threshold regression models. J Stat Softw 66 (8):1–16 10.18637/jss.v066.i08
Article Google Scholar
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B 68 (1):49–67. https://doi.org/10.1111/j.1467-9868.2005.00532.x
Article MathSciNet MATH Google Scholar
Zhang CH (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38 (2):894–942. https://doi.org/10.1214/09-AOS729
Article MathSciNet MATH Google Scholar
Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101 (476):1418–1429. https://doi.org/10.1198/016214506000000735
Article MathSciNet MATH Google Scholar
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67 (2):301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Biostatistics, University of Maryland, College Park, MD, 20742, USA
Takumi Saegusa
Department of Epidemiology and Biostatistics, University of Maryland, College Park, MD, 20742, USA
Tianzhou Ma & Mei-Ling Ting Lee
Department of Biostatistics, University of California, Los Angeles, CA, 90095, USA
Gang Li
Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA
Ying Qing Chen

Authors

Takumi Saegusa
View author publications
You can also search for this author in PubMed Google Scholar
Tianzhou Ma
View author publications
You can also search for this author in PubMed Google Scholar
Gang Li
View author publications
You can also search for this author in PubMed Google Scholar
Ying Qing Chen
View author publications
You can also search for this author in PubMed Google Scholar
Mei-Ling Ting Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Takumi Saegusa.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations

Appendix

1.1 A. Proofs

Let ${{\varvec{\alpha }}}_{0,1} = (\alpha _{0,1,j})_{j=1}^{s_1}\in {\mathbb {R}}^{s_1}$, ${{\varvec{\beta }}}_{0,1}= (\beta _{0,1,j})_{j=1}^{s_2}\in {\mathbb {R}}^{s_2}$. and

${{\varvec{\theta }}} = ({{\varvec{\alpha }}}^t,{{\varvec{\beta }}}^t)^t = ({{\varvec{\alpha }}}_1^t,{{\varvec{\alpha }}}_2^t,{{\varvec{\beta }}}_1^t,{{\varvec{\beta }}}_2^t)^t$ with ${{\varvec{\alpha }}}_1= (\alpha _{1,j})_{j=1}^{s_1},{{\varvec{\alpha }}}_2= (\alpha _{2,j})_{j=1}^{p-s_1},{{\varvec{\beta }}}_1= (\beta _{1,j})_{j=1}^{s_2},{{\varvec{\beta }}}_2= (\beta _{2,j})_{j=1}^{p-s_2}$. Let $r_n^{ (0)} = 1/\sqrt{n}+\lambda _{n,1}^{ (0)}/n+\lambda _{n,2}^{ (0)}/n$ and $r_n = 1/\sqrt{n}+\lambda _{n,1}/n+\lambda _{n,2}/n$.

Lemma 1

The initial estimator $\hat{{{\varvec{\theta }}}}^{ (0)}$ satisfies

$$\begin{aligned} \Vert \hat{{{\varvec{\theta }}}}^{ (0)}-{{\varvec{\theta }}}_0\Vert = O_P\left ( r_n^{ (0)} \right) . \end{aligned}$$

Proof

Let the criterion function for the initial estimator be

$$\begin{aligned} \mathbb {M}_n ({\varvec{\theta }})\equiv \left\{ -2\ell _n ({\varvec{\theta }})+ \lambda _{n,1}^{ (0)}\sum _{j=1}^p\alpha ^2_j + \lambda _{n,2}^{ (0)}\sum _{j=1}^p\beta _j^2\right\} . \end{aligned}$$

The initial estimator $\hat{{\varvec{\theta }}}^{ (0)}$ minimizes $\mathbb {M}_n ({\varvec{\theta }})$.

We claim that for any $\epsilon >0$ there exists a constant $M>0$ such that

$$\begin{aligned} P\left ( \sup _{{{\varvec{u}}}:\Vert {{\varvec{u}}}\Vert =M}\mathbb {M}_n ({\varvec{\theta }}_0+r_n^{ (0)} {{\varvec{u}}})>\mathbb {M}_n ({\varvec{\theta }}_0)\right) = 1-\epsilon . \end{aligned}$$

If the claim holds, the local minimizer $\hat{{\varvec{\theta }}}^{ (0)}$ is in the ball $\{{\varvec{\theta }}_0+r_n^{ (0)}{{\varvec{u}}}: \Vert {{\varvec{u}}}\Vert \le M\}$ with probability at least $1-\epsilon$. This implies $P\left ( \Vert \hat{{\varvec{\theta }}}^{ (0)}-{\varvec{\theta }}_0\Vert >r_n^{ (0)}M\right) \ge 1-\epsilon$ and hence $\Vert \hat{{\varvec{\theta }}}^{ (0)}-{\varvec{\theta }}_0\Vert = O_P (r_n^{ (0)})$.

We have

$$\begin{aligned}&\frac{1}{n}\mathbb {M}_n\left ( {\varvec{\theta }}_0+r_n^{ (0)} {{\varvec{u}}}\right) -\frac{1}{n}\mathbb {M}_n ({\varvec{\theta }}_0)\\&\quad =-\frac{2\ell _n\left ( {\varvec{\theta }}_0+r_n^{ (0)} {{\varvec{u}}}\right) }{n} + \frac{2\ell _n ({\varvec{\theta }}_0)}{n} +\frac{\lambda _{n,1}^{ (0)}}{n} \sum _{j=1}^p\left\{ \alpha _{0,j}+r_n^{ (0)}u_j\right\} ^2 -\frac{\lambda _{n,1}^{ (0)}}{n}\sum _{j=1}^p\alpha _{0,j}^2\\&\qquad + \frac{\lambda _{n,2}^{ (0)}}{n}\sum _{j=1}^p \left\{ \beta _{0,j}+r_n^{ (0)}u_{p+j}\right\} ^2- \frac{\lambda _{n,2}^{ (0)}}{n}\sum _{j=1}^p \beta _{0,j}^2\\&\quad \ge -2\ell _n\left ( {\varvec{\theta }}_0+r_n^{ (0)} {{\varvec{u}}}\right) + 2\ell _n ({\varvec{\theta }}_0) +\frac{\lambda _{n,1}^{ (0)}}{n} \sum _{j=1}^{s_1}\left\{ \left ( \alpha _{0,j}+r_n^{ (0)}u_j\right) ^2-\alpha _{0,j}^2\right\} \\&\qquad + \frac{\lambda _{n,2}^{ (0)}}{n}\sum _{j=1}^{s_2} \left\{ \left ( \beta _{0,j}+r_n^{ (0)}u_{p+j}\right) ^2-\beta _{0,j}^2\right\} . \end{aligned}$$

Note that the class of functions $\{-E\ddot{\ell }_1 ({\varvec{\theta }}) :{\varvec{\theta }}\in \varTheta \}$ is a Glivenko-Cantelli class by Lemma 2.6.15 of van der Vaart and Wellner [37] and the Glivenko-Cantelli preservation theorem van der Vaart and Wellner [36] under Conditions (C1) and (C2). Since $r_n^{ (0)}\rightarrow 0$, for any vector ${\varvec{\theta }}^*$ between ${\varvec{\theta }}_0$ and ${\varvec{\theta }}_0+r_n^{ (0)}{{\varvec{u}}}$ we have $-\ddot{\ell } ({\varvec{\theta }}^*)/n=I ({\varvec{\theta }}_0) +o_P (1)$. Note also that the central limit theorem with Condition (C2) yields $n^{-1/2}\dot{\ell }_n ({\varvec{\theta }}_0)$ converges in distribution to a zero-mean normal vector Z. Thus, it follows from the fact that $r_n^{ (0)}>n^{-1/2}$ and the Cauchy–Schwarz inequality that Taylor expansion yields

$$\begin{aligned}&-\frac{2\ell _n\left ( {\varvec{\theta }}_0+r_n^{ (0)} {{\varvec{u}}}\right) }{n}+ \frac{2\ell _n ({\varvec{\theta }}_0) }{n} =-\frac{2r_n^{ (0)}{{\varvec{u}}}^t \dot{\ell }_n ({\varvec{\theta }}_0)}{n}- 2\left ( r_n^{ (0)}\right) ^2{{\varvec{u}}}^t\frac{\ddot{\ell }_n ({\varvec{\theta }}^*)}{n}{\varvec{u}}\\&\quad =-2r_n^{ (0)}n^{-1/2}\{{{\varvec{u}}}^tZ+o_P (1)\} +2\left ( r_n^{ (0)}\right) ^2{{\varvec{u}}}^t\{I ({\varvec{\theta }}_0)+o_P (1)\}{{\varvec{u}}}\\&\quad \ge -2\left ( r_n^{ (0)}\right) ^2\{|{{\varvec{u}}}^tZ|+|o_P (1)|\} +2\left ( r_n^{ (0)}\right) ^2{{\varvec{u}}}^t\{I ({\varvec{\theta }}_0)+|o_P (1)|\}{{\varvec{u}}}. \end{aligned}$$

Since $I ({\varvec{\theta }}_0)$ is a strictly positive definite matrix, ${{\varvec{u}}}^tI ({\varvec{\theta }}_0){{\varvec{u}}}\ge c_1\Vert {{\varvec{u}}}\Vert ^2$ for some constant $c_1>0$. Tightness of Z implies that there exists a large constant $c_2=\Vert {{\varvec{u}}}\Vert >0$ such that for n sufficiently large,

$$\begin{aligned} P\left ( A\cap \{|{{\varvec{u}}}^tZ|> c_1\Vert {{\varvec{u}}}\Vert ^2/4\}\right) \le P\left ( A\cap \{\Vert Z\Vert >c_1c_2/4\}\right) \le \epsilon \end{aligned}$$

where A is the event that the sum of two $o_P (1)$ terms above is larger than $c_1c_2^2/4$. Thus, the difference in the log-likelihoods above is $c_1c_2^2/2$ with probability at least $1-\epsilon$.

Next, we bound the difference in the penalty terms regarding ${{\varvec{\alpha }}}$ to obtain

$$\begin{aligned}&\frac{\lambda _{n,1}^{ (0)}}{n}\sum _{j=1}^{s_1} \left\{ 2r_n^{ (0)}\alpha _{0,j}u_j+ \left ( r_n^{ (0)}\right) ^2u_j^2\right\} \\&\quad \ge -2\frac{\lambda _{n,1}^{ (0)}}{n}r_n^{ (0)} \sqrt{s_1}\max _j|\alpha _{0,j}| \Vert {{\varvec{u}}}\Vert \ge -2\left ( r_n^{ (0)}\right) ^2\sqrt{s_1}\max _j|\alpha _{0,j}|c_2. \end{aligned}$$

Here we use the inequality $\sum _{j=1}^pa_j \ge -\sum _{j=1}^p|a_j|\ge -\sqrt{p}\Vert {{\varvec{a}}}\Vert$ and the fact $r_n^{ (0)}>\lambda _{n,1}^{ (0)}/n$. Similarly, we obtain

$$\begin{aligned} \frac{\lambda _{n,2}^{ (0)}}{n}\sum _{j=1}^{s_2}\left\{ 2r_n^{ (0)}\beta _{0,j}u_{p+j}+ \left ( r_n^{ (0)}\right) ^2u_{p+j}^2\right\} \ge -2 (r_n^{ (0)})^2\sqrt{s_2}\max _j|\beta _{0,j}|c_2. \end{aligned}$$

Because ${\varvec{\theta }}_0$ is bounded, we can choose a large constant $M>c_2$ such that $c_1M^2/2 >2 (\sqrt{s_1}+\sqrt{s_2})\max _j|{\varvec{\theta }}_{0,j}|M$.

Now replace $c_2$ by M and follow the same argument to conclude that for n sufficiently large, $\mathbb {M}_n\left ( {\varvec{\theta }}_0+r_n^{ (0)}{{\varvec{u}}}\right) - \mathbb {M}_n ({\varvec{\theta }}_0)$ is strictly positive for all ${{\varvec{u}}}$ satisfying $\Vert {{\varvec{u}}}\Vert =M$ with probability at least $1-\epsilon$. Hence the claim is proved. □

To analyze the broken adaptive ridge estimator, we study the criterion function in each step of optimization. Let

$$\begin{aligned} {\mathbb {Q}}_n ({{\varvec{\eta }}}|{{\varvec{\theta }}}) = -2\ell _n ({{\varvec{\eta }}}) + {{\varvec{\eta }}}^tD ({{\varvec{\theta }}}){{\varvec{\eta }}} \end{aligned}$$

where ${{\varvec{\eta }}} \in {\mathbb {R}}^p$ and $D ({{\varvec{\theta }}})$ is a block diagonal matrix with diagonal elements

$$\begin{aligned}&D_1 ({{\varvec{\theta }}}) =\lambda _{n,1}\mathrm {diag} (|\alpha _1|^{-2},\ldots ,|\alpha _{s_1}|^{-2} ),\\&D_2 ({{\varvec{\theta }}}) =\lambda _{n,1}\mathrm {diag} (|\alpha _{s_1+1}|^{-2},\ldots ,|\alpha _{p}|^{-2} ),\\&D_3 ({{\varvec{\theta }}}) =\lambda _{n,2}\mathrm {diag} (|\beta _1|^{-2},\ldots ,|\beta _{s_2}|^{-2} ),\\&D_4 ({{\varvec{\theta }}}) =\lambda _{n,2}\mathrm {diag} (|\beta _{s_2+1}|^{-2},\ldots ,|\beta _{p}|^{-2} ). \end{aligned}$$

Note that ${\mathbb {Q}}_n\left ( {{\varvec{\theta }}}\left| \hat{{{\varvec{\theta }}}}_n^{ (k-1)}\right. \right)$ is the criterion function in the kth step. Viewing ${\mathbb {Q}}_n ({{\varvec{\eta }}}|{{\varvec{\theta }}})$ as a function of ${{\varvec{\eta }}}$, the first and second derivatives are

$$\begin{aligned} \dot{{\mathbb {Q}}}_n ({{\varvec{\eta }}}|{{\varvec{\theta }}}) = -2{\dot{\ell }}_n ({{\varvec{\eta }}}) +2D ({{\varvec{\theta }}}) {{\varvec{\eta }}},\quad \ddot{{\mathbb {Q}}}_n ({{\varvec{\eta }}}|{{\varvec{\theta }}}) = -2\ddot{\ell }_n ({{\varvec{\eta }}}) +2D ({{\varvec{\theta }}}). \end{aligned}$$

Lemma 2

Let

$$\begin{aligned} {\mathcal {M}}_n= & {} \left\{ {{\varvec{\theta }}}:m_0\le |\alpha _{1j}|\le M_0,j=1,\ldots ,s_1,m_0\le |\beta _{1j}|\le M_0,\right. \\&\ \left. j=1,\ldots ,s_2,\Vert {{\varvec{\alpha }}}_2\Vert \le \delta _nn^{-1/2},\Vert {{\varvec{\beta }}}_2\Vert \le \delta _nn^{-1/2}\right\} \end{aligned}$$

where $m_0=\min _k \{|\alpha _{0,1,k}|,|\beta _{0,1,k}|\}/2$, $M_0=2\max _k\{|\alpha _{0,1,k}|,|\beta _{0,1,k}|\}$ and $\delta _n = O (n^{1/4-\epsilon })$. Let $g ({{\varvec{\theta }}}) = (g_1 ({{\varvec{\theta }}})^t,g_2 ({{\varvec{\theta }}})^t,g_3 ({{\varvec{\theta }}})^t, g_4 ({{\varvec{\theta }}})^t)^t,g_1 ({{\varvec{\theta }}})\in {\mathbb {R}}^{s_1}, g_2 ({{\varvec{\theta }}})\in {\mathbb {R}}^{p-s_1},g_3 ({{\varvec{\theta }}}) \in {\mathbb {R}}^{s_2},g_4 ({{\varvec{\theta }}})\in {\mathbb {R}}^{p-s_2}$ be solutions to $\dot{{\mathbb {Q}}}_n ({{\varvec{\eta }}}|{{\varvec{\theta }}})=0$ (When some elements in ${{\varvec{\theta }}}$ are zero, the corresponding elements in $g ({{\varvec{\theta }}})$ are set to zero and the equation should be understood to be with respect to the rest of elements). Then the following holds with probability tending to 1 as $n\rightarrow \infty$ for some $c>0$:

$$\begin{aligned} \sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n} \frac{\Vert g_2 ({{\varvec{\theta }}})\Vert }{\Vert {{\varvec{\alpha }}}_2\Vert }\le c<1,\quad \sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n} \frac{\Vert g_4 ({{\varvec{\theta }}})\Vert }{\Vert {{\varvec{\beta }}}_2\Vert }\le c<1,\quad \end{aligned}$$

and

$$\begin{aligned} (g_1 ({{\varvec{\theta }}})^t,g_2 ({{\varvec{\theta }}})^t,g_3 ({{\varvec{\theta }}})^t, g_4 ({{\varvec{\theta }}})^t)^t\in {\mathcal {M}}_n. \end{aligned}$$

Proof

Without loss of generality, we assume all elements of ${{\varvec{\theta }}}$ are not zero. Let $m_{g_2 ({{\varvec{\theta }}})/{{\varvec{\alpha }}}_2}\in {\mathbb {R}}^{p-s_1}$ be a vector obtained by the element-wise division of $g_2 ({{\varvec{\theta }}})$ by ${{\varvec{\alpha }}}_2$. Because ${{\varvec{\theta }}}\in {\mathcal {M}}_n$ implies $\max _j|\alpha _{s_1+j}|\le \delta _n/n^{1/2}$, we have

$$\begin{aligned} \Vert D_2 ({{\varvec{\theta }}})g_2 ({{\varvec{\theta }}})\Vert \ge \frac{\lambda _{n,1}n^{1/2}}{\delta _n} \Vert m_{g_2 ({{\varvec{\theta }}})/{{\varvec{\alpha }}}_2}\Vert . \end{aligned}$$

(7)

To analyze $D_2 ({{\varvec{\theta }}})g_2 ({{\varvec{\theta }}})$ we apply the Taylor expansion of $\dot{{\mathbb {Q}}}_n (g ({{\varvec{\theta }}})|{{\varvec{\theta }}})$ around ${{\varvec{\theta }}}_0$ to obtain

$$\begin{aligned} 0=\dot{{\mathbb {Q}}}_n (g ({{\varvec{\theta }}})|{{\varvec{\theta }}}) =\dot{{\mathbb {Q}}}_n ({{\varvec{\theta }}}_0|{{\varvec{\theta }}}) +\ddot{{\mathbb {Q}}}_n ({{\varvec{\theta }}}^*|{{\varvec{\theta }}}) \left\{ g ({{\varvec{\theta }}})-{{\varvec{\theta }}}_0\right\} \end{aligned}$$

where ${{\varvec{\theta }}}^*$ is a convex combination of ${{\varvec{\theta }}}_0$ and $g ({{\varvec{\theta }}})$. We can rewrite this in terms of ${\dot{\ell }}_n$, $\ddot{\ell }_n$, and D by

$$\begin{aligned} \left\{ -2\ddot{\ell }_n ({{\varvec{\theta }}}^*) +2D ({{\varvec{\theta }}})\right\} \left\{ g ({{\varvec{\theta }}})-{{\varvec{\theta }}}_0\right\} = 2{\dot{\ell }}_n ({{\varvec{\theta }}}_0) -2D ({{\varvec{\theta }}}) {{\varvec{\theta }}}_0. \end{aligned}$$

Rearrangement yields

$$\begin{aligned} \frac{1}{n}{\dot{\ell }}_n ({{\varvec{\theta }}}_0)= & {} -\frac{\ddot{\ell }_n ({{\varvec{\theta }}}^*)}{n}\left\{ g ({{\varvec{\theta }}}) - {{\varvec{\theta }}}_0\right\} + \frac{1}{n}D ({{\varvec{\theta }}})g ({{\varvec{\theta }}}). \end{aligned}$$

Since the central limit theorem yields ${\dot{\ell }} ({{\varvec{\theta }}}_0)/n=O_P (n^{-1/2})$, the triangle inequality yields

$$\begin{aligned} \sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n}\left\Vert \frac{1}{n} D_2 ({{\varvec{\theta }}})g_2 ({{\varvec{\theta }}})\right\Vert \le O_P (n^{-1/2})+ \sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n}\left\Vert \frac{\ddot{\ell }_n ({{\varvec{\theta }}}^*)}{n}g_2 ({{\varvec{\theta }}}) \right\Vert . \end{aligned}$$

(8)

Since each element of $I ({{\varvec{\theta }}})$ is bounded and continuous in ${{\varvec{\theta }}}\in \varTheta$, the Glivenko–Cantelli theorem together with the dominated convergence theorem yields

$$\begin{aligned} \sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n}\left\Vert \frac{\ddot{\ell }_n ({{\varvec{\theta }}})}{n} \right\Vert =O_P (1) \end{aligned}$$

where $\Vert \cdot \Vert$ is the operator norm for a matrix. Thus, if we show that

$$\begin{aligned} \sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n}\Vert g_2 ({{\varvec{\theta }}}) \Vert =O_P (n^{-1/2}), \end{aligned}$$

(9)

it follows from (7), (8), and the definition of $O_P (n^{-1/2})$ that with probability tending to 1 we have

$$\begin{aligned} \frac{\lambda _{n,1}n^{1/2}}{n\delta _n} \Vert m_{g_2 ({{\varvec{\theta }}})/{{\varvec{\alpha }}}_2}\Vert \le \delta _nn^{-1/2}. \end{aligned}$$

Since $\delta _n^2/\lambda _{n,1}\rightarrow 0$, we have that with probability tending to 1,

$$\begin{aligned} \sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n}\Vert m_{g_2 ({{\varvec{\theta }}})/{{\varvec{\alpha }}}_2}\Vert <c\le 1. \end{aligned}$$

Since

$$\begin{aligned} \Vert g_2 ({{\varvec{\theta }}})\Vert \le \Vert m_{g_2 ({{\varvec{\theta }}})/{{\varvec{\alpha }}}_2}\Vert \max _{s_1+1\le j\le p}|\alpha _j| \le \Vert m_{g_2 ({{\varvec{\theta }}})/{{\varvec{\alpha }}}_2}\Vert \Vert {{\varvec{\alpha }}}_2\Vert , \end{aligned}$$

we obtain

$$\begin{aligned} P\left ( \sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n}\frac{\Vert g_2 ({{\varvec{\theta }}})\Vert }{\Vert {{\varvec{\alpha }}}_2\Vert }<c\right) \rightarrow 1. \end{aligned}$$

Now we show (9). Let $g_{j,k} ({{\varvec{\theta }}}), j=1,2,3,4,$ is the kth element of $g_j ({{\varvec{\theta }}})$. Because the likelihood is not concave, $g ({{\varvec{\theta }}})$ may take multiple values. With probability tending to 1, we can choose one solution ${\tilde{g}} ({{\varvec{\theta }}}_0)$ to $\dot{{\mathbb {Q}}}_n ({{\varvec{\eta }}}|{{\varvec{\theta }}}_0)=0$ such that ${\tilde{g}}_2 ({{\varvec{\theta }}}_0)-{{\varvec{\theta }}}_0=O_P (r_n)$. Because ${\mathbb {Q}}_n ({{\varvec{\eta }}}|{{\varvec{\theta }}})$ is continuous in ${{\varvec{\eta }}}$ and ${{\varvec{\theta }}}$, there is a continuous map ${\mathcal {M}}_n\mapsto \{g ({{\varvec{\theta }}}):\theta \in {\mathcal {M}}_n\}$ that passes through ${\tilde{g}} ({{\varvec{\theta }}}_0)$. To see this, we show ${{\varvec{\theta }}}\rightarrow {{\varvec{\theta }}}'$ implies ${\tilde{g}}_2 ({{\varvec{\theta }}})\rightarrow {\tilde{g}}_2 ({{\varvec{\theta }}}').$ Since $\dot{{\mathbb {Q}}}_n ({\tilde{g}} ({{\varvec{\theta }}})|{{\varvec{\theta }}}) =\dot{{\mathbb {Q}}}_n ({\tilde{g}} ({{\varvec{\theta }}}')|{{\varvec{\theta }}}')=0$, the Taylor expansion yields

$$\begin{aligned}&\frac{\ddot{\ell }_n ({{\varvec{\theta }}}^*)}{n}\left\{ {\tilde{g}}_2 ({{\varvec{\theta }}}')-{\tilde{g}}_2 ({{\varvec{\theta }}})\right\} =\frac{1}{n}D_2 ({{\varvec{\theta }}}')g_2 ({{\varvec{\theta }}}') - \frac{1}{n}D_2 ({{\varvec{\theta }}}){\tilde{g}}_2 ({{\varvec{\theta }}}) \end{aligned}$$

(10)

where ${{\varvec{\theta }}}^*$ is some convex combination of ${\tilde{g}}_2 ({{\varvec{\theta }}})$ and ${\tilde{g}}_2 ({{\varvec{\theta }}}').$ Since ${\tilde{g}} ({{\varvec{\theta }}})\rightarrow _P{{\varvec{\theta }}}_0, {\tilde{g}} ({{\varvec{\theta }}}')\rightarrow _P{{\varvec{\theta }}}_0$ and $I ({{\varvec{\theta }}}_0)$ is strictly positive definite, it follows from the Glivenko-Cantelli theorem that the following inequality holds almost surely:

$$\begin{aligned} \{c_1-o (1)\} \Vert {\tilde{g}}_2 ({{\varvec{\theta }}}')-{\tilde{g}}_2 ({{\varvec{\theta }}})\Vert \le \left\Vert -\frac{\ddot{\ell }_n ({{\varvec{\theta }}}^*)}{n}\left\{ {\tilde{g}}_2 ({{\varvec{\theta }}}')-{\tilde{g}}_2 ({{\varvec{\theta }}})\right\} \right\Vert \end{aligned}$$

for some constant $c_1>0$. The right-hand side of (10) is bounded above by

$$\begin{aligned} \frac{1}{n}\Vert D_2 ({{\varvec{\theta }}}') -D_2 ({{\varvec{\theta }}})\Vert \Vert {\tilde{g}}_2 ({{\varvec{\theta }}}')\Vert + \frac{1}{n}\Vert D_2 ({{\varvec{\theta }}}) \Vert \Vert {\tilde{g}}_2 ({{\varvec{\theta }}}')-{\tilde{g}}_2 ({{\varvec{\theta }}})\Vert . \end{aligned}$$

Since ${{\varvec{\theta }}}\in {\mathcal {M}}_n$ and $\lambda _{n,1}/n\rightarrow 0$, $\Vert D_2 ({{\varvec{\theta }}})\Vert /n =o (1)$. Thus, we have almost surely

$$\begin{aligned} \{c_1-o (1)\} \Vert {\tilde{g}}_2 ({{\varvec{\theta }}}')-{\tilde{g}}_2 ({{\varvec{\theta }}})\Vert \le \frac{1}{n}\Vert D_2 ({{\varvec{\theta }}}') -D_2 ({{\varvec{\theta }}})\Vert \Vert {\tilde{g}}_2 ({{\varvec{\theta }}}')\Vert . \end{aligned}$$

If every element of ${{\varvec{\alpha }}}_2'$ is non-zero, then $\Vert D_2 ({{\varvec{\theta }}}') -D_2 ({{\varvec{\theta }}})\Vert \rightarrow 0$ as ${{\varvec{\theta }}}\rightarrow {{\varvec{\theta }}}'$, and hence $\Vert {\tilde{g}}_2 ({{\varvec{\theta }}}')-{\tilde{g}}_2 ({{\varvec{\theta }}})\Vert \rightarrow 0$. Suppose the jth element $\alpha _{2,j}'$ of ${{\varvec{\alpha }}}'_2$ is zero. Let ${\tilde{g}}_{2,j} ({{\varvec{\theta }}})$ be the jth element of ${\tilde{g}}_{2} ({{\varvec{\theta }}})$. Recall that $\dot{{\mathbb {Q}}}_n ({{\varvec{\eta }}}|{{\varvec{\theta }}}) =- {\dot{\ell }}_n ({{\varvec{\theta }}}) +D ({{\varvec{\theta }}}){{\varvec{\eta }}}$. Since ${\dot{\ell }}_n ({{\varvec{\theta }}})$ is uniformly bounded in ${{\varvec{\theta }}}$, $D ({{\varvec{\theta }}})g ({\theta })$ must be uniformly bounded. That is, $\lambda _{n,1}{\tilde{g}}_{2,j} ({{\varvec{\theta }}})/\alpha _{2,j}^2$ is bounded by a constant which does not depend on ${{\varvec{\theta }}}$. Thus, $\alpha _{2,j}\rightarrow 0=\alpha _{2,j}'$ implies ${\tilde{g}}_{2,j} ({{\varvec{\theta }}})\rightarrow 0 = {\tilde{g}}_{2,j} ({{\varvec{\theta }}}')$ as expected. The case where jth element $\alpha _{2,j}'$ is non-zero reduces to the previous case where every element of $\alpha _{2,j}'$ is non-zero. Now, the continuity of ${\tilde{g}}_2$ and compactness of ${\mathcal {M}}_n$ implies the image of ${\tilde{g}}_2$ is compact. This image can be covered by finitely many balls with radius $o (n^{-1/2})$ and center ${\tilde{g}}_2 ({{\varvec{\theta }}}^{ (j)}),j=1,\ldots ,J,$ which is $O_P (r_n)=O_P (n^{-1/2})$. Then for any ${{\varvec{\theta }}}\in {\mathcal {M}}_n$ the triangle inequality gives

$$\begin{aligned} \Vert {\tilde{g}}_2 ({{\varvec{\theta }}})\Vert \le o (n^{-1/2}) + O_P (n^{-1/2}) = O_p (n^{-1/2}). \end{aligned}$$

If some solution $g_2 ({{\varvec{\theta }}})$ is not included in ${\tilde{g}}_2 ({{\varvec{\theta }}})$ one can create another function and repeat the same argument above. This proves (9).

The case for $\sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n}\Vert g_4 ({{\varvec{\theta }}})\Vert /\Vert {{\varvec{\beta }}}_2\Vert$ is similar. For $g_1 ({{\varvec{\theta }}})$ and $g_3 ({{\varvec{\theta }}})$ it can be shown in a similar way as above that $g_1$ and $g_3$ (with appropriate modification into functions) are continuous. Because the similar argument in the proof of Lemma 1 implies

$$\begin{aligned} \Vert g_1 ({{\varvec{\theta }}})-{{\varvec{\alpha }}}_1\Vert =O_P (r_n) = O_P (n^{-1/2}),\quad \Vert g_3 ({{\varvec{\theta }}})-{{\varvec{\beta }}}_1\Vert = O_P (n^{-1/2}), \end{aligned}$$

the similar arguent for bounding $\sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n}\Vert {\tilde{g}}_2 ({{\varvec{\theta }}})\Vert$ yields

$$\begin{aligned} \sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n}\Vert g_1 ({{\varvec{\theta }}})-{{\varvec{\alpha }}}_1\Vert \le O_P (n^{-1/2})\rightarrow _P0, \quad \sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n}\Vert g_3 ({{\varvec{\theta }}}) -{{\varvec{\beta }}}_1\Vert \le O_P (n^{-1/2})\rightarrow _P0. \end{aligned}$$

This proves that with probability tending to 1, $g ({{\varvec{\theta }}})\in {\mathcal {M}}_n$. □

We will show with the help of Lemma 2 that $\hat{{{\varvec{\alpha }}}}_2=0$ and $\hat{{{\varvec{\beta }}}}_2=0$ with probability tending to 1. Then the analysis of $\hat{{{\varvec{\theta }}}}$ reduces to the analysis of the oracle estimator based on the model where ${{\varvec{\alpha }}}_2=0$ and ${{\varvec{\beta }}}_2=0$. Let ${{\varvec{\theta }}}^{o} = ({{\varvec{\alpha }}}_1^t,{{\varvec{\beta }}}^t_1)^t \in {\mathbb {R}}^{s_1+s_2}$ and ${{\varvec{\theta }}}_0^{o} = ({{\varvec{\alpha }}}_{0,1}^t,{{\varvec{\beta }}}^t_{0,1})^t$. With abuse of notation, let $\ell _n ({{\varvec{\theta }}}^o)$ be the likelihood evaluated at $({{\varvec{\alpha }}}_1^t,{\varvec{0}}^t,{{\varvec{\beta }}}^t_1,{\varvec{0}}^t)^t$ with the first and second derivatives ${\dot{\ell }}_n ({{\varvec{\theta }}}^o)$ and $\ddot{\ell }_n ({{\varvec{\theta }}}^o)$ with respect to ${{\varvec{\theta }}}^o$ and the corresponding Fisher information matrix $I ({{\varvec{\theta }}}^o)$. The criterion functions for the oracle estimator are

$$\begin{aligned}&{\mathbb {M}}_n ({{\varvec{\theta }}}^o)= -2\ell _n ({{\varvec{\theta }}}^o)+ \lambda _{n,1}^{ (0)} \sum _{j=1}^{s_1}\alpha ^2_{1,j} + \lambda _{n,2}^{ (0)}\sum _{j=1}^{s_2}\beta _{1,j}^2,\\&{\mathbb {Q}}_n ({{\varvec{\eta }}}^o|{{\varvec{\theta }}}^o) = -2\ell _n ({{\varvec{\eta }}}^o) + {{\varvec{\alpha }}}_1^tD_1 ({{\varvec{\theta }}}^o){{\varvec{\alpha }}}_1 +{{\varvec{\beta }}}_1^tD_3 ({{\varvec{\theta }}}^o){{\varvec{\beta }}}_1, \end{aligned}$$

with

$$\begin{aligned}&D_1 ({{\varvec{\theta }}}^o) =\lambda _{n,1} \mathrm {diag} (|\alpha _{1,1}|^{-2},\ldots ,|\alpha _{1,s_1}|^{-2}),\\&D_3 ({{\varvec{\theta }}}^o) =\lambda _{n,2} \mathrm {diag} (|\beta _{1,1}|^{-2},\ldots ,|\beta _{1,s_2}|^{-2}),\\&\dot{{\mathbb {Q}}}_n ({{\varvec{\eta }}}^o|{{\varvec{\theta }}}^o) = -2{\dot{\ell }}_n ({{\varvec{\eta }}}^o) +2\mathrm {diag}\left ( D_1 ({{\varvec{\theta }}}^o) ,D_3 ({{\varvec{\theta }}}^o)\right) ( {{\varvec{\alpha }}}_1^t,{{\varvec{\beta }}}^t_1)^t,\\&\ddot{{\mathbb {Q}}}_n ({{\varvec{\eta }}}^o|{{\varvec{\theta }}}^o) = -2\ddot{\ell }_n ({{\varvec{\eta }}}^o) +2\mathrm {diag}\left ( D_1 ({{\varvec{\theta }}}^o),D_3 ({{\varvec{\theta }}}^o)\right) . \end{aligned}$$

Lemma 3

Let

$$\begin{aligned} {\mathcal {M}}_n^o= \left\{ {{\varvec{\theta }}}^o:m_0\le \min _j|\alpha _{1j}|\le \max _j|\alpha _{1j}|\le M_0,m_0\le \min _j|\beta _{1j}|\le \max _j|\beta _{1j}|\le M_0\right\} . \end{aligned}$$

Let $h ({{\varvec{\theta }}}^o) = (h_1 ({{\varvec{\theta }}}^o)^t, h_3 ({{\varvec{\theta }}}^o)^t)^t,h_1 ({{\varvec{\theta }}}^o)\in {\mathbb {R}}^{s_1}, h_3 ({{\varvec{\theta }}}^o)\in {\mathbb {R}}^{s_2},$ be solutions to $\dot{{\mathbb {Q}}}_n ({{\varvec{\eta }}}^o|{{\varvec{\theta }}}^o)=0$. Then, the map h on ${\mathcal {M}}_n^o$ is a contraction map with probability tending to 1 with the unique fixed point $\tilde{{{\varvec{\theta }}}}^o$ satisfying

$$\begin{aligned} \sqrt{n} (\tilde{{{\varvec{\theta }}}}^o-{{\varvec{\theta }}}_0^o)\rightarrow _d Z\sim N\left ( 0,I^{-1} ({{\varvec{\theta }}}_0^o)\right) . \end{aligned}$$

Proof

A similar argument in Lemma 2 implies h maps ${\mathcal {M}}_n^o$ into itself. Since $\dot{{\mathbb {Q}}}_n (h ({{\varvec{\theta }}}^o)|{{\varvec{\theta }}}^o)=\dot{{\mathbb {Q}}}_n (h ({{\varvec{\theta }}'}^o)|{{\varvec{\theta }}'}^o)=0$ for ${{\varvec{\theta }}}^o\ne {{\varvec{\theta }}'}^o$, a similar argument in Lemma 2 yields

$$\begin{aligned} \{c+o (1)\}\Vert h_1 ({{\varvec{\theta }}}^o)-h_1 ({{\varvec{\theta }}'}^o)\Vert \le \frac{1}{n}\left\Vert D_1 ({{\varvec{\theta }}'}^o) -D_1 ({{\varvec{\theta }}}^o)\right\Vert \left\Vert h_1 ({{\varvec{\theta }}}^o) \right\Vert \end{aligned}$$

for some constant $c>0$. By the Taylor expansion to the function $y=1/x^2$, the right-hand side of the inequality is bounded by

$$\begin{aligned} \frac{1}{n}\left\Vert D_1 ({{\varvec{\theta }}'}^o) -D_1 ({{\varvec{\theta }}}^o)\right\Vert \left\Vert h_1 ({{\varvec{\theta }}}^o) \right\Vert \le \frac{2\lambda _{n,1}\left\Vert h_1 ({{\varvec{\theta }}}^o) \right\Vert }{m_0^{3}n} \Vert {{\varvec{\theta }}}^o-{{\varvec{\theta }}'}^o\Vert . \end{aligned}$$

Since we have

$$\begin{aligned} h_1 ({{\varvec{\theta }}}^o) =h_1 ({{\varvec{\theta }}}^o)-{{\varvec{\alpha }}}_1+{{\varvec{\alpha }}}_1 = O_P (r_n)+ O (1), \end{aligned}$$

we have

$$\begin{aligned} \frac{2\lambda _{n,1}\left\Vert h_1 ({{\varvec{\theta }}}^o) \right\Vert }{m_0^{3}n} \le \frac{2\lambda _{n,1}}{m_0^{3}n} (O_P (r_n)+O (1)) \rightarrow _P0. \end{aligned}$$

Thus, the map $h_1$ is a contraction map with probability tending to 1. The case for $h_3$ is similar. Thus, it follows from the fixed point theorem that with probability tending to 1 there is a unique fixed point $\tilde{{{\varvec{\theta }}}}^o$ such that $h (\tilde{{{\varvec{\theta }}}}^o)=\tilde{{{\varvec{\theta }}}}^o$.

As in the proof of Lemma 2, the Taylor expansion of ${\mathbb {Q}}_n (h (\tilde{{{\varvec{\theta }}}}^o)|\tilde{{{\varvec{\theta }}}}^o)$ around ${{\varvec{\theta }}}_0^o$ yields

$$\begin{aligned} -\frac{1}{n}\ddot{\ell }_n ({{\varvec{\theta }}}^{*o})\sqrt{n} (\tilde{{{\varvec{\theta }}}}^o-{{\varvec{\theta }}}_0^o) = \frac{1}{n^{1/2}}{\dot{\ell }}_n ({{\varvec{\theta }}}_0^o) -\frac{1}{n^{1/2}}D_1 (\tilde{{{\varvec{\theta }}}}^o)h_1 (\tilde{{{\varvec{\theta }}}}^o) -\frac{1}{n^{1/2}}D_3 (\tilde{{{\varvec{\theta }}}}^o)h_3 (\tilde{{{\varvec{\theta }}}}^o) \end{aligned}$$

where ${{\varvec{\theta }}}^{*o}$ is some convex combination of $\tilde{{{\varvec{\theta }}}}^o$ and ${{\varvec{\theta }}}_0^o$. Since $h (\tilde{{{\varvec{\theta }}}}^o)-{{\varvec{\theta }}}_0^o=O_P (r_n)=o_P (1)$, $\lambda _{n,1}/\sqrt{n}\rightarrow 0$ and $\lambda _{n,2}/\sqrt{n}\rightarrow 0$, the second and third terms in the right-hand side of the last display is $o_P (1)$. Since $-\ddot{\ell }_n (\tilde{{{\varvec{\theta }}}}^o)/n\rightarrow _PI ({{\varvec{\theta }}}^o_0)$ as in the proof of Lemma 2, the central limit theorem applied to ${\dot{\ell }}_n ({{\varvec{\theta }}}^o_0)/\sqrt{n}$ and the Slutsky theorem yields

$$\begin{aligned} \sqrt{n} (\tilde{{{\varvec{\theta }}}}^o-{{\varvec{\theta }}}_0^o)\rightarrow _d Z\sim N\left ( 0,I^{-1} ({{\varvec{\theta }}}_0^o)\right) . \end{aligned}$$

□

Proof of Theorems 1and 2 Since the initial estimator $\hat{{{\varvec{\theta }}}}^{ (0)}$ is in ${\mathcal {M}}_n$ with probability tending to 1 by Lemma 1, it follows from Lemma 2 that

$$\begin{aligned} \Vert \hat{{{\varvec{\alpha }}}}_2^{ (k)}\Vert = \frac{\Vert \hat{{{\varvec{\alpha }}}}_2^{ (k)}\Vert }{\Vert \hat{{{\varvec{\alpha }}}}_2^{ (k-1)}\Vert } \times \ldots \times \frac{\Vert \hat{{{\varvec{\alpha }}}}_2^{ (1)}\Vert }{\Vert \hat{{{\varvec{\alpha }}}}_2^{ (0)}\Vert } \rightarrow 0 \end{aligned}$$

as $k\rightarrow \infty$ so that $\hat{{{\varvec{\alpha }}}}_2 = {\varvec{0}}$ and, similarly, $\hat{{{\varvec{\beta }}}}_2 = {\varvec{0}}$ with probability tending to 1. This together with the consistency of $\hat{{{\varvec{\alpha }}}}_1$ and $\hat{{{\varvec{\beta }}}}_1$ proves the model selection consistency of $\hat{{{\varvec{\theta }}}}$.

Since a similar argument in the proof of Lemma 2 shows the map g from ${\mathcal {M}}_n^0$ to itself is continuous, $\hat{{{\varvec{\alpha }}}}_2^{ (k)}\rightarrow {\varvec{0}}$ and $\hat{{{\varvec{\beta }}}}_2^{ (k)}\rightarrow {\varvec{0}}$ implies

$$\begin{aligned} \Vert g (\hat{{{\varvec{\theta }}}}^{ (k)}) - h (\hat{{{\varvec{\theta }}}}^{ (k)o}) \Vert \rightarrow 0 \end{aligned}$$

as $k\rightarrow \infty$ where $\hat{{{\varvec{\theta }}}}^{ (k)o} = ( (\hat{{{\varvec{\alpha }}}}^{ (k)}_1)^t, (\hat{{{\varvec{\beta }}}}^{ (k)}_1)^t)^t$. Since h is a contraction map by Lemma 2,

$$\begin{aligned} \hat{{{\varvec{\theta }}}}^{ (k)o}\rightarrow \tilde{{{\varvec{\theta }}}}^o = (\tilde{{{\varvec{\alpha }}}}_1^t,\tilde{{{\varvec{\beta }}}}_1^t)^t \end{aligned}$$

as $k\rightarrow \infty$ where $\tilde{{{\varvec{\theta }}}}^o$ is a unique fixed point of h. Thus the triangle inequality yields

$$\begin{aligned} \Vert \hat{{{\varvec{\alpha }}}}_1-\tilde{{{\varvec{\alpha }}}}_1\Vert \le \lim _{k\rightarrow \infty }\Vert g_1 (\hat{{{\varvec{\theta }}}}^{ (k)})-h_1 (\hat{{{\varvec{\theta }}}}^o)\Vert + \lim _{k\rightarrow \infty }\Vert h_1 (\hat{{{\varvec{\theta }}}}^o) -\tilde{{{\varvec{\alpha }}}}_1\Vert =0. \end{aligned}$$

Similarly, $\hat{{{\varvec{\beta }}}}_1\rightarrow \tilde{{{\varvec{\beta }}}}_1$ as $k\rightarrow \infty$. Thus, $\hat{{{\varvec{\alpha }}}}_1=\tilde{{{\varvec{\alpha }}}}_1$ and $\hat{{{\varvec{\beta }}}}_1=\tilde{{{\varvec{\beta }}}}_1$ with probability tending to 1. This implies the asymptotic properties of $\hat{{{\varvec{\alpha }}}}_1$ and $\hat{{{\varvec{\beta }}}}_1$ reduces to the properties of $\tilde{{{\varvec{\theta }}}}^o$ derived in the Lemma. This completes the proof. □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Saegusa, T., Ma, T., Li, G. et al. Variable Selection in Threshold Regression Model with Applications to HIV Drug Adherence Data. Stat Biosci 12, 376–398 (2020). https://doi.org/10.1007/s12561-020-09284-1

Download citation

Received: 01 September 2019
Revised: 13 April 2020
Accepted: 10 June 2020
Published: 17 June 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s12561-020-09284-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Variable Selection in Threshold Regression Model with Applications to HIV Drug Adherence Data

Abstract

Access this article

Similar content being viewed by others

A flexible approach for variable selection in large-scale healthcare database studies with missing covariate and outcome data

Estimating Costs Associated with Disease Model States Using Generalized Linear Models: A Tutorial

Clustering Functional Data with Application to Electronic Medication Adherence Monitoring in HIV Prevention Trials

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

1.1 A. Proofs

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Variable Selection in Threshold Regression Model with Applications to HIV Drug Adherence Data

Abstract

Access this article

Similar content being viewed by others

A flexible approach for variable selection in large-scale healthcare database studies with missing covariate and outcome data

Estimating Costs Associated with Disease Model States Using Generalized Linear Models: A Tutorial

Clustering Functional Data with Application to Electronic Medication Adherence Monitoring in HIV Prevention Trials

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 A. Proofs

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation