Abstract
The threshold regression model is an effective alternative to the Cox proportional hazards regression model when the proportional hazards assumption is not met. This paper considers variable selection for threshold regression. This model has separate regression functions for the initial health status and the speed of degradation in health. This flexibility is an important advantage when considering relevant risk factors for a complex time-to-event model where one needs to decide which variables should be included in the regression function for the initial health status, in the function for the speed of degradation in health, or in both functions. In this paper, we extend the broken adaptive ridge (BAR) method, originally designed for variable selection for one regression function, to simultaneous variable selection for both regression functions needed in the threshold regression model. We establish variable selection consistency of the proposed method and asymptotic normality of the estimator of non-zero regression coefficients. Simulation results show that our method outperformed threshold regression without variable selection and variable selection based on the Akaike information criterion. We apply the proposed method to data from an HIV drug adherence study in which electronic monitoring of drug intake is used to identify risk factors for non-adherence.
Similar content being viewed by others
References
Akaike H (1974) Stochastic theory of minimal realization. IEEE Trans Autom Control 19:667–674. https://doi.org/10.1109/tac.1974.1100707
Cambiano V, Lampe FC, Rodger AJ, Smith CJ, Geretti AM, Lodwick RK, Puradiredja DI, Johnson M, Swaden L, Phillips AN (2010) Long-term trends in adherence to antiretroviral therapy from start of HAART. AIDS 24 (8):1153–1162
Candes E, Tao T (2007) The Dantzig selector: statistical estimation when \(p\) is much larger than \(n\). Ann Stat 35 (6):2313–2351. https://doi.org/10.1214/009053606000001523
Chen L, Huang JZ (2012) Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. J Am Stat Assoc 107 (500):1533–1545. https://doi.org/10.1080/01621459.2012.734178
Cox DR (1972) Regression models and life-tables. J R Stat Soc Ser B 34:187–220
Cox DR, Miller HD (1965) The theory of stochastic processes. Wiley, New York
Dai L, Chen K, Sun Z, Liu Z, Li G (2018) Broken adaptive ridge regression and its asymptotic properties. J Multivariate Anal 168:334–351. https://doi.org/10.1016/j.jmva.2018.08.007
Denison JA, Packer C, Stalter RM, Banda H, Mercer S, Nyambe N, Katayamoyo P, Mwansa JK, McCarraher DR (2018) Factors related to incomplete adherence to antiretroviral therapy among adolescents attending three HIV clinics in the copperbelt, Zambia. AIDS Behav 22 (3):996–1005
Donoho DL, Johnstone IM (1994) Ideal spatial adaptation by wavelet shrinkage. Biometrika 81 (3):425–455. https://doi.org/10.1093/biomet/81.3.425
Du P, Ma S, Liang H (2010) Penalized variable selection procedure for Cox models with semiparametric relative risk. Ann Stat 38 (4):2092–2117. https://doi.org/10.1214/09-AOS780
Fan J (2005) A selective overview of nonparametric methods in financial econometrics. Stat Sci 20 (4):317–357. https://doi.org/10.1214/088342305000000412
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96 (456):1348–1360. https://doi.org/10.1198/016214501753382273
Fan J, Lv J (2010) A selective overview of variable selection in high dimensional feature space. Stat Sinica 20 (1):101–148
Frank IE, Friedman JH (1993) A statistical view of some chemometrics regression tools. Technometrics 35 (2):109–135
Frommlet F, Nuel G (2016) An adaptive ridge procedure for l0 regularization. PLoS ONE https://doi.org/10.1371/journal.pone.0148620
Glass TR, Battegay M, Cavassini M, De Geest S, Furrer H, Vernazza PL, Hirschel B, Bernasconi E, Rickenbach M, Günthard HF, Bucher HC (2010) Longitudinal analysis of patterns and predictors of changes in self-reported adherence to antiretroviral therapy: Swiss HIV Cohort Study. J Acquir Immune Defic Syndr 54 (2):197–203
Gulick RM, Wilkin TJ, Chen YQ, Landovitz RJ, Amico KR, Young AM, Richardson P, Marzinke MA, Hendrix CW, Eshleman SH, McGowan I, Cottle LM, Andrade A, Marcus C, Klingman KL, Chege W, Rinehart AR, Rooney JF, Andrew P, Salata RA, Magnus M, Farley JE, Liu A, Frank I, Ho K, Santana J, Stekler JD, McCauley M, Mayer KH (2017) Phase 2 study of the safety and tolerability of maraviroc-containing regimens to prevent HIV infection in men who have sex with men (HPTN 069/ACTG A5305). J Infect Dis 215 (2):238–246
Huang J, Ma S, Xie H, Zhang CH (2009) A group bridge approach for variable selection. Biometrika 96 (2):339–355. https://doi.org/10.1093/biomet/asp020
Huang J, Breheny P, Ma S (2012) A selective review of group selection in high-dimensional models. Stat Sci 27 (4):481–499. https://doi.org/10.1214/12-STS392
Huang J, Liu L, Liu Y, Zhao X (2014) Group selection in the Cox model with a diverging number of covariates. Stat Sinica 24 (4):1787–1810
Kawaguchi ES, Suchard MA, Liu Z, Li G (2017) Scalable Sparse Cox’s regression for large-scale survival data via broken adaptive ridge. arXiv:1712.00561, arXiv:1712.00561
Kim J, Sohn I, Jung SH, Kim S, Park C (2012) Analysis of survival data with group lasso. Commun Stat 41 (9):1593–1605. https://doi.org/10.1080/03610918.2011.611311
Lawson C (1961) Contribution to the theory of linear least maximum approximation. PhD thesis, University of California, Los Angeles
Lee MLT, Whitmore GA (2006) Threshold regression for survival analysis: modeling event times by a stochastic process reaching a boundary. Stat Sci 21 (4):501–513. https://doi.org/10.1214/088342306000000330
Lee MLT, Whitmore GA (2010) Proportional hazards and threshold regression: their theoretical and practical connections. Lifetime Data Anal 16 (2):196–214. https://doi.org/10.1007/s10985-009-9138-0
Mallows CL (1973) Some comments on cp. Technometrics 15 (4):661–675. https://doi.org/10.1080/00401706.1973.10489103
Mittal S, Madigan D, Cheng JQ, Burd RS (2013) Large-scale parametric survival analysis. Stat Med 32 (23):3955–3971. https://doi.org/10.1002/sim.5817
Peng J, Zhu J, Bergamaschi A, Han W, Noh DY, Pollack JR, Wang P (2010) Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Ann Appl Stat 4 (1):53–77. https://doi.org/10.1214/09-AOAS271
Rothman AJ, Levina E, Zhu J (2010) Sparse multivariate regression with covariance estimation. J Comput Graph Stat 19 (4):947–962. https://doi.org/10.1198/jcgs.2010.09188 supplementary materials available online
Saegusa T, Lee MLT, Chen YQ (2020) Short- and long-term adherence patterns to antiretroviral drugs and prediction of time to non-adherence based on electronic drug monitoring devices
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6 (2):461–464
Simon N, Friedman J, Hastie T, Tibshirani R (2011) Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw 39 (5):1–13 10.18637/jss.v039.i05
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58 (1):267–288
Tibshirani R (1997) The lasso method for variable selection in the Cox model. Stat Med 16 (4):385–395
Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused lasso. J R Stat Soc Ser B 67 (1):91–108. https://doi.org/10.1111/j.1467-9868.2005.00490.x
van der Vaart A, Wellner JA (2000) Preservation theorems for Glivenko-Cantelli and uniform Glivenko-Cantelli classes. In: High dimensional probability, II (Seattle, WA, 1999), Progr. Probab., vol 47, Birkhäuser Boston, Boston, MA, pp 115–133
van der Vaart AW, Wellner JA (1996) Weak convergence and empirical processes. Springer series in statistics. Springer, New York
Xiao T, Whitmore G, He X, Lee ML (2012) Threshold regression for time-to-event analysis: the stthreg package. Stat J 12 (2):257–283
Xiao T, Whitmore G, He X, Lee ML (2015) The R package threg to implement threshold regression models. J Stat Softw 66 (8):1–16 10.18637/jss.v066.i08
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B 68 (1):49–67. https://doi.org/10.1111/j.1467-9868.2005.00532.x
Zhang CH (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38 (2):894–942. https://doi.org/10.1214/09-AOS729
Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101 (476):1418–1429. https://doi.org/10.1198/016214506000000735
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67 (2):301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations
Appendix
Appendix
1.1 A. Proofs
Let \({{\varvec{\alpha }}}_{0,1} = (\alpha _{0,1,j})_{j=1}^{s_1}\in {\mathbb {R}}^{s_1}\), \({{\varvec{\beta }}}_{0,1}= (\beta _{0,1,j})_{j=1}^{s_2}\in {\mathbb {R}}^{s_2}\). and
\({{\varvec{\theta }}} = ({{\varvec{\alpha }}}^t,{{\varvec{\beta }}}^t)^t = ({{\varvec{\alpha }}}_1^t,{{\varvec{\alpha }}}_2^t,{{\varvec{\beta }}}_1^t,{{\varvec{\beta }}}_2^t)^t\) with \({{\varvec{\alpha }}}_1= (\alpha _{1,j})_{j=1}^{s_1},{{\varvec{\alpha }}}_2= (\alpha _{2,j})_{j=1}^{p-s_1},{{\varvec{\beta }}}_1= (\beta _{1,j})_{j=1}^{s_2},{{\varvec{\beta }}}_2= (\beta _{2,j})_{j=1}^{p-s_2}\). Let \(r_n^{ (0)} = 1/\sqrt{n}+\lambda _{n,1}^{ (0)}/n+\lambda _{n,2}^{ (0)}/n\) and \(r_n = 1/\sqrt{n}+\lambda _{n,1}/n+\lambda _{n,2}/n\).
Lemma 1
The initial estimator \(\hat{{{\varvec{\theta }}}}^{ (0)}\) satisfies
Proof
Let the criterion function for the initial estimator be
The initial estimator \(\hat{{\varvec{\theta }}}^{ (0)}\) minimizes \(\mathbb {M}_n ({\varvec{\theta }})\).
We claim that for any \(\epsilon >0\) there exists a constant \(M>0\) such that
If the claim holds, the local minimizer \(\hat{{\varvec{\theta }}}^{ (0)}\) is in the ball \(\{{\varvec{\theta }}_0+r_n^{ (0)}{{\varvec{u}}}: \Vert {{\varvec{u}}}\Vert \le M\}\) with probability at least \(1-\epsilon\). This implies \(P\left ( \Vert \hat{{\varvec{\theta }}}^{ (0)}-{\varvec{\theta }}_0\Vert >r_n^{ (0)}M\right) \ge 1-\epsilon\) and hence \(\Vert \hat{{\varvec{\theta }}}^{ (0)}-{\varvec{\theta }}_0\Vert = O_P (r_n^{ (0)})\).
We have
Note that the class of functions \(\{-E\ddot{\ell }_1 ({\varvec{\theta }}) :{\varvec{\theta }}\in \varTheta \}\) is a Glivenko-Cantelli class by Lemma 2.6.15 of van der Vaart and Wellner [37] and the Glivenko-Cantelli preservation theorem van der Vaart and Wellner [36] under Conditions (C1) and (C2). Since \(r_n^{ (0)}\rightarrow 0\), for any vector \({\varvec{\theta }}^*\) between \({\varvec{\theta }}_0\) and \({\varvec{\theta }}_0+r_n^{ (0)}{{\varvec{u}}}\) we have \(-\ddot{\ell } ({\varvec{\theta }}^*)/n=I ({\varvec{\theta }}_0) +o_P (1)\). Note also that the central limit theorem with Condition (C2) yields \(n^{-1/2}\dot{\ell }_n ({\varvec{\theta }}_0)\) converges in distribution to a zero-mean normal vector Z. Thus, it follows from the fact that \(r_n^{ (0)}>n^{-1/2}\) and the Cauchy–Schwarz inequality that Taylor expansion yields
Since \(I ({\varvec{\theta }}_0)\) is a strictly positive definite matrix, \({{\varvec{u}}}^tI ({\varvec{\theta }}_0){{\varvec{u}}}\ge c_1\Vert {{\varvec{u}}}\Vert ^2\) for some constant \(c_1>0\). Tightness of Z implies that there exists a large constant \(c_2=\Vert {{\varvec{u}}}\Vert >0\) such that for n sufficiently large,
where A is the event that the sum of two \(o_P (1)\) terms above is larger than \(c_1c_2^2/4\). Thus, the difference in the log-likelihoods above is \(c_1c_2^2/2\) with probability at least \(1-\epsilon\).
Next, we bound the difference in the penalty terms regarding \({{\varvec{\alpha }}}\) to obtain
Here we use the inequality \(\sum _{j=1}^pa_j \ge -\sum _{j=1}^p|a_j|\ge -\sqrt{p}\Vert {{\varvec{a}}}\Vert\) and the fact \(r_n^{ (0)}>\lambda _{n,1}^{ (0)}/n\). Similarly, we obtain
Because \({\varvec{\theta }}_0\) is bounded, we can choose a large constant \(M>c_2\) such that \(c_1M^2/2 >2 (\sqrt{s_1}+\sqrt{s_2})\max _j|{\varvec{\theta }}_{0,j}|M\).
Now replace \(c_2\) by M and follow the same argument to conclude that for n sufficiently large, \(\mathbb {M}_n\left ( {\varvec{\theta }}_0+r_n^{ (0)}{{\varvec{u}}}\right) - \mathbb {M}_n ({\varvec{\theta }}_0)\) is strictly positive for all \({{\varvec{u}}}\) satisfying \(\Vert {{\varvec{u}}}\Vert =M\) with probability at least \(1-\epsilon\). Hence the claim is proved. □
To analyze the broken adaptive ridge estimator, we study the criterion function in each step of optimization. Let
where \({{\varvec{\eta }}} \in {\mathbb {R}}^p\) and \(D ({{\varvec{\theta }}})\) is a block diagonal matrix with diagonal elements
Note that \({\mathbb {Q}}_n\left ( {{\varvec{\theta }}}\left| \hat{{{\varvec{\theta }}}}_n^{ (k-1)}\right. \right)\) is the criterion function in the kth step. Viewing \({\mathbb {Q}}_n ({{\varvec{\eta }}}|{{\varvec{\theta }}})\) as a function of \({{\varvec{\eta }}}\), the first and second derivatives are
Lemma 2
Let
where \(m_0=\min _k \{|\alpha _{0,1,k}|,|\beta _{0,1,k}|\}/2\), \(M_0=2\max _k\{|\alpha _{0,1,k}|,|\beta _{0,1,k}|\}\) and \(\delta _n = O (n^{1/4-\epsilon })\). Let \(g ({{\varvec{\theta }}}) = (g_1 ({{\varvec{\theta }}})^t,g_2 ({{\varvec{\theta }}})^t,g_3 ({{\varvec{\theta }}})^t, g_4 ({{\varvec{\theta }}})^t)^t,g_1 ({{\varvec{\theta }}})\in {\mathbb {R}}^{s_1}, g_2 ({{\varvec{\theta }}})\in {\mathbb {R}}^{p-s_1},g_3 ({{\varvec{\theta }}}) \in {\mathbb {R}}^{s_2},g_4 ({{\varvec{\theta }}})\in {\mathbb {R}}^{p-s_2}\) be solutions to \(\dot{{\mathbb {Q}}}_n ({{\varvec{\eta }}}|{{\varvec{\theta }}})=0\) (When some elements in \({{\varvec{\theta }}}\) are zero, the corresponding elements in \(g ({{\varvec{\theta }}})\) are set to zero and the equation should be understood to be with respect to the rest of elements). Then the following holds with probability tending to 1 as \(n\rightarrow \infty\) for some \(c>0\):
and
Proof
Without loss of generality, we assume all elements of \({{\varvec{\theta }}}\) are not zero. Let \(m_{g_2 ({{\varvec{\theta }}})/{{\varvec{\alpha }}}_2}\in {\mathbb {R}}^{p-s_1}\) be a vector obtained by the element-wise division of \(g_2 ({{\varvec{\theta }}})\) by \({{\varvec{\alpha }}}_2\). Because \({{\varvec{\theta }}}\in {\mathcal {M}}_n\) implies \(\max _j|\alpha _{s_1+j}|\le \delta _n/n^{1/2}\), we have
To analyze \(D_2 ({{\varvec{\theta }}})g_2 ({{\varvec{\theta }}})\) we apply the Taylor expansion of \(\dot{{\mathbb {Q}}}_n (g ({{\varvec{\theta }}})|{{\varvec{\theta }}})\) around \({{\varvec{\theta }}}_0\) to obtain
where \({{\varvec{\theta }}}^*\) is a convex combination of \({{\varvec{\theta }}}_0\) and \(g ({{\varvec{\theta }}})\). We can rewrite this in terms of \({\dot{\ell }}_n\), \(\ddot{\ell }_n\), and D by
Rearrangement yields
Since the central limit theorem yields \({\dot{\ell }} ({{\varvec{\theta }}}_0)/n=O_P (n^{-1/2})\), the triangle inequality yields
Since each element of \(I ({{\varvec{\theta }}})\) is bounded and continuous in \({{\varvec{\theta }}}\in \varTheta\), the Glivenko–Cantelli theorem together with the dominated convergence theorem yields
where \(\Vert \cdot \Vert\) is the operator norm for a matrix. Thus, if we show that
it follows from (7), (8), and the definition of \(O_P (n^{-1/2})\) that with probability tending to 1 we have
Since \(\delta _n^2/\lambda _{n,1}\rightarrow 0\), we have that with probability tending to 1,
Since
we obtain
Now we show (9). Let \(g_{j,k} ({{\varvec{\theta }}}), j=1,2,3,4,\) is the kth element of \(g_j ({{\varvec{\theta }}})\). Because the likelihood is not concave, \(g ({{\varvec{\theta }}})\) may take multiple values. With probability tending to 1, we can choose one solution \({\tilde{g}} ({{\varvec{\theta }}}_0)\) to \(\dot{{\mathbb {Q}}}_n ({{\varvec{\eta }}}|{{\varvec{\theta }}}_0)=0\) such that \({\tilde{g}}_2 ({{\varvec{\theta }}}_0)-{{\varvec{\theta }}}_0=O_P (r_n)\). Because \({\mathbb {Q}}_n ({{\varvec{\eta }}}|{{\varvec{\theta }}})\) is continuous in \({{\varvec{\eta }}}\) and \({{\varvec{\theta }}}\), there is a continuous map \({\mathcal {M}}_n\mapsto \{g ({{\varvec{\theta }}}):\theta \in {\mathcal {M}}_n\}\) that passes through \({\tilde{g}} ({{\varvec{\theta }}}_0)\). To see this, we show \({{\varvec{\theta }}}\rightarrow {{\varvec{\theta }}}'\) implies \({\tilde{g}}_2 ({{\varvec{\theta }}})\rightarrow {\tilde{g}}_2 ({{\varvec{\theta }}}').\) Since \(\dot{{\mathbb {Q}}}_n ({\tilde{g}} ({{\varvec{\theta }}})|{{\varvec{\theta }}}) =\dot{{\mathbb {Q}}}_n ({\tilde{g}} ({{\varvec{\theta }}}')|{{\varvec{\theta }}}')=0\), the Taylor expansion yields
where \({{\varvec{\theta }}}^*\) is some convex combination of \({\tilde{g}}_2 ({{\varvec{\theta }}})\) and \({\tilde{g}}_2 ({{\varvec{\theta }}}').\) Since \({\tilde{g}} ({{\varvec{\theta }}})\rightarrow _P{{\varvec{\theta }}}_0, {\tilde{g}} ({{\varvec{\theta }}}')\rightarrow _P{{\varvec{\theta }}}_0\) and \(I ({{\varvec{\theta }}}_0)\) is strictly positive definite, it follows from the Glivenko-Cantelli theorem that the following inequality holds almost surely:
for some constant \(c_1>0\). The right-hand side of (10) is bounded above by
Since \({{\varvec{\theta }}}\in {\mathcal {M}}_n\) and \(\lambda _{n,1}/n\rightarrow 0\), \(\Vert D_2 ({{\varvec{\theta }}})\Vert /n =o (1)\). Thus, we have almost surely
If every element of \({{\varvec{\alpha }}}_2'\) is non-zero, then \(\Vert D_2 ({{\varvec{\theta }}}') -D_2 ({{\varvec{\theta }}})\Vert \rightarrow 0\) as \({{\varvec{\theta }}}\rightarrow {{\varvec{\theta }}}'\), and hence \(\Vert {\tilde{g}}_2 ({{\varvec{\theta }}}')-{\tilde{g}}_2 ({{\varvec{\theta }}})\Vert \rightarrow 0\). Suppose the jth element \(\alpha _{2,j}'\) of \({{\varvec{\alpha }}}'_2\) is zero. Let \({\tilde{g}}_{2,j} ({{\varvec{\theta }}})\) be the jth element of \({\tilde{g}}_{2} ({{\varvec{\theta }}})\). Recall that \(\dot{{\mathbb {Q}}}_n ({{\varvec{\eta }}}|{{\varvec{\theta }}}) =- {\dot{\ell }}_n ({{\varvec{\theta }}}) +D ({{\varvec{\theta }}}){{\varvec{\eta }}}\). Since \({\dot{\ell }}_n ({{\varvec{\theta }}})\) is uniformly bounded in \({{\varvec{\theta }}}\), \(D ({{\varvec{\theta }}})g ({\theta })\) must be uniformly bounded. That is, \(\lambda _{n,1}{\tilde{g}}_{2,j} ({{\varvec{\theta }}})/\alpha _{2,j}^2\) is bounded by a constant which does not depend on \({{\varvec{\theta }}}\). Thus, \(\alpha _{2,j}\rightarrow 0=\alpha _{2,j}'\) implies \({\tilde{g}}_{2,j} ({{\varvec{\theta }}})\rightarrow 0 = {\tilde{g}}_{2,j} ({{\varvec{\theta }}}')\) as expected. The case where jth element \(\alpha _{2,j}'\) is non-zero reduces to the previous case where every element of \(\alpha _{2,j}'\) is non-zero. Now, the continuity of \({\tilde{g}}_2\) and compactness of \({\mathcal {M}}_n\) implies the image of \({\tilde{g}}_2\) is compact. This image can be covered by finitely many balls with radius \(o (n^{-1/2})\) and center \({\tilde{g}}_2 ({{\varvec{\theta }}}^{ (j)}),j=1,\ldots ,J,\) which is \(O_P (r_n)=O_P (n^{-1/2})\). Then for any \({{\varvec{\theta }}}\in {\mathcal {M}}_n\) the triangle inequality gives
If some solution \(g_2 ({{\varvec{\theta }}})\) is not included in \({\tilde{g}}_2 ({{\varvec{\theta }}})\) one can create another function and repeat the same argument above. This proves (9).
The case for \(\sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n}\Vert g_4 ({{\varvec{\theta }}})\Vert /\Vert {{\varvec{\beta }}}_2\Vert\) is similar. For \(g_1 ({{\varvec{\theta }}})\) and \(g_3 ({{\varvec{\theta }}})\) it can be shown in a similar way as above that \(g_1\) and \(g_3\) (with appropriate modification into functions) are continuous. Because the similar argument in the proof of Lemma 1 implies
the similar arguent for bounding \(\sup _{{{\varvec{\theta }}}\in {\mathcal {M}}_n}\Vert {\tilde{g}}_2 ({{\varvec{\theta }}})\Vert\) yields
This proves that with probability tending to 1, \(g ({{\varvec{\theta }}})\in {\mathcal {M}}_n\). □
We will show with the help of Lemma 2 that \(\hat{{{\varvec{\alpha }}}}_2=0\) and \(\hat{{{\varvec{\beta }}}}_2=0\) with probability tending to 1. Then the analysis of \(\hat{{{\varvec{\theta }}}}\) reduces to the analysis of the oracle estimator based on the model where \({{\varvec{\alpha }}}_2=0\) and \({{\varvec{\beta }}}_2=0\). Let \({{\varvec{\theta }}}^{o} = ({{\varvec{\alpha }}}_1^t,{{\varvec{\beta }}}^t_1)^t \in {\mathbb {R}}^{s_1+s_2}\) and \({{\varvec{\theta }}}_0^{o} = ({{\varvec{\alpha }}}_{0,1}^t,{{\varvec{\beta }}}^t_{0,1})^t\). With abuse of notation, let \(\ell _n ({{\varvec{\theta }}}^o)\) be the likelihood evaluated at \(({{\varvec{\alpha }}}_1^t,{\varvec{0}}^t,{{\varvec{\beta }}}^t_1,{\varvec{0}}^t)^t\) with the first and second derivatives \({\dot{\ell }}_n ({{\varvec{\theta }}}^o)\) and \(\ddot{\ell }_n ({{\varvec{\theta }}}^o)\) with respect to \({{\varvec{\theta }}}^o\) and the corresponding Fisher information matrix \(I ({{\varvec{\theta }}}^o)\). The criterion functions for the oracle estimator are
with
Lemma 3
Let
Let \(h ({{\varvec{\theta }}}^o) = (h_1 ({{\varvec{\theta }}}^o)^t, h_3 ({{\varvec{\theta }}}^o)^t)^t,h_1 ({{\varvec{\theta }}}^o)\in {\mathbb {R}}^{s_1}, h_3 ({{\varvec{\theta }}}^o)\in {\mathbb {R}}^{s_2},\) be solutions to \(\dot{{\mathbb {Q}}}_n ({{\varvec{\eta }}}^o|{{\varvec{\theta }}}^o)=0\). Then, the map h on \({\mathcal {M}}_n^o\) is a contraction map with probability tending to 1 with the unique fixed point \(\tilde{{{\varvec{\theta }}}}^o\) satisfying
Proof
A similar argument in Lemma 2 implies h maps \({\mathcal {M}}_n^o\) into itself. Since \(\dot{{\mathbb {Q}}}_n (h ({{\varvec{\theta }}}^o)|{{\varvec{\theta }}}^o)=\dot{{\mathbb {Q}}}_n (h ({{\varvec{\theta }}'}^o)|{{\varvec{\theta }}'}^o)=0\) for \({{\varvec{\theta }}}^o\ne {{\varvec{\theta }}'}^o\), a similar argument in Lemma 2 yields
for some constant \(c>0\). By the Taylor expansion to the function \(y=1/x^2\), the right-hand side of the inequality is bounded by
Since we have
we have
Thus, the map \(h_1\) is a contraction map with probability tending to 1. The case for \(h_3\) is similar. Thus, it follows from the fixed point theorem that with probability tending to 1 there is a unique fixed point \(\tilde{{{\varvec{\theta }}}}^o\) such that \(h (\tilde{{{\varvec{\theta }}}}^o)=\tilde{{{\varvec{\theta }}}}^o\).
As in the proof of Lemma 2, the Taylor expansion of \({\mathbb {Q}}_n (h (\tilde{{{\varvec{\theta }}}}^o)|\tilde{{{\varvec{\theta }}}}^o)\) around \({{\varvec{\theta }}}_0^o\) yields
where \({{\varvec{\theta }}}^{*o}\) is some convex combination of \(\tilde{{{\varvec{\theta }}}}^o\) and \({{\varvec{\theta }}}_0^o\). Since \(h (\tilde{{{\varvec{\theta }}}}^o)-{{\varvec{\theta }}}_0^o=O_P (r_n)=o_P (1)\), \(\lambda _{n,1}/\sqrt{n}\rightarrow 0\) and \(\lambda _{n,2}/\sqrt{n}\rightarrow 0\), the second and third terms in the right-hand side of the last display is \(o_P (1)\). Since \(-\ddot{\ell }_n (\tilde{{{\varvec{\theta }}}}^o)/n\rightarrow _PI ({{\varvec{\theta }}}^o_0)\) as in the proof of Lemma 2, the central limit theorem applied to \({\dot{\ell }}_n ({{\varvec{\theta }}}^o_0)/\sqrt{n}\) and the Slutsky theorem yields
□
Proof of Theorems 1and 2 Since the initial estimator \(\hat{{{\varvec{\theta }}}}^{ (0)}\) is in \({\mathcal {M}}_n\) with probability tending to 1 by Lemma 1, it follows from Lemma 2 that
as \(k\rightarrow \infty\) so that \(\hat{{{\varvec{\alpha }}}}_2 = {\varvec{0}}\) and, similarly, \(\hat{{{\varvec{\beta }}}}_2 = {\varvec{0}}\) with probability tending to 1. This together with the consistency of \(\hat{{{\varvec{\alpha }}}}_1\) and \(\hat{{{\varvec{\beta }}}}_1\) proves the model selection consistency of \(\hat{{{\varvec{\theta }}}}\).
Since a similar argument in the proof of Lemma 2 shows the map g from \({\mathcal {M}}_n^0\) to itself is continuous, \(\hat{{{\varvec{\alpha }}}}_2^{ (k)}\rightarrow {\varvec{0}}\) and \(\hat{{{\varvec{\beta }}}}_2^{ (k)}\rightarrow {\varvec{0}}\) implies
as \(k\rightarrow \infty\) where \(\hat{{{\varvec{\theta }}}}^{ (k)o} = ( (\hat{{{\varvec{\alpha }}}}^{ (k)}_1)^t, (\hat{{{\varvec{\beta }}}}^{ (k)}_1)^t)^t\). Since h is a contraction map by Lemma 2,
as \(k\rightarrow \infty\) where \(\tilde{{{\varvec{\theta }}}}^o\) is a unique fixed point of h. Thus the triangle inequality yields
Similarly, \(\hat{{{\varvec{\beta }}}}_1\rightarrow \tilde{{{\varvec{\beta }}}}_1\) as \(k\rightarrow \infty\). Thus, \(\hat{{{\varvec{\alpha }}}}_1=\tilde{{{\varvec{\alpha }}}}_1\) and \(\hat{{{\varvec{\beta }}}}_1=\tilde{{{\varvec{\beta }}}}_1\) with probability tending to 1. This implies the asymptotic properties of \(\hat{{{\varvec{\alpha }}}}_1\) and \(\hat{{{\varvec{\beta }}}}_1\) reduces to the properties of \(\tilde{{{\varvec{\theta }}}}^o\) derived in the Lemma. This completes the proof. □
Rights and permissions
About this article
Cite this article
Saegusa, T., Ma, T., Li, G. et al. Variable Selection in Threshold Regression Model with Applications to HIV Drug Adherence Data. Stat Biosci 12, 376–398 (2020). https://doi.org/10.1007/s12561-020-09284-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12561-020-09284-1