Abstract
Logistic regression is used thousands of times a day to fit data, predict future outcomes, and assess the statistical significance of explanatory variables. When used for the purpose of statistical inference, logistic models produce p-values for the regression coefficients by using an approximation to the distribution of the likelihood-ratio test (LRT). Indeed, Wilks’ theorem asserts that whenever we have a fixed number p of variables, twice the log-likelihood ratio (LLR) \(2 \Lambda \) is distributed as a \(\chi ^2_k\) variable in the limit of large sample sizes n; here, \(\chi ^2_k\) is a Chi-square with k degrees of freedom and k the number of variables being tested. In this paper, we prove that when p is not negligible compared to n, Wilks’ theorem does not hold and that the Chi-square approximation is grossly incorrect; in fact, this approximation produces p-values that are far too small (under the null hypothesis). Assume that n and p grow large in such a way that \(p/n \rightarrow \kappa \) for some constant \(\kappa < 1/2\). (For \(\kappa > 1/2\), \(2\Lambda {\mathop {\rightarrow }\limits ^{{\mathbb {P}}}}0\) so that the LRT is not interesting in this regime.) We prove that for a class of logistic models, the LLR converges to a rescaled Chi-square, namely, \(2\Lambda ~{\mathop {\rightarrow }\limits ^{\mathrm {d}}}~ \alpha (\kappa ) \chi _k^2\), where the scaling factor \(\alpha (\kappa )\) is greater than one as soon as the dimensionality ratio \(\kappa \) is positive. Hence, the LLR is larger than classically assumed. For instance, when \(\kappa = 0.3\), \(\alpha (\kappa ) \approx 1.5\). In general, we show how to compute the scaling factor by solving a nonlinear system of two equations with two unknowns. Our mathematical arguments are involved and use techniques from approximate message passing theory, from non-asymptotic random matrix theory and from convex geometry. We also complement our mathematical study by showing that the new limiting distribution is accurate for finite sample sizes. Finally, all the results from this paper extend to some other regression models such as the probit regression model.
Similar content being viewed by others
Notes
Such conditions would also typically imply asymptotic normality of the MLE.
The conjugate \(f^*\) of a function f is defined as \(f^{*} (x) = \sup _{u \in \mathrm {dom}(f)} \{\langle u, x \rangle - f(u) \}\).
Note that the p-values obtained at each trial are not exactly independent. However, they are exchangeable, and weakly dependent (see the proof of Corollary 1 for a formal justification of this fact). Therefore, we expect the goodness of fit test to be an approximately valid procedure in this setting.
Mathematically, the convex geometry and the leave-one-out analyses employed in our proof naturally extend to the case where \(p=o(n)\). It remains to develop a formal AMP theory for the regime where \(p=o(n)\). Alternatively, we note that the AMP theory has been mainly invoked to characterize \(\Vert \hat{\varvec{\beta }}\Vert \), which can also be accomplished via the leave-one-out argument (cf. [25]). This alternative proof strategy can easily extend to the regime \(p=o(n)\).
Recall our earlier footnote about the use of a \(\chi ^2\) test.
When \(\varvec{X}_i\sim \mathcal {N}(\varvec{0},\varvec{\Sigma })\) for a general \(\varvec{\Sigma }\succ \varvec{0}\), one has \(\Vert \varvec{\Sigma }^{1/2}\hat{\varvec{\beta }}\Vert \lesssim {1}/{\epsilon ^{2}}\) with high probability.
References
Agresti, A., Kateri, M.: Categorical Data Analysis. Springer, Berlin (2011)
Alon, N., Spencer, J.H.: The Probabilistic Method, 3rd edn. Wiley, Hoboken (2008)
Amelunxen, D., Lotz, M., McCoy, M.B., Tropp, J.A.: Living on the edge: phase transitions in convex programs with random data. Inf. Inference 3, 224–294 (2014)
Baricz, Á.: Mills’ ratio: monotonicity patterns and functional inequalities. J. Math. Anal. Appl. 340(2), 1362–1370 (2008)
Bartlett, M.S.: Properties of sufficiency and statistical tests. Proc. R. Soc. Lond. Ser. A Math. Phys. Sci. 160, 268–282 (1937)
Bayati, M., Lelarge, M., Montanari, A., et al.: Universality in polytope phase transitions and message passing algorithms. Ann. Appl. Probab. 25(2), 753–822 (2015)
Bayati, M., Montanari, A.: The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Trans. Inf. Theory 57(2), 764–785 (2011)
Bayati, M., Montanari, A.: The LASSO risk for Gaussian matrices. IEEE Trans. Inf. Theory 58(4), 1997–2017 (2012)
Bickel, P.J., Ghosh, J.K.: A decomposition for the likelihood ratio statistic and the Bartlett correction—a Bayesian argument. Ann. Stat. 18, 1070–1090 (1990)
Boucheron, S., Massart, P.: A high-dimensional Wilks phenomenon. Probab. Theory Relat. Fields 150(3–4), 405–433 (2011)
Box, G.: A general distribution theory for a class of likelihood criteria. Biometrika 36(3/4), 317–346 (1949)
Candès, E., Fan, Y., Janson, L., Lv, J.: Panning for gold: model-free knockoffs for high-dimensional controlled variable selection (2016). ArXiv preprint arXiv:1610.02351
Chernoff, H.: On the distribution of the likelihood ratio. Ann. Math. Stat. 25, 573–578 (1954)
Cordeiro, G.M.: Improved likelihood ratio statistics for generalized linear models. J. R. Stat. Soc. Ser. B (Methodol.) 25, 404–413 (1983)
Cordeiro, G.M., Cribari-Neto, F.: An Introduction to Bartlett Correction and Bias Reduction. Springer, New York (2014)
Cordeiro, G.M., Cribari-Neto, F., Aubin, E.C.Q., Ferrari, S.L.P.: Bartlett corrections for one-parameter exponential family models. J. Stat. Comput. Simul. 53(3–4), 211–231 (1995)
Cover, T.M.: Geometrical and statistical properties of linear threshold devices. Ph.D. thesis (1964)
Cover, T.M.: Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron. Comput. 3, 326–334 (1965)
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Hoboken (2012)
Cribari-Neto, F., Cordeiro, G.M.: On Bartlett and Bartlett-type corrections Francisco Cribari-Neto. Econom. Rev. 15(4), 339–367 (1996)
Deshpande, Y., Montanari, A.: Finding hidden cliques of size \(\sqrt{N/e}\) in nearly linear time. Found. Comput. Math. 15(4), 1069–1128 (2015)
Donoho, D., Montanari, A.: High dimensional robust M-estimation: asymptotic variance via approximate message passing. Probab. Theory Relat. Fields 3, 935–969 (2013)
Donoho, D., Montanari, A.: Variance breakdown of Huber (M)-estimators: \(n/p \rightarrow m \in (1,\infty )\). Technical report (2015)
El Karoui, N.: Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: rigorous results (2013). ArXiv preprint arXiv:1311.2445
El Karoui, N.: On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probab. Theory Relat. Fields 170, 95–175 (2017)
El Karoui, N., Bean, D., Bickel, P.J., Lim, C., Yu, B.: On robust regression with high-dimensional predictors. Proc. Natl. Acad. Sci. 110(36), 14557–14562 (2013)
Fan, J., Jiang, J.: Nonparametric inference with generalized likelihood ratio tests. Test 16(3), 409–444 (2007)
Fan, J., Lv, J.: Nonconcave penalized likelihood with NP-dimensionality. IEEE Trans. Inf. Theory 57(8), 5467–5484 (2011)
Fan, J., Zhang, C., Zhang, J.: Generalized likelihood ratio statistics and Wilks phenomenon. Ann. Stat. 29, 153–193 (2001)
Fan, Y., Demirkaya, E., Lv, J.: Nonuniformity of p-values can occur early in diverging dimensions (2017). arXiv:1705.03604
Hager, W.W.: Updating the inverse of a matrix. SIAM Rev. 31(2), 221–239 (1989)
Hanson, D.L., Wright, F.T.: A bound on tail probabilities for quadratic forms in independent random variables. Ann. Math. Stat. 42(3), 1079–1083 (1971)
He, X., Shao, Q.-M.: On parameters of increasing dimensions. J. Multivar. Anal. 73(1), 120–135 (2000)
Hsu, D., Kakade, S., Zhang, T.: A tail inequality for quadratic forms of subgaussian random vectors. Electron. Commun. Probab. 17(52), 1–6 (2012)
Huber, P.J.: Robust regression: asymptotics, conjectures and Monte Carlo. Ann. Stat. 1, 799–821 (1973)
Huber, P.J.: Robust Statistics. Springer, Berlin (2011)
Janková, J., Van De Geer, S.: Confidence regions for high-dimensional generalized linear models under sparsity (2016). ArXiv preprint arXiv:1610.01353
Javanmard, A., Montanari, A.: State evolution for general approximate message passing algorithms, with applications to spatial coupling. Inf. Inference 2, 115–144 (2013)
Javanmard, A., Montanari, A.: De-biasing the lasso: optimal sample size for Gaussian designs (2015). ArXiv preprint arXiv:1508.02757
Lawley, D.N.: A general method for approximating to the distribution of likelihood ratio criteria. Biometrika 43(3/4), 295–303 (1956)
Lehmann, E.L., Romano, J.P.: Testing Statistical Hypotheses. Springer, Berlin (2006)
Liang, H., Pang, D., et al.: Maximum likelihood estimation in logistic regression models with a diverging number of covariates. Electron. J. Stat. 6, 1838–1846 (2012)
Mammen, E.: Asymptotics with increasing dimension for robust regression with applications to the bootstrap. Ann. Stat. 17, 382–400 (1989)
McCullagh, P., Nelder, J.A.: Generalized Linear Models. Monograph on Statistics and Applied Probability. Chapman & Hall, London (1989)
Moulton, L.H., Weissfeld, L.A., Laurent, R.T.S.: Bartlett correction factors in logistic regression models. Comput. Stat. Data Anal. 15(1), 1–11 (1993)
Oymak, S., Tropp, J.A.: Universality laws for randomized dimension reduction, with applications. Inf. Inference J. IMA 7, 337–446 (2015)
Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014)
Portnoy, S.: Asymptotic behavior of M-estimators of \(p\) regression parameters when \(p^2/n\) is large. I. Consistency. Ann. Stat. 12, 1298–1309 (1984)
Portnoy, S.: Asymptotic behavior of M-estimators of \(p\) regression parameters when \(p^2/n\) is large; II. Normal approximation. Ann. Stat. 13, 1403–1417 (1985)
Portnoy, S.: Asymptotic behavior of the empiric distribution of m-estimated residuals from a regression model with many parameters. Ann. Stat. 14, 1152–1170 (1986)
Portnoy, S., et al.: Asymptotic behavior of likelihood methods for exponential families when the number of parameters tends to infinity. Ann. Stat. 16(1), 356–366 (1988)
Rudelson, M., Vershynin, R., et al.: Hanson-Wright inequality and sub-Gaussian concentration. Electron. Commun. Probab. 18(82), 1–9 (2013)
Sampford, M.R.: Some inequalities on Mill’s ratio and related functions. Ann. Math. Stat. 24(1), 130–132 (1953)
Spokoiny, V.: Penalized maximum likelihood estimation and effective dimension (2012). ArXiv preprint arXiv:1205.0498
Su, W., Bogdan, M., Candes, E.: False discoveries occur early on the Lasso path. Ann. Stat. 45, 2133–2150 (2015)
Sur, P., Candès, E.J.: Additional supplementary materials for: a modern maximum-likelihood theory for high-dimensional logistic regression. https://statweb.stanford.edu/~candes/papers/proofs_LogisticAMP.pdf (2018)
Sur, P., Candès, E.J.: A modern maximum-likelihood theory for high-dimensional logistic regression (2018). ArXiv preprint arXiv:1803.06964
Sur, P., Chen, Y., Candès, E.: Supplemental materials for “the likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled chi-square”. http://statweb.stanford.edu/~candes/papers/supplement_LRT.pdf (2017)
Tang, C.Y., Leng, C.: Penalized high-dimensional empirical likelihood. Biometrika 97, 905–919 (2010)
Tao, T.: Topics in Random Matrix Theory, vol. 132. American Mathematical Society, Providence (2012)
Thrampoulidis, C., Abbasi, E., Hassibi, B.: Precise error analysis of regularized m-estimators in high-dimensions (2016). ArXiv preprint arXiv:1601.06233
Van de Geer, S., Bühlmann, P., Ritov, Y., Dezeure, R., et al.: On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Stat. 42(3), 1166–1202 (2014)
Van de Geer, S.A., et al.: High-dimensional generalized linear models and the lasso. Ann. Stat. 36(2), 614–645 (2008)
Van der Vaart, A.W.: Asymptotic Statistics, vol. 3. Cambridge University Press, Cambridge (2000)
Vershynin, R.: Introduction to the non-asymptotic analysis of random matrices. In: Compressed Sensing: Theory and Applications, pp. 210–268 (2012)
Wilks, S.S.: The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann. Math. Stat. 9(1), 60–62 (1938)
Yan, T., Li, Y., Xu, J., Yang, Y., Zhu, J.: High-dimensional Wilks phenomena in some exponential random graph models (2012). ArXiv preprint arXiv:1201.0058
Acknowledgements
E. C. was partially supported by the Office of Naval Research under grant N00014-16-1-2712, and by the Math + X Award from the Simons Foundation. P. S. was partially supported by the Ric Weiland Graduate Fellowship in the School of Humanities and Sciences, Stanford University. Y. C. is supported in part by the AFOSR YIP award FA9550-19-1-0030, by the ARO grant W911NF-18-1-0303, and by the Princeton SEAS innovation award. P. S. and Y. C. are grateful to Andrea Montanari for his help in understanding AMP and [22]. Y. C. thanks Kaizheng Wang and Cong Ma for helpful discussion about [25], and P. S. thanks Subhabrata Sen for several helpful discussions regarding this project. E. C. would like to thank Iain Johnstone for a helpful discussion as well.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendices
Proofs for eigenvalue bounds
1.1 Proof of Lemma 3
Fix \(\epsilon \ge 0\) sufficiently small. For any given \(S\subseteq [n]\) obeying \(|S|=(1-\epsilon )n\) and \(0 \le t \le \sqrt{1-\epsilon } - \sqrt{p/n}\) it follows from [65, Corollary 5.35] that
holds with probability at most \(2\exp \left( -\frac{t^{2}|S|}{2}\right) =2\exp \left( -\frac{\left( 1-\epsilon \right) t^{2}n}{2}\right) \). Taking the union bound over all possible subsets S of size \((1-\epsilon )n\) gives
where the last line is a consequence of the inequality \({n \atopwithdelims ()(1-\epsilon )n}\le e^{n H(\epsilon )}\) [19, Example 11.1.3].
1.2 Proof of Lemma 4
Define
for any \(B>0\) and any \(\varvec{\beta }\). Then
If one also has \(|S_{B}\left( \varvec{\beta }\right) |\ge (1-\epsilon )n\) (for \(\epsilon \ge 0\) sufficiently small), then this together with Lemma 3 implies that
with probability at least \(1-2\exp \left( -\left( \frac{\left( 1-\epsilon \right) t^{2}}{2} - H\left( \epsilon \right) \right) n\right) \).
Thus if we can ensure that with high probability, \(|S_{B}\left( \varvec{\beta }\right) |\ge (1-\epsilon )n\) holds simultaneously for all \(\varvec{\beta }\), then we are done. From Lemma 2 we see that \( \frac{1}{n}\left\| \varvec{X}^{\top }\varvec{X}\right\| \le 9 \) with probability exceeding \(1-2\exp \left( -n/2\right) \). On this event,
On the other hand, the definition of \(S_B(\varvec{\beta })\) gives
Taken together, (114) and (115) yield
with probability at least \(1-2\exp (-n/2)\). Therefore, with probability \(1-2\exp (-n/2)\), \( \left| S_{3/\sqrt{\epsilon }}(\varvec{\beta })\right| \ge \left( 1-\epsilon \right) n \) holds simultaneously for all \(\varvec{\beta }\). Putting the above results together and setting \(t = 2\sqrt{\frac{H(\epsilon )}{1-\epsilon }}\) give
simultaneously for all \(\varvec{\beta }\) with probability at least \(1-2\exp \left( -nH\left( \epsilon \right) \right) -2\exp \left( -{n}/2\right) \).
Proof of Lemma 5
Applying an integration by parts leads to
with \(\phi (z)=\frac{1}{\sqrt{2\pi }}\exp (-z^{2}/2)\). This reveals that
where the second identity comes from [22, Proposition 6.4], and the last identity holds since \(\phi '(z)=-\phi '(-z)\).
Next, we claim that
-
(a)
The function \(h\left( z\right) :=\frac{\rho '\left( z\right) }{1+b\rho ''\left( z\right) }\) is increasing in z;
-
(b)
\(\mathrm {prox}_{b\rho }(z)\) is increasing in z.
These two claims imply that
which combined with the fact \(\phi '(z)<0\) for \(z>0\) reveals
In other words, the integrand in (116) is positive, which allows one to conclude that \(G'(b)>0\).
We then move on to justify (a) and (b). For the first, the derivative of h is given by
Since \(\rho '\) is log concave, this directly yields \((\rho '')^2 - \rho ' \rho ''' > 0\). As \(\rho '' > 0\) and \(b \ge 0\), the above implies \(h'(z) > 0\) for all z.
The second claim follows from \(\frac{\partial \mathrm {prox}_{b\rho }(z)}{\partial z}\ge \frac{1}{1+b\Vert \rho ''\Vert _{\infty }}>0\) (cf. [22, Equation (56)]).
It remains to analyze the behavior of G in the limits when \(b \rightarrow 0\) and \(b \rightarrow \infty \). From [22, Proposition 6.4], G(b) can also be expressed as
Since \(\rho ''\) is bounded and the integrand is at most 1, the dominated convergence theorem gives
When \(b \rightarrow \infty \), \(b\rho ''(\mathsf {prox}_{b \rho }(\tau z)) \rightarrow \infty \) for a fixed z. Again by applying the dominated convergence theorem,
It follows that \(\lim _{b \rightarrow 0}G(b)< \kappa < \lim _{b \rightarrow \infty }G(b)\) and, therefore, \(G(b) = \kappa \) has a unique positive solution.
Remark 3
Finally, we show that the logistic and the probit effective links obey the assumptions of Lemma 5. We work with a fixed \(\tau >0\).
-
A direct computation shows that \(\rho '\) is log-concave for the logistic model. For the probit, it is well-known that the reciprocal of the hazard function (also known as Mills’ ratio) is strictly log-convex [4].
-
To check the other condition, recall that the proximal mapping operator satisfies
$$\begin{aligned} b\rho '(\mathsf {prox}_{b \rho }(\tau z))+\mathsf {prox}_{b \rho }(\tau z) = \tau z. \end{aligned}$$(117)For a fixed z, we claim that if \(b \rightarrow \infty \), \(\mathsf {prox}_{b \rho }(\tau z) \rightarrow -\infty \). To prove this claim, we start by assuming that this is not true. Then either \(\mathsf {prox}_{b \rho }(\tau z)\) is bounded or diverges to \(\infty \). If it is bounded, the LHS above diverges to \(\infty \) while the RHS is fixed, which is a contradiction. Similarly if \(\mathsf {prox}_{b \rho }(\tau z)\) diverges to \(\infty ,\) the left-hand side of (117) diverges to \(\infty \) while the right-hand side is fixed, which cannot be true as well. Further, when \(b \rightarrow \infty \), we must have \(\mathsf {prox}_{b \rho }(\tau z) \rightarrow -\infty \), \(b \rho '(\mathsf {prox}_{b \rho }(\tau z)) \rightarrow \infty \), such that the difference of these two is \(\tau z\). Observe that for the logistic, \(\rho ''(x)=\rho '(x)(1-\rho '(x))\) and for the probit, \(\rho ''(x) = \rho '(x)(\rho '(x)-x)\) [53]. Hence, combining the asymptotic behavior of \(\mathsf {prox}_{b \rho }(\tau z)\) and \(b \rho '(\mathsf {prox}_{b \rho }(\tau z)) \), we obtain that \(b\rho ''(\mathsf {prox}_{b \rho }(\tau z))\) diverges to \(\infty \) in both models when \(b \rightarrow \infty \).
Proof of Lemma 6
1.1 Proof of Part (i)
Recall from [22, Proposition 6.4] that
If we denote \(c:= \mathsf {prox}_{b\rho }(0)\), then b(0) is given by the following relation:
as \(\rho ''(c)>0\) for any given \(c>0\). In addition, since \(\rho '(c) > 0\), we have
where (a) comes from (22).
1.2 Proof of Part (ii)
We defer the proof of this part to the supplemental materials [58].
Proof of Part (ii) of Theorem 4
As discussed in Sect. 5.2.2, it suffices to (1) construct a set \(\left\{ \mathcal {B}_{i}\mid 1\le i\le N\right\} \) that forms a cover of the cone \(\mathcal {A}\) defined in (52), and (2) upper bound \({\mathbb {P}}\{ \left\{ \varvec{X}\varvec{\beta }\mid \varvec{\beta }\in {\mathbb {R}}^{p}\right\} \cap \mathcal {B}_{i}\ne \{ \varvec{0}\} \} \). In what follows, we elaborate on these two steps.
-
Step 1. Generate \(N=\exp \left( 2\epsilon ^{2}p\right) \) i.i.d. points \(\varvec{z}^{(i)}\sim \mathcal {N}(\varvec{0},\frac{1}{p}\varvec{I}_{p})\), \(1\le i\le N\), and construct a collection of convex cones
$$\begin{aligned} \mathcal {C}_{i}:=\left\{ \varvec{u}\in {\mathbb {R}}^{p}\left| \left\langle \varvec{u},\frac{\varvec{z}^{(i)}}{\Vert \varvec{z}^{(i)}\Vert }\right\rangle \ge \epsilon \Vert \varvec{u}\Vert \right. \right\} ,\qquad 1\le i\le N. \end{aligned}$$In words, \(\mathcal {C}_{i}\) consists of all directions that have nontrivial positive correlation with \(\varvec{z}^{(i)}\). With high probability, this collection \(\left\{ \mathcal {C}_{i}\mid 1\le i\le N\right\} \) forms a cover of \({\mathbb {R}}^{p}\), a fact which is an immediate consequence of the following lemma.
Lemma 12
Consider any given constant \(0<\epsilon <1\), and let \(N=\exp \left( 2\epsilon ^{2}p\right) \). Then there exist some positive universal constants \(c_{5},C_{5}>0\) such that with probability exceeding \(1-C_{5}\exp \left( -c_{5}\epsilon ^{2}p\right) \),
holds simultaneously for all \(\varvec{x} \in {\mathbb {R}}^p\).
With our family \(\left\{ \mathcal {C}_{i}\mid 1\le i\le N\right\} \) we can introduce
which in turn forms a cover of the nonconvex cone \(\mathcal {A}\) defined in (52). To justify this, note that for any \(\varvec{u}\in \mathcal {A}\), one can find \(i\in \{1,\ldots ,N\}\) obeying \(\varvec{u}\in \mathcal {C}_{i}\), or equivalently, \(\left\langle \varvec{u},\frac{\varvec{z}^{(i)}}{\Vert \varvec{z}^{(i)}\Vert }\right\rangle \ge \epsilon \Vert \varvec{u}\Vert \), with high probability. Combined with the membership to \(\mathcal {A}\) this gives
indicating that \(\varvec{u}\) is contained within some \(\mathcal {B}_{i}\).
-
Step 2. We now move on to control \({\mathbb {P}}\left\{ \left\{ \varvec{X}\varvec{\beta }\mid \varvec{\beta }\in {\mathbb {R}}^{p}\right\} \cap \mathcal {B}_{i}\ne \left\{ \varvec{0}\right\} \right\} \). If the statistical dimensions of the two cones obey \(\delta \left( \mathcal {B}_{i}\right) <n-\delta \left( \left\{ \varvec{X}\varvec{\beta }\mid \varvec{\beta }\in {\mathbb {R}}^{p}\right\} \right) =n-p\), then an application of [3, Theorem I] gives
$$\begin{aligned}&{\mathbb {P}}\left\{ \left\{ \varvec{X}\varvec{\beta }\mid \varvec{\beta }\in {\mathbb {R}}^{p}\right\} \cap \mathcal {B}_{i}\ne \left\{ \varvec{0}\right\} \right\} \nonumber \\&\quad \le 4\exp \left\{ -\frac{1}{8}\left( \frac{n-\delta \left( \left\{ \varvec{X}\varvec{\beta }\mid \varvec{\beta }\in {\mathbb {R}}^{p}\right\} \right) -\delta \left( \mathcal {B}_{i}\right) }{\sqrt{n}}\right) ^{2}\right\} \nonumber \\&\quad \le 4\exp \left\{ -\frac{\left( n-p-\delta (\mathcal {B}_{i})\right) ^{2}}{8n}\right\} . \end{aligned}$$(120)It then comes down to upper bounding \(\delta (\mathcal {B}_{i})\), which is the content of the following lemma.
Lemma 13
Fix \(\epsilon >0\). When n is sufficiently large, the statistical dimension of the convex cone \(\mathcal {B}_{i}\) defined in (119) obeys
where \(H(x):=-x\log x-(1-x)\log (1-x)\).
Substitution into (120) gives
Finally, we prove Lemmas 12 and 13 in the next subsections. These are the only remaining parts for the proof of Theorem 4.
1.1 Proof of Lemma 12
To begin with, it is seen that all \(\Vert \varvec{z}^{(i)}\Vert \) concentrates around 1. Specifically, apply [34, Proposition 1] to get
and set \(t=3\epsilon ^{2}p\) to reach
Taking the union bound we obtain
Next, we note that it suffices to prove Lemma 12 for all unit vectors \(\varvec{x}\). The following lemma provides a bound on \(\left\langle \varvec{z}^{(i)},\varvec{x}\right\rangle \) for any fixed unit vector \(\varvec{x}\in {\mathbb {R}}^{p}\).
Lemma 14
Consider any fixed unit vector \(\varvec{x}\in {\mathbb {R}}^{p}\) and any given constant \(0<\epsilon <1\), and set \(N=\exp \left( 2\epsilon ^{2}p\right) \). There exist positive universal constants \(c_{5},c_{6},C_{6}>0\) such that
Recognizing that Lemma 12 is a uniform result, we need to extend Lemma 14 to all \(\varvec{x}\) simultaneously, which we achieve via the standard covering argument. Specifically, one can find a set \(\mathcal {C}:=\left\{ \varvec{x}^{(j)}\in {\mathbb {R}}^{p}\mid 1\le j\le K\right\} \) of unit vectors with cardinality \(K=\left( 1+2p^{2}\right) ^{p}\) to form a cover of the unit ball of resolution \(p^{-2}\) [65, Lemma 5.2]; that is, for any unit vector \(\varvec{x}\in {\mathbb {R}}^{p}\), there exists a \(\varvec{x}^{(j)}\in \mathcal {C}\) such that
Apply Lemma 14 and take the union bound to arrive at
with probability exceeding \(1-K\exp \left\{ -2\exp \left( \left( 1-o(1)\right) \frac{7}{4}\epsilon ^{2}p\right) \right\} \ge 1-\exp \left\{ -2\left( 1-o\left( 1\right) \right) \exp \left( \left( 1-o(1)\right) \frac{7}{4}\epsilon ^{2}p\right) \right\} \). This guarantees that for each \(\varvec{x}^{(j)}\), one can find at least one \(\varvec{z}^{(i)}\) obeying
This result together with (123) yields that with probability exceeding \(1-C\exp \left( -c\epsilon ^2 p\right) \), for some universal constants \(C,c>0\).
holds simultaneously for all unit vectors \(\varvec{x}\in {\mathbb {R}}^{p}\). Since \(\epsilon >0\) can be an arbitrary constant, this concludes the proof.
Proof of Lemma 14
Without loss of generality, it suffices to consider \(\varvec{x}=\varvec{e}_{1}=[1,0,\ldots ,0]^{\top }\). For any \(t>0\) and any constant \(\zeta >0\), it comes from [2, Theorem A.1.4] that
Setting \(t=1-\Phi \left( \zeta \sqrt{p}\right) \) gives
Recall that for any \(t>1\), one has \((t^{-1} -t^{-3})\phi (t) \le 1-\Phi (t)\le t^{-1} \phi (t)\) which implies that
Taking \(\zeta =\frac{1}{2}\epsilon \), we arrive at
This justifies that
as claimed. \(\square \)
1.2 Proof of Lemma 13
First of all, recall from the definition (19) that
where \(\varvec{g}\sim \mathcal {N}\left( \varvec{0},\varvec{I}_{n}\right) \), and \(\mathcal {D}_{i}\) is a superset of \(\mathcal {B}_{i}\) defined by
Recall from the triangle inequality that
Since \(\varvec{0}\in \mathcal {D}_{i}\), this implies that
revealing that
In what follows, it suffices to look at the set of \(\varvec{u}\)’s within \(\mathcal {D}_{i}\) obeying \(\Vert \varvec{u}\Vert \le 2\Vert \varvec{g}\Vert \), which verify
It is seen that
-
1.
Regarding the first term of (128), we first recognize that
$$\begin{aligned} \left\{ i\mid u_{i}\le -\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert \right\} \le \frac{\sum _{i:\text { }u_{i}<0}|u_{i}|}{\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert }=\frac{\sum _{i=1}^{n}\max \left\{ -u_{i},0\right\} }{\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert }\le 2\sqrt{\epsilon }n, \end{aligned}$$where the last inequality follows from the constraint (127). As a consequence,
$$\begin{aligned} \sum _{i:g_{i}<0,\text { }u_{i}>-\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert }g_{i}^{2}\ge & {} \sum _{i:g_{i}<0}g_{i}^{2}-\sum _{i:u_{i}\le -\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert }g_{i}^{2}\\\ge & {} \sum _{i:g_{i}<0}g_{i}^{2}-\max _{S\subseteq [n]:\text { }|S|=2\sqrt{\epsilon }n}\sum _{i\in S}g_{i}^{2}. \end{aligned}$$ -
2.
Next, we turn to the second term of (128), which can be bounded by
$$\begin{aligned}&\sum _{i:g_{i}<0,\text { }-\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert<u_{i}<0}u_{i}g_{i}\\&\quad \le \sqrt{\left( \sum _{i:g_{i}<0,\text { }-\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert<u_{i}<0}u_{i}^{2}\right) \left( \sum _{i:g_{i}<0,\text { }-\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert<u_{i}<0}g_{i}^{2}\right) }\\&\quad \le \sqrt{\left( \max _{i:-\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert<u_{i}<0}|u_{i}|\right) \left( \sum _{i:u_{i}<0}|u_{i}|\right) \cdot \Vert \varvec{g}\Vert ^{2}}\\&\quad \le \sqrt{\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert \left( \sum _{i:u_{i}<0}|u_{i}|\right) \cdot \Vert \varvec{g}\Vert ^{2}}\le \sqrt{2}\epsilon ^{\frac{3}{4}}\Vert \varvec{g}\Vert ^{2}, \end{aligned}$$where the last inequality follows from the constraint (127).
Putting the above results together, we have
for any \(\varvec{u}\in \mathcal {D}_{i}\) obeying \(\Vert \varvec{u}\Vert \le 2\Vert \varvec{g}\Vert \), whence
Finally, it follows from [34, Proposition 1] that for any \(t>2\sqrt{\epsilon }n\),
which together with the union bound gives
This gives
for any given \(\epsilon >0\) with the proviso that n is sufficiently large. This combined with (129) yields
as claimed.
Proof of Lemma 8
Throughout, we shall restrict ourselves on the event \(\mathcal {A}_n\) as defined in (86), on which \(\tilde{\varvec{G}}\succeq \lambda _{\mathrm {lb}}\varvec{I}\). Recalling the definitions of \(\tilde{\varvec{G}}\) and \(\varvec{w}\) from (82) and (89), we see that
If we let the singular value decomposition of \(\frac{1}{\sqrt{n}}\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \tilde{\varvec{X}}\) be \(\varvec{U}\varvec{\Sigma }\varvec{V}^{\top }\), then a little algebra gives \(\varvec{\Sigma }\succeq \sqrt{\lambda _{\mathrm {lb}} }\varvec{I}\) and
Substituting this into (131) and using the fact \(\Vert \varvec{X}_{\cdot 1}\Vert ^2 \lesssim n\) with high probability (by Lemma 2), we obtain
with probability at least \(1-\exp (-\Omega (n))\).
Proof of Lemma 9
Throughout this and the subsequent sections, we consider \(H_n\) and \(K_n\) to be two diverging sequences with the following properties:
for any constants \(c_i > 0\), \(i=1,2\) and any \(\epsilon >0\). This lemma is an analogue of [25, Proposition 3.18]. We modify and adapt the proof ideas to establish the result in our setup. Throughout we shall restrict ourselves to the event \(\mathcal {A}_n\), on which \(\tilde{\varvec{G}}\succeq \lambda _{\mathrm {lb}}\varvec{I}\).
Due to independence between \(\varvec{X}_{\cdot 1}\) and \(\{\varvec{D}_{\tilde{\varvec{\beta }}}, \varvec{H}\}\), one can invoke the Hanson-Wright inequality [52, Theorem 1.1] to yield
where \(\Vert .\Vert _{\mathrm {F}}\) denotes the Frobenius norm. Choose \(t = C^2 \big \Vert \varvec{D}_{\tilde{\varvec{\beta }}}^{1/2}\varvec{H}\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \big \Vert H_n/\sqrt{n}\) with \(C>0\) a sufficiently large constant, and take \(H_n\) to be as in (132). Substitution into the above inequality and unconditioning give
for some universal constants \(C, c>0\).
We are left to analyzing \(\mathrm {Tr}\big ( \varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \varvec{H}\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \big )\). Recall from the definition (92) of \(\varvec{H}\) that
and, hence,
This requires us to analyze \(\tilde{\varvec{G}}^{-1}\) carefully. To this end, recall that the matrix \(\tilde{\varvec{G}}_{(i)}\) defined in (83) obeys
Invoking Sherman–Morrison–Woodbury formula (e.g. [31]), we have
It follows that
which implies that
The relations (134) and (136) taken collectively reveal that
We shall show that the trace above is close to \(\mathrm {Tr}(\varvec{I}- \varvec{H})\) up to some factors. For this purpose we analyze the latter quantity in two different ways. To begin with, observe that
On the other hand, it directly follows from the definition of \(\varvec{H}\) and (136) that the \(i{\text {th}}\) diagonal entry of \(\varvec{H}\) is given by
Applying this relation, we can compute \(\mathrm {Tr}(\varvec{I}-\varvec{H})\) analytically as follows:
where \(\tilde{\alpha }:= \frac{1}{n} \mathrm {Tr}\left( \tilde{\varvec{G}}^{-1}\right) \).
Observe that the first quantity in the right-hand side above is simply \(\tilde{\alpha }\mathrm {Tr}\big (\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2}\varvec{H}\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2}\big )\). For simplicity, denote
Note that \(\tilde{\varvec{G}}_{(i)}\succ \varvec{0}\) on \(\mathcal {A}_n \) and that \(\rho ''>0\). Hence the denominator in the second term in (140) is greater than 1 for all i. Comparing (138) and (140), we deduce that
on \(\mathcal {A}_n\). It thus suffices to control \(\sup _i|\eta _i|\). The above bounds together with Lemma (87) and the proposition below complete the proof.
Proposition 1
Let \(\eta _i\) be as defined in (141). Then there exist universal constants \(C_1,C_2, C_3 > 0\) such that
where \(K_n , H_n\) are diverging sequences as specified in (132).
Proof of Proposition 1
Fix any index i. Recall that \(\tilde{\varvec{\beta }}_{[-i]}\) is the MLE when the \(1{\text {st}}\) predictor and \(i{\text {th}}\) observation are removed. Also recall the definition of \(\tilde{\varvec{G}}_{[-i]}\) in (85). The proof essentially follows three steps. First, note that \(\tilde{\varvec{X}}_i\) and \(\tilde{\varvec{G}}_{[-i]}\) are independent. Hence, an application of the Hanson-Wright inequality [52] yields that
We choose \(t = C^2 \big \Vert \tilde{\varvec{G}}_{[-i]}^{-1} \big \Vert H_n/\sqrt{n}\), where \(C>0\) is a sufficiently large constant.
Now marginalizing gives
where \(C' > 0 \) is a sufficiently large constant. On \(\mathcal {A}_n\), the spectral norm \(\big \Vert \tilde{\varvec{G}}_{(i)}^{-1} \big \Vert \) is bounded above by \(\lambda _{\mathrm {lb}}\) for all i. Invoking (87) we obtain that there exist universal constants \(C_1, C_2, C_3>0 \) such that
The next step consists of showing that \(\mathrm {Tr}\big (\tilde{\varvec{G}}_{[-i]}^{-1}\big ) \) (resp. \(\tilde{\varvec{X}}_i^{\top } \tilde{\varvec{G}}_{[-i]}^{-1}\tilde{\varvec{X}}_i\)) and \(\mathrm {Tr}\big (\tilde{\varvec{G}}_{(i)}^{-1}\big ) \) (resp. \(\tilde{\varvec{X}}_i^{\top } \tilde{\varvec{G}}_{(i)}^{-1} \tilde{\varvec{X}}_i\)) are uniformly close across all i. This is established in the following lemma.
Lemma 15
Let \(\tilde{\varvec{G}}_{(i)}\) and \(\tilde{\varvec{G}}_{[-i]}\) be defined as in (83) and (85), respectively. Then there exist universal constants \(C_1, C_2, C_3,C_4,c_2,c_3 > 0 \) such that
where \(K_n, H_n\) are diverging sequences as defined in (132).
This together with (143) yields that
The final ingredient is to establish that \(\frac{1}{n}\mathrm {Tr}\big (\tilde{\varvec{G}}_{(i)}^{-1}\big )\) and \(\frac{1}{n} \mathrm {Tr}\big (\tilde{\varvec{G}}^{-1}\big )\) are uniformly close across i.
Lemma 16
Let \(\tilde{\varvec{G}}\) and \(\tilde{\varvec{G}}_{(i)}\) be as defined in (82) and (83), respectively. Then one has
This completes the proof. \(\square \)
Proof of Lemma 15
For two invertible matrices \(\varvec{A}\) and \(\varvec{B}\) of the same dimensions, the difference of their inverses can be written as
Applying this identity, we have
From the definition of these matrices, it follows directly that
As \(\rho '''\) is bounded, by the mean-value theorem, it suffices to control the differences \(\varvec{X}_j^{\top } \tilde{\varvec{\beta }}_{[-i]}- \tilde{\varvec{X}}_j^{\top } \tilde{\varvec{\beta }}\) uniformly across all j. This is established in the following lemma, the proof of which is deferred to Appendix H.
Lemma 17
Let \(\hat{\varvec{\beta }}\) be the full model MLE and \(\hat{\varvec{\beta }}_{[-i]}\) be the MLE when the \(i{\text {th}}\) observation is dropped. Let \(q_i\) be as described in Lemma 18 and \(K_n, H_n\) be as in (132). Then there exist universal constants \(C_1,C_2,C_3,C_4,c_2,c_3>0\) such that
Invoking this lemma, we see that the spectral norm of (148) is bounded above by some constant times
with high probability as specified in (149). From Lemma 2, the spectral norm here is bounded by some constant with probability at least \(1-c_1 \exp (-c_2n)\). These observations together with (87) and the fact that on \(\mathcal {A}_n\) the minimum eigenvalues of \(\tilde{\varvec{G}}_{(i)}\) and \(\tilde{\varvec{G}}_{[-i]}\) are bounded by \(\lambda _{\mathrm {lb}}\) yield that
This is true for any i. Hence, taking the union bound we obtain
In order to establish the first result, note that
To obtain the second result, note that
Therefore, combining (151) and Lemma 2 gives the desired result. \(\square \)
Proof of Lemma 16
We restrict ourselves to the event \(\mathcal {A}_n\) throughout. Recalling (135), one has
In addition, on \(\mathcal {A}_n\) we have
Combining these results and recognizing that \(\rho ''>0\), we get
as claimed. \(\square \)
Proof of Lemma 11
Again, we restrict ourselves to the event \(\mathcal {A}_n\) on which \(\tilde{\varvec{G}}\succeq \lambda _{\mathrm {lb}}\varvec{I}\). Note that
Note that \(\{\tilde{\varvec{G}}, \tilde{\varvec{X}}\}\) and \(\varvec{X}_{\cdot 1}\) are independent. Conditional on \(\tilde{\varvec{X}}\), the left-hand side is Gaussian with mean zero and variance \(\frac{1}{n^2}\tilde{\varvec{X}}_i^{\top } \tilde{\varvec{G}}^{-1} \tilde{\varvec{X}}^{\top }\varvec{D}_{\tilde{\varvec{\beta }}}^2 \tilde{\varvec{X}}\tilde{\varvec{G}}^{-1} \tilde{\varvec{X}}_i\). The variance is bounded above by
In turn, Lemma 2 asserts that \(n^{-1} \Vert \tilde{\varvec{X}}_i\Vert ^2\) is bounded by a constant with high probability. As a result, applying Gaussian concentration results [60, Theorem 2.1.12] gives
with probability exceeding \(1-C\exp \left( -cH_n^2\right) \), where \( C, c >0\) are universal constants.
In addition, \(\sup _i|X_{i1}| \lesssim H_n\) holds with probability exceeding \(1-C\exp \left( -cH_n^2\right) \). Putting the above results together, applying the triangle inequality \(|X_{i1}-\tilde{\varvec{X}}_i^{\top } \tilde{\varvec{G}}^{-1}\varvec{w}|\le |X_{i1}|+ |\tilde{\varvec{X}}_i^{\top } \tilde{\varvec{G}}^{-1}\varvec{w}|\), and taking the union bound, we obtain
Proof of Lemma 17
The goal of this section is to prove Lemma 17, which relates the full-model MLE \(\hat{\varvec{\beta }}\) and the MLE \(\hat{\varvec{\beta }}_{[-i]}\). To this end, we establish the key lemma below.
Lemma 18
Suppose \(\hat{\varvec{\beta }}_{[-i]}\) denote the MLE when the \(i{\text {th}}\) observation is dropped. Further let \(\varvec{G}_{[-i]}\) be as in (84), and define \(q_i\) and \(\hat{\varvec{b}}\) as follows:
Suppose \(K_n, H_n\) are diverging sequences as in (132). Then there exist universal constants \(C_1, C_2, C_3 > 0\) such that
The proof ideas are inspired by the leave-one-observation-out approach of [25]. We however emphasize once more that the adaptation of these ideas to our setup is not straightforward and crucially hinges on Theorem 4, Lemma 7 and properties of the effective link function.
Proof of Lemma 18
Invoking techniques similar to that for establishing Lemma 7, it can be shown that
with probability at least \(1- \exp (\Omega (n))\), where \(\gamma _i^{*}\) is between \(\varvec{X}_i^{\top } \hat{\varvec{b}}\) and \(\varvec{X}_i^{\top } \hat{\varvec{\beta }}\). Denote by \(\mathcal {B}_n\) the event where (157) holds. Throughout this proof, we work on the event \(\mathcal {C}_n :=\mathcal {A}_n \cap \mathcal {B}_n\), which has probability \(1- \exp \left( -\Omega (n)\right) \). As in (107) then,
Next, we simplify (158). To this end, recall the defining relation of the proximal operator
which together with the definitions of \(\hat{\varvec{b}}\) and \(q_i\) gives
Now, let \(\ell _{[-i]}\) denote the negative log-likelihood function when the \(i{\text {th}}\) observation is dropped, and hence \(\nabla \ell _{[-i]}\big ( \hat{\varvec{\beta }}_{[-i]}\big ) = \varvec{0}\). Expressing \(\nabla \ell (\hat{\varvec{b}})\) as \(\nabla \ell (\hat{\varvec{b}})-\nabla \ell _{[-i]}\big ( \hat{\varvec{\beta }}_{[-i]}\big )\), applying the mean value theorem, and using the analysis similar to that in [25, Proposition 3.4], we obtain
where \(\gamma _j^{*}\) is between \(\varvec{X}_j^{\top }\hat{\varvec{b}}\) and \(\varvec{X}_j^{\top }\hat{\varvec{\beta }}_{[-i]}\). Combining (158) and (160) leads to the upper bound
We need to control each term in the right-hand side. To start with, the first term is bounded by a universal constant with probability \(1- \exp (-\Omega (n))\) (Lemma 2). For the second term, since \(\gamma _{j}^{*}\) is between \(\varvec{X}_j^{\top }\hat{\varvec{b}}\) and \(\varvec{X}_j^{\top }\hat{\varvec{\beta }}_{[-i]}\) and \(\Vert \rho '''\Vert _\infty <\infty \), we get
Given that \(\{\varvec{X}_j, \varvec{G}_{[-i]}\}\) and \(\varvec{X}_i\) are independent for all \(j \ne i\), conditional on \(\{\varvec{X}_j,\varvec{G}_{[-i]}\}\) one has
In addition, the variance satisfies
with probability at least \(1-\exp (-\Omega (n))\). Applying standard Gaussian concentration results [60, Theorem 2.1.12], we obtain
By the union bound
Consequently,
In addition, the third term in the right-hand side of (161) can be upper bounded as well since
with high probability.
It remains to bound \(\left| \rho ' \left( \mathsf {prox}_{q_i \rho }(\varvec{X}_i^{\top } \hat{\varvec{\beta }}_{[-i]}) \right) \right| \). To do this, we begin by considering \(\rho '(\mathsf {prox}_{c \rho }(Z))\) for any constant \(c>0\) (rather than a random variable \(q_i\)). Recall that for any constant \(c > 0\) and any \(Z \sim {\mathcal {N}}(0,\sigma ^2)\) with finite variance, the random variable \(\rho '(\mathsf {prox}_{c \rho }(Z))\) is sub-Gaussian. Conditional on \(\hat{\varvec{\beta }}_{[-i]}\), one has \(\varvec{X}_i^{\top }\hat{\varvec{\beta }}_{[-i]}\sim {\mathcal {N}}\big ( 0,\Vert \hat{\varvec{\beta }}_{[-i]}\Vert ^2 \big )\). This yields
for some constants \(C_1,C_2,C_3,C_4, C_5>0\) since\(\Vert \hat{\varvec{\beta }}_{[-i]}\Vert \) is bounded with high probability (see Theorem 4).
Note that \(\frac{\partial \mathsf {prox}_{b\rho }(z)}{\partial b}\le 0\) by [22, Proposition 6.3]. Hence, in order to move over from the above concentration result established for a fixed constant c to the random variables \(q_i\), it suffices to establish a uniform lower bound for \(q_i\) with high probability. Observe that for each i,
with probability \(1-\exp (-\Omega (n))\), where \(C^*\) is some universal constant. On this event, one has
This taken collectively with (170) yields
This controls the last term.
To summarize, if \(\{K_n\}\) and \(\{H_n\}\) are diverging sequences satisfying the assumptions in (132), combining (161) and the bounds for each term in the right-hand side finally gives (155). On the other hand, combining (167) and (172) yields (156). \(\square \)
With the help of Lemma 18 we are ready to prove Lemma 17. Indeed, observe that
and hence by combining Lemmas 2 and 18, we establish the first claim (149). The second claim (150) follows directly from Lemmas 2, 18 and (159).
Proof of Theorem 7(b)
This section proves that the random sequence \(\tilde{\alpha }= \mathrm {Tr}\big (\tilde{\varvec{G}}^{-1}\big )/n\) converges in probability to the constant \(b_{*}\) defined by the system of Eqs. (25) and (26). To begin with, we claim that \(\tilde{\alpha }\) is close to a set of auxiliary random variables \(\{\tilde{q}_i\}\) defined below.
Lemma 19
Define \(\tilde{q}_i\) to be
where \(\tilde{\varvec{G}}_{[-i]}\) is defined in 85.
Then there exist universal constants \(C_1, C_2, C_3, C_4, c_2,c_3 >0\) such that
where \(K_n, H_n\) are as in (132).
Proof
This result follows directly from Proposition 1 and Eq. (144). \(\square \)
A consequence is that \(\mathsf {prox}_{\tilde{q}_{i} \rho }\left( \varvec{X}_i^{\top } \hat{\varvec{\beta }}_{[-i]}\right) \) becomes close to \(\mathsf {prox}_{\tilde{\alpha }\rho } \left( \varvec{X}_i ^ {\top } \hat{\varvec{\beta }}_{[-i]}\right) \).
Lemma 20
Let \(\tilde{q}_i\) and \(\tilde{\alpha }\) be as defined earlier. Then one has
where \(K_n, H_n\) are as in (132).
The key idea behind studying \(\mathsf {prox}_{\tilde{\alpha }\rho } \left( \varvec{X}_i ^ {\top } \hat{\varvec{\beta }}_{[-i]}\right) \) is that it is connected to a random function \(\delta _n(\cdot )\) defined below, which happens to be closely related to the Eq. (26). In fact, we will show that \(\delta _n(\tilde{\alpha })\) converges in probability to 0; the proof relies on the connection between \(\mathsf {prox}_{\tilde{\alpha }\rho } \left( \varvec{X}_i ^ {\top } \hat{\varvec{\beta }}_{[-i]}\right) \) and the auxiliary quantity \(\mathsf {prox}_{\tilde{q}_{i} \rho }\left( \varvec{X}_i^{\top } \hat{\varvec{\beta }}_{[-i]}\right) \). The formal results is this:
Proposition 2
For any index i, let \(\hat{\varvec{\beta }}_{[-i]}\) be the MLE obtained on dropping the \(i{\text {th}}\) observation. Define \(\delta _n(x)\) to be the random function
Then one has \( \delta _n(\tilde{\alpha }) ~{\mathop {\rightarrow }\limits ^{{\mathbb {P}}}}~ 0\).
Furthermore, the random function \(\delta _n(x)\) converges to a deterministic function \(\Delta (x)\) defined by
where \(Z \sim {\mathcal {N}}(0,1)\), and \(\tau _{*}\) is such that \((\tau _{*},b_{*})\) is the unique solution to (25) and (26).
Proposition 3
With \(\Delta (x) \) as in (175), \(\Delta (\tilde{\alpha }) {\mathop {\rightarrow }\limits ^{{\mathbb {P}}}}0.\)
In fact, one can easily verify that
and hence by Lemma 5, the solution to \(\Delta (x)=0\) is exactly \(b_{*}\). As a result, putting the above claims together, we show that \(\tilde{\alpha }\) converges in probability to \(b_{*}\).
It remains to formally prove the preceding lemmas and propositions, which is the goal of the rest of this section.
Proof of Lemma 20
By [22, Proposition 6.3], one has
which yields
where \(q_{\tilde{\alpha },i}\) is between \(\tilde{q}_{i}\) and \(\tilde{\alpha }\). Here, the last inequality holds since \(q_{\tilde{\alpha },i}, \rho ''\ge 0 \).
In addition, just as in the proof of Lemma 18, one can show that \(q_{i}\) is bounded below by some constant \(C^{*}>0\) with probability \(1- \exp (- \Omega ( n) )\). Since \(q_{\tilde{\alpha },i} \ge \min \{\tilde{q}_i,\tilde{\alpha }\} \), on the event \(\sup _{i}|\tilde{q}_i - \tilde{\alpha }| \le C_1 K_n^2 H_n /\sqrt{n}\), which happens with high probability (Lemma 19), \(q_{\tilde{\alpha },i} \ge C_{\alpha }\) for some universal constant \(C_\alpha >0\). Hence, by an argument similar to that establishing (172), we have
This together with (177) and Lemma 19 concludes the proof. \(\square \)
Proof of Proposition 2
To begin with, recall from (138) and (139) that on \(\mathcal {A}_n\),
Using the fact that \(\big | \frac{1}{1+x} - \frac{1}{1+y} \big | \le |x-y|\) for \(x, y \ge 0\), we obtain
with high probability (Proposition 1). This combined with (178) yields
The above bound concerns \(\frac{1}{n} \sum _{i=1}^n \frac{1}{1+ \rho ''(\tilde{\varvec{X}}_i^{\top } \tilde{\varvec{\beta }}) \tilde{\alpha }} \), and it remains to relate it to \(\frac{1}{n} \sum _{i=1}^n \frac{1}{1+ \rho ''\left( \mathsf {prox}_{\tilde{\alpha }\rho } \left( \tilde{\varvec{X}}_i^{\top } \tilde{\varvec{\beta }}\right) \right) \tilde{\alpha }} \). To this end, we first get from the uniform boundedness of \(\rho '''\) and Lemma 17 that
Note that
By the bound (179), an application of Lemma 20, and the fact that \(\tilde{\alpha }\le p/(n\lambda _{\mathrm {lb}})\) (on \(\mathcal {A}_n\)), we obtain
This establishes that \(\delta _n(\tilde{\alpha }) {\mathop {\rightarrow }\limits ^{{\mathbb {P}}}}0\). \(\square \)
Proof of Proposition 3
Here we only provide the main steps of the proof. Note that since \(0 < \alpha \le p/(n\lambda _{\mathrm {lb}}):=B\) on \(\mathcal {A}_n\), it suffices to show that
We do this by following three steps. Below, \(M>0\) is some sufficiently large constant.
-
1.
First we truncate the random function \(\delta _n(x)\) and define
$$\begin{aligned} \tilde{\delta }_n(x) = \frac{p}{n} - 1+\sum _{i=1}^n \frac{1}{1+ x \rho ''\left( \mathsf {prox}_{x\rho }\left( \varvec{X}_i^{\top }\hat{\varvec{\beta }}_{[-i]}\varvec{1}_{\{\Vert \hat{\varvec{\beta }}_{[-i]}\Vert \le M\}} \right) \right) } . \end{aligned}$$The first step is to show that \(\sup _{x \in [0,B]} \left| \tilde{\delta }_n(x) - \delta _n(x) \right| {\mathop {\rightarrow }\limits ^{{\mathbb {P}}}}0.\) This step can be established using Theorem 4 and some straightforward analysis. We stress that this truncation does not arise in [25], and it is required to keep track of the truncation throughout the rest of the proof.
-
2.
Show that \(\sup _{x \in [0,B]} \left| \tilde{\delta }_n(x) - {{\,\mathrm{{\mathbb {E}}}\,}}\big [\tilde{\delta }_n(x)\big ] \right| {\mathop {\rightarrow }\limits ^{{\mathbb {P}}}}0\).
-
3.
Show that \(\sup _{x \in [0,B]} \left| {{\,\mathrm{{\mathbb {E}}}\,}}\big [ \tilde{\delta }_n(x) \big ] - \Delta (x) \right| {\mathop {\rightarrow }\limits ^{{\mathbb {P}}}}0\).
Steps 2 and 3 can be established by arguments similar to that in [25, Lemma 3.24,3.25], with neceassary modifications for our setup. We skip the detailed arguments here and refer the reader to [58]. \(\square \)
Rights and permissions
About this article
Cite this article
Sur, P., Chen, Y. & Candès, E.J. The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled Chi-square. Probab. Theory Relat. Fields 175, 487–558 (2019). https://doi.org/10.1007/s00440-018-00896-9
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00440-018-00896-9
Keywords
- Logistic regression
- Likelihood-ratio tests
- Wilks’ theorem
- High-dimensionality
- Goodness of fit
- Approximate message passing
- Concentration inequalities
- Convex geometry
- Leave-one-out analysis