Skip to main content
Log in

The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled Chi-square

  • Published:
Probability Theory and Related Fields Aims and scope Submit manuscript

Abstract

Logistic regression is used thousands of times a day to fit data, predict future outcomes, and assess the statistical significance of explanatory variables. When used for the purpose of statistical inference, logistic models produce p-values for the regression coefficients by using an approximation to the distribution of the likelihood-ratio test (LRT). Indeed, Wilks’ theorem asserts that whenever we have a fixed number p of variables, twice the log-likelihood ratio (LLR) \(2 \Lambda \) is distributed as a \(\chi ^2_k\) variable in the limit of large sample sizes n; here, \(\chi ^2_k\) is a Chi-square with k degrees of freedom and k the number of variables being tested. In this paper, we prove that when p is not negligible compared to n, Wilks’ theorem does not hold and that the Chi-square approximation is grossly incorrect; in fact, this approximation produces p-values that are far too small (under the null hypothesis). Assume that n and p grow large in such a way that \(p/n \rightarrow \kappa \) for some constant \(\kappa < 1/2\). (For \(\kappa > 1/2\), \(2\Lambda {\mathop {\rightarrow }\limits ^{{\mathbb {P}}}}0\) so that the LRT is not interesting in this regime.) We prove that for a class of logistic models, the LLR converges to a rescaled Chi-square, namely, \(2\Lambda ~{\mathop {\rightarrow }\limits ^{\mathrm {d}}}~ \alpha (\kappa ) \chi _k^2\), where the scaling factor \(\alpha (\kappa )\) is greater than one as soon as the dimensionality ratio \(\kappa \) is positive. Hence, the LLR is larger than classically assumed. For instance, when \(\kappa = 0.3\), \(\alpha (\kappa ) \approx 1.5\). In general, we show how to compute the scaling factor by solving a nonlinear system of two equations with two unknowns. Our mathematical arguments are involved and use techniques from approximate message passing theory, from non-asymptotic random matrix theory and from convex geometry. We also complement our mathematical study by showing that the new limiting distribution is accurate for finite sample sizes. Finally, all the results from this paper extend to some other regression models such as the probit regression model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Such conditions would also typically imply asymptotic normality of the MLE.

  2. The separability results in [17, 18] do not imply the control on the norm of the MLE when \(\kappa < 1/2\).

  3. The conjugate \(f^*\) of a function f is defined as \(f^{*} (x) = \sup _{u \in \mathrm {dom}(f)} \{\langle u, x \rangle - f(u) \}\).

  4. Note that the p-values obtained at each trial are not exactly independent. However, they are exchangeable, and weakly dependent (see the proof of Corollary  1 for a formal justification of this fact). Therefore, we expect the goodness of fit test to be an approximately valid procedure in this setting.

  5. Mathematically, the convex geometry and the leave-one-out analyses employed in our proof naturally extend to the case where \(p=o(n)\). It remains to develop a formal AMP theory for the regime where \(p=o(n)\). Alternatively, we note that the AMP theory has been mainly invoked to characterize \(\Vert \hat{\varvec{\beta }}\Vert \), which can also be accomplished via the leave-one-out argument (cf. [25]). This alternative proof strategy can easily extend to the regime \(p=o(n)\).

  6. Recall our earlier footnote about the use of a \(\chi ^2\) test.

  7. When \(\varvec{X}_i\sim \mathcal {N}(\varvec{0},\varvec{\Sigma })\) for a general \(\varvec{\Sigma }\succ \varvec{0}\), one has \(\Vert \varvec{\Sigma }^{1/2}\hat{\varvec{\beta }}\Vert \lesssim {1}/{\epsilon ^{2}}\) with high probability.

References

  1. Agresti, A., Kateri, M.: Categorical Data Analysis. Springer, Berlin (2011)

    MATH  Google Scholar 

  2. Alon, N., Spencer, J.H.: The Probabilistic Method, 3rd edn. Wiley, Hoboken (2008)

    Book  MATH  Google Scholar 

  3. Amelunxen, D., Lotz, M., McCoy, M.B., Tropp, J.A.: Living on the edge: phase transitions in convex programs with random data. Inf. Inference 3, 224–294 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  4. Baricz, Á.: Mills’ ratio: monotonicity patterns and functional inequalities. J. Math. Anal. Appl. 340(2), 1362–1370 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  5. Bartlett, M.S.: Properties of sufficiency and statistical tests. Proc. R. Soc. Lond. Ser. A Math. Phys. Sci. 160, 268–282 (1937)

    Article  MATH  Google Scholar 

  6. Bayati, M., Lelarge, M., Montanari, A., et al.: Universality in polytope phase transitions and message passing algorithms. Ann. Appl. Probab. 25(2), 753–822 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  7. Bayati, M., Montanari, A.: The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Trans. Inf. Theory 57(2), 764–785 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  8. Bayati, M., Montanari, A.: The LASSO risk for Gaussian matrices. IEEE Trans. Inf. Theory 58(4), 1997–2017 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  9. Bickel, P.J., Ghosh, J.K.: A decomposition for the likelihood ratio statistic and the Bartlett correction—a Bayesian argument. Ann. Stat. 18, 1070–1090 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  10. Boucheron, S., Massart, P.: A high-dimensional Wilks phenomenon. Probab. Theory Relat. Fields 150(3–4), 405–433 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  11. Box, G.: A general distribution theory for a class of likelihood criteria. Biometrika 36(3/4), 317–346 (1949)

    Article  MathSciNet  MATH  Google Scholar 

  12. Candès, E., Fan, Y., Janson, L., Lv, J.: Panning for gold: model-free knockoffs for high-dimensional controlled variable selection (2016). ArXiv preprint arXiv:1610.02351

  13. Chernoff, H.: On the distribution of the likelihood ratio. Ann. Math. Stat. 25, 573–578 (1954)

    Article  MathSciNet  MATH  Google Scholar 

  14. Cordeiro, G.M.: Improved likelihood ratio statistics for generalized linear models. J. R. Stat. Soc. Ser. B (Methodol.) 25, 404–413 (1983)

    MathSciNet  MATH  Google Scholar 

  15. Cordeiro, G.M., Cribari-Neto, F.: An Introduction to Bartlett Correction and Bias Reduction. Springer, New York (2014)

    Book  MATH  Google Scholar 

  16. Cordeiro, G.M., Cribari-Neto, F., Aubin, E.C.Q., Ferrari, S.L.P.: Bartlett corrections for one-parameter exponential family models. J. Stat. Comput. Simul. 53(3–4), 211–231 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  17. Cover, T.M.: Geometrical and statistical properties of linear threshold devices. Ph.D. thesis (1964)

  18. Cover, T.M.: Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron. Comput. 3, 326–334 (1965)

    Article  MATH  Google Scholar 

  19. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Hoboken (2012)

    MATH  Google Scholar 

  20. Cribari-Neto, F., Cordeiro, G.M.: On Bartlett and Bartlett-type corrections Francisco Cribari-Neto. Econom. Rev. 15(4), 339–367 (1996)

    Article  MATH  Google Scholar 

  21. Deshpande, Y., Montanari, A.: Finding hidden cliques of size \(\sqrt{N/e}\) in nearly linear time. Found. Comput. Math. 15(4), 1069–1128 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  22. Donoho, D., Montanari, A.: High dimensional robust M-estimation: asymptotic variance via approximate message passing. Probab. Theory Relat. Fields 3, 935–969 (2013)

    MathSciNet  MATH  Google Scholar 

  23. Donoho, D., Montanari, A.: Variance breakdown of Huber (M)-estimators: \(n/p \rightarrow m \in (1,\infty )\). Technical report (2015)

  24. El Karoui, N.: Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: rigorous results (2013). ArXiv preprint arXiv:1311.2445

  25. El Karoui, N.: On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probab. Theory Relat. Fields 170, 95–175 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  26. El Karoui, N., Bean, D., Bickel, P.J., Lim, C., Yu, B.: On robust regression with high-dimensional predictors. Proc. Natl. Acad. Sci. 110(36), 14557–14562 (2013)

    Article  MATH  Google Scholar 

  27. Fan, J., Jiang, J.: Nonparametric inference with generalized likelihood ratio tests. Test 16(3), 409–444 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  28. Fan, J., Lv, J.: Nonconcave penalized likelihood with NP-dimensionality. IEEE Trans. Inf. Theory 57(8), 5467–5484 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  29. Fan, J., Zhang, C., Zhang, J.: Generalized likelihood ratio statistics and Wilks phenomenon. Ann. Stat. 29, 153–193 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  30. Fan, Y., Demirkaya, E., Lv, J.: Nonuniformity of p-values can occur early in diverging dimensions (2017). arXiv:1705.03604

  31. Hager, W.W.: Updating the inverse of a matrix. SIAM Rev. 31(2), 221–239 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  32. Hanson, D.L., Wright, F.T.: A bound on tail probabilities for quadratic forms in independent random variables. Ann. Math. Stat. 42(3), 1079–1083 (1971)

    Article  MathSciNet  MATH  Google Scholar 

  33. He, X., Shao, Q.-M.: On parameters of increasing dimensions. J. Multivar. Anal. 73(1), 120–135 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  34. Hsu, D., Kakade, S., Zhang, T.: A tail inequality for quadratic forms of subgaussian random vectors. Electron. Commun. Probab. 17(52), 1–6 (2012)

    MathSciNet  MATH  Google Scholar 

  35. Huber, P.J.: Robust regression: asymptotics, conjectures and Monte Carlo. Ann. Stat. 1, 799–821 (1973)

    Article  MathSciNet  MATH  Google Scholar 

  36. Huber, P.J.: Robust Statistics. Springer, Berlin (2011)

    Google Scholar 

  37. Janková, J., Van De Geer, S.: Confidence regions for high-dimensional generalized linear models under sparsity (2016). ArXiv preprint arXiv:1610.01353

  38. Javanmard, A., Montanari, A.: State evolution for general approximate message passing algorithms, with applications to spatial coupling. Inf. Inference 2, 115–144 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  39. Javanmard, A., Montanari, A.: De-biasing the lasso: optimal sample size for Gaussian designs (2015). ArXiv preprint arXiv:1508.02757

  40. Lawley, D.N.: A general method for approximating to the distribution of likelihood ratio criteria. Biometrika 43(3/4), 295–303 (1956)

    Article  MathSciNet  MATH  Google Scholar 

  41. Lehmann, E.L., Romano, J.P.: Testing Statistical Hypotheses. Springer, Berlin (2006)

    MATH  Google Scholar 

  42. Liang, H., Pang, D., et al.: Maximum likelihood estimation in logistic regression models with a diverging number of covariates. Electron. J. Stat. 6, 1838–1846 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  43. Mammen, E.: Asymptotics with increasing dimension for robust regression with applications to the bootstrap. Ann. Stat. 17, 382–400 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  44. McCullagh, P., Nelder, J.A.: Generalized Linear Models. Monograph on Statistics and Applied Probability. Chapman & Hall, London (1989)

    Book  MATH  Google Scholar 

  45. Moulton, L.H., Weissfeld, L.A., Laurent, R.T.S.: Bartlett correction factors in logistic regression models. Comput. Stat. Data Anal. 15(1), 1–11 (1993)

    Article  MATH  Google Scholar 

  46. Oymak, S., Tropp, J.A.: Universality laws for randomized dimension reduction, with applications. Inf. Inference J. IMA 7, 337–446 (2015)

    Article  MathSciNet  Google Scholar 

  47. Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014)

    Article  Google Scholar 

  48. Portnoy, S.: Asymptotic behavior of M-estimators of \(p\) regression parameters when \(p^2/n\) is large. I. Consistency. Ann. Stat. 12, 1298–1309 (1984)

    Article  MATH  MathSciNet  Google Scholar 

  49. Portnoy, S.: Asymptotic behavior of M-estimators of \(p\) regression parameters when \(p^2/n\) is large; II. Normal approximation. Ann. Stat. 13, 1403–1417 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  50. Portnoy, S.: Asymptotic behavior of the empiric distribution of m-estimated residuals from a regression model with many parameters. Ann. Stat. 14, 1152–1170 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  51. Portnoy, S., et al.: Asymptotic behavior of likelihood methods for exponential families when the number of parameters tends to infinity. Ann. Stat. 16(1), 356–366 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  52. Rudelson, M., Vershynin, R., et al.: Hanson-Wright inequality and sub-Gaussian concentration. Electron. Commun. Probab. 18(82), 1–9 (2013)

    MathSciNet  MATH  Google Scholar 

  53. Sampford, M.R.: Some inequalities on Mill’s ratio and related functions. Ann. Math. Stat. 24(1), 130–132 (1953)

    Article  MathSciNet  MATH  Google Scholar 

  54. Spokoiny, V.: Penalized maximum likelihood estimation and effective dimension (2012). ArXiv preprint arXiv:1205.0498

  55. Su, W., Bogdan, M., Candes, E.: False discoveries occur early on the Lasso path. Ann. Stat. 45, 2133–2150 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  56. Sur, P., Candès, E.J.: Additional supplementary materials for: a modern maximum-likelihood theory for high-dimensional logistic regression. https://statweb.stanford.edu/~candes/papers/proofs_LogisticAMP.pdf (2018)

  57. Sur, P., Candès, E.J.: A modern maximum-likelihood theory for high-dimensional logistic regression (2018). ArXiv preprint arXiv:1803.06964

  58. Sur, P., Chen, Y., Candès, E.: Supplemental materials for “the likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled chi-square”. http://statweb.stanford.edu/~candes/papers/supplement_LRT.pdf (2017)

  59. Tang, C.Y., Leng, C.: Penalized high-dimensional empirical likelihood. Biometrika 97, 905–919 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  60. Tao, T.: Topics in Random Matrix Theory, vol. 132. American Mathematical Society, Providence (2012)

    Book  MATH  Google Scholar 

  61. Thrampoulidis, C., Abbasi, E., Hassibi, B.: Precise error analysis of regularized m-estimators in high-dimensions (2016). ArXiv preprint arXiv:1601.06233

  62. Van de Geer, S., Bühlmann, P., Ritov, Y., Dezeure, R., et al.: On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Stat. 42(3), 1166–1202 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  63. Van de Geer, S.A., et al.: High-dimensional generalized linear models and the lasso. Ann. Stat. 36(2), 614–645 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  64. Van der Vaart, A.W.: Asymptotic Statistics, vol. 3. Cambridge University Press, Cambridge (2000)

    Google Scholar 

  65. Vershynin, R.: Introduction to the non-asymptotic analysis of random matrices. In: Compressed Sensing: Theory and Applications, pp. 210–268 (2012)

  66. Wilks, S.S.: The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann. Math. Stat. 9(1), 60–62 (1938)

    Article  MATH  Google Scholar 

  67. Yan, T., Li, Y., Xu, J., Yang, Y., Zhu, J.: High-dimensional Wilks phenomena in some exponential random graph models (2012). ArXiv preprint arXiv:1201.0058

Download references

Acknowledgements

E. C. was partially supported by the Office of Naval Research under grant N00014-16-1-2712, and by the Math + X Award from the Simons Foundation. P. S. was partially supported by the Ric Weiland Graduate Fellowship in the School of Humanities and Sciences, Stanford University. Y. C. is supported in part by the AFOSR YIP award FA9550-19-1-0030, by the ARO grant W911NF-18-1-0303, and by the Princeton SEAS innovation award. P. S. and Y. C. are grateful to Andrea Montanari for his help in understanding AMP and [22]. Y. C. thanks Kaizheng Wang and Cong Ma for helpful discussion about [25], and P. S. thanks Subhabrata Sen for several helpful discussions regarding this project. E. C. would like to thank Iain Johnstone for a helpful discussion as well.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pragya Sur.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 490 KB)

Appendices

Proofs for eigenvalue bounds

1.1 Proof of Lemma 3

Fix \(\epsilon \ge 0\) sufficiently small. For any given \(S\subseteq [n]\) obeying \(|S|=(1-\epsilon )n\) and \(0 \le t \le \sqrt{1-\epsilon } - \sqrt{p/n}\) it follows from [65, Corollary 5.35] that

$$\begin{aligned} \lambda _{\min }\left( \frac{1}{n}\sum _{i\in S}\varvec{X}_{i}\varvec{X}_{i}^{\top }\right) < \frac{1}{n}\left( \sqrt{|S|}-\sqrt{p}-t\sqrt{n}\right) ^{2}=\left( \sqrt{1-\epsilon }-\sqrt{\frac{p}{n}}-t\right) ^{2} \end{aligned}$$

holds with probability at most \(2\exp \left( -\frac{t^{2}|S|}{2}\right) =2\exp \left( -\frac{\left( 1-\epsilon \right) t^{2}n}{2}\right) \). Taking the union bound over all possible subsets S of size \((1-\epsilon )n\) gives

$$\begin{aligned}&{\mathbb {P}}\left\{ \exists S\subseteq [n]\text { with }|S|=(1-\epsilon )n\quad \text {s.t.}\quad \frac{1}{n}\lambda _{\min }\left( \sum _{i\in S}\varvec{X}_{i}\varvec{X}_{i}^{\top }\right) \right. \\&\quad <\left. \left( \sqrt{1-\epsilon }-\sqrt{\frac{p}{n}}-t\right) ^{2}\right\} \\&\quad \le ~{n \atopwithdelims ()(1-\epsilon )n}2\exp \left( -\frac{\left( 1-\epsilon \right) t^{2}n}{2}\right) \\&\quad \le ~2\exp \left( n H\left( \epsilon \right) - \frac{\left( 1-\epsilon \right) t^{2}}{2}n\right) , \end{aligned}$$

where the last line is a consequence of the inequality \({n \atopwithdelims ()(1-\epsilon )n}\le e^{n H(\epsilon )}\) [19, Example 11.1.3].

1.2 Proof of Lemma 4

Define

$$\begin{aligned} S_{B}\left( \varvec{\beta }\right) :=\left\{ i:\text { }|\varvec{X}_{i}^{\top }\varvec{\beta }|\le B\Vert \varvec{\beta }\Vert \right\} \end{aligned}$$

for any \(B>0\) and any \(\varvec{\beta }\). Then

$$\begin{aligned} \sum _{i=1}^{n}\rho ''\left( \varvec{X}_{i}^{\top }\varvec{\beta }\right) \varvec{X}_{i}\varvec{X}_{i}^{\top }\succeq & {} \sum _{i\in S_{B}\left( \varvec{\beta }\right) }\rho ''\left( \varvec{X}_{i}^{\top }\varvec{\beta }\right) \varvec{X}_{i}\varvec{X}_{i}^{\top }\succeq \inf _{z:|z|\le B\Vert \varvec{\beta }\Vert }\rho ''\left( z\right) \sum _{i\in S_{B}\left( \varvec{\beta }\right) }\varvec{X}_{i}\varvec{X}_{i}^{\top }. \end{aligned}$$

If one also has \(|S_{B}\left( \varvec{\beta }\right) |\ge (1-\epsilon )n\) (for \(\epsilon \ge 0\) sufficiently small), then this together with Lemma 3 implies that

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n}\rho ''\left( \varvec{X}_{i}^{\top }\varvec{\beta }\right) \varvec{X}_{i}\varvec{X}_{i}^{\top }\succeq \inf _{z:|z|\le B\Vert \varvec{\beta }\Vert }\rho ''\left( z\right) \left( \sqrt{1-\epsilon }-\sqrt{\frac{p}{n}}-t\right) ^{2}\varvec{I} \end{aligned}$$

with probability at least \(1-2\exp \left( -\left( \frac{\left( 1-\epsilon \right) t^{2}}{2} - H\left( \epsilon \right) \right) n\right) \).

Thus if we can ensure that with high probability, \(|S_{B}\left( \varvec{\beta }\right) |\ge (1-\epsilon )n\) holds simultaneously for all \(\varvec{\beta }\), then we are done. From Lemma 2 we see that \( \frac{1}{n}\left\| \varvec{X}^{\top }\varvec{X}\right\| \le 9 \) with probability exceeding \(1-2\exp \left( -n/2\right) \). On this event,

$$\begin{aligned} \left\| \varvec{X}\varvec{\beta }\right\| ^{2}\le 9n\Vert \varvec{\beta }\Vert ^{2}, \qquad \forall \varvec{\beta }. \end{aligned}$$
(114)

On the other hand, the definition of \(S_B(\varvec{\beta })\) gives

$$\begin{aligned} \left\| \varvec{X}\varvec{\beta }\right\| ^{2}\ge \sum _{i\notin S_{B}(\varvec{\beta })}\left| \varvec{X}_{i}^{\top }\varvec{\beta }\right| ^{2} \ge \big (n-\left| S_{B}(\varvec{\beta })\right| \big )\left( B\Vert \varvec{\beta }\Vert \right) ^{2}=n\left( 1-\frac{\left| S_{B}(\varvec{\beta })\right| }{n}\right) B^{2}\Vert \varvec{\beta }\Vert ^2.\nonumber \\ \end{aligned}$$
(115)

Taken together, (114) and (115) yield

$$\begin{aligned} \left| S_{B}(\varvec{\beta })\right| \ge \left( 1-\frac{9}{B^{2}}\right) n, \qquad \forall \varvec{\beta }\end{aligned}$$

with probability at least \(1-2\exp (-n/2)\). Therefore, with probability \(1-2\exp (-n/2)\), \( \left| S_{3/\sqrt{\epsilon }}(\varvec{\beta })\right| \ge \left( 1-\epsilon \right) n \) holds simultaneously for all \(\varvec{\beta }\). Putting the above results together and setting \(t = 2\sqrt{\frac{H(\epsilon )}{1-\epsilon }}\) give

$$\begin{aligned} \sum _{i=1}^{n}\rho ''\left( \varvec{X}_{i}^{\top }\varvec{\beta }\right) \varvec{X}_{i}\varvec{X}_{i}^{\top }\succeq \inf _{z:|z|\le \frac{3\Vert \varvec{\beta }\Vert }{\sqrt{\epsilon }}}\rho ''\left( z\right) \left( \sqrt{1-\epsilon }-\sqrt{\frac{p}{n}}- 2\sqrt{\frac{H(\epsilon )}{1-\epsilon }}\right) ^{2}\varvec{I} \end{aligned}$$

simultaneously for all \(\varvec{\beta }\) with probability at least \(1-2\exp \left( -nH\left( \epsilon \right) \right) -2\exp \left( -{n}/2\right) \).

Proof of Lemma 5

Applying an integration by parts leads to

$$\begin{aligned} {\mathbb {E}}\left[ \Psi '(\tau Z;b)\right]= & {} {\displaystyle \int }_{-\infty }^{\infty }\Psi '(\tau z;b)\phi (z)\mathrm {d}z=\frac{1}{\tau } \Psi (\tau z;b)\phi (z)\Big |_{-\infty }^{\infty }\\&-\frac{1}{\tau }{\displaystyle \int }_{-\infty }^{\infty }\Psi (\tau z;b)\phi '(z)\mathrm {d}z\\= & {} - \frac{1}{\tau } {\displaystyle \int }_{-\infty }^{\infty }\Psi (\tau z;b)\phi '(z)\mathrm {d}z \end{aligned}$$

with \(\phi (z)=\frac{1}{\sqrt{2\pi }}\exp (-z^{2}/2)\). This reveals that

$$\begin{aligned} G'(b)= & {} -\frac{1}{\tau }{\displaystyle \int }_{-\infty }^{\infty }\frac{\partial \Psi (\tau z;b)}{\partial b}\phi '(z)\mathrm {d}z=-\frac{1}{\tau }{\displaystyle \int }_{-\infty }^{\infty }\frac{\rho '\left( \mathsf {prox}_{b\rho }(\tau z)\right) }{1+b\rho ''\left( \mathsf {prox}_{b\rho }(\tau z)\right) }\phi '(z)\mathrm {d}z\nonumber \\= & {} \frac{1}{\tau }\int _{0}^{\infty }\left( \frac{\rho '\left( \mathrm {prox}_{b\rho }(-\tau z)\right) }{1+x\rho ''\left( \mathrm {prox}_{b\rho }(-\tau z)\right) }-\frac{\rho '\left( \mathrm {prox}_{b\rho }(\tau z)\right) }{1+x\rho ''\left( \mathrm {prox}_{b\rho }(\tau z)\right) }\right) \phi '(z)\mathrm {d}z,\nonumber \\ \end{aligned}$$
(116)

where the second identity comes from [22, Proposition 6.4], and the last identity holds since \(\phi '(z)=-\phi '(-z)\).

Next, we claim that

  1. (a)

    The function \(h\left( z\right) :=\frac{\rho '\left( z\right) }{1+b\rho ''\left( z\right) }\) is increasing in z;

  2. (b)

    \(\mathrm {prox}_{b\rho }(z)\) is increasing in z.

These two claims imply that

$$\begin{aligned} \frac{\rho '\left( \mathrm {prox}_{b\rho }(-\tau z)\right) }{1+b\rho ''\left( \mathrm {prox}_{b\rho }(-\tau z)\right) }-\frac{\rho '\left( \mathrm {prox}_{b\rho }(\tau z)\right) }{1+b\rho ''\left( \mathrm {prox}_{b\rho }(\tau z)\right) }<0,\quad \forall z>0, \end{aligned}$$

which combined with the fact \(\phi '(z)<0\) for \(z>0\) reveals

$$\begin{aligned} \mathrm {sign}\left( \left( \frac{\rho '\left( \mathrm {prox}_{b\rho }(-\tau z)\right) }{1+b\rho ''\left( \mathrm {prox}_{b\rho }(-\tau z)\right) }-\frac{\rho '\left( \mathrm {prox}_{b\rho }(\tau z)\right) }{1+b\rho ''\left( \mathrm {prox}_{b\rho }(\tau z)\right) }\right) \phi '(z)\right) =1,\quad \forall z>0. \end{aligned}$$

In other words, the integrand in (116) is positive, which allows one to conclude that \(G'(b)>0\).

We then move on to justify (a) and (b). For the first, the derivative of h is given by

$$\begin{aligned} h'(z)= \frac{\rho ''(z)+b(\rho ''(z))^2 - b\rho '(z)\rho '''(z) }{\left( 1+b\rho ''(z) \right) ^2}. \end{aligned}$$

Since \(\rho '\) is log concave, this directly yields \((\rho '')^2 - \rho ' \rho ''' > 0\). As \(\rho '' > 0\) and \(b \ge 0\), the above implies \(h'(z) > 0\) for all z.

The second claim follows from \(\frac{\partial \mathrm {prox}_{b\rho }(z)}{\partial z}\ge \frac{1}{1+b\Vert \rho ''\Vert _{\infty }}>0\) (cf. [22, Equation (56)]).

It remains to analyze the behavior of G in the limits when \(b \rightarrow 0\) and \(b \rightarrow \infty \). From [22, Proposition 6.4], G(b) can also be expressed as

$$\begin{aligned} G(b) = 1 - {{\,\mathrm{{\mathbb {E}}}\,}}\left[ \frac{1}{1+b\rho ''(\mathsf {prox}_{b \rho }(\tau Z))} \right] . \end{aligned}$$

Since \(\rho ''\) is bounded and the integrand is at most 1, the dominated convergence theorem gives

$$\begin{aligned} \lim _{b \rightarrow 0}G(b) =0. \end{aligned}$$

When \(b \rightarrow \infty \), \(b\rho ''(\mathsf {prox}_{b \rho }(\tau z)) \rightarrow \infty \) for a fixed z. Again by applying the dominated convergence theorem,

$$\begin{aligned} \lim _{b \rightarrow \infty }G(b) =1. \end{aligned}$$

It follows that \(\lim _{b \rightarrow 0}G(b)< \kappa < \lim _{b \rightarrow \infty }G(b)\) and, therefore, \(G(b) = \kappa \) has a unique positive solution.

Remark 3

Finally, we show that the logistic and the probit effective links obey the assumptions of Lemma 5. We work with a fixed \(\tau >0\).

  • A direct computation shows that \(\rho '\) is log-concave for the logistic model. For the probit, it is well-known that the reciprocal of the hazard function (also known as Mills’ ratio) is strictly log-convex [4].

  • To check the other condition, recall that the proximal mapping operator satisfies

    $$\begin{aligned} b\rho '(\mathsf {prox}_{b \rho }(\tau z))+\mathsf {prox}_{b \rho }(\tau z) = \tau z. \end{aligned}$$
    (117)

    For a fixed z, we claim that if \(b \rightarrow \infty \), \(\mathsf {prox}_{b \rho }(\tau z) \rightarrow -\infty \). To prove this claim, we start by assuming that this is not true. Then either \(\mathsf {prox}_{b \rho }(\tau z)\) is bounded or diverges to \(\infty \). If it is bounded, the LHS above diverges to \(\infty \) while the RHS is fixed, which is a contradiction. Similarly if \(\mathsf {prox}_{b \rho }(\tau z)\) diverges to \(\infty ,\) the left-hand side of (117) diverges to \(\infty \) while the right-hand side is fixed, which cannot be true as well. Further, when \(b \rightarrow \infty \), we must have \(\mathsf {prox}_{b \rho }(\tau z) \rightarrow -\infty \), \(b \rho '(\mathsf {prox}_{b \rho }(\tau z)) \rightarrow \infty \), such that the difference of these two is \(\tau z\). Observe that for the logistic, \(\rho ''(x)=\rho '(x)(1-\rho '(x))\) and for the probit, \(\rho ''(x) = \rho '(x)(\rho '(x)-x)\) [53]. Hence, combining the asymptotic behavior of \(\mathsf {prox}_{b \rho }(\tau z)\) and \(b \rho '(\mathsf {prox}_{b \rho }(\tau z)) \), we obtain that \(b\rho ''(\mathsf {prox}_{b \rho }(\tau z))\) diverges to \(\infty \) in both models when \(b \rightarrow \infty \).

Proof of Lemma 6

1.1 Proof of Part (i)

Recall from [22, Proposition 6.4] that

$$\begin{aligned} \kappa= & {} {\mathbb {E}}\left[ \Psi '\left( \tau Z;\text { }b({\tau }) \right) \right] =1-{\mathbb {E}}\left[ \frac{1}{1+b({\tau })\rho ''\big (\mathsf {prox}_{b({\tau }) \rho }\left( \tau Z\right) \big )}\right] . \end{aligned}$$
(118)

If we denote \(c:= \mathsf {prox}_{b\rho }(0)\), then b(0) is given by the following relation:

$$\begin{aligned} 1- \kappa = \frac{1}{1+b(0) \rho ''(c)} \quad \implies \quad b(0) = \frac{\kappa }{\rho ''(c)(1-\kappa )} > 0 \end{aligned}$$

as \(\rho ''(c)>0\) for any given \(c>0\). In addition, since \(\rho '(c) > 0\), we have

$$\begin{aligned} \mathcal {V}(0) = \frac{\Psi (0,b(0))^2}{\kappa } ~\overset{(\text {a})}{=}~ \frac{b(0)^2\rho '(c)^2}{\kappa } > 0, \end{aligned}$$

where (a) comes from (22).

1.2 Proof of Part (ii)

We defer the proof of this part to the supplemental materials [58].

Proof of Part (ii) of Theorem 4

As discussed in Sect. 5.2.2, it suffices to (1) construct a set \(\left\{ \mathcal {B}_{i}\mid 1\le i\le N\right\} \) that forms a cover of the cone \(\mathcal {A}\) defined in (52), and (2) upper bound \({\mathbb {P}}\{ \left\{ \varvec{X}\varvec{\beta }\mid \varvec{\beta }\in {\mathbb {R}}^{p}\right\} \cap \mathcal {B}_{i}\ne \{ \varvec{0}\} \} \). In what follows, we elaborate on these two steps.

  • Step 1. Generate \(N=\exp \left( 2\epsilon ^{2}p\right) \) i.i.d. points \(\varvec{z}^{(i)}\sim \mathcal {N}(\varvec{0},\frac{1}{p}\varvec{I}_{p})\), \(1\le i\le N\), and construct a collection of convex cones

    $$\begin{aligned} \mathcal {C}_{i}:=\left\{ \varvec{u}\in {\mathbb {R}}^{p}\left| \left\langle \varvec{u},\frac{\varvec{z}^{(i)}}{\Vert \varvec{z}^{(i)}\Vert }\right\rangle \ge \epsilon \Vert \varvec{u}\Vert \right. \right\} ,\qquad 1\le i\le N. \end{aligned}$$

    In words, \(\mathcal {C}_{i}\) consists of all directions that have nontrivial positive correlation with \(\varvec{z}^{(i)}\). With high probability, this collection \(\left\{ \mathcal {C}_{i}\mid 1\le i\le N\right\} \) forms a cover of \({\mathbb {R}}^{p}\), a fact which is an immediate consequence of the following lemma.

Lemma 12

Consider any given constant \(0<\epsilon <1\), and let \(N=\exp \left( 2\epsilon ^{2}p\right) \). Then there exist some positive universal constants \(c_{5},C_{5}>0\) such that with probability exceeding \(1-C_{5}\exp \left( -c_{5}\epsilon ^{2}p\right) \),

$$\begin{aligned} \sum _{i=1}^{N}\varvec{1}_{\left\{ \left\langle \varvec{x},\varvec{z}^{(i)}\right\rangle \ge \epsilon \Vert \varvec{x}\Vert \Vert \varvec{z}^{(i)}\Vert \right\} }\ge 1 \end{aligned}$$

holds simultaneously for all \(\varvec{x} \in {\mathbb {R}}^p\).

With our family \(\left\{ \mathcal {C}_{i}\mid 1\le i\le N\right\} \) we can introduce

$$\begin{aligned} \mathcal {B}_{i}:=\mathcal {C}_{i}\cap \left\{ \varvec{u}\in {\mathbb {R}}^{n}\mid \sum _{j=1}^{n}\max \left\{ -u_{j},0\right\} \le \epsilon \sqrt{n}\left\langle \varvec{u},\frac{\varvec{z}^{(i)}}{\Vert \varvec{z}^{(i)}\Vert }\right\rangle \right\} ,\quad 1\le i\le N,\qquad \quad \end{aligned}$$
(119)

which in turn forms a cover of the nonconvex cone \(\mathcal {A}\) defined in (52). To justify this, note that for any \(\varvec{u}\in \mathcal {A}\), one can find \(i\in \{1,\ldots ,N\}\) obeying \(\varvec{u}\in \mathcal {C}_{i}\), or equivalently, \(\left\langle \varvec{u},\frac{\varvec{z}^{(i)}}{\Vert \varvec{z}^{(i)}\Vert }\right\rangle \ge \epsilon \Vert \varvec{u}\Vert \), with high probability. Combined with the membership to \(\mathcal {A}\) this gives

$$\begin{aligned} \sum _{j=1}^{n}\max \left\{ -u_{j},0\right\} \le \epsilon ^{2}\sqrt{n}\Vert \varvec{u}\Vert \le \epsilon \sqrt{n}\left\langle \varvec{u},\frac{\varvec{z}^{(i)}}{\Vert \varvec{z}^{(i)}\Vert }\right\rangle , \end{aligned}$$

indicating that \(\varvec{u}\) is contained within some \(\mathcal {B}_{i}\).

  • Step 2. We now move on to control \({\mathbb {P}}\left\{ \left\{ \varvec{X}\varvec{\beta }\mid \varvec{\beta }\in {\mathbb {R}}^{p}\right\} \cap \mathcal {B}_{i}\ne \left\{ \varvec{0}\right\} \right\} \). If the statistical dimensions of the two cones obey \(\delta \left( \mathcal {B}_{i}\right) <n-\delta \left( \left\{ \varvec{X}\varvec{\beta }\mid \varvec{\beta }\in {\mathbb {R}}^{p}\right\} \right) =n-p\), then an application of [3, Theorem I] gives

    $$\begin{aligned}&{\mathbb {P}}\left\{ \left\{ \varvec{X}\varvec{\beta }\mid \varvec{\beta }\in {\mathbb {R}}^{p}\right\} \cap \mathcal {B}_{i}\ne \left\{ \varvec{0}\right\} \right\} \nonumber \\&\quad \le 4\exp \left\{ -\frac{1}{8}\left( \frac{n-\delta \left( \left\{ \varvec{X}\varvec{\beta }\mid \varvec{\beta }\in {\mathbb {R}}^{p}\right\} \right) -\delta \left( \mathcal {B}_{i}\right) }{\sqrt{n}}\right) ^{2}\right\} \nonumber \\&\quad \le 4\exp \left\{ -\frac{\left( n-p-\delta (\mathcal {B}_{i})\right) ^{2}}{8n}\right\} . \end{aligned}$$
    (120)

    It then comes down to upper bounding \(\delta (\mathcal {B}_{i})\), which is the content of the following lemma.

Lemma 13

Fix \(\epsilon >0\). When n is sufficiently large, the statistical dimension of the convex cone \(\mathcal {B}_{i}\) defined in (119) obeys

$$\begin{aligned} \delta (\mathcal {B}_{i})\le & {} \left( \frac{1}{2}+2\sqrt{2}\epsilon ^{\frac{3}{4}}+10H(2\sqrt{\epsilon })\right) n, \end{aligned}$$
(121)

where \(H(x):=-x\log x-(1-x)\log (1-x)\).

Substitution into (120) gives

$$\begin{aligned}&{\mathbb {P}}\left\{ \left\{ \varvec{X}\varvec{\beta }\mid \varvec{\beta }\in {\mathbb {R}}^{p}\right\} \cap \mathcal {B}_{i}\ne \left\{ \varvec{0}\right\} \right\} \nonumber \\&\quad \le 4\exp \left\{ -\frac{\left( \left( \frac{1}{2}-2\sqrt{2}\epsilon ^{\frac{3}{4}}-10H(2\sqrt{\epsilon })\right) n-p\right) ^{2}}{8n}\right\} \nonumber \\&\quad = 4\exp \left\{ -\frac{1}{8}\left( \frac{1}{2}-2\sqrt{2}\epsilon ^{\frac{3}{4}}-10H(2\sqrt{\epsilon })-\frac{p}{n}\right) ^{2}n\right\} . \end{aligned}$$
(122)

Finally, we prove Lemmas 12 and 13 in the next subsections. These are the only remaining parts for the proof of Theorem 4.

1.1 Proof of Lemma 12

To begin with, it is seen that all \(\Vert \varvec{z}^{(i)}\Vert \) concentrates around 1. Specifically, apply [34, Proposition 1] to get

$$\begin{aligned} {\mathbb {P}}\left\{ \Vert \varvec{z}^{(i)}\Vert ^{2}>1+2\sqrt{\frac{t}{p}}+\frac{2t}{p}\right\} \le e^{-t}, \end{aligned}$$

and set \(t=3\epsilon ^{2}p\) to reach

$$\begin{aligned} {\mathbb {P}}\left\{ \Vert \varvec{z}^{(i)}\Vert ^{2}>1+10\epsilon \right\} \text { }\le \text { }{\mathbb {P}}\left\{ \Vert \varvec{z}^{(i)}\Vert ^{2}>1+2\sqrt{3}\epsilon +6\epsilon ^{2}\right\} \text { }\le \text { }e^{-3\epsilon ^{2}p}. \end{aligned}$$

Taking the union bound we obtain

$$\begin{aligned} {\mathbb {P}}\left\{ \exists 1\le i\le N\text { s.t. }\Vert \varvec{z}^{(i)}\Vert ^{2}>1+10\epsilon \right\}\le & {} Ne^{-3\epsilon ^{2}p}=e^{-\epsilon ^{2}p}. \end{aligned}$$
(123)

Next, we note that it suffices to prove Lemma 12 for all unit vectors \(\varvec{x}\). The following lemma provides a bound on \(\left\langle \varvec{z}^{(i)},\varvec{x}\right\rangle \) for any fixed unit vector \(\varvec{x}\in {\mathbb {R}}^{p}\).

Lemma 14

Consider any fixed unit vector \(\varvec{x}\in {\mathbb {R}}^{p}\) and any given constant \(0<\epsilon <1\), and set \(N=\exp \left( 2\epsilon ^{2}p\right) \). There exist positive universal constants \(c_{5},c_{6},C_{6}>0\) such that

$$\begin{aligned}&{\mathbb {P}}\left\{ \sum _{i=1}^{N}\varvec{1}_{\left\{ \left\langle \varvec{z}^{(i)},\varvec{x}\right\rangle \ge \frac{1}{2}\epsilon \right\} }\le \exp \left( \left( 1-o\left( 1\right) \right) \frac{7}{4}\epsilon ^{2}p\right) \right\} \nonumber \\&\quad \le \exp \left\{ -2\exp \left( \left( 1-o\left( 1\right) \right) \frac{7}{4}\epsilon ^{2}p\right) \right\} . \end{aligned}$$
(124)

Recognizing that Lemma 12 is a uniform result, we need to extend Lemma 14 to all \(\varvec{x}\) simultaneously, which we achieve via the standard covering argument. Specifically, one can find a set \(\mathcal {C}:=\left\{ \varvec{x}^{(j)}\in {\mathbb {R}}^{p}\mid 1\le j\le K\right\} \) of unit vectors with cardinality \(K=\left( 1+2p^{2}\right) ^{p}\) to form a cover of the unit ball of resolution \(p^{-2}\) [65, Lemma 5.2]; that is, for any unit vector \(\varvec{x}\in {\mathbb {R}}^{p}\), there exists a \(\varvec{x}^{(j)}\in \mathcal {C}\) such that

$$\begin{aligned} \Vert \varvec{x}^{(j)}-\varvec{x}\Vert \le p^{-2}. \end{aligned}$$

Apply Lemma 14 and take the union bound to arrive at

$$\begin{aligned} \sum _{i=1}^{N}\varvec{1}_{\left\{ \left\langle \varvec{z}^{(i)},\varvec{x}^{(j)}\right\rangle \ge \frac{1}{2}\epsilon \right\} }\ge \exp \left( \left( 1-o(1)\right) \frac{7}{4}\epsilon ^{2}p\right) >1,&\qquad 1\le j\le K \end{aligned}$$
(125)

with probability exceeding \(1-K\exp \left\{ -2\exp \left( \left( 1-o(1)\right) \frac{7}{4}\epsilon ^{2}p\right) \right\} \ge 1-\exp \left\{ -2\left( 1-o\left( 1\right) \right) \exp \left( \left( 1-o(1)\right) \frac{7}{4}\epsilon ^{2}p\right) \right\} \). This guarantees that for each \(\varvec{x}^{(j)}\), one can find at least one \(\varvec{z}^{(i)}\) obeying

$$\begin{aligned} \left\langle \varvec{z}^{(i)},\varvec{x}^{(j)}\right\rangle \ge \frac{1}{2}\epsilon . \end{aligned}$$

This result together with (123) yields that with probability exceeding \(1-C\exp \left( -c\epsilon ^2 p\right) \), for some universal constants \(C,c>0\).

$$\begin{aligned} \left\langle \varvec{z}^{(i)},\varvec{x}\right\rangle \ge \left\langle \varvec{z}^{(i)},\varvec{x}^{(j)}\right\rangle -\left\langle \varvec{z}^{(i)},\varvec{x}^{(j)}-\varvec{x}\right\rangle\ge & {} \left\langle \varvec{z}^{(i)},\varvec{x}^{(j)}\right\rangle -\Vert \varvec{z}^{(i)}\Vert \cdot \Vert \varvec{x}^{(j)}-\varvec{x}\Vert \\\ge & {} \frac{1}{2}\epsilon -\frac{1}{p^{2}}\Vert \varvec{z}^{(i)}\Vert \ge \frac{\frac{1}{2}\epsilon }{\sqrt{1+10\epsilon }}\Vert \varvec{z}^{(i)}\Vert \\&-\frac{1}{p^{2}}\Vert \varvec{z}^{(i)}\Vert \\\ge & {} \frac{1}{30}\epsilon \Vert \varvec{z}^{(i)}\Vert \end{aligned}$$

holds simultaneously for all unit vectors \(\varvec{x}\in {\mathbb {R}}^{p}\). Since \(\epsilon >0\) can be an arbitrary constant, this concludes the proof.

Proof of Lemma 14

Without loss of generality, it suffices to consider \(\varvec{x}=\varvec{e}_{1}=[1,0,\ldots ,0]^{\top }\). For any \(t>0\) and any constant \(\zeta >0\), it comes from [2, Theorem A.1.4] that

$$\begin{aligned}&{\mathbb {P}}\left\{ \frac{1}{N}\sum _{i=1}^{N}\varvec{1}_{\left\{ \left\langle \varvec{z}^{(i)},\varvec{e}_{1}\right\rangle <\zeta \right\} }>\left( 1+t\right) \Phi \left( \zeta \sqrt{p}\right) \right\} \le \exp \left( -2t^{2}\Phi ^2\left( \zeta \sqrt{p}\right) N\right) . \end{aligned}$$

Setting \(t=1-\Phi \left( \zeta \sqrt{p}\right) \) gives

$$\begin{aligned}&{\mathbb {P}}\left\{ \frac{1}{N}\sum _{i=1}^{N}\varvec{1}_{\left\{ \left\langle \varvec{z}^{(i)},\varvec{e}_{1}\right\rangle <\zeta \right\} }>\left( 2-\Phi \left( \zeta \sqrt{p}\right) \right) \Phi \left( \zeta \sqrt{p}\right) \right\} \\&\quad \le \exp \left( -2\left( 1-\Phi \left( \zeta \sqrt{p}\right) \right) ^{2}\Phi ^2\left( \zeta \sqrt{p}\right) N\right) . \end{aligned}$$

Recall that for any \(t>1\), one has \((t^{-1} -t^{-3})\phi (t) \le 1-\Phi (t)\le t^{-1} \phi (t)\) which implies that

$$\begin{aligned} 1-\Phi \left( \zeta \sqrt{p}\right) =\exp \left( -\frac{\left( 1+o\left( 1\right) \right) \zeta ^{2}p}{2}\right) . \end{aligned}$$

Taking \(\zeta =\frac{1}{2}\epsilon \), we arrive at

$$\begin{aligned} \left( 2-\Phi \left( \zeta \sqrt{p}\right) \right) \Phi \left( \zeta \sqrt{p}\right)= & {} 1-\exp \left( -\left( 1+o\left( 1\right) \right) \zeta ^{2}p\right) \\= & {} 1-\exp \left( -\left( 1+o\left( 1\right) \right) \frac{1}{4}\epsilon ^{2}p\right) ,\\ \left( 1-\Phi \left( \zeta \sqrt{p}\right) \right) ^{2}\Phi ^2\left( \zeta \sqrt{p}\right)= & {} \exp \left( -\left( 1+o\left( 1\right) \right) \zeta ^{2}p\right) \\= & {} \exp \left( -\left( 1+o\left( 1\right) \right) \frac{1}{4}\epsilon ^{2}p\right) \gg \frac{1}{N}. \end{aligned}$$

This justifies that

$$\begin{aligned}&{\mathbb {P}}\left\{ \sum _{i=1}^{N}\varvec{1}_{\left\{ \left\langle \varvec{z}^{(i)},\varvec{e}_{1}\right\rangle \ge \frac{1}{2}\epsilon \right\} }\le N\exp \left( -\left( 1+o\left( 1\right) \right) \frac{1}{4}\epsilon ^{2}p\right) \right\} \\&\quad ={\mathbb {P}}\left\{ \frac{1}{N}\sum _{i=1}^{N}\varvec{1}_{\left\{ \left\langle \varvec{z}^{(i)},\varvec{e}_{1}\right\rangle <\zeta \right\} }>\left( 2-\Phi \left( \zeta \sqrt{p}\right) \right) \Phi \left( \zeta \sqrt{p}\right) \right\} \\&\quad \le \exp \left\{ -2\exp \left( -\left( 1+o\left( 1\right) \right) \frac{1}{4}\epsilon ^{2}p\right) N\right\} \\&\quad =\exp \left\{ -2\exp \left( \left( 1-o\left( 1\right) \right) \frac{7}{4}\epsilon ^{2}p\right) \right\} \end{aligned}$$

as claimed. \(\square \)

1.2 Proof of Lemma 13

First of all, recall from the definition (19) that

$$\begin{aligned} \delta (\mathcal {B}_{i})= & {} {\mathbb {E}}\left[ \left\| \Pi {}_{\mathcal {B}_{i}}\left( \varvec{g}\right) \right\| ^{2}\right] ={\mathbb {E}}\left[ \left\| \varvec{g}\right\| ^{2}-\min _{\varvec{u}\in \mathcal {B}_{i}}\left\| \varvec{g}-\varvec{u}\right\| ^{2}\right] =n-{\mathbb {E}}\left[ \min _{\varvec{u}\in \mathcal {B}_{i}}\left\| \varvec{g}-\varvec{u}\right\| ^{2}\right] \\\le & {} n-{\mathbb {E}}\left[ \min _{\varvec{u}\in \mathcal {D}_{i}}\left\| \varvec{g}-\varvec{u}\right\| ^{2}\right] , \end{aligned}$$

where \(\varvec{g}\sim \mathcal {N}\left( \varvec{0},\varvec{I}_{n}\right) \), and \(\mathcal {D}_{i}\) is a superset of \(\mathcal {B}_{i}\) defined by

$$\begin{aligned} \mathcal {D}_{i}:=\left\{ \varvec{u}\in {\mathbb {R}}^{n}\mid \sum \limits _{j=1}^{n}\max \left\{ -u_{j},0\right\} \le \epsilon \sqrt{n}\Vert \varvec{u}\Vert \right\} . \end{aligned}$$
(126)

Recall from the triangle inequality that

$$\begin{aligned} \left\| \varvec{g}-\varvec{u}\right\|\ge & {} \Vert \varvec{u}\Vert -\Vert \varvec{g}\Vert>\Vert \varvec{g}\Vert =\Vert \varvec{g}-\varvec{0}\Vert ,\qquad \forall \varvec{u}:\text { }\Vert \varvec{u}\Vert >2\Vert \varvec{g}\Vert . \end{aligned}$$

Since \(\varvec{0}\in \mathcal {D}_{i}\), this implies that

$$\begin{aligned} \Big \Vert \arg \min _{\varvec{u}\in \mathcal {D}_{i}}\Vert \varvec{g}-\varvec{u}\Vert \Big \Vert \le 2\Vert \varvec{g}\Vert , \end{aligned}$$

revealing that

$$\begin{aligned} {\mathbb {E}}\left[ \min _{\varvec{u}\in \mathcal {D}_{i}}\left\| \varvec{g}-\varvec{u}\right\| ^{2}\right] ={\mathbb {E}}\left[ \min _{\varvec{u}\in \mathcal {D}_{i},\Vert \varvec{u}\Vert \le 2\Vert \varvec{g}\Vert }\left\| \varvec{g}-\varvec{u}\right\| ^{2}\right] . \end{aligned}$$

In what follows, it suffices to look at the set of \(\varvec{u}\)’s within \(\mathcal {D}_{i}\) obeying \(\Vert \varvec{u}\Vert \le 2\Vert \varvec{g}\Vert \), which verify

$$\begin{aligned} \sum \limits _{j=1}^{n}\max \left\{ -u_{j},0\right\} \le \epsilon \sqrt{n}\Vert \varvec{u}\Vert \le 2\epsilon \sqrt{n}\Vert \varvec{g}\Vert . \end{aligned}$$
(127)

It is seen that

$$\begin{aligned} \Vert \varvec{g}-\varvec{u}\Vert ^{2}\ge & {} \sum _{i:g_{i}<0}\left( g_{i}-u_{i}\right) ^{2}=\left\{ \sum _{i:g_{i}<0,u_{i}\ge 0}+\sum _{i:g_{i}<0,\text { }-\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert<u_{i}<0}\right. \nonumber \\&\quad \left. +\sum _{i:g_{i}<0,\text { }u_{i}\le -\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert }\right\} \left( g_{i}-u_{i}\right) ^{2}\nonumber \\\ge & {} \sum _{i:g_{i}<0,u_{i}\ge 0}g_{i}^{2}+\sum _{i:g_{i}<0,\text { }-\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert<u_{i}<0}\left( g_{i}-u_{i}\right) ^{2}\nonumber \\\ge & {} \sum _{i:g_{i}<0,u_{i}\ge 0}g_{i}^{2}+\sum _{i:g_{i}<0,\text { }-\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert<u_{i}<0}\left( g_{i}^{2}-2u_{i}g_{i}\right) \nonumber \\\ge & {} \sum _{i:g_{i}<0,\text { }u_{i}>-\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert }g_{i}^{2}-\sum _{i:g_{i}<0,\text { }-\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert<u_{i}<0}2u_{i}g_{i}. \end{aligned}$$
(128)
  1. 1.

    Regarding the first term of (128), we first recognize that

    $$\begin{aligned} \left\{ i\mid u_{i}\le -\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert \right\} \le \frac{\sum _{i:\text { }u_{i}<0}|u_{i}|}{\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert }=\frac{\sum _{i=1}^{n}\max \left\{ -u_{i},0\right\} }{\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert }\le 2\sqrt{\epsilon }n, \end{aligned}$$

    where the last inequality follows from the constraint (127). As a consequence,

    $$\begin{aligned} \sum _{i:g_{i}<0,\text { }u_{i}>-\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert }g_{i}^{2}\ge & {} \sum _{i:g_{i}<0}g_{i}^{2}-\sum _{i:u_{i}\le -\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert }g_{i}^{2}\\\ge & {} \sum _{i:g_{i}<0}g_{i}^{2}-\max _{S\subseteq [n]:\text { }|S|=2\sqrt{\epsilon }n}\sum _{i\in S}g_{i}^{2}. \end{aligned}$$
  2. 2.

    Next, we turn to the second term of (128), which can be bounded by

    $$\begin{aligned}&\sum _{i:g_{i}<0,\text { }-\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert<u_{i}<0}u_{i}g_{i}\\&\quad \le \sqrt{\left( \sum _{i:g_{i}<0,\text { }-\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert<u_{i}<0}u_{i}^{2}\right) \left( \sum _{i:g_{i}<0,\text { }-\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert<u_{i}<0}g_{i}^{2}\right) }\\&\quad \le \sqrt{\left( \max _{i:-\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert<u_{i}<0}|u_{i}|\right) \left( \sum _{i:u_{i}<0}|u_{i}|\right) \cdot \Vert \varvec{g}\Vert ^{2}}\\&\quad \le \sqrt{\sqrt{\frac{\epsilon }{n}}\Vert \varvec{g}\Vert \left( \sum _{i:u_{i}<0}|u_{i}|\right) \cdot \Vert \varvec{g}\Vert ^{2}}\le \sqrt{2}\epsilon ^{\frac{3}{4}}\Vert \varvec{g}\Vert ^{2}, \end{aligned}$$

    where the last inequality follows from the constraint (127).

Putting the above results together, we have

$$\begin{aligned} \left\| \varvec{g}-\varvec{u}\right\| ^{2}\ge \sum _{i:g_{i}<0}g_{i}^{2}-\max _{S\subseteq [n]:\text { }|S|=2\sqrt{\epsilon }n}\sum _{i\in S}g_{i}^{2}-2\sqrt{2}\epsilon ^{\frac{3}{4}}\Vert \varvec{g}\Vert ^{2} \end{aligned}$$

for any \(\varvec{u}\in \mathcal {D}_{i}\) obeying \(\Vert \varvec{u}\Vert \le 2\Vert \varvec{g}\Vert \), whence

$$\begin{aligned} {\mathbb {E}}\left[ \min _{\varvec{u}\in \mathcal {D}_{i}}\left\| \varvec{g}-\varvec{u}\right\| ^{2}\right]\ge & {} {\mathbb {E}}\left[ \sum _{i:g_{i}<0}g_{i}^{2}-\max _{S\subseteq [n]:\text { }|S|=2\sqrt{\epsilon }n}\sum _{i\in S}g_{i}^{2}-2\sqrt{2}\epsilon ^{\frac{3}{4}}\Vert \varvec{g}\Vert ^{2}\right] \nonumber \\= & {} \left( \frac{1}{2}-2\sqrt{2}\epsilon ^{\frac{3}{4}}\right) n-{\mathbb {E}}\left[ \max _{S\subseteq [n]:\text { }|S|=2\sqrt{\epsilon }n}\sum _{i\in S}g_{i}^{2}\right] . \end{aligned}$$
(129)

Finally, it follows from [34, Proposition 1] that for any \(t>2\sqrt{\epsilon }n\),

$$\begin{aligned} {\mathbb {P}}\left\{ \sum _{i\in S}g_{i}^{2}\ge 5t\right\} \le {\mathbb {P}}\left\{ \sum _{i\in S}g_{i}^{2}\ge |S|+2\sqrt{|S|t}+2t\right\} \le e^{-t}, \end{aligned}$$

which together with the union bound gives

$$\begin{aligned}&{\mathbb {P}}\left\{ \max _{S\subseteq [n]:\text { }|S|= 2\sqrt{\epsilon }n}\sum _{i\in S}g_{i}^{2}\ge 5t\right\} \le \sum _{S\subseteq [n]:\text { }|S|=2\sqrt{\epsilon }n}{\mathbb {P}}\left\{ \sum _{i\in S}g_{i}^{2}\ge 5t\right\} \\&\quad \le \exp \left\{ H\left( 2\sqrt{\epsilon }\right) n-t\right\} . \end{aligned}$$

This gives

$$\begin{aligned} {\mathbb {E}}\left[ \max _{S\subseteq [n]:\text { }|S|=2\sqrt{\epsilon }n}\sum _{i\in S}g_{i}^{2}\right]= & {} {\displaystyle \int }_{0}^{\infty }{\mathbb {P}}\left\{ \max _{S\subseteq [n]:\text { }|S|= 2\sqrt{\epsilon }n}\sum _{i\in S}g_{i}^{2}\ge t\right\} \mathrm {d}t\\\le & {} 5H\left( 2\sqrt{\epsilon }\right) n+{\displaystyle \int }_{5H\left( 2\sqrt{\epsilon }\right) n}^{\infty }\exp \left\{ H\left( 2\sqrt{\epsilon }\right) n-\frac{1}{5}t\right\} \mathrm {d}t\\< & {} 10H\left( 2\sqrt{\epsilon }\right) n, \end{aligned}$$

for any given \(\epsilon >0\) with the proviso that n is sufficiently large. This combined with (129) yields

$$\begin{aligned} {\mathbb {E}}\left[ \min _{\varvec{u}\in \mathcal {D}_{i}}\left\| \varvec{g}-\varvec{u}\right\| ^{2}\right]\ge & {} \left( \frac{1}{2}-2\sqrt{2}\epsilon ^{\frac{3}{4}}-10H(2\sqrt{\epsilon })\right) n \end{aligned}$$
(130)

as claimed.

Proof of Lemma 8

Throughout, we shall restrict ourselves on the event \(\mathcal {A}_n\) as defined in (86), on which \(\tilde{\varvec{G}}\succeq \lambda _{\mathrm {lb}}\varvec{I}\). Recalling the definitions of \(\tilde{\varvec{G}}\) and \(\varvec{w}\) from (82) and (89), we see that

$$\begin{aligned} \varvec{w}^{\top }\tilde{\varvec{G}}^{-2}\varvec{w}&= ~\frac{1}{n^2} \varvec{X}_{\cdot 1}^{\top }\varvec{D}_{\tilde{\varvec{\beta }}}\tilde{\varvec{X}}\left( \frac{1}{n} \tilde{\varvec{X}}^{\top }\varvec{D}_{\tilde{\varvec{\beta }}}\tilde{\varvec{X}}\right) ^{-2} \tilde{\varvec{X}}^{\top } \varvec{D}_{\tilde{\varvec{\beta }}}\varvec{X}_{\cdot 1}\nonumber \\&\le ~ \frac{\big \Vert \varvec{X}_{\cdot 1}^{\top } \big \Vert ^2}{n} \left\| \frac{1}{n} \varvec{D}_{\tilde{\varvec{\beta }}}\tilde{\varvec{X}}\left( \frac{1}{n} \tilde{\varvec{X}}^{\top }\varvec{D}_{\tilde{\varvec{\beta }}}\tilde{\varvec{X}}\right) ^{-2} \tilde{\varvec{X}}^{\top } \varvec{D}_{\tilde{\varvec{\beta }}}\right\| . \end{aligned}$$
(131)

If we let the singular value decomposition of \(\frac{1}{\sqrt{n}}\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \tilde{\varvec{X}}\) be \(\varvec{U}\varvec{\Sigma }\varvec{V}^{\top }\), then a little algebra gives \(\varvec{\Sigma }\succeq \sqrt{\lambda _{\mathrm {lb}} }\varvec{I}\) and

$$\begin{aligned} \frac{1}{n} \varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \tilde{\varvec{X}}\left( \frac{1}{n} \tilde{\varvec{X}}'\varvec{D}_{\tilde{\varvec{\beta }}}\tilde{\varvec{X}}\right) ^{-2} \tilde{\varvec{X}}^{\top } \varvec{D}_{\tilde{\varvec{\beta }}}^{1/2}&= \varvec{U}\varvec{\Sigma }^{-2} \varvec{U}^{\top } ~\preceq ~ \lambda _{\mathrm {lb}}^{-1} \varvec{I}. \ \ \end{aligned}$$

Substituting this into (131) and using the fact \(\Vert \varvec{X}_{\cdot 1}\Vert ^2 \lesssim n\) with high probability (by Lemma 2), we obtain

$$\begin{aligned} \varvec{w}^{\top } \tilde{\varvec{G}}^{-2} \varvec{w}~\lesssim ~ \frac{1}{n\lambda _{\mathrm {lb}}} \Vert \varvec{X}_{\cdot 1}\Vert ^2 \lesssim 1 \end{aligned}$$

with probability at least \(1-\exp (-\Omega (n))\).

Proof of Lemma 9

Throughout this and the subsequent sections, we consider \(H_n\) and \(K_n\) to be two diverging sequences with the following properties:

$$\begin{aligned} H_n=o\left( n^\epsilon \right) , \ \ K_n=o\left( n^\epsilon \right) , \ \ n^2\exp \left( -c_1 H_n^2\right) = o(1), \ \ n\exp \left( -c_2K_n^2\right) =o(1),\nonumber \\ \end{aligned}$$
(132)

for any constants \(c_i > 0\), \(i=1,2\) and any \(\epsilon >0\). This lemma is an analogue of [25, Proposition 3.18]. We modify and adapt the proof ideas to establish the result in our setup. Throughout we shall restrict ourselves to the event \(\mathcal {A}_n\), on which \(\tilde{\varvec{G}}\succeq \lambda _{\mathrm {lb}}\varvec{I}\).

Due to independence between \(\varvec{X}_{\cdot 1}\) and \(\{\varvec{D}_{\tilde{\varvec{\beta }}}, \varvec{H}\}\), one can invoke the Hanson-Wright inequality [52, Theorem 1.1] to yield

$$\begin{aligned}&{\mathbb {P}}\left( \left| \frac{1}{n} \varvec{X}_{\cdot 1}^{\top } \varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \varvec{H}\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \varvec{X}_{\cdot 1}- \frac{1}{n} \mathrm {Tr}\left( \varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \varvec{H}\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \right) \right| > t ~\Bigg |~ \varvec{H}, \varvec{D}_{\tilde{\varvec{\beta }}}\right) \\&\quad \le 2 \exp \left( -c \min \left\{ \frac{t^2}{\frac{K^4}{n^2} \big \Vert \varvec{D}_{\tilde{\varvec{\beta }}}^{1/2}\varvec{H}\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \big \Vert ^2_{\mathrm {F}}}, \frac{t}{\frac{K^2}{n} \big \Vert \varvec{D}_{\tilde{\varvec{\beta }}}^{1/2}\varvec{H}\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \big \Vert } \right\} \right) \\&\quad \le 2 \exp \left( -c \min \left\{ \frac{t^2}{\frac{K^4}{n} \big \Vert \varvec{D}_{\tilde{\varvec{\beta }}}^{1/2}\varvec{H}\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \big \Vert ^2}, \frac{t}{\frac{K^2}{n} \big \Vert \varvec{D}_{\tilde{\varvec{\beta }}}^{1/2}\varvec{H}\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \big \Vert } \right\} \right) , \end{aligned}$$

where \(\Vert .\Vert _{\mathrm {F}}\) denotes the Frobenius norm. Choose \(t = C^2 \big \Vert \varvec{D}_{\tilde{\varvec{\beta }}}^{1/2}\varvec{H}\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \big \Vert H_n/\sqrt{n}\) with \(C>0\) a sufficiently large constant, and take \(H_n\) to be as in (132). Substitution into the above inequality and unconditioning give

$$\begin{aligned}&{\mathbb {P}}\left( \left| \frac{1}{n} \varvec{X}_{\cdot 1}^{\top } \varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \varvec{H}\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \varvec{X}_{\cdot 1}- \frac{1}{n} \mathrm {Tr}\left( \varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \varvec{H}\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \right) \right| > \frac{1}{\sqrt{n}} C^2 H_n \Vert \varvec{D}_{\tilde{\varvec{\beta }}}^{1/2}\varvec{H}\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \Vert \right) \nonumber \\&\qquad \le ~2\exp \left( -c \min \left\{ \frac{C^4 H_n^2}{K^4}, \frac{C^2 \sqrt{n}H_n}{K^2} \right\} \right) = C\exp \left( -c H_n^2 \right) = o(1), \end{aligned}$$
(133)

for some universal constants \(C, c>0\).

We are left to analyzing \(\mathrm {Tr}\big ( \varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \varvec{H}\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \big )\). Recall from the definition (92) of \(\varvec{H}\) that

$$\begin{aligned} \varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \varvec{H}\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} = \varvec{D}_{\tilde{\varvec{\beta }}}- \frac{1}{n} \varvec{D}_{\tilde{\varvec{\beta }}}\tilde{\varvec{X}}\tilde{\varvec{G}}^{-1} \tilde{\varvec{X}}^{\top } \varvec{D}_{\tilde{\varvec{\beta }}}, \end{aligned}$$

and, hence,

$$\begin{aligned} \mathrm {Tr}\left( \varvec{D}_{\tilde{\varvec{\beta }}}^{1/2}\varvec{H}\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2}\right) = \sum _{i=1}^n \left( \rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }}) - \frac{\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})^2}{n} \tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}^{-1} \tilde{\varvec{X}}_i \right) . \end{aligned}$$
(134)

This requires us to analyze \(\tilde{\varvec{G}}^{-1}\) carefully. To this end, recall that the matrix \(\tilde{\varvec{G}}_{(i)}\) defined in (83) obeys

$$\begin{aligned} \tilde{\varvec{G}}_{(i)}= \tilde{\varvec{G}}- \frac{1}{n} \rho ''(\tilde{\varvec{X}}^{\top }\tilde{\varvec{\beta }}) \tilde{\varvec{X}}_i \tilde{\varvec{X}}_i^{\top }. \end{aligned}$$

Invoking Sherman–Morrison–Woodbury formula (e.g. [31]), we have

$$\begin{aligned} \tilde{\varvec{G}}^{-1}&= \tilde{\varvec{G}}_{(i)}^{-1} -\frac{\frac{\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})}{n}\tilde{\varvec{G}}_{(i)}^{-1} \tilde{\varvec{X}}_i \tilde{\varvec{X}}_i^{\top } \tilde{\varvec{G}}_{(i)}^{-1}}{1+\frac{\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})}{n}\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1}\tilde{\varvec{X}}_i} . \end{aligned}$$
(135)

It follows that

$$\begin{aligned} \tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}^{-1} \tilde{\varvec{X}}_i = \tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1} \tilde{\varvec{X}}_i - \frac{\frac{\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})}{n}(\varvec{X}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1} \tilde{\varvec{X}}_i )^2}{1+\frac{\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})}{n}\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1}\tilde{\varvec{X}}_i}, \end{aligned}$$

which implies that

$$\begin{aligned} \tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}^{-1} \tilde{\varvec{X}}_i = \frac{\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1}\tilde{\varvec{X}}_i}{1+\frac{\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})}{n}\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1}\tilde{\varvec{X}}_i}. \end{aligned}$$
(136)

The relations (134) and (136) taken collectively reveal that

$$\begin{aligned} \frac{1}{n}\mathrm {Tr}\left( \varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \varvec{H}\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \right) = \frac{1}{n} \sum _{i=1}^n \frac{\rho ''(\tilde{\varvec{X}}_i \tilde{\varvec{\beta }})}{1+\frac{\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})}{n}\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1}\tilde{\varvec{X}}_i}. \end{aligned}$$
(137)

We shall show that the trace above is close to \(\mathrm {Tr}(\varvec{I}- \varvec{H})\) up to some factors. For this purpose we analyze the latter quantity in two different ways. To begin with, observe that

$$\begin{aligned} \mathrm {Tr}(\varvec{I}-\varvec{H})= \mathrm {Tr}\Bigg (\frac{\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2}\tilde{\varvec{X}}\tilde{\varvec{G}}^{-1}\tilde{\varvec{X}}^{\top }\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2}}{n} \Bigg ) =\mathrm {Tr}(\tilde{\varvec{G}}\tilde{\varvec{G}}^{-1})=p-1. \end{aligned}$$
(138)

On the other hand, it directly follows from the definition of \(\varvec{H}\) and (136) that the \(i{\text {th}}\) diagonal entry of \(\varvec{H}\) is given by

$$\begin{aligned} H_{i,i} = \frac{1}{ 1+\frac{\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})}{n}\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1}\tilde{\varvec{X}}_i} . \end{aligned}$$

Applying this relation, we can compute \(\mathrm {Tr}(\varvec{I}-\varvec{H})\) analytically as follows:

$$\begin{aligned} \mathrm {Tr}(\varvec{I}-\varvec{H})&= \sum _{i} \frac{\frac{\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})}{n}\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1}\tilde{\varvec{X}}_i}{1+\frac{\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})}{n}\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1}\tilde{\varvec{X}}_i} \end{aligned}$$
(139)
$$\begin{aligned}&= \sum _{i} \frac{\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})\tilde{\alpha }+ \frac{\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})}{n}\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1}\tilde{\varvec{X}}_i - \rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})\tilde{\alpha }}{1+\frac{\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})}{n}\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1}\tilde{\varvec{X}}_i} \nonumber \\&= \sum _{i} \rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})\tilde{\alpha }H_{i,i} + \sum _{i} \frac{\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }}) \left( \frac{1}{n}\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1}\tilde{\varvec{X}}_i -\tilde{\alpha }\right) }{1+\frac{\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})}{n}\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1}\tilde{\varvec{X}}_i} , \end{aligned}$$
(140)

where \(\tilde{\alpha }:= \frac{1}{n} \mathrm {Tr}\left( \tilde{\varvec{G}}^{-1}\right) \).

Observe that the first quantity in the right-hand side above is simply \(\tilde{\alpha }\mathrm {Tr}\big (\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2}\varvec{H}\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2}\big )\). For simplicity, denote

$$\begin{aligned} \eta _i = \frac{1}{n}\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1}\tilde{\varvec{X}}_i -\tilde{\alpha }. \end{aligned}$$
(141)

Note that \(\tilde{\varvec{G}}_{(i)}\succ \varvec{0}\) on \(\mathcal {A}_n \) and that \(\rho ''>0\). Hence the denominator in the second term in (140) is greater than 1 for all i. Comparing (138) and (140), we deduce that

$$\begin{aligned} \left| \frac{p-1}{n} - \frac{1}{n}\mathrm {Tr}\left( \varvec{D}_{\tilde{\varvec{\beta }}}^{1/2} \varvec{H}\varvec{D}_{\tilde{\varvec{\beta }}}^{1/2}\right) \tilde{\alpha }\right| ~\le ~ \sup _i|\eta _i| \cdot \frac{1}{n}\sum _i|\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})| ~\lesssim ~ \sup _i|\eta _i|\nonumber \\ \end{aligned}$$
(142)

on \(\mathcal {A}_n\). It thus suffices to control \(\sup _i|\eta _i|\). The above bounds together with Lemma (87) and the proposition below complete the proof.

Proposition 1

Let \(\eta _i\) be as defined in (141). Then there exist universal constants \(C_1,C_2, C_3 > 0\) such that

$$\begin{aligned} {\mathbb {P}}\left( \sup _i |\eta _i| \le \frac{C_1 K_n^2 H_n}{\sqrt{n}} \right)&\ge 1- C_2n^2\exp \left( -c_2 H_n^2\right) -C_3n\exp \left( -c_3K_n^2\right) \nonumber \\&\quad -\, \exp \left( -C_4n\left( 1+o(1)\right) \right) =1-o(1), \end{aligned}$$

where \(K_n , H_n\) are diverging sequences as specified in (132).

Proof of Proposition 1

Fix any index i. Recall that \(\tilde{\varvec{\beta }}_{[-i]}\) is the MLE when the \(1{\text {st}}\) predictor and \(i{\text {th}}\) observation are removed. Also recall the definition of \(\tilde{\varvec{G}}_{[-i]}\) in (85). The proof essentially follows three steps. First, note that \(\tilde{\varvec{X}}_i\) and \(\tilde{\varvec{G}}_{[-i]}\) are independent. Hence, an application of the Hanson-Wright inequality [52] yields that

$$\begin{aligned}&{\mathbb {P}}\left( \left| \frac{1}{n} \tilde{\varvec{X}}_i^{\top } \tilde{\varvec{G}}_{[-i]}^{-1} \tilde{\varvec{X}}_i - \frac{1}{n} \mathrm {Tr}\left( \tilde{\varvec{G}}_{[-i]}^{-1} \right) \right| > t ~ \Bigg |~ \tilde{\varvec{G}}_{[-i]}\right) \\&\quad \le 2 \exp \left( -c \min \left\{ \frac{t^2}{\frac{K^4}{n^2} \big \Vert \tilde{\varvec{G}}_{[-i]}^{-1} \big \Vert _{\mathrm {F}}^2}, \frac{t}{\frac{K^2}{n} \big \Vert \tilde{\varvec{G}}_{[-i]}^{-1} \big \Vert } \right\} \right) \\&\quad \le 2 \exp \left( -c \min \left\{ \frac{t^2}{\frac{K^4}{n} \big \Vert \tilde{\varvec{G}}_{[-i]}^{-1} \big \Vert ^2}, \frac{t}{\frac{K^2}{n} \big \Vert \tilde{\varvec{G}}_{[-i]}^{-1} \big \Vert } \right\} \right) . \end{aligned}$$

We choose \(t = C^2 \big \Vert \tilde{\varvec{G}}_{[-i]}^{-1} \big \Vert H_n/\sqrt{n}\), where \(C>0\) is a sufficiently large constant.

Now marginalizing gives

$$\begin{aligned}&{\mathbb {P}}\left( \left| \frac{1}{n} \tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{[-i]}^{-1} \tilde{\varvec{X}}_i - \frac{1}{n} \mathrm {Tr}\left( \tilde{\varvec{G}}_{[-i]}^{-1} \right) \right| > C^2 \big \Vert \tilde{\varvec{G}}_{[-i]}^{-1} \big \Vert \frac{H_n}{\sqrt{n}} \right) \\&\quad \le 2\exp \left( -c \min \left\{ \frac{C^4 H_n^2}{K^4}, \frac{C^2 \sqrt{n }H_n}{K^2} \right\} \right) \\&\quad \le 2 \exp \left( -C' H_n^2 \right) , \end{aligned}$$

where \(C' > 0 \) is a sufficiently large constant. On \(\mathcal {A}_n\), the spectral norm \(\big \Vert \tilde{\varvec{G}}_{(i)}^{-1} \big \Vert \) is bounded above by \(\lambda _{\mathrm {lb}}\) for all i. Invoking (87) we obtain that there exist universal constants \(C_1, C_2, C_3>0 \) such that

$$\begin{aligned} {\mathbb {P}}\left( \sup _i \left| \frac{1}{n} \tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{[-i]}^{-1} \tilde{\varvec{X}}_i - \frac{1}{n} \mathrm {Tr}\left( \tilde{\varvec{G}}_{[-i]}^{-1} \right) \right| > C_1 \frac{H_n}{\sqrt{n}} \right) \le C_2 n \exp \left( -C_3 H_n^2 \right) .\qquad \end{aligned}$$
(143)

The next step consists of showing that \(\mathrm {Tr}\big (\tilde{\varvec{G}}_{[-i]}^{-1}\big ) \) (resp. \(\tilde{\varvec{X}}_i^{\top } \tilde{\varvec{G}}_{[-i]}^{-1}\tilde{\varvec{X}}_i\)) and \(\mathrm {Tr}\big (\tilde{\varvec{G}}_{(i)}^{-1}\big ) \) (resp. \(\tilde{\varvec{X}}_i^{\top } \tilde{\varvec{G}}_{(i)}^{-1} \tilde{\varvec{X}}_i\)) are uniformly close across all i. This is established in the following lemma.

Lemma 15

Let \(\tilde{\varvec{G}}_{(i)}\) and \(\tilde{\varvec{G}}_{[-i]}\) be defined as in (83) and (85), respectively. Then there exist universal constants \(C_1, C_2, C_3,C_4,c_2,c_3 > 0 \) such that

$$\begin{aligned}&{\mathbb {P}}\left( \sup _i \left| \frac{1}{n} \tilde{\varvec{X}}_i^{\top } \tilde{\varvec{G}}_{(i)}^{-1} \tilde{\varvec{X}}_i - \frac{1}{n}\tilde{\varvec{X}}_i^{\top } \tilde{\varvec{G}}_{[-i]}^{-1} \tilde{\varvec{X}}_i \right| \le C_1 \frac{K_n^2H_n}{\sqrt{n}} \right) \nonumber \\&\quad = 1-C_2n^2\exp \left( -c_2 H_n^2\right) -C_3n\exp \left( -c_3K_n^2\right) \nonumber \\&\qquad - \exp \left( -C_4n\left( 1+o(1)\right) \right) =1-o(1) , \end{aligned}$$
(144)
$$\begin{aligned}&{\mathbb {P}}\left( \sup _i \left| \frac{1}{n} \mathrm {Tr}\big (\tilde{\varvec{G}}_{(i)}^{-1}\big ) - \frac{1}{n}\mathrm {Tr}\big ( \tilde{\varvec{G}}_{[-i]}^{-1} \big ) \right| \le C_1\frac{K_n^2H_n}{\sqrt{n}} \right) \nonumber \\&\quad = 1-C_2n^2\exp \left( -c_2 H_n^2\right) -C_3n\exp \left( -c_3K_n^2\right) \nonumber \\&\qquad - \exp \left( -C_4n\left( 1+o(1)\right) \right) =1-o(1) , \end{aligned}$$
(145)

where \(K_n, H_n\) are diverging sequences as defined in (132).

This together with (143) yields that

$$\begin{aligned}&{\mathbb {P}}\left( \sup _i \left| \frac{1}{n} \tilde{\varvec{X}}_i^{\top } \tilde{\varvec{G}}_{(i)}^{-1} \tilde{\varvec{X}}_i - \frac{1}{n}\mathrm {Tr}(\tilde{\varvec{G}}_{(i)}^{-1}) \right| > C_1 \frac{K_n^2 H_n}{\sqrt{n}}\right) \nonumber \\&\quad \le ~C_2n^2\exp \left( -c_2 H_n^2\right) + C_3n\exp \left( -c_3K_n^2\right) +\exp \left( -C_4n\left( 1+o(1)\right) \right) .\nonumber \\ \end{aligned}$$
(146)

The final ingredient is to establish that \(\frac{1}{n}\mathrm {Tr}\big (\tilde{\varvec{G}}_{(i)}^{-1}\big )\) and \(\frac{1}{n} \mathrm {Tr}\big (\tilde{\varvec{G}}^{-1}\big )\) are uniformly close across i.

Lemma 16

Let \(\tilde{\varvec{G}}\) and \(\tilde{\varvec{G}}_{(i)}\) be as defined in (82) and (83), respectively. Then one has

$$\begin{aligned} {\mathbb {P}}\left( \left| \mathrm {Tr}\big ( \tilde{\varvec{G}}_{(i)}^{-1} \big ) -\mathrm {Tr}\big ( \tilde{\varvec{G}}^{-1} \big )\right| \le \frac{1}{\lambda _{\mathrm {lb}}} \right) \ge 1- \exp \left( - \Omega ( n) \right) . \end{aligned}$$
(147)

This completes the proof. \(\square \)

Proof of Lemma 15

For two invertible matrices \(\varvec{A}\) and \(\varvec{B}\) of the same dimensions, the difference of their inverses can be written as

$$\begin{aligned} \varvec{A}^{-1} - \varvec{B}^{-1} = \varvec{A}^{-1}(\varvec{B}- \varvec{A}) \varvec{B}^{-1}. \end{aligned}$$

Applying this identity, we have

$$\begin{aligned} \tilde{\varvec{G}}_{(i)}^{-1}-\tilde{\varvec{G}}_{[-i]}^{-1} = \tilde{\varvec{G}}_{(i)}^{-1} \left( \tilde{\varvec{G}}_{[-i]}-\tilde{\varvec{G}}_{(i)}\right) \tilde{\varvec{G}}_{[-i]}^{-1}. \end{aligned}$$

From the definition of these matrices, it follows directly that

$$\begin{aligned} \tilde{\varvec{G}}_{[-i]}-\tilde{\varvec{G}}_{(i)}= \frac{1}{n} \sum _{j: j \ne i} \left( \rho ''\big (\tilde{\varvec{X}}_j^{\top } \tilde{\varvec{\beta }}_{[-i]}\big )- \rho ''\big (\tilde{\varvec{X}}_j^{\top }\tilde{\varvec{\beta }}\big ) \right) \tilde{\varvec{X}}_j\tilde{\varvec{X}}_j^{\top } . \end{aligned}$$
(148)

As \(\rho '''\) is bounded, by the mean-value theorem, it suffices to control the differences \(\varvec{X}_j^{\top } \tilde{\varvec{\beta }}_{[-i]}- \tilde{\varvec{X}}_j^{\top } \tilde{\varvec{\beta }}\) uniformly across all j. This is established in the following lemma, the proof of which is deferred to Appendix  H.

Lemma 17

Let \(\hat{\varvec{\beta }}\) be the full model MLE and \(\hat{\varvec{\beta }}_{[-i]}\) be the MLE when the \(i{\text {th}}\) observation is dropped. Let \(q_i\) be as described in Lemma 18 and \(K_n, H_n\) be as in (132). Then there exist universal constants \(C_1,C_2,C_3,C_4,c_2,c_3>0\) such that

$$\begin{aligned}&{\mathbb {P}}\left( \sup _{j \ne i}\left| \varvec{X}_j^{\top }\hat{\varvec{\beta }}_{[-i]}- \varvec{X}_j^{\top }\hat{\varvec{\beta }}\right| \le C_1 \frac{K_n^2 H_n}{\sqrt{n}} \right) \nonumber \\&\quad \ge 1-C_2n\exp \left( -c_2 H_n^2\right) -C_3\exp \left( -c_3K_n^2\right) \nonumber \\&\qquad - \exp \left( -C_4n\left( 1+o(1)\right) \right) =1-o(1), \end{aligned}$$
(149)
$$\begin{aligned}&{\mathbb {P}}\left( \sup _i |\varvec{X}_i^{\top }\hat{\varvec{\beta }}- \mathsf {prox}_{q_i \rho }(\varvec{X}_i^{\top }\hat{\varvec{\beta }}_{[-i]})| \le C_1 \frac{K_n^2H_n}{\sqrt{n}}\right) \nonumber \\&\quad \ge 1- C_2n\exp \left( -c_2 H_n^2\right) -C_3\exp \left( -c_3K_n^2\right) \nonumber \\&\qquad - \exp \left( -C_4n\left( 1+o(1)\right) \right) =1-o(1) . \end{aligned}$$
(150)

Invoking this lemma, we see that the spectral norm of (148) is bounded above by some constant times

$$\begin{aligned} \frac{ K_n^2 H_n}{ \sqrt{n} } \Big \Vert \sum _{j: j \ne i} \tilde{\varvec{X}}_j \tilde{\varvec{X}}_j^{\top }/n \Big \Vert \end{aligned}$$

with high probability as specified in (149). From Lemma 2, the spectral norm here is bounded by some constant with probability at least \(1-c_1 \exp (-c_2n)\). These observations together with (87) and the fact that on \(\mathcal {A}_n\) the minimum eigenvalues of \(\tilde{\varvec{G}}_{(i)}\) and \(\tilde{\varvec{G}}_{[-i]}\) are bounded by \(\lambda _{\mathrm {lb}}\) yield that

$$\begin{aligned} {\mathbb {P}}\left( \big \Vert \tilde{\varvec{G}}_{(i)}^{-1} -\tilde{\varvec{G}}_{[-i]}^{-1} \big \Vert \le C_1 \frac{K_n^2 H_n}{\sqrt{n}} \right)\ge & {} 1 - C_2n\exp \left( -c_2 H_n^2\right) -C_3\exp \left( -c_3K_n^2\right) \\&- \exp \left( -C_4n\left( 1+o(1)\right) \right) . \end{aligned}$$

This is true for any i. Hence, taking the union bound we obtain

$$\begin{aligned}&{\mathbb {P}}\left( \sup _i \big \Vert \tilde{\varvec{G}}_{(i)}^{-1} -\tilde{\varvec{G}}_{[-i]}^{-1} \big \Vert \le C_1 \frac{K_n^2 H_n}{\sqrt{n}} \right) \nonumber \\&\quad \ge 1 - C_2n^2\exp \left( -c_2 H_n^2\right) -C_3n\exp \left( -c_3K_n^2\right) - \exp \left( -C_4n\left( 1+o(1)\right) \right) .\nonumber \\ \end{aligned}$$
(151)

In order to establish the first result, note that

$$\begin{aligned} \sup _i\frac{1}{n} \left| \tilde{\varvec{X}}_i^{\top } \tilde{\varvec{G}}_{(i)}^{-1} \tilde{\varvec{X}}_i - \tilde{\varvec{X}}_i^{\top } \tilde{\varvec{G}}_{[-i]}^{-1} \tilde{\varvec{X}}_i \right| \le \sup _i \frac{\Vert \tilde{\varvec{X}}_i \Vert ^2}{n} \sup _i\Vert \tilde{\varvec{G}}_{(i)}^{-1} -\tilde{\varvec{G}}_{[-i]}^{-1} \Vert . \end{aligned}$$

To obtain the second result, note that

$$\begin{aligned} \sup _i \left| \frac{1}{n} \mathrm {Tr}(\tilde{\varvec{G}}_{(i)}^{-1}) - \frac{1}{n}\mathrm {Tr}(\tilde{\varvec{G}}_{[-i]}^{-1}) \right| \le \frac{p-1}{n} \sup _i\Vert \tilde{\varvec{G}}_{(i)}^{-1} -\tilde{\varvec{G}}_{[-i]}^{-1} \Vert . \end{aligned}$$

Therefore, combining (151) and Lemma 2 gives the desired result. \(\square \)

Proof of Lemma 16

We restrict ourselves to the event \(\mathcal {A}_n\) throughout. Recalling (135), one has

$$\begin{aligned} \mathrm {Tr}(\tilde{\varvec{G}}_{(i)}^{-1}) - \mathrm {Tr}(\tilde{\varvec{G}}^{-1})&= \frac{\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})}{n} \frac{\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-2} \tilde{\varvec{X}}_i}{1+ \frac{\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})}{n}\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1}\tilde{\varvec{X}}_i}. \end{aligned}$$

In addition, on \(\mathcal {A}_n\) we have

$$\begin{aligned} \frac{1}{\lambda _{\mathrm {lb}}} \tilde{\varvec{X}}_i^{\top } \tilde{\varvec{G}}_{(i)}^{-1} \tilde{\varvec{X}}_i - \tilde{\varvec{X}}_i^{\top } \tilde{\varvec{G}}_{(i)}^{-2} \tilde{\varvec{X}}_i&= \frac{1}{\lambda _{\mathrm {lb}}} \tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1}\left( \tilde{\varvec{G}}_{(i)}- \lambda _{\mathrm {lb}} \varvec{I}\right) \tilde{\varvec{G}}_{(i)}^{-1}\tilde{\varvec{X}}_i \ge 0. \end{aligned}$$

Combining these results and recognizing that \(\rho ''>0\), we get

$$\begin{aligned} \left| \mathrm {Tr}(\tilde{\varvec{G}}_{(i)}^{-1}) - \mathrm {Tr}(\tilde{\varvec{G}}^{-1}) \right| \le \frac{\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})}{n}\frac{\frac{1}{\lambda _{\mathrm {lb}}}\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1}\tilde{\varvec{X}}_i}{1+ \frac{\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})}{n}\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1}\tilde{\varvec{X}}_i} \le \frac{1}{\lambda _{\mathrm {lb}} } \end{aligned}$$
(152)

as claimed. \(\square \)

Proof of Lemma 11

Again, we restrict ourselves to the event \(\mathcal {A}_n\) on which \(\tilde{\varvec{G}}\succeq \lambda _{\mathrm {lb}}\varvec{I}\). Note that

$$\begin{aligned} \tilde{\varvec{X}}_i^{\top } \tilde{\varvec{G}}^{-1}\varvec{w}= \frac{1}{n} \tilde{\varvec{X}}_i^{\top } \tilde{\varvec{G}}^{-1} \tilde{\varvec{X}}^{\top }\varvec{D}_{\tilde{\varvec{\beta }}}\varvec{X}_{\cdot 1}. \end{aligned}$$

Note that \(\{\tilde{\varvec{G}}, \tilde{\varvec{X}}\}\) and \(\varvec{X}_{\cdot 1}\) are independent. Conditional on \(\tilde{\varvec{X}}\), the left-hand side is Gaussian with mean zero and variance \(\frac{1}{n^2}\tilde{\varvec{X}}_i^{\top } \tilde{\varvec{G}}^{-1} \tilde{\varvec{X}}^{\top }\varvec{D}_{\tilde{\varvec{\beta }}}^2 \tilde{\varvec{X}}\tilde{\varvec{G}}^{-1} \tilde{\varvec{X}}_i\). The variance is bounded above by

$$\begin{aligned} \sigma _{X}^2&:=\frac{1}{n^2}\tilde{\varvec{X}}_i^{\top } \tilde{\varvec{G}}^{-1} \tilde{\varvec{X}}^{\top }\varvec{D}_{\tilde{\varvec{\beta }}}^2 \tilde{\varvec{X}}\tilde{\varvec{G}}^{-1} \tilde{\varvec{X}}_i \le \sup _i\big |\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})\big | \cdot \frac{1}{n^2}\tilde{\varvec{X}}_i^{\top } \tilde{\varvec{G}}^{-1} \tilde{\varvec{X}}^{\top }\varvec{D}_{\tilde{\varvec{\beta }}}\tilde{\varvec{X}}\tilde{\varvec{G}}^{-1} \tilde{\varvec{X}}_i \nonumber \\&= \frac{1}{n} \sup _i \big | \rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})\big | \cdot \tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}^{-1} \tilde{\varvec{X}}_i \lesssim \frac{1 }{n} \Vert \tilde{\varvec{X}}_i\Vert ^2 \end{aligned}$$
(153)

In turn, Lemma 2 asserts that \(n^{-1} \Vert \tilde{\varvec{X}}_i\Vert ^2\) is bounded by a constant with high probability. As a result, applying Gaussian concentration results [60, Theorem 2.1.12] gives

$$\begin{aligned} |\tilde{\varvec{X}}_i ^{\top } \tilde{\varvec{G}}^{-1} \varvec{w}| \lesssim H_n \end{aligned}$$

with probability exceeding \(1-C\exp \left( -cH_n^2\right) \), where \( C, c >0\) are universal constants.

In addition, \(\sup _i|X_{i1}| \lesssim H_n\) holds with probability exceeding \(1-C\exp \left( -cH_n^2\right) \). Putting the above results together, applying the triangle inequality \(|X_{i1}-\tilde{\varvec{X}}_i^{\top } \tilde{\varvec{G}}^{-1}\varvec{w}|\le |X_{i1}|+ |\tilde{\varvec{X}}_i^{\top } \tilde{\varvec{G}}^{-1}\varvec{w}|\), and taking the union bound, we obtain

$$\begin{aligned} {\mathbb {P}}\left( \sup _{1\le i \le n}|X_{i1} - \tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}^{-1}\varvec{w}| \lesssim H_n\right) \ge 1-Cn\exp \left( -cH_n^2\right) = 1-o(1). \end{aligned}$$

Proof of Lemma 17

The goal of this section is to prove Lemma 17, which relates the full-model MLE \(\hat{\varvec{\beta }}\) and the MLE \(\hat{\varvec{\beta }}_{[-i]}\). To this end, we establish the key lemma below.

Lemma 18

Suppose \(\hat{\varvec{\beta }}_{[-i]}\) denote the MLE when the \(i{\text {th}}\) observation is dropped. Further let \(\varvec{G}_{[-i]}\) be as in (84), and define \(q_i\) and \(\hat{\varvec{b}}\) as follows:

$$\begin{aligned} q_i&=\frac{1}{n} \varvec{X}_i^{\top }\varvec{G}_{[-i]}^{-1} \varvec{X}_i ; \nonumber \\ \hat{\varvec{b}}&= \hat{\varvec{\beta }}_{[-i]}- \frac{1}{n} \varvec{G}_{[-i]}^{-1} \varvec{X}_i \left( \rho ' \Big ( \mathsf {prox}_{q_i \rho } \big (\varvec{X}_i^{\top }\hat{\varvec{\beta }}_{[-i]}\big ) \Big ) \right) . \end{aligned}$$
(154)

Suppose \(K_n, H_n\) are diverging sequences as in (132). Then there exist universal constants \(C_1, C_2, C_3 > 0\) such that

$$\begin{aligned}&{\mathbb {P}}\left( \Vert \hat{\varvec{\beta }}- \hat{\varvec{b}}\Vert \le C_1 \frac{K_n^2 H_n}{n} \right) \ge 1-C_2n\exp (-c_2 H_n^2)\nonumber \\&\qquad -\,C_3\exp (-c_3K_n^2)- \exp (-C_4n(1+o(1))); \end{aligned}$$
(155)
$$\begin{aligned}&{\mathbb {P}}\left( \sup _{j \ne i} \big | \varvec{X}_j^{\top } \hat{\varvec{\beta }}_{[-i]}- \varvec{X}_j^{\top } \hat{\varvec{b}}\big | \le C_1 \frac{K_n H_n}{\sqrt{n}} \right) \nonumber \\&\quad \ge 1-C_2n\exp \left( -c_2 H_n^2\right) -C_3\exp \left( -c_3K_n^2\right) \nonumber \\&\qquad -\, \exp \left( -C_4n\left( 1+o(1)\right) \right) . \end{aligned}$$
(156)

The proof ideas are inspired by the leave-one-observation-out approach of [25]. We however emphasize once more that the adaptation of these ideas to our setup is not straightforward and crucially hinges on Theorem 4, Lemma 7 and properties of the effective link function.

Proof of Lemma 18

Invoking techniques similar to that for establishing Lemma 7, it can be shown that

$$\begin{aligned} \frac{1}{n} \sum _{i=1}^n\rho ''(\gamma _i^{*}) \varvec{X}_i \varvec{X}_i^{\top } \succeq \lambda _{\mathrm {lb}} \varvec{I} \end{aligned}$$
(157)

with probability at least \(1- \exp (\Omega (n))\), where \(\gamma _i^{*}\) is between \(\varvec{X}_i^{\top } \hat{\varvec{b}}\) and \(\varvec{X}_i^{\top } \hat{\varvec{\beta }}\). Denote by \(\mathcal {B}_n\) the event where (157) holds. Throughout this proof, we work on the event \(\mathcal {C}_n :=\mathcal {A}_n \cap \mathcal {B}_n\), which has probability \(1- \exp \left( -\Omega (n)\right) \). As in (107) then,

$$\begin{aligned} \Vert \hat{\varvec{\beta }}- \hat{\varvec{b}}\Vert \le \frac{1}{n\lambda _{\mathrm {lb}}} \big \Vert {\nabla \ell (\hat{\varvec{b}})} \big \Vert . \end{aligned}$$
(158)

Next, we simplify (158). To this end, recall the defining relation of the proximal operator

$$\begin{aligned} b\rho '(\mathsf {prox}_{b\rho }(z)) + \mathsf {prox}_{b\rho }(z) = z, \end{aligned}$$

which together with the definitions of \(\hat{\varvec{b}}\) and \(q_i\) gives

$$\begin{aligned} \varvec{X}_i^{\top } \hat{\varvec{b}}= \mathsf {prox}_{q_i \rho } \left( \varvec{X}_i^{\top }\hat{\varvec{\beta }}_{[-i]}\right) . \end{aligned}$$
(159)

Now, let \(\ell _{[-i]}\) denote the negative log-likelihood function when the \(i{\text {th}}\) observation is dropped, and hence \(\nabla \ell _{[-i]}\big ( \hat{\varvec{\beta }}_{[-i]}\big ) = \varvec{0}\). Expressing \(\nabla \ell (\hat{\varvec{b}})\) as \(\nabla \ell (\hat{\varvec{b}})-\nabla \ell _{[-i]}\big ( \hat{\varvec{\beta }}_{[-i]}\big )\), applying the mean value theorem, and using the analysis similar to that in [25, Proposition 3.4], we obtain

$$\begin{aligned} \frac{1}{n}\nabla \ell (\hat{\varvec{b}}) = \frac{1}{n} \sum _{j: j \ne i } \left[ \rho ''(\gamma _j^{*})-\rho ''(\varvec{X}_j^{\top }\hat{\varvec{\beta }}_{[-i]})\right] \varvec{X}_j \varvec{X}_j^{\top } \left( \hat{\varvec{b}}- \hat{\varvec{\beta }}_{[-i]}\right) , \end{aligned}$$
(160)

where \(\gamma _j^{*}\) is between \(\varvec{X}_j^{\top }\hat{\varvec{b}}\) and \(\varvec{X}_j^{\top }\hat{\varvec{\beta }}_{[-i]}\). Combining (158) and (160) leads to the upper bound

$$\begin{aligned} \Vert \hat{\varvec{\beta }}- \hat{\varvec{b}}\Vert\le & {} \frac{1}{\lambda _{\mathrm {lb}}} \left\| \frac{1}{n} \sum _{j: j \ne i}\varvec{X}_j \varvec{X}_j ^{\top } \right\| \cdot \sup _{j \ne i} \Big |\rho ''(\gamma _j^{*})-\rho ''\big (\varvec{X}_j^{\top }\hat{\varvec{\beta }}_{[-i]}\big )\Big | \cdot \Bigg \Vert \frac{1}{n} \varvec{G}_{[-i]}^{-1} \varvec{X}_i \Bigg \Vert \nonumber \\&\cdot \left| \rho '\left( \mathsf {prox}_{q_i \rho }(\varvec{X}_i^{\top }\hat{\varvec{\beta }}_{[-i]}) \right) \right| . \end{aligned}$$
(161)

We need to control each term in the right-hand side. To start with, the first term is bounded by a universal constant with probability \(1- \exp (-\Omega (n))\) (Lemma 2). For the second term, since \(\gamma _{j}^{*}\) is between \(\varvec{X}_j^{\top }\hat{\varvec{b}}\) and \(\varvec{X}_j^{\top }\hat{\varvec{\beta }}_{[-i]}\) and \(\Vert \rho '''\Vert _\infty <\infty \), we get

$$\begin{aligned}&\sup _{j \ne i} \big |\rho ''(\gamma _j^{*})-\rho ''(\varvec{X}_j^{\top }\hat{\varvec{\beta }}_{[-i]})\big | \le \Vert \rho ''' \Vert _{\infty } \Vert \varvec{X}_j^{\top }\hat{\varvec{b}}- \varvec{X}_j^{\top }\hat{\varvec{\beta }}_{[-i]}\Vert \end{aligned}$$
(162)
$$\begin{aligned}&\quad \le \Vert \rho ''' \Vert _{\infty } \left| \frac{1}{n} \varvec{X}_j^{\top } \varvec{G}_{[-i]}^{-1} \varvec{X}_i \rho ' \Big ( \mathsf {prox}_{q_i \rho } \big (\varvec{X}_i^{\top }\hat{\varvec{\beta }}_{[-i]}\big ) \Big ) \right| \end{aligned}$$
(163)
$$\begin{aligned}&\quad \le \Vert \rho ''' \Vert _{\infty } \frac{1}{n}\sup _{j \ne i} \left| \varvec{X}_j^{\top }\varvec{G}_{[-i]}^{-1}\varvec{X}_i \right| \cdot \left| \rho ' \left( \mathsf {prox}_{q_i \rho }(\varvec{X}_i^{\top } \hat{\varvec{\beta }}_{[-i]}) \right) \right| . \end{aligned}$$
(164)

Given that \(\{\varvec{X}_j, \varvec{G}_{[-i]}\}\) and \(\varvec{X}_i\) are independent for all \(j \ne i\), conditional on \(\{\varvec{X}_j,\varvec{G}_{[-i]}\}\) one has

$$\begin{aligned} \varvec{X}_j^{\top }\varvec{G}_{[-i]}^{-1}\varvec{X}_i \sim {\mathcal {N}}\left( 0, \varvec{X}_j^{\top }\varvec{G}_{[-i]}^{-2}\varvec{X}_j \right) . \end{aligned}$$

In addition, the variance satisfies

$$\begin{aligned} |\varvec{X}_j^{\top }\varvec{G}_{[-i]}^{-2} \varvec{X}_j| \le \frac{\Vert \varvec{X}_j \Vert ^2}{\lambda _{\mathrm {lb}}^2 } \lesssim n \end{aligned}$$
(165)

with probability at least \(1-\exp (-\Omega (n))\). Applying standard Gaussian concentration results [60, Theorem 2.1.12], we obtain

$$\begin{aligned} {\mathbb {P}}\left( \frac{1}{\sqrt{p}} \left| \varvec{X}_j ^{\top } \varvec{G}_{[-i]}^{-1} \varvec{X}_i \right| \ge C_1 H_n \right)&\le C_2\exp \left( - c_2 H_n^2 \right) + \exp \left( - C_3 n\left( 1+o(1)\right) \right) . \end{aligned}$$
(166)

By the union bound

$$\begin{aligned} {\mathbb {P}}\left( \frac{1}{\sqrt{p}}\sup _{j \ne i} \big | \varvec{X}_j ^{\top } \varvec{G}_{[-i]}^{-1} \varvec{X}_i \big | \le C_1 H_n \right)\ge & {} 1- nC_2\exp \left( -c_2H_n^2 \right) \nonumber \\&- \exp \left( -C_3 n \left( 1+o(1)\right) \right) . \end{aligned}$$
(167)

Consequently,

$$\begin{aligned} \sup _{j \ne i} \big |\rho ''(\gamma _j^{*})-\rho ''(\varvec{X}_j^{\top }\hat{\varvec{\beta }}_{[-i]})\big |&\lesssim \sup _{j \ne i} \Vert \varvec{X}_j^{\top }\hat{\varvec{b}}- \varvec{X}_j^{\top }\hat{\varvec{\beta }}_{[-i]}\Vert \nonumber \\&\lesssim \frac{1}{\sqrt{n}} H_n \left| \rho ' \left( \mathsf {prox}_{q_i \rho }(\varvec{X}_i^{\top } \hat{\varvec{\beta }}_{[-i]}) \right) \right| . \end{aligned}$$
(168)

In addition, the third term in the right-hand side of (161) can be upper bounded as well since

$$\begin{aligned} \frac{1}{n}\Vert \varvec{G}_{[-i]}^{-1} \varvec{X}_i \Vert = \frac{1}{n} \sqrt{|\varvec{X}_i^{\top }\varvec{G}_{[-i]}^{-2} \varvec{X}_i| } \lesssim \frac{1}{\sqrt{n}} \end{aligned}$$
(169)

with high probability.

It remains to bound \(\left| \rho ' \left( \mathsf {prox}_{q_i \rho }(\varvec{X}_i^{\top } \hat{\varvec{\beta }}_{[-i]}) \right) \right| \). To do this, we begin by considering \(\rho '(\mathsf {prox}_{c \rho }(Z))\) for any constant \(c>0\) (rather than a random variable \(q_i\)). Recall that for any constant \(c > 0\) and any \(Z \sim {\mathcal {N}}(0,\sigma ^2)\) with finite variance, the random variable \(\rho '(\mathsf {prox}_{c \rho }(Z))\) is sub-Gaussian. Conditional on \(\hat{\varvec{\beta }}_{[-i]}\), one has \(\varvec{X}_i^{\top }\hat{\varvec{\beta }}_{[-i]}\sim {\mathcal {N}}\big ( 0,\Vert \hat{\varvec{\beta }}_{[-i]}\Vert ^2 \big )\). This yields

$$\begin{aligned} {\mathbb {P}}\left( \rho ' \left( \mathsf {prox}_{c \rho }(\varvec{X}_i^{\top }\hat{\varvec{\beta }}_{[-i]}) \right) \ge C_1 K_n \right)&\le C_2{{\,\mathrm{{\mathbb {E}}}\,}}\left[ \exp \left( -\frac{C_3^2 K_n^2}{ \Vert \hat{\varvec{\beta }}_{[-i]}\Vert ^2} \right) \right] \nonumber \\&\le C_2 \exp \left( -C_3 K_n^2\right) + C_4\exp \left( -C_5n\right) \nonumber \\ \end{aligned}$$
(170)

for some constants \(C_1,C_2,C_3,C_4, C_5>0\) since\(\Vert \hat{\varvec{\beta }}_{[-i]}\Vert \) is bounded with high probability (see Theorem 4).

Note that \(\frac{\partial \mathsf {prox}_{b\rho }(z)}{\partial b}\le 0\) by [22, Proposition 6.3]. Hence, in order to move over from the above concentration result established for a fixed constant c to the random variables \(q_i\), it suffices to establish a uniform lower bound for \(q_i\) with high probability. Observe that for each i,

$$\begin{aligned} q_i \ge \frac{\Vert \varvec{X}_i \Vert ^{2}}{n} \frac{1}{\big \Vert \varvec{G}_{[-i]}\big \Vert } \ge C^* \end{aligned}$$

with probability \(1-\exp (-\Omega (n))\), where \(C^*\) is some universal constant. On this event, one has

$$\begin{aligned} \rho ' \left( \mathsf {prox}_{q_i \rho } \left( \varvec{X}_i^{\top } \hat{\varvec{\beta }}_{[-i]}\right) \right) \le \rho ' \left( \mathsf {prox}_{C^{*} \rho } \left( \varvec{X}_i^{\top } \hat{\varvec{\beta }}_{[-i]}\right) \right) . \end{aligned}$$

This taken collectively with (170) yields

$$\begin{aligned} {\mathbb {P}}\left( \rho '(\mathsf {prox}_{q_i \rho }(\varvec{X}_i^{\top }\hat{\varvec{\beta }}_{[-i]})) \le C_1 K_n \right)&\ge ~ {\mathbb {P}}\left( \rho '(\mathsf {prox}_{C^* \rho }(\varvec{X}_i^{\top }\hat{\varvec{\beta }}_{[-i]})) \le C_1 K_n \right) \end{aligned}$$
(171)
$$\begin{aligned}&\ge ~ 1-C_2 \exp \left( -C_3 K_n^2\right) - C_4 \exp \left( -C_5 n\right) . \end{aligned}$$
(172)

This controls the last term.

To summarize, if \(\{K_n\}\) and \(\{H_n\}\) are diverging sequences satisfying the assumptions in (132), combining (161) and the bounds for each term in the right-hand side finally gives (155). On the other hand, combining (167) and (172) yields (156). \(\square \)

With the help of Lemma 18 we are ready to prove Lemma 17. Indeed, observe that

$$\begin{aligned} \big |\varvec{X}_j^{\top }(\hat{\varvec{\beta }}_{[-i]}- \hat{\varvec{\beta }}) \big | \le \big | \varvec{X}_j^{\top } (\hat{\varvec{b}}- \hat{\varvec{\beta }}) \big | + \big | \varvec{X}_j^{\top } (\hat{\varvec{\beta }}_{[-i]}- \hat{\varvec{b}}) \big | , \end{aligned}$$

and hence by combining Lemmas 2 and 18, we establish the first claim (149). The second claim (150) follows directly from Lemmas  218 and (159).

Proof of Theorem 7(b)

This section proves that the random sequence \(\tilde{\alpha }= \mathrm {Tr}\big (\tilde{\varvec{G}}^{-1}\big )/n\) converges in probability to the constant \(b_{*}\) defined by the system of Eqs. (25) and (26). To begin with, we claim that \(\tilde{\alpha }\) is close to a set of auxiliary random variables \(\{\tilde{q}_i\}\) defined below.

Lemma 19

Define \(\tilde{q}_i\) to be

$$\begin{aligned} \tilde{q}_i =\frac{1}{n} \tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{[-i]}^{-1} \tilde{\varvec{X}}_i, \end{aligned}$$

where \(\tilde{\varvec{G}}_{[-i]}\) is defined in 85.

Then there exist universal constants \(C_1, C_2, C_3, C_4, c_2,c_3 >0\) such that

$$\begin{aligned}&{\mathbb {P}}\left( \sup _i |\tilde{q}_i - \tilde{\alpha }| \le C_1 \frac{K_n^2 H_n}{\sqrt{n}} \right) \\&\quad \ge 1-C_2n^2\exp \left( c_2 H_n^2\right) -C_3n\exp \left( -c_3 K_n^2\right) \\&\qquad -\,\exp \left( -C_4n\left( 1+o(1)\right) \right) =1-o(1), \end{aligned}$$

where \(K_n, H_n\) are as in (132).

Proof

This result follows directly from Proposition 1 and Eq. (144). \(\square \)

A consequence is that \(\mathsf {prox}_{\tilde{q}_{i} \rho }\left( \varvec{X}_i^{\top } \hat{\varvec{\beta }}_{[-i]}\right) \) becomes close to \(\mathsf {prox}_{\tilde{\alpha }\rho } \left( \varvec{X}_i ^ {\top } \hat{\varvec{\beta }}_{[-i]}\right) \).

Lemma 20

Let \(\tilde{q}_i\) and \(\tilde{\alpha }\) be as defined earlier. Then one has

$$\begin{aligned}&{\mathbb {P}}\left( \sup _i\left| \mathsf {prox}_{\tilde{q}_{i} \rho }\left( \varvec{X}_i^{\top } \hat{\varvec{\beta }}_{[-i]}\right) - \mathsf {prox}_{\tilde{\alpha }\rho } \left( \varvec{X}_i ^ {\top } \hat{\varvec{\beta }}_{[-i]}\right) \right| \le C_1 \frac{K_n^3 H_n}{\sqrt{n}}\right) \nonumber \\&\quad \ge 1- C_2n^2\exp \left( -c_2 H_n^2\right) -C_3n\exp \left( -c_3 K_n^2\right) \nonumber \\&\qquad -\exp \left( -C_4n\left( 1+o(1)\right) \right) =1-o(1), \end{aligned}$$
(173)

where \(K_n, H_n\) are as in (132).

The key idea behind studying \(\mathsf {prox}_{\tilde{\alpha }\rho } \left( \varvec{X}_i ^ {\top } \hat{\varvec{\beta }}_{[-i]}\right) \) is that it is connected to a random function \(\delta _n(\cdot )\) defined below, which happens to be closely related to the Eq. (26). In fact, we will show that \(\delta _n(\tilde{\alpha })\) converges in probability to 0; the proof relies on the connection between \(\mathsf {prox}_{\tilde{\alpha }\rho } \left( \varvec{X}_i ^ {\top } \hat{\varvec{\beta }}_{[-i]}\right) \) and the auxiliary quantity \(\mathsf {prox}_{\tilde{q}_{i} \rho }\left( \varvec{X}_i^{\top } \hat{\varvec{\beta }}_{[-i]}\right) \). The formal results is this:

Proposition 2

For any index i, let \(\hat{\varvec{\beta }}_{[-i]}\) be the MLE obtained on dropping the \(i{\text {th}}\) observation. Define \(\delta _n(x)\) to be the random function

$$\begin{aligned} \delta _n(x) := \frac{p}{n} - 1 + \frac{1}{n} \sum _{i=1}^n \frac{1}{1+x \rho ''\left( \mathsf {prox}_{x \rho }\left( \varvec{X}_i^{\top } \hat{\varvec{\beta }}_{[-i]}\right) \right) } . \end{aligned}$$
(174)

Then one has \( \delta _n(\tilde{\alpha }) ~{\mathop {\rightarrow }\limits ^{{\mathbb {P}}}}~ 0\).

Furthermore, the random function \(\delta _n(x)\) converges to a deterministic function \(\Delta (x)\) defined by

$$\begin{aligned} \Delta (x) = \kappa - 1 + {{\,\mathrm{{\mathbb {E}}}\,}}_{Z} \left[ \frac{1}{1+x \rho ''(\mathsf {prox}_{x \rho }(\tau _{*}Z))} \right] , \end{aligned}$$
(175)

where \(Z \sim {\mathcal {N}}(0,1)\), and \(\tau _{*}\) is such that \((\tau _{*},b_{*})\) is the unique solution to (25) and (26).

Proposition 3

With \(\Delta (x) \) as in (175), \(\Delta (\tilde{\alpha }) {\mathop {\rightarrow }\limits ^{{\mathbb {P}}}}0.\)

In fact, one can easily verify that

$$\begin{aligned} \Delta (x) = \kappa - {\mathbb {E}}\big [\Psi '\left( \tau _{*}Z; \, x\right) \big ], \end{aligned}$$
(176)

and hence by Lemma 5, the solution to \(\Delta (x)=0\) is exactly \(b_{*}\). As a result, putting the above claims together, we show that \(\tilde{\alpha }\) converges in probability to \(b_{*}\).

It remains to formally prove the preceding lemmas and propositions, which is the goal of the rest of this section.

Proof of Lemma 20

By [22, Proposition 6.3], one has

$$\begin{aligned} \frac{\partial \mathsf {prox}_{b\rho }(z)}{\partial b}=-\left. \frac{\rho '(x)}{1+b\rho ''(x)}\right| _{x=\mathsf {prox}_{b\rho }(z)}, \end{aligned}$$

which yields

$$\begin{aligned}&\sup _i\left| \mathsf {prox}_{\tilde{q}_i \rho }\left( \varvec{X}_i^{\top } \hat{\varvec{\beta }}_{[-i]}\right) - \mathsf {prox}_{\tilde{\alpha }\rho }\left( \varvec{X}_i^{\top } \hat{\varvec{\beta }}_{[-i]}\right) \right| \nonumber \\&\qquad = \sup _i \left[ \left| \left. \frac{\rho '(x)}{1+q_{\tilde{\alpha },i}\rho ''(x)} \right| _{x=\mathsf {prox}_{q_{\tilde{\alpha },i}\rho }\big (\varvec{X}_i^{\top } \hat{\varvec{\beta }}_{[-i]}\big )} \right| \cdot |\tilde{q}_i - \tilde{\alpha }| \right] \nonumber \\&\qquad \le \sup _i \left| \rho '\left( \mathsf {prox}_{q_{\tilde{\alpha },i}}(\varvec{X}_i^{\top }\hat{\varvec{\beta }}_{[-i]}) \right) \right| \cdot \sup _i |\tilde{q}_i - \tilde{\alpha }|, \end{aligned}$$
(177)

where \(q_{\tilde{\alpha },i}\) is between \(\tilde{q}_{i}\) and \(\tilde{\alpha }\). Here, the last inequality holds since \(q_{\tilde{\alpha },i}, \rho ''\ge 0 \).

In addition, just as in the proof of Lemma 18, one can show that \(q_{i}\) is bounded below by some constant \(C^{*}>0\) with probability \(1- \exp (- \Omega ( n) )\). Since \(q_{\tilde{\alpha },i} \ge \min \{\tilde{q}_i,\tilde{\alpha }\} \), on the event \(\sup _{i}|\tilde{q}_i - \tilde{\alpha }| \le C_1 K_n^2 H_n /\sqrt{n}\), which happens with high probability (Lemma 19), \(q_{\tilde{\alpha },i} \ge C_{\alpha }\) for some universal constant \(C_\alpha >0\). Hence, by an argument similar to that establishing (172), we have

$$\begin{aligned}&{\mathbb {P}}\left( \sup _i \left| \rho '\left( \mathsf {prox}_{q_{\tilde{\alpha },i}} \left( \varvec{X}_i^{\top }\hat{\varvec{\beta }}_{[-i]}\right) \right) \right| \ge C_1 K_n \right) \\&\quad \le C_2n^2 \exp \left( -c_2 H_n^2\right) +C_3n\exp \left( - c_3 K_n^2\right) +\exp \left( -C_4n\left( 1+o(1)\right) \right) . \end{aligned}$$

This together with (177) and Lemma 19 concludes the proof. \(\square \)

Proof of Proposition 2

To begin with, recall from (138) and (139) that on \(\mathcal {A}_n\),

$$\begin{aligned} \frac{p-1}{n} = \sum _{i=1}^n \frac{\frac{\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})}{n}\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1}\tilde{\varvec{X}}_i}{1+\frac{\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})}{n}\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1}\tilde{\varvec{X}}_i} = 1- \frac{1}{n} \sum _{i=1}^n \frac{1}{1+\frac{\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})}{n}\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1}\tilde{\varvec{X}}_i} .\nonumber \\ \end{aligned}$$
(178)

Using the fact that \(\big | \frac{1}{1+x} - \frac{1}{1+y} \big | \le |x-y|\) for \(x, y \ge 0\), we obtain

$$\begin{aligned}&\left| \frac{1}{n} \sum _{i=1}^n \frac{1}{1+\frac{\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }})}{n}\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1}\tilde{\varvec{X}}_i} - \frac{1}{n} \sum _{i=1}^n \frac{1}{1+\rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }}) \tilde{\alpha }} \right| \\&\quad \le \frac{1}{n} \sum _{i=1}^n \rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }}) \left| \frac{1}{n}\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1}\tilde{\varvec{X}}_i - \tilde{\alpha }\right| ~\le ~ \Vert \rho '' \Vert _{\infty } \sup _i \left| \frac{1}{n}\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{G}}_{(i)}^{-1}\tilde{\varvec{X}}_i - \tilde{\alpha }\right| \\&\quad = \Vert \rho '' \Vert _{\infty } \sup _i |\eta _i| \le C_1 \frac{K_n^2 H_n}{\sqrt{n}}, \end{aligned}$$

with high probability (Proposition 1). This combined with (178) yields

$$\begin{aligned}&{\mathbb {P}}\left( \left| \frac{p-1}{n} - 1 +\frac{1}{n} \sum _{i=1}^n \frac{1}{1+ \rho ''(\tilde{\varvec{X}}_i^{\top } \tilde{\varvec{\beta }}) \tilde{\alpha }} \right| \ge C_1 \frac{K_n^2 H_n}{\sqrt{n}} \right) \\&\quad \le C_2n^2 \exp \left( -c_2 H_n^2\right) +C_3n\exp \left( - c_3 K_n^2\right) +\exp \left( -C_4n\left( 1+o(1)\right) \right) . \end{aligned}$$

The above bound concerns \(\frac{1}{n} \sum _{i=1}^n \frac{1}{1+ \rho ''(\tilde{\varvec{X}}_i^{\top } \tilde{\varvec{\beta }}) \tilde{\alpha }} \), and it remains to relate it to \(\frac{1}{n} \sum _{i=1}^n \frac{1}{1+ \rho ''\left( \mathsf {prox}_{\tilde{\alpha }\rho } \left( \tilde{\varvec{X}}_i^{\top } \tilde{\varvec{\beta }}\right) \right) \tilde{\alpha }} \). To this end, we first get from the uniform boundedness of \(\rho '''\) and Lemma 17 that

$$\begin{aligned}&{\mathbb {P}}\left( \sup _i \left| \rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }}) - \rho ''\left( \mathsf {prox}_{\tilde{q}_i \rho }(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }}_{[-i]}) \right) \right| \ge C_1 \frac{K_n^2 H_n}{\sqrt{n}} \right) \nonumber \\&\quad \le C_2n \exp (-c_2 H_n^2)+C_3\exp (- c_3 K_n^2) +\exp (-C_4n(1+o(1))). \end{aligned}$$
(179)

Note that

$$\begin{aligned}&\left| \frac{1}{n}\sum _{i=1}^n \frac{1}{1+ \rho ''(\tilde{\varvec{X}}_i^{\top } \tilde{\varvec{\beta }}) \tilde{\alpha }} -\frac{1}{n}\sum _{i=1}^n\frac{1}{1+ \rho ''(\mathsf {prox}_{\tilde{\alpha }\rho }( \tilde{\varvec{X}}_i ^{\top } \tilde{\varvec{\beta }}_{[-i]})) \tilde{\alpha }} \right| \\&\quad \le |\tilde{\alpha }| \sup _i \left| \rho ''(\tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }}) - \rho ''\left( \mathsf {prox}_{\tilde{\alpha }\rho }( \tilde{\varvec{X}}_i ^{\top } \tilde{\varvec{\beta }}_{[-i]}) \right) \right| \\&\quad \le |\tilde{\alpha }| \sup _i \left\{ \left| \rho ''\left( \tilde{\varvec{X}}_i^{\top }\tilde{\varvec{\beta }}\right) - \rho ''\left( \mathsf {prox}_{\tilde{q}_i \rho }( \tilde{\varvec{X}}_i ^{\top } \tilde{\varvec{\beta }}_{[-i]}) \right) \right| \right. \\&\qquad \left. + \left| \rho '' \left( \mathsf {prox}_{\tilde{q}_i \rho }( \tilde{\varvec{X}}_i ^{\top } \tilde{\varvec{\beta }}_{[-i]}) \right) - \rho '' \left( \mathsf {prox}_{\tilde{\alpha }\rho }( \tilde{\varvec{X}}_i ^{\top } \tilde{\varvec{\beta }}_{[-i]}) \right) \right| \right\} . \end{aligned}$$

By the bound (179), an application of Lemma 20, and the fact that \(\tilde{\alpha }\le p/(n\lambda _{\mathrm {lb}})\) (on \(\mathcal {A}_n\)), we obtain

$$\begin{aligned}&{\mathbb {P}}\left( \left| \frac{p}{n} -1 + \frac{1}{n} \sum _{i=1}^n \frac{1}{1+ \rho '' \big ( \mathsf {prox}_{\tilde{\alpha }\rho }( \varvec{X}_i ^{\top } \hat{\varvec{\beta }}_{[-i]}) \big ) \tilde{\alpha }} \right| \ge C_1 \frac{K_n^3H_n}{\sqrt{n}} \right) \\&\quad \le C_2n^2 \exp \big (-c_2 H_n^2\big )+C_3n\exp \big (- c_3 K_n^2\big ) +\exp \big (-C_4n(1+o(1))\big ). \end{aligned}$$

This establishes that \(\delta _n(\tilde{\alpha }) {\mathop {\rightarrow }\limits ^{{\mathbb {P}}}}0\). \(\square \)

Proof of Proposition 3

Here we only provide the main steps of the proof. Note that since \(0 < \alpha \le p/(n\lambda _{\mathrm {lb}}):=B\) on \(\mathcal {A}_n\), it suffices to show that

$$\begin{aligned} \sup _{x \in [0,B]} |\delta _n(x)-\Delta (x)| ~{\mathop {\rightarrow }\limits ^{{\mathbb {P}}}}~ 0. \end{aligned}$$

We do this by following three steps. Below, \(M>0\) is some sufficiently large constant.

  1. 1.

    First we truncate the random function \(\delta _n(x)\) and define

    $$\begin{aligned} \tilde{\delta }_n(x) = \frac{p}{n} - 1+\sum _{i=1}^n \frac{1}{1+ x \rho ''\left( \mathsf {prox}_{x\rho }\left( \varvec{X}_i^{\top }\hat{\varvec{\beta }}_{[-i]}\varvec{1}_{\{\Vert \hat{\varvec{\beta }}_{[-i]}\Vert \le M\}} \right) \right) } . \end{aligned}$$

    The first step is to show that \(\sup _{x \in [0,B]} \left| \tilde{\delta }_n(x) - \delta _n(x) \right| {\mathop {\rightarrow }\limits ^{{\mathbb {P}}}}0.\) This step can be established using Theorem 4 and some straightforward analysis. We stress that this truncation does not arise in [25], and it is required to keep track of the truncation throughout the rest of the proof.

  2. 2.

    Show that \(\sup _{x \in [0,B]} \left| \tilde{\delta }_n(x) - {{\,\mathrm{{\mathbb {E}}}\,}}\big [\tilde{\delta }_n(x)\big ] \right| {\mathop {\rightarrow }\limits ^{{\mathbb {P}}}}0\).

  3. 3.

    Show that \(\sup _{x \in [0,B]} \left| {{\,\mathrm{{\mathbb {E}}}\,}}\big [ \tilde{\delta }_n(x) \big ] - \Delta (x) \right| {\mathop {\rightarrow }\limits ^{{\mathbb {P}}}}0\).

Steps 2 and 3 can be established by arguments similar to that in [25, Lemma 3.24,3.25], with neceassary modifications for our setup. We skip the detailed arguments here and refer the reader to [58]. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sur, P., Chen, Y. & Candès, E.J. The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled Chi-square. Probab. Theory Relat. Fields 175, 487–558 (2019). https://doi.org/10.1007/s00440-018-00896-9

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00440-018-00896-9

Keywords

Mathematics Subject Classification

Navigation