Adaptive log-density estimation

Abstract

This study examines an adaptive log-density estimation method with an \(\ell _1\)-type penalty. The proposed estimator is guaranteed to be a valid density in the sense that it is positive and integrates to one. The smoothness of the estimator is controlled in a data-adaptive way via \(\ell _1\) penalization. The advantages of the penalized log-density estimator are discussed with an emphasis on wavelet estimators. Theoretical properties of the estimator are studied when the quality of fit is measured by the Kullback–Leibler divergence (relative entropy). A nonasymptotic oracle inequality is obtained assuming a near orthogonality condition on the given dictionary. Based on the oracle inequality, selection consistency and minimax adaptivity are proved under some regularity conditions. The proposed method is implemented with a coordinate descent algorithm. Numerical illustrations based on the periodized Meyer wavelets are performed to demonstrate the finite sample performance of the proposed estimator.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3

References

  1. Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723.

    MathSciNet  MATH  Article  Google Scholar 

  2. Antoniadis, A., & Fan, J. (2001). Regularization of wavelet approximations. Journal of the American Statistical Association, 96(455), 939–967.

    MathSciNet  MATH  Article  Google Scholar 

  3. Barron, A. R., & Cover, T. M. (1988). A bound on the financial value of information. IEEE Transactions on Information Theory, 34(5), 1097–1100.

    MathSciNet  MATH  Article  Google Scholar 

  4. Barron, A. R., & Cover, T. M. (1991). Minimum complexity density estimation. IEEE Transactions on Information Theory, 37(4), 1034–1054.

    MathSciNet  MATH  Article  Google Scholar 

  5. Barron, A. R., & Sheu, C.-H. (1991). Approximation of density functions by sequences of exponential families. The Annals of Statistics, 19, 1347–1369.

    MathSciNet  MATH  Article  Google Scholar 

  6. Biau, G., & Devroye, L. (2005). Density estimation by the penalized combinatorial method. Journal of Multivariate Analysis, 94(1), 196–208.

    MathSciNet  MATH  Article  Google Scholar 

  7. Bigot, J., & Bellegem, S. V. (2009). Log-density deconvolution by wavelet thresholding. Scandinavian Journal of Statistics, 36(4), 749–763.

    MathSciNet  MATH  Article  Google Scholar 

  8. Bunea, F. (2004). Consistent covariate selection and post model selection inference in semiparametric regression. The Annals of Statistics, 32, 898–927.

    MathSciNet  MATH  Article  Google Scholar 

  9. Bunea, F. (2008). Honest variable selection in linear and logistic regression models via \(\ell _1\) and \(\ell _1 + \ell _2\) penalization. Electronic Journal of Statistics, 2, 1153–1194.

    MathSciNet  MATH  Article  Google Scholar 

  10. Bunea, F., Tsybakov, A. B., & Wegkamp, M. H. (2007a). Sparse density estimation with \(\ell _1\) penalties. International Conference on Computational Learning Theory (pp. 530–543). Berlin: Springer.

    Google Scholar 

  11. Bunea, F., Tsybakov, A. B., & Wegkamp, M. H. (2007b). Sparsity oracle inequalities for the lasso. Electronic Journal of Statistics, 1, 169–194.

    MathSciNet  MATH  Article  Google Scholar 

  12. Bunea, F., Tsybakov, A. B., Wegkamp, M. H., & Barbu, A. (2010). Spades and mixture models. The Annals of Statistics, 38(4), 2525–2558.

    MathSciNet  MATH  Article  Google Scholar 

  13. Csiszár, I. (1975). I-divergence geometry of probability distributions and minimization problems. The Annals of Probability, 3, 146–158.

    MathSciNet  MATH  Article  Google Scholar 

  14. Daubechies, I. (1992). Ten Lectures on Wavelets (Vol. 61). Philadelphia: Siam.

    Google Scholar 

  15. de Montricher, G. F., Tapia, R. A., & Thompson, J. R. (1975). Nonparametric maximum likelihood estimation of probability densities by penalty function methods. The Annals of Statistics, 3, 1329–1348.

    MathSciNet  MATH  Article  Google Scholar 

  16. Donoho, D. L., Elad, M., & Temlyakov, V. N. (2006). Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on Information Theory, 52(1), 6–18.

    MathSciNet  MATH  Article  Google Scholar 

  17. Donoho, D. L., Johnstone, I. M., Kerkyacharian, G., & Picard, D. (1996). Density estimation by wavelet thresholding. The Annals of Statistics, 2, 508–539.

    MathSciNet  MATH  Google Scholar 

  18. Friedman, J., Hastie, T., Höfling, H., & Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2), 302–332.

    MathSciNet  MATH  Article  Google Scholar 

  19. Friedman, J., Hastie, T., & Tibshirani, R. (2001). The Elements of Statistical Learning (Vol. 1). Berlin: Springer.

    Google Scholar 

  20. Good, I., & Gaskins, R. (1980). Density estimation and bump-hunting by the penalized likelihood method exemplified by scattering and meteorite data. Journal of the American Statistical Association, 75(369), 42–56.

    MathSciNet  MATH  Article  Google Scholar 

  21. Good, I., & Gaskins, R. A. (1971). Nonparametric roughness penalties for probability densities. Biometrika, 58(2), 255–277.

    MathSciNet  MATH  Article  Google Scholar 

  22. Härdle, W., Kerkyacharian, G., Picard, D., & Tsybakov, A. (2012). Wavelets, Approximation, and Statistical Applications (Vol. 129). Berlin: Springer Science & Business Media.

    Google Scholar 

  23. Koo, J.-Y. (1996). Bivariate b-splines for tensor logspline density estimation. Computational Statistics & Data Analysis, 21(1), 31–42.

    MATH  Article  Google Scholar 

  24. Koo, J.-Y., & Chung, H.-Y. (1998). Log-density estimation in linear inverse problems. The Annals of Statistics, 26(1), 335–362.

    MathSciNet  MATH  Article  Google Scholar 

  25. Koo, J.-Y., & Kim, W.-C. (1996). Wavelet density estimation by approximation of log-densities. Statistics & Probability Letters, 26(3), 271–278.

    MathSciNet  MATH  Article  Google Scholar 

  26. Koo, J.-Y., & Park, B. U. (1996). B-spline deconvolution based on the em algorithm. Journal of Statistical Computation and Simulation, 54(4), 275–288.

    MathSciNet  MATH  Article  Google Scholar 

  27. Kooperberg, C. (2016). logspline: Logspline Density Estimation Routines. R package version 2.1.9. https://CRAN.R-project.org/package=logspline.

  28. Machler, M. (2017). nor1mix: Normal (1-d) Mixture Models (S3 Classes and Methods). R package version 1.2-3. https://CRAN.R-project.org/package=nor1mix.

  29. Mallat, S. (1999). A Wavelet Tour of Signal Processing. Cambridge: Academic press.

    Google Scholar 

  30. Negahban, S. N., Ravikumar, P., Wainwright, M. J., & Yu, B. (2012). A unified framework for high-dimensional analysis of \(m\)-estimators with decomposable regularizers. Statistical Science, 27(4), 538–557.

    MathSciNet  MATH  Article  Google Scholar 

  31. Postman, M., Huchra, J., & Geller, M. (1986). Probes of large-scale structure in the corona borealis region. The Astronomical Journal, 92, 1238–1247.

    Article  Google Scholar 

  32. Rockafellar, R. T. (2015). Convex Analysis. Princeton: Princeton University Press.

    Google Scholar 

  33. Roeder, K. (1990). Density estimation with confidence sets exemplified by superclusters and voids in the galaxies. Journal of the American Statistical Association, 85(411), 617–624.

    MATH  Article  Google Scholar 

  34. Scott, D. (2015). ash: David Scott’s ASH Routines. R package version 1.0-15. https://CRAN.R-project.org/package=ash.

  35. Silverman, B. W. (1982). On the estimation of a probability density function by the maximum penalized likelihood method. The Annals of Statistics, 10, 795–810.

    MathSciNet  MATH  Article  Google Scholar 

  36. Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis (Vol. 26). Boca Raton: CRC Press.

    Google Scholar 

  37. Stone, C. J. (1989). Uniform error bounds involving logspline models. Probability, Statistics, and Mathematics (pp. 335–355). Amsterdam: Elsevier.

    Google Scholar 

  38. Stone, C. J. (1990). Large-sample inference for log-spline models. The Annals of Statistics, 18, 717–741.

    MathSciNet  MATH  Article  Google Scholar 

  39. Stone, C. J., & Koo, C.-Y. (1986). Logspline density estimation. Contemporary Mathematics, 59, 1–15.

    MathSciNet  MATH  Article  Google Scholar 

  40. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological), 58, 267–288.

    MathSciNet  MATH  Article  Google Scholar 

  41. Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 109(3), 475–494.

    MathSciNet  MATH  Article  Google Scholar 

  42. Tsybakov, A. B. (2008). Introduction to Nonparametric Estimation (1st ed.). Berlin: Springer Publishing Company.

    Google Scholar 

  43. van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2), 614–645.

    MathSciNet  MATH  Article  Google Scholar 

  44. Venables, W. N., & Ripley, B. D. (2002). Modern Applied Statistics with S (4th ed.). New York: Springer.

    Google Scholar 

  45. Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using \(\ell _{1}\) -constrained quadratic programming (lasso). IEEE Transactions on Information Theory, 55(5), 2183–2202.

    MathSciNet  MATH  Article  Google Scholar 

  46. Wasserman, L. (2006). All of Nonparametric Statistics. New York: Springer.

    Google Scholar 

  47. Zhang, C.-H., & Huang, J. (2008). The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics, 36(4), 1567–1594.

    MathSciNet  MATH  Article  Google Scholar 

  48. Zhao, P., & Yu, B. (2006). On model selection consistency of lasso. Journal of Machine learning research, 7, 2541–2563.

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the editor and referees for their valuable comments and suggestions that greatly improved this paper. This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (NRF-2018R1D1A1B07049972).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Ja-Yong Koo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A. Proofs of theoretical results

Appendix A.1. Proof of the oracle inequality

For notational convenience, we omit the range of integrals \(\mathcal {I}\). Let \(C_1, C_2, \ldots \) be positive constants that may differ in their occurrences. Define the expectation of the basis vector as

$$\begin{aligned} \bar{B} = \mathbb {E}B(X_1) = \int B f. \end{aligned}$$

Let

$$\begin{aligned} \mathcal {A}= \bigcap _{j=1}^J {\left\{ {|\hat{B}_j - \bar{B}_j|} < \frac{\lambda _j}{2} \right\} }. \end{aligned}$$

Lemma 1

Suppose \(\lambda \) is chosen as (3) or (4). Then,

$$\begin{aligned} \mathbb {P}(\mathcal {A}) \ge 1 - \delta /J^{m-1}. \end{aligned}$$

Proof

Suppose first that \(\lambda \) is chosen as (3). By Hoeffding’s inequality, we have

$$\begin{aligned} \mathbb {P}(\mathcal {A}^c)&\le \sum _{j=1}^J \mathbb {P}\left( {|\hat{B}_j - \bar{B}_j|} \ge \frac{\lambda _j}{2} \right) \le 2 \sum _{j=1}^J \exp \left( - \frac{2 N^2 ((\lambda _j /2) )^2}{N(2u_j)^2} \right) \\&\le 2J \exp \left( - N r_m^2(\delta ) \right) = 2J \exp \left( - {\frac{ \log (2J^m/\delta )}{N}} N \right) = \frac{\delta }{J^{m-1}} . \end{aligned}$$

Suppose now that \(\lambda \) is chosen as (4). Applying Bernstein’s inequality to the random variables \(B_j(X_n) - \mathbb {E}B_j(X_n)\), we have

$$\begin{aligned} \mathbb {P}(\mathcal {A}^c)&= \mathbb {P}\left( \bigcup _{j=1}^J {\left\{ {|\hat{B}_j - \bar{B}_j|} \ge \frac{\lambda _j}{2} \right\} } \right) \\&\le \sum _{j=1}^J \mathbb {P}{\left( {|\hat{B}_j - \bar{B}_j|} \ge \frac{\lambda _j}{2} \right) } \le \sum _{j=1}^J \exp \left( - \frac{N \lambda _j^2/4}{2\sigma _j^2 + 2 u_j \lambda _j /3} \right) \\&\le \sum _{j=1}^J \exp (-N r_m^2(\delta )) \le J\exp (-N r_m^2(\delta )) < \frac{\delta }{J^{m-1}}. \end{aligned}$$

\(\square \)

Lemma 2

For \(\theta ^1, \theta ^2 \in \mathbb {R}^J\), define a deviation

$$\begin{aligned} {\nu ( \theta ^1 \Vert \theta ^2 )} = \left[ \ell (\theta ^1) - \ell (\theta ^2) \right] - \left[ L(\theta ^1) - L(\theta ^2) \right] . \end{aligned}$$

Then, we have

$$\begin{aligned} {|{\nu ( \theta ^1 \Vert \theta ^2 )}|} < \frac{1}{2} \mathsf {p}^\lambda (\theta ^1 - \theta ^2) \end{aligned}$$

on the event \(\mathcal {A}\).

Proof

Note

$$\begin{aligned} {\nu ( \theta ^1 \Vert \theta ^2 )}&= \left[ \ell (\theta ^1) - \ell (\theta ^2) \right] - \left[ L(\theta ^1) - L(\theta ^2) \right] \\&= (\theta ^1 - \theta ^2) \cdot (\hat{B} - \bar{B}), \end{aligned}$$

which implies

$$\begin{aligned} {|{\nu ( \theta ^1 \Vert \theta ^2 )}|} < \frac{1}{2} \mathsf {p}^\lambda (\theta ^1 - \theta ^2) \end{aligned}$$

on the event \(\mathcal {A}\). \(\square \)

Proposition 5

For all \(\theta \in \mathbb {R}^J\), we have

$$\begin{aligned} {\Delta ( f \Vert \hat{f} )} + \frac{1}{2} \mathsf {p}^\lambda ({\hat{\theta }}- \theta ) \le {\Delta ( f \Vert {\,\mathsf {f}}_\theta )} + 2 \sum _{j \in S(\theta )} \lambda _j {|{\hat{\theta }}_j - \theta _j|} \end{aligned}$$

on the event \(\mathcal {A}\).

Proof

By the definition of \({\hat{\theta }}\),

$$\begin{aligned} \ell ({\hat{\theta }}) - \mathsf {p}^\lambda ({\hat{\theta }}) \ge \ell (\theta ) - \mathsf {p}^\lambda (\theta ) \end{aligned}$$
(A.1)

for all \(\theta \in \mathbb {R}^J\). Note

$$\begin{aligned} {\Delta ( f \Vert \hat{f} )} - {\Delta ( f \Vert {\,\mathsf {f}}_\theta )}&= \int f \log \frac{f}{\hat{f}} - \int f \log \frac{f}{{\,\mathsf {f}}_\theta } = - \int f \log \hat{f} + \int f \log {\,\mathsf {f}}_\theta \\&= {\nu ( {\hat{\theta }} \Vert \theta )} - \left[ \ell ({\hat{\theta }}) - \ell (\theta ) \right] \le {\nu ( {\hat{\theta }} \Vert \theta )} - \mathsf {p}^\lambda ({\hat{\theta }}) + \mathsf {p}^\lambda (\theta ), \end{aligned}$$

where (A.1) is used for the last inequality. Observe

$$\begin{aligned}&\mathsf {p}^\lambda ({\hat{\theta }}- \theta ) - \mathsf {p}^\lambda ({\hat{\theta }}) + \mathsf {p}^\lambda (\theta )\\&\quad = \left[ \sum _{j \in S(\theta )} \lambda _j {|{\hat{\theta }}_j - \theta _j|} + \sum _{j \notin S(\theta )} \lambda _j {|{\hat{\theta }}_j|} \right] - \left[ \sum _{j \in S(\theta )} \lambda _j {|{\hat{\theta }}_j|} + \sum _{j \notin S(\theta )} \lambda _j {|{\hat{\theta }}_j|} \right] \\&\qquad + \sum _{j \in S(\theta )} \lambda _j {|\theta _j|}\\&\quad = \sum _{j \in S(\theta )} \lambda _j \left[ {|{\hat{\theta }}_j - \theta _j|} - {|{\hat{\theta }}_j|} + {|\theta _j|} \right] \le 2 \sum _{j \in S(\theta )} \lambda _j {|{\hat{\theta }}_j - \theta _j|} . \end{aligned}$$

Combining the results with Lemma 2, we have

$$\begin{aligned} {\Delta ( f \Vert \hat{f} )} + \frac{1}{2} \mathsf {p}^\lambda ({\hat{\theta }}- \theta )&\le \Delta (f \Vert {\,\mathsf {f}}_\theta ) + {\nu ( {\hat{\theta }} \Vert \theta )} - \mathsf {p}^\lambda ({\hat{\theta }}) + \mathsf {p}^\lambda ({\theta ^*})\\&\le \Delta (f \Vert {\,\mathsf {f}}_\theta ) + 2 \sum _{j \in S(\theta )} \lambda _j {|{\hat{\theta }}_j - \theta _j|} \end{aligned}$$

\(\square \)

Proof of Proposition 2

Note

$$\begin{aligned} ||\log {\,\mathsf {f}}_\theta ||_2^2 \ge \sum _{j=1}^J \sum _{k=1}^J \theta _j \theta _k {\left\langle B_j, B_k \right\rangle } \ge \varsigma _J \sum _{j \in S(\theta )} \theta _j^2 . \end{aligned}$$

This, with Lemma 1 of Barron and Sheu (1991), the triangle inequality and the inequality \(2xy \le x^2/\kappa + \kappa y^2 (x,y\in \mathbb {R}, \kappa >0)\) implies

$$\begin{aligned} \sum _{j \in S(\theta )} \lambda _j {|{\hat{\theta }}_j - \theta _j|}&\le \sqrt{\sum _{j \in S(\theta )} \lambda _j^2} \sqrt{\sum _{j \in S(\theta )} {|{\hat{\theta }}_j - \theta _j|}^2}\\&\le \frac{\eta (\theta ) }{\sqrt{\varsigma _J}} \left( ||\log f / \hat{f} ||_2 + ||\log f / {\,\mathsf {f}}_\theta ||_2 \right) \\&\le \sqrt{C_{3}} \frac{\eta (\theta ) }{\sqrt{\varsigma _J}} \left( \sqrt{{\Delta ( f \Vert \hat{f} )}} + \sqrt{{\Delta ( f \Vert {\,\mathsf {f}}_\theta )}} \right) \\&\le C_{3} \frac{\eta ^2(\theta )}{\varsigma _J} \kappa + \frac{1}{2\kappa } {\Delta ( f \Vert \hat{f} )} + \frac{1}{2\kappa } {\Delta ( f \Vert {\,\mathsf {f}}_\theta )} \end{aligned}$$

where \(C_{3} = 2 e^{M_{1} + M_{2} + M_{3}} \). Applying Proposition 5, we obtain

$$\begin{aligned} {\Delta ( f \Vert \hat{f} )} + \frac{1}{2} \mathsf {p}^\lambda ({\hat{\theta }}- \theta )&\le {\Delta ( f \Vert {\,\mathsf {f}}_\theta )} + 2 \sum _{j \in S(\theta )} \lambda _j {|{\hat{\theta }}- \theta _j|}\\&\le {\Delta ( f \Vert {\,\mathsf {f}}_\theta )} + 2 \left[ \kappa C_{3} \frac{\eta ^2(\theta )}{\varsigma _J} + \frac{1}{2\kappa } {\Delta ( f \Vert \hat{f} )} + \frac{1}{2\kappa } {\Delta ( f \Vert {\,\mathsf {f}}_\theta )} \right] \end{aligned}$$

so that

$$\begin{aligned} \frac{\kappa -1}{\kappa } {\Delta ( f \Vert \hat{f} )} + \frac{1}{2} \mathsf {p}^\lambda ({\hat{\theta }}- \theta ) \le 2\kappa C_{3} \frac{\eta ^2(\theta )}{\varsigma _J} + \frac{\kappa +1}{\kappa } {\Delta ( f \Vert {\,\mathsf {f}}_\theta )}. \end{aligned}$$

Multiplying \(\kappa / (\kappa -1)\) on both sides yields the desired result.

Appendix A.2. Proof of selection consistency

Lemma 3

Let \({\left\{ B_j :j=1,\ldots , J \right\} }\) be an orthonormal basis and choose \(\lambda _j = 4ur_2(\delta )\) for \(j = 1, \ldots , J\). Suppose (O1) and (O2) hold. Then, with probability at least \(1-\delta /J\), we have

$$\begin{aligned} \sum _{j=1}^{J} {|{\hat{\theta }}_j - \theta _j^0|} \le 32 \sqrt{2} e^{M_{1} + M_{2} + M_{3}} u r_2(\delta )k^0 \end{aligned}$$

for all \(\theta \in \mathbb {R}^J\) with \({||\log {\,\mathsf {f}}_\theta ||}_\infty \le M_{3}\).

Proof

Note that in the proof of Proposition 2, we have

$$\begin{aligned} \sum _{j \in S(\theta )} \lambda _j {|{\hat{\theta }}_j - \theta _j^0|} \le 32 e^{M_{1} + M_{2} + M_{3}} u^2 r_2^2(\delta ) {\left| S(\theta ) \right| } / \kappa + \frac{1}{2\kappa } {\Delta ( f \Vert \hat{f} )} + \frac{1}{2\kappa } {\Delta ( f \Vert {\,\mathsf {f}}_\theta )}. \end{aligned}$$

Hence, (5) becomes

$$\begin{aligned} {\Delta ( f \Vert \hat{f} )} + \frac{\kappa }{2(\kappa -1)} \mathsf {p}^\lambda ({\hat{\theta }}- \theta ) \le \frac{\kappa +1}{\kappa -1} {\Delta ( f \Vert {\,\mathsf {f}}_\theta )} + \frac{64e^{M_{1} + M_{2} + M_{3}}\kappa ^2}{\kappa -1} {\left| S(\theta ) \right| } u^2 r_2^2(\delta ) . \end{aligned}$$

for \(\kappa >1 \) . Setting \(\theta = \theta ^0\) and \(\kappa = \sqrt{2}\) gives the desired result. \(\square \)

Lemma 4

\({\hat{\theta }}\) exists for all \(\lambda = (\lambda _j)\) with \(\lambda _j > 0\), \(j=1\), \(\ldots ,J\).

Proof

The penalized log-likelihood problem can be written as an equivalent constrained optimization problem:

$$\begin{aligned} \text{ maximize } ~ \ell (\theta ) ~ \quad \text{ subject } \text{ to } ~ {|\theta _j|} \le C_j \end{aligned}$$

for some constant \( 0< C_j< \infty \) for \(j=1,\ldots , J\), where the penalization parameter \(\lambda _j\) and the constraint level \(C_j\) are in one-to-one correspondence via Lagrangian duality. Since the function \(\theta \mapsto \theta \cdot \hat{B} - \mathsf {c}(\theta )\) is continuous on a compact set \({\left\{ \theta \in \mathbb {R}^J: {|\theta _j|} \le C_j, j=1,\ldots ,J \right\} }\), an optimal solution \({\hat{\theta }}\in \mathbb {R}^J\) exists. \(\square \)

Lemma 5

Let \({\left\{ B_j :j=1,\ldots , J \right\} }\) be an orthonormal basis. Then, \({\hat{\theta }}\) is unique for all \(\lambda = (\lambda _j)\) with \(\lambda _j > 0\), \(j=1,\ldots ,J\).

Proof

We use the properties of concave functions. Recall that the set of maxima of a concave function is convex. Then, if \(\hat{\theta }^{(1)}\) and \(\hat{\theta }^{(2)}\) are points of maxima, so is \(\epsilon \hat{\theta }^{(1)} + (1-\epsilon )\hat{\theta }^{(2)}\) for any \(0< \epsilon < 1\). Rewrite this convex combination as \(\hat{\theta }^{(2)} +\epsilon \eta \), where \(\eta = \hat{\theta }^{(1)} - \hat{\theta }^{(2)}\).

Suppose that \(\eta \ne 0\). Recall that the maximum value of any concave function is unique. Therefore, for any \(0< \epsilon < 1\), the value of \(\ell ^\lambda (\theta )\) at \(\theta = \hat{\theta }^{(2)} +\epsilon \eta \) is equal to some constant \(C_{1}\):

$$\begin{aligned} F(\epsilon )&\triangleq \sum _{j=1}^{J} (\hat{\theta }_j^{(2)} + \epsilon \eta _j) \hat{B}_j - \mathsf {c}(\hat{\theta }^{(2)} + \epsilon \eta ) - \sum _{j=1}^{J} (\hat{\theta }_j^{(2)} + \epsilon \eta _j)\\&= (\hat{\theta }^{(2)} + \epsilon \eta ) \cdot \hat{B} - \mathsf {c}(\hat{\theta }^{(2)} + \epsilon \eta ) - \sum _{j=1}^{J} (\hat{\theta }_j^{(2)} + \epsilon \eta _j) = C_{1} \end{aligned}$$

where \(\theta \cdot \hat{B}\) denotes \(\sum _{j=1}^J \theta _j \hat{B}_j\) for all \(\theta \in \mathbb {R}^J\). By taking the derivative with respect to \(\epsilon \) of \(F(\epsilon )\) when \(\hat{\theta }^{(2)}_j +\epsilon \eta _j \ne 0\) for \(j = 1, \ldots , J\), we obtain that, for all \(0< \epsilon < 1\),

$$\begin{aligned} F'(\epsilon ) = \eta \cdot \hat{B} - \sum _{j=1}^J \lambda _j \eta _j \mathsf {sign}\left( \hat{\theta }^{(2)}_j +\epsilon \eta _j \right) - \int (\eta \cdot B) {\,\mathsf {f}}_{\hat{\theta }^{(2)} +\epsilon \eta } = 0. \end{aligned}$$

This follows from the fact that

$$\begin{aligned} \frac{\partial \mathsf {c}(\hat{\theta }^{(2)} + \epsilon \eta )}{\partial \epsilon } = \frac{\int (\eta \cdot B) \exp \left( (\hat{\theta }^{(2)} + \epsilon \eta ) \cdot B \right) }{\int \exp \left( (\hat{\theta }^{(2)} + \epsilon \eta ) \cdot B \right) } = \int (\eta \cdot B) {\,\mathsf {f}}_{\hat{\theta }^{(2)} +\epsilon \eta }. \end{aligned}$$

By the continuity of \(\epsilon \mapsto \hat{\theta }^{(2)}_j +\epsilon \eta _j\), there exists an open interval I in (0, 1) on which \(\epsilon \mapsto \mathsf {sign}(\hat{\theta }^{(2)}_j +\epsilon \eta _j)\) is constant for all j. Therefore, on that interval, \(F'(\epsilon )=0\), \(\eta \cdot \hat{B}\) and \( \sum _{j=1}^J \lambda _j \eta _j \mathsf {sign}\left( \hat{\theta }^{(2)}_j +\epsilon \eta _j \right) \) are independent of \(\epsilon \) so that

$$\begin{aligned} G(\epsilon ) \triangleq \int (\eta \cdot B) {\,\mathsf {f}}_{\hat{\theta }^{(2)} +\epsilon \eta } = C_{2} \end{aligned}$$

for some constant \(C_{2}\). Observe that on this open interval,

$$\begin{aligned} G'(\epsilon )&= \frac{\partial }{\partial \epsilon } \int (\eta \cdot B) \exp \left( (\hat{\theta }^{(2)} + \epsilon \eta ) \cdot B - \mathsf {c}(\hat{\theta }^{(2)} + \epsilon \eta ) \right) \\&= \int \left[ (\eta \cdot B) - \int (\eta \cdot B) {\,\mathsf {f}}_{\hat{\theta }^{(2)} +\epsilon \eta } \right] ^2 {\,\mathsf {f}}_{\hat{\theta }^{(2)} +\epsilon \eta } =0, \end{aligned}$$

which implies that \(\eta \cdot B = C_{3}\). By the orthogonality of \(B_j\)’s, we have \(C_{3} =0\) and \(\eta =0\), hence \(\hat{\theta }^{(1)} = \hat{\theta }^{(2)}\). \(\square \)

Lemma 6

The subgradient optimality condition for \({\hat{\theta }}_j\) is given as

$$\begin{aligned} \frac{1}{N} \sum _{n=1}^N B_j(X_n) - {\left\langle B_j, \hat{f} \right\rangle } = \lambda _j \omega _j {\quad \text{ for }\quad }j =1,\ldots , J, \end{aligned}$$

where \(\omega \in \partial {|\theta _j|}\) given as

$$\begin{aligned} \omega _j = \left\{ \begin{array}{ll} 1 &{} {\quad \text{ if } \quad }{\hat{\theta }}_j >0\\ -1 &{} {\quad \text{ if } \quad }{\hat{\theta }}_j < 0\\ {[}-1,1] &{}{\quad \text{ if } \quad }{\hat{\theta }}_j =0 \end{array}\right. . \end{aligned}$$

Proof

Note

$$\begin{aligned} \frac{\partial }{\partial \theta _j} \mathsf {c}(\theta )&= \frac{\partial }{\partial \theta _j} \log \int \exp (\theta \cdot B) = \frac{\int B_j \exp (\theta \cdot B)}{\int \exp (\theta \cdot B)}\\&= \int B_j \exp (\theta \cdot B - \mathsf {c}(\theta )) = \int B_j {\,\mathsf {f}}_\theta = {\left\langle B_j, {\,\mathsf {f}}_\theta \right\rangle }. \end{aligned}$$

Since \(\ell ^\lambda _j(\theta )\) is concave in \(\theta \), by standard results in convex analysis (see, for example, Rockafellar 2015), the subgradient optimality condition is given as

$$\begin{aligned} \hat{B}_j - {\left\langle B_j, {\,\mathsf {f}}_\theta \right\rangle } - \lambda _j\omega _j = \frac{1}{N} \sum _{n=1}^N B_j(X_n) - {\left\langle B_j, f_{\hat{\theta }} \right\rangle } - \lambda _j \omega _j {\quad \text{ for }\quad }j =1,\ldots , J \end{aligned}$$

which gives the desired result. \(\square \)

Let

$$\begin{aligned} \Theta ^0 = {\left\{ \theta \in \mathbb {R}^J :\theta _j = 0 {\quad \text{ for }\quad }j \notin I^0 \right\} } \end{aligned}$$

and

$$\begin{aligned} \hat{\theta }^0 = \mathop {\hbox {argmax}}\limits _{\theta \in \Theta ^0} \ell ^\lambda (\theta ) \end{aligned}$$
(A.2)

where \(\lambda _j = 4ur_2(\delta )\) for \(j =1,\ldots , J\). Define

$$\begin{aligned} \mathcal {B} = \bigcap _{k \notin I^0}{\left\{ {\left| \frac{1}{N} \sum _{n=1}^N B_k(X_n) - \int B_k {\,\mathsf {f}}_{{\hat{\theta }}^0} \right| } \le 4ur_2(\delta ) \right\} }. \end{aligned}$$

Lemma 7

On the event \(\mathcal {B}\), \({\hat{\theta }}^0 \in \mathbb {R}^J\) defined by (A.2) is a solution to the penalized likelihood problem (1).

Proof

By Lemma 6, \({\hat{\theta }}\) is a solution of \(\ell ^\lambda (\theta )\) if and only if, for all \(j =1,\ldots , J\),

$$\begin{aligned} \frac{1}{N}\sum _{n=1}^N B_k(X_n) - {\left\langle B_k, \hat{f} \right\rangle }&= 4ur_2(\delta ) \mathsf {sign}({\hat{\theta }}_k)&\text {if } {\hat{\theta }}_k \ne 0, \end{aligned}$$
(A.3)
$$\begin{aligned} {\left| \frac{1}{N}\sum _{n=1}^N B_k(X_n) - {\left\langle B_k, \hat{f} \right\rangle }\right| }&\le 4ur_2(\delta )&\text {if } {\hat{\theta }}_k = 0. \end{aligned}$$
(A.4)

Consider \({\hat{\theta }}^0 \in \mathbb {R}^J\) defined in (A.2). Note that the non-zero component of \({\hat{\theta }}^0\) is obtained by maximizing \(\ell ^\lambda (\theta )\) on \(\mathbb {R}^{k^0}\). Using the same argument as in Lemma 6, we have

$$\begin{aligned} \frac{1}{N}\sum _{n=1}^N B_k(X_n) - {\left\langle B_k, {\,\mathsf {f}}_{{\hat{\theta }}^0} \right\rangle }&= 4ur_2(\delta ) \mathsf {sign}({\hat{\theta }}^0_k)&\text {if } {\hat{\theta }}^0_k \ne 0, k\in I^0,\\ {\left| \frac{1}{N}\sum _{n=1}^N B_k(X_n) - {\left\langle B_k, {\,\mathsf {f}}_{{\hat{\theta }}^0} \right\rangle }\right| }&\le 4ur_2(\delta )&\text {if } {\hat{\theta }}_k^0 = 0, k\in I^0. \end{aligned}$$

On the other hand, we have

$$\begin{aligned} {\left| \frac{1}{N}\sum _{n=1}^N B_k(X_n) - {\left\langle B_k, {\,\mathsf {f}}_{{\hat{\theta }}^0} \right\rangle }\right| }&\le 4ur_2(\delta )&\text {if } k \notin I^0 \end{aligned}$$

on the event \(\mathcal {B}\) where \({\hat{\theta }}^0_k =0\) by construction for \(k \notin I^0\). Combining the results, it is seen that \({\hat{\theta }}^0\) satisfies the optimality conditions (A.3) and (A.4), hence is a maximizer of \(\ell ^\lambda (\theta )\) on the event \(\mathcal {B}\). \(\square \)

Lemma 8

Let

$$\begin{aligned} q_k(\theta ) = {\left\langle B_k, {\,\mathsf {f}}_\theta - f \right\rangle } = \int B_k ({\,\mathsf {f}}_\theta - f) {\quad \text{ for }\quad }k = 1,\ldots , J \end{aligned}$$

where \(f = {\,\mathsf {f}}_{\theta ^0}\). Then

$$\begin{aligned} q_k({\hat{\theta }}) = \sum _{j=1}^J ({\hat{\theta }}_j - \theta _j^0) H_{jk}(\bar{\theta }) \end{aligned}$$

where \(\bar{\theta }\) is between \(\theta ^0\) and \({\hat{\theta }}\).

Proof

Since \(f = {\,\mathsf {f}}_{\theta ^0}\), we have

$$\begin{aligned} q_k(\theta ^0) = 0 \end{aligned}$$

so that

$$\begin{aligned} q_k({\hat{\theta }}) =q_k(\theta ^0) + \nabla q_k(\bar{\theta })^\top ({\hat{\theta }}- \theta ^0) = \nabla q_k(\bar{\theta })^\top ({\hat{\theta }}- \theta ^0) \end{aligned}$$

where \(\bar{\theta }\) is between \(\theta ^0\) and \({\hat{\theta }}\). Note that the gradient \(\nabla q_k(\bar{\theta })\) is the J-dimensional vector with entries \(H_{jk}(\bar{\theta })\) for \(j =1,\ldots , J\). It follows that

$$\begin{aligned} q_k({\hat{\theta }})&= \nabla q_k(\bar{\theta })^\top ({\hat{\theta }}- \theta ^0) \nonumber \\&= \sum _{j=1}^J ({\hat{\theta }}_j - \theta _j^0) \left( \int B_j B_k {\,\mathsf {f}}_{\bar{\theta }} - \int B_j {\,\mathsf {f}}_{\bar{\theta }} \int B_k {\,\mathsf {f}}_{\bar{\theta }} \right) \nonumber \\&= \sum _{j=1}^J ({\hat{\theta }}_j - \theta _j^0) H_{jk}(\bar{\theta }). \end{aligned}$$
(A.5)

\(\square \)

Note that Lemmas 8 and 3 justify the definition of the set \(\mathcal {U}\) in (6).

Proof of Proposition 3

Note

$$\begin{aligned} \mathbb {P}(I^0 \nsubseteq \hat{I})&\le \mathbb {P}(\hat{\theta }_j = 0 \text { for some } j \in I^0) \le k^0 \max _{k \in I^0} \mathbb {P}(\hat{\theta }_k = 0). \end{aligned}$$

Choose \(k \in I^0\). It follows from Lemmas 6 and 8 that

$$\begin{aligned} \mathbb {P}(\hat{\theta }_k =0)&= \mathbb {P}\left( {\left| \frac{1}{N} \sum _{n=1}^N B_k(X_n) - {\left\langle B_k, \hat{f} \right\rangle } \right| } \le 4ur\ ;\ \hat{\theta }_k =0 \right) \nonumber \\&= \mathbb {P}\left( {\left| \frac{1}{N} \sum _{n=1}^N B_k(X_n) - {\left\langle B_k, f \right\rangle } - q_k({\hat{\theta }})\right| } \le 4ur\ ;\ \hat{\theta }_k =0 \right) \nonumber \\&\le \mathbb {P}\left( {\left| \frac{1}{N} \sum _{n=1}^N B_k(X_n) - {\left\langle B_k, f \right\rangle }\right| } \ge \frac{{|\theta _k^0|}H_{kk}(\bar{\theta })}{2} - 2ur \right) \end{aligned}$$
(A.6)
$$\begin{aligned}&\quad + \mathbb {P}\left( {\left| \sum _{j \ne k} ({\hat{\theta }}_j - \theta _j^0) H_{jk}(\bar{\theta })\right| } \ge \frac{{|\theta _k^0 |}H_{kk}(\bar{\theta })}{2} - 2ur \right) \end{aligned}$$
(A.7)

To bound (A.6), we use Hoeffiding’s inequality. By Assumption (S1), we have

$$\begin{aligned}&\mathbb {P}\left( {\left| \frac{1}{N} \sum _{n=1}^N B_k(X_n) - {\left\langle B_k, f \right\rangle }\right| } \ge \frac{{|\theta _k^0|}H_{kk}(\bar{\theta })}{2} - 2ur \right) \\&\le \mathbb {P}\left( {\left| \frac{1}{N} \sum _{n=1}^N B_k(X_n) - {\left\langle B_k, f \right\rangle }\right| } \ge 2ur \right) \le \frac{\delta }{J^2} \end{aligned}$$

Now, it remains to bound the term (A.7). From Lemma 3, we have

$$\begin{aligned}&\mathbb {P}\left( {\left| \sum _{j \ne k} ({\hat{\theta }}_j - \theta _j^0) H_{jk}(\bar{\theta })\right| } \ge \frac{{|\theta _k^0 |}H_{kk}(\bar{\theta })}{2} - 2ur \right) \\&\quad \le \mathbb {P}\left( \sum _{j=1}^J {|\theta _j^0 -{\hat{\theta }}_j|} \ge 32 \sqrt{2} e^{M_{1} + M_{2} + M_{3} } u rk^0 \right) \le \frac{\delta }{J} \end{aligned}$$

under (S1) and (S2). Therefore, we have

$$\begin{aligned} \mathbb {P}(I^0 \nsubseteq \hat{I}) \le k^0 \frac{\delta }{J^2} + k^0 \frac{\delta }{J} \le \frac{\delta }{J} + \delta . \end{aligned}$$

Proof of Proposition 4

For \({\hat{\theta }}^0 \in \mathbb {R}^J\) defined as (A.2), denote \(\hat{I}^0 = S({\hat{\theta }}^0)\). Observe that \(\hat{I}^0 \subseteq I^0\) by construction. From Lemma 7, we know that \({\hat{\theta }}\) is a maximizer of the penalized likelihood problem (1). Recall that \({\hat{\theta }}\) is also a solution to (1). Note that Lemma 5 implies \(\hat{I} = \hat{I}^0 \subseteq I^0\) on the event \(\mathcal {B}\). Therefore, it suffices to bound the probability \(\mathbb {P}(\mathcal {B}^c)\) to bound \(\mathbb {P}(\hat{I} \nsubseteq I^0)\). By Lemma 8, we have

$$\begin{aligned} \mathbb {P}(\hat{I} \nsubseteq I^0)&\le \mathbb {P}(\mathcal {B}^c) \nonumber \\&\le \sum _{k \notin I^0} \mathbb {P}\left( {\left| \frac{1}{N} \sum _{n=1}^N B_k(X_n) - {\left\langle B_k, {\,\mathsf {f}}_{{\hat{\theta }}^0} \right\rangle } \right| } \ge 4ur \right) \nonumber \\&\le \sum _{k \notin I^0}\mathbb {P}\left( {\left| \frac{1}{N} \sum _{n=1}^N B_k(X_n) - {\left\langle B_k, f \right\rangle } + {\left\langle B_k, f \right\rangle } - {\left\langle B_k, {\,\mathsf {f}}_{{\hat{\theta }}^0} \right\rangle } \right| } \ge 4ur \right) \nonumber \\&\le \sum _{k \notin I^0} \mathbb {P}\left( {\left| \frac{1}{N} \sum _{n=1}^N B_k(X_n) - {\left\langle B_k, f \right\rangle }\right| } \ge 2ur \right) \nonumber \\&\quad + \sum _{k \notin I^0} \mathbb {P}\left( {\left| {\left\langle B_k, {\,\mathsf {f}}_{{\hat{\theta }}^0} \right\rangle } - {\left\langle B_k, f \right\rangle } \right| } \ge 2ur \right) . \end{aligned}$$
(A.8)

As in the proof of Proposition 3, by Hoeffding’s inequality, we have

$$\begin{aligned} \sum _{k \notin I^0} \mathbb {P}\left( {\left| \frac{1}{N} \sum _{n=1}^N B_k(X_n) - {\left\langle B_k, f \right\rangle }\right| } \ge 2ur \right) \le \frac{\delta }{J}. \end{aligned}$$

Now it remains to bound the term (A.8). By Lemma 3 and Eq. (A.5), we have

$$\begin{aligned}&\sum _{k \notin I^0} \mathbb {P}\left( {\left| {\left\langle B_k, {\,\mathsf {f}}_{{\hat{\theta }}^0} \right\rangle } - {\left\langle B_k, f \right\rangle } \right| } \ge 2ur \right) = \sum _{k \notin I^0} \mathbb {P}\left( {|q_k({\hat{\theta }}^0)|} \ge 2ur \right) \\&\quad \le \sum _{k \notin I^0} \mathbb {P}\left( \sum _{j \in I^0} {|\theta _j^0 - {\hat{\theta }}^0_j|} \ge 32 \sqrt{2} e^{M_{1} + M_{2} + M_{3} } u rk^0 \right) \le \delta \end{aligned}$$

under Assumption (S2). Therefore, we have

$$\begin{aligned} \mathbb {P}(\hat{I} \nsubseteq I^0) \le \frac{\delta }{J} + \delta . \end{aligned}$$

Proof of Theorem 1

It follows from Propositions 3 and 4, we have

$$\begin{aligned} \mathbb {P}(\hat{I} \ne I^0) \le \mathbb {P}(I^0 \nsubseteq \hat{I}) + \mathbb {P}(\hat{I} \nsubseteq I^0) \le 2\left( 1+\frac{1}{J} \right) \delta . \end{aligned}$$

Choosing \(\delta < \frac{J \varepsilon }{2(J+1)}\) yields the desired result.

Appendix A.3. Proof of adaptivity

Define the maximizer of the expected log-likelihood

$$\begin{aligned} {\theta ^*}= {\mathop {\hbox {argmax}}\limits _{\theta \in \mathbb {R}^J}} L(\theta ) \end{aligned}$$

where \(L(\theta ) = \mathbb {E}\log {\,\mathsf {f}}_\theta (X_1)\). The existence of \(\theta ^*\) is proved in Proposition 6. The information projection is the density \(f^* = {\,\mathsf {f}}_{{\theta ^*}}\) that is closest to f in the Kullback–Leibler sense; see Csiszár (1975).

Let \(A_J\) be such that

$$\begin{aligned} {||\theta \cdot B ||}_\infty \le A_J ||\theta \cdot B ||_2 {\quad \text{ for }\quad }\theta \in \mathbb {R}^J. \end{aligned}$$

Let \(\gamma _{2 J} = ||\log f - s ||_2\) and \(\gamma _{\infty J} = {||\log f - s ||}_\infty \) be \(L_2\) and \(L_\infty \) degrees of approximation of \(\log f\) by some \(s \in {\text{ span }{\left\{ B_j :j=0,\ldots , J \right\} }}\).

Proposition 6

Suppose (W1) holds and \(\varepsilon _J \triangleq 4 M_{1}^2 e^{4\gamma _{\infty J} + 1} A_J \gamma _{2J} < 1\). Then, \({\theta ^*}\) exists and satisfies the bound

$$\begin{aligned} {||\log f/f^* ||}_\infty \le 2 \gamma _{\infty J} + \varepsilon _J. \end{aligned}$$

Proof

Let \(s_J = \sum _{j=0}^J \beta _j B_j\) be the approximation of \(\log f\) satisfying the given \(L_2\) and \(L_\infty \) bounds on the error \(\log f- s_J\), and let \(\beta = (\beta _1,\ldots , \beta _J)\). Set \(\alpha ^* = \int B f\), \(\alpha = \int B f_\beta \), and \(b=e^{{||\log {\,\mathsf {f}}_\beta ||}_\infty }\). By Bessel’s inequality, Lemma 2 of Barron and Sheu (1991), and (W1), we have

$$\begin{aligned} {|\alpha ^* - \alpha |}_2^2&\le ||f - {\,\mathsf {f}}_\beta ||_2^2 \le M_{1} \int \frac{(f-{\,\mathsf {f}}_\beta )^2}{f}\\&\le M_{1} \exp (2 {||\log f - s_J ||}_\infty - 2(\beta _0 + c(\beta ) )) \int f (\log f - s_J)^2\\&\le M_{1}^2 \exp (4\gamma _{\infty J}) \gamma _{2J}^2. \end{aligned}$$

For the last inequality, we have used the fact that \({|c(\beta ) + \beta _0|}\) is not greater than \({||\log f - s_J ||}_\infty \), since \(c(\beta ) + \beta _0 = \log \int \exp (s_J - \log f )f\). From this same fact, it is seen that

$$\begin{aligned} {||\log f - \log {\,\mathsf {f}}_\beta ||}_\infty \le 2 {||f- s_J ||}_\infty \le 2\gamma _{\infty J}. \end{aligned}$$

If \(M_{1} \exp (2\gamma _{\infty J}) \gamma _{2J} \le 1/(4ebA_J)\), that is, if \(\epsilon _J \le 1\), it follows from Lemma 5 of Barron and Sheu (1991) that the solution \(\theta ^*\) to the equation \(\int B {\,\mathsf {f}}\) exists and that \(||\log {\,\mathsf {f}}^* / {\,\mathsf {f}}_\beta ||_2 \le \varepsilon _J\). By the triangle inequality, we have

$$\begin{aligned} {||\log f / f^* ||}_\infty \le {||\log f / {\,\mathsf {f}}_\beta ||}_\infty + ||\log {\,\mathsf {f}}_\beta / f^* ||_2 \le 2 \gamma _{\infty J} + \varepsilon _J. \end{aligned}$$

\(\square \)

For \(f^1 = {\,\mathsf {f}}_{\theta ^1}\) and \(f^2 = {\,\mathsf {f}}_{\theta ^2}\), we write

$$\begin{aligned} {\Delta ( \theta ^1 \Vert \theta ^2 )} = {\Delta ( f^1 \Vert f^2 )}. \end{aligned}$$

Proposition 7

Let \(b = {||\log f^* ||}_\infty \). Suppose (W1) holds, \(\varepsilon _J \le 1\) and the event \(\mathcal {A}\) has occurred. Then, \({\hat{\theta }}\) satisfies

$$\begin{aligned} {|{\hat{\theta }}- {\theta ^*}|}_2 \le 4 b e^\tau {| \lambda |}_2 \end{aligned}$$

and

$$\begin{aligned} {\mathopen {}\mathclose {\left||\log \hat{f}/ f^* \right|}|}_\infty \le 8 b e^\tau A_J {|\lambda |}_2 \le \tau \end{aligned}$$
(A.9)

for \(\tau \) satisfying \(8 e b A_J {|\lambda |}_2 \le \tau \le 1\).

Proof

From Proposition 6, we know that \({\theta ^*}\) exists and \({||\log f^* ||}_\infty \) is bounded. Note

$$\begin{aligned} {\Delta ( {\theta ^*} \Vert \theta )}&= \int f^* \log \frac{{\,\mathsf {f}}^*}{{\,\mathsf {f}}_\theta } = \int f \left[ {\theta ^*}\cdot B - \mathsf {c}({\theta ^*}) - \left\{ \theta \cdot B - \mathsf {c}(\theta ) \right\} \right] \\&= ({\theta ^*}- \theta ) \cdot \bar{B} - \mathsf {c}({\theta ^*}) + \mathsf {c}(\theta ) \end{aligned}$$

where we have used the fact that the information projection satisfies \(\int s f = \int s f^*\) for \(s \in {\text{ span }{\left\{ B_1, \dots , B_J \right\} }}\); see Barron and Sheu (1991). It follows that

$$\begin{aligned} \ell ^\lambda ({\theta ^*}) - \ell ^\lambda (\theta )&= {\theta ^*}\cdot \hat{B} - \mathsf {c}({\theta ^*}) - \mathsf {p}^\lambda ({\theta ^*}) - \left[ \theta \cdot \hat{B} - \mathsf {c}(\theta ) - \mathsf {p}^\lambda (\theta ) \right] \\&= {\Delta ( {\theta ^*} \Vert \theta )} - (\theta - {\theta ^*}) \cdot (\hat{B} - \bar{B}) - \mathsf {p}^\lambda ({\theta ^*}) + \mathsf {p}^\lambda (\theta ) . \end{aligned}$$

From now on, we assume that the set \(\mathcal {A}\) has occurred. Observe

$$\begin{aligned} {|\mathsf {p}^\lambda ({\theta ^*}) - \mathsf {p}^\lambda (\theta )|} \le {|\lambda |}_2 {|\theta - {\theta ^*}|}_2 \end{aligned}$$

and

$$\begin{aligned} {|(\theta - {\theta ^*}) \cdot (\hat{B} - \bar{B})|}_2 \le \sum _{j=1}^J {|\theta _j - \theta ^*_j|} {|\hat{B}_j - \bar{B}_j|} < \sum _{j=1}^J \lambda _j {|\theta _j - \theta ^*_j|} \le {|\lambda |}_2 {|\theta - {\theta ^*}|}_2. \end{aligned}$$

It follows that

$$\begin{aligned} \ell ({\theta ^*}) - \ell (\theta )&> {\Delta ( {\theta ^*} \Vert \theta )} - 2 {|\lambda |}_2 {|\theta - {\theta ^*}|}_2\\&\ge \frac{1}{2b} e^{- 2 A_J {|\theta - {\theta ^*}|}_2} {|\theta - {\theta ^*}|}_2^2 - 2 {|\lambda |}_2 {|\theta - {\theta ^*}|}_2. \end{aligned}$$

Consider \(\theta \) on the sphere \({\left\{ {|\theta - {\theta ^*}|}_2 = d \right\} }\) for \(d = 4 e^\tau b {|\lambda |}_2\). For all \(\theta \) on this sphere

$$\begin{aligned} \ell ({\theta ^*}) - \ell (\theta ) > 8 e^\tau b {|\lambda |}_2^2 \left( e^{\tau - 8 A_J e^\tau b {|\lambda |}_2} - 1 \right) . \end{aligned}$$

The right side is nonnegative when \(8 A_J e^\tau b {|\lambda |}_2 \le \tau \le 1\). Thus, the value of \(\ell \) at \({\theta ^*}\) is larger than all the values \(\ell \) on the sphere. Consequently, \(\ell \) has an extreme point \({\hat{\theta }}\) that is inside the sphere, that is, \({|{\hat{\theta }}- {\theta ^*}|} < d = 4 e^\tau b {|\lambda |}_2 \). Note that (A.9) follows by applying Lemma 4 of Barron and Sheu (1991). \(\square \)

We state the main result on adaptivity in terms of the wavelet bases given in (7). For the wavelet bases in (7), we have

$$\begin{aligned} u_{jk} \triangleq {||\psi _{jk} ||}_\infty \le M_{7} 2^{j/2} \end{aligned}$$

and

$$\begin{aligned} \Big ||\sum _{k \in V(j)} \psi ^2_{jk}\Big ||_\infty \le M_{8} 2^j. \end{aligned}$$
(A.10)

for \(0 \le j \le j_{1}\) and \(k \in V(j)\); see Härdle et al. (2012) for example. Moreover, by Lemma 1 of Koo and Kim (1996), we have

$$\begin{aligned} A_{j'} = M_{9} 2^{j'/2} \end{aligned}$$
(A.11)

where

$$\begin{aligned} {||w ||}_\infty \le A_{j'} \mathopen {}\mathclose {\left||w \right|}|_2 {\quad \text{ for }\quad }w \in \mathcal {W}_{j'}. \end{aligned}$$

Define \(\varepsilon _{j'}= 4 M_{1}^2 e^{4\gamma _{\infty j'} + 1} A_{j'} \gamma _{2 j'} \).

We make use of the following proposition obtained from Proposition 2 to prove adaptivity.

Proposition 8

Suppose (O1), (O2) hold and choose \(\lambda \) as (4) with \(m=1\), \(\delta = N^{-2}\) and let \(j_1 \asymp \log _2 (N/\log N)^{1/2}\). We then have, with probability at least \(1-N^{-2}\),

$$\begin{aligned} {\Delta ( f \Vert \hat{f} )} \le M_{10} \left[ {\Delta ( f \Vert {\,\mathsf {f}}_\theta )} + \sum _{(j,k) \in S(\theta )} \left( \sigma _{jk}^2 \frac{\log N}{N} + 2^j \left( \frac{\log N}{N} \right) ^2 \right) \right] \end{aligned}$$

for all \(\theta \) with \({||\log {\,\mathsf {f}}_\theta ||}_\infty \le M_{3}\).

Proof

Choosing \(\lambda \) as (4) with \(m=1\), we have

$$\begin{aligned} \lambda _{jk}^2 \le C_{1} \sigma _{jk}^2 r_1^2(\delta ) + C_{2} u_{jk}^2 r_1^4(\delta ) . \end{aligned}$$

It follows that

$$\begin{aligned} \eta ^2(\theta )&\le C_{1} \sum _{(j,k) \in S(\theta )} \sigma _{jk}^2 r_1^2(\delta ) + C_{2} \sum _{(j,k) \in S(\theta )} u_{jk}^2 r_1^4(\delta )\\&\le C_{3} \sum _{(j,k) \in S(\theta )} \left[ \sigma _{jk}^2 \frac{\log N}{N} + 2^j \left( \frac{\log N}{N} \right) ^2 \right] . \end{aligned}$$

The desired result follows from Proposition 2. \(\square \)

Proof of Theorem 2

Let \(s\in \mathcal {W}_{j'}\) be the approximation of \(\log f\) satisfying the given \(L_2\) and \(L_\infty \) bounds on the error \(\log f - s\). It follows from Lemma 1 of Barron and Sheu (1991) that

$$\begin{aligned} {\Delta ( f \Vert f^* )} \le \frac{1}{2} e^{{||\log f - s ||}_\infty } \int f (\log f - s)^2 \le \frac{M_{1}}{2} \exp (\gamma _{\infty j'}) \gamma _{2 j'}^2. \end{aligned}$$

For \(j' \asymp \log _2 \left( \frac{N}{log N} \right) ^{\frac{1}{2\alpha +1}}\) with \(\alpha > 1/2\), note that (W1) and (A.11) implies

$$\begin{aligned} \varepsilon _{j'} \le C_{1} A_{j'} \gamma _{2j'} = 2^{(\frac{1}{2} -\alpha )j'} \rightarrow 0. \end{aligned}$$

Moreover, we have

$$\begin{aligned} 8ebA_{j'} {|\lambda |}_2=C_{2} A_{j'} {|\lambda |}_2 \le C_{3} 2^{ j'/2} {|\lambda |}_2 \le C_{4} \left( \frac{N}{\log N} \right) ^{\frac{-2\alpha +1 }{2 \alpha +1}} \rightarrow 0 \end{aligned}$$

for \(\alpha > 1/2\). Thus, from (W1), Propositions 6, and 7, we have

$$\begin{aligned} {||\log \hat{f} ||}_\infty \le M_{2} {\quad \text{ and } \quad }{||\log f^* ||}_\infty \le M_{3} \end{aligned}$$

for a sufficiently large N. Observe that the inequality (A.10) implies

$$\begin{aligned} \sum _{j=0}^{j'} \sum _{k \in V(j)} \sigma _{jk}^2 \le \sum _{j=0}^{j'} \sum _{k \in V(j)} \mathbb {E}\psi _{jk}^2(X_1) \le M_{8} \sum _{j=0}^{j'} 2^j \le C_{5} 2^{j'}. \end{aligned}$$

Let \(j_{1} \asymp \log _2 \left( \frac{N}{\log N} \right) ^{\frac{1}{2}}\). It follows from Theorem 8 that

$$\begin{aligned} {\Delta ( f \Vert \hat{f} )}&\le \min _{j' \le j_{1}} C_{6} \left( 2^{-2\alpha j'} + 2^{j'} \left( \frac{\log N}{N} \right) + 2^{2j'} \left( \frac{\log N}{N} \right) ^2 \right) \\&\le M_{6} \left( \frac{N}{\log N} \right) ^{-\frac{2\alpha }{2\alpha + 1}} \end{aligned}$$

by choosing \(j' \asymp \log _2 \left( \frac{N}{\log N} \right) ^{\frac{1}{2\alpha +1}}\).

Appendix B. Derivation of the coordinate-wise update formula

Consider a quadratic function p defined as \(p(z) = \frac{b}{2} (z - c)^2 + d\) for \(z \in \mathbb {R}\), \(b > 0\) and \(c, d \in \mathbb {R}\). Note that \(b = p''(z)\) for all z and c is the solution to equation \(p'(z) = 0\). Let \(p^\tau \) be a penalized quadratic function given as

$$\begin{aligned} p^\tau (z) = p(z) + \tau {|z|} \end{aligned}$$
(B.1)

and denote \(z^\tau = \hbox {argmin}_{z \in \mathbb {R}} p^\tau (z)\).

Lemma 9

The minimizer \(z^\tau \) of \(p^\tau \) in (B.1) is given by

$$\begin{aligned} z^\tau = \mathsf {ST}\left( c, \frac{\tau }{b} \right) . \end{aligned}$$

Proof

Since \(p^\tau \) is the sum of a strictly convex function and a convex function, \(p^\tau \) is strictly convex. Hence, it suffices to show that \(z^\tau \) is a local minimum. First, suppose \( c > \tau / b\). Observe

$$\begin{aligned} \frac{d}{dz} p^\tau (z) = b(z - c) + \tau \end{aligned}$$

for \(z \ge 0\). This implies that the solution \(z^\tau = c - \tau / b\) to \(d p^\tau (z) / dz = 0\) is a local minimum for the region \(z \ge 0\). Similarly, suppose \( c < - \tau / b\). Observe

$$\begin{aligned} \frac{d}{dz} p^\tau (z) = b(z - c) - \tau \end{aligned}$$

for \(z < 0\). This implies that the solution \(z^\tau = c + \tau / b\) to \(d p^\tau (z) / dz = 0\) is a local minimum for the region \(z < 0\). Suppose now \({|c|} \le \tau / b\). If \(z < 0\), then

$$\begin{aligned} \frac{d}{dz} p^\tau (z) = b(z - c) - \tau = b \left[ z - \left( c + \frac{\tau }{b} \right) \right] \le bz < 0, \end{aligned}$$

which implies that \(p^\tau \) is strictly decreasing over the region \(- \xi _L< z < 0\) for the sufficiently small positive number \(\xi _L\). Similarly, if \(z > 0\), then

$$\begin{aligned} \frac{d}{dz} p^\tau (z) = b(z - c) + \tau = b \left[ z - \left( c - \frac{\tau }{b} \right) \right] \ge bz > 0 \end{aligned}$$

so that \(p^\tau \) is strictly increasing over the region \(0< z < \xi _R\) for the sufficiently small positive number \(\xi _R\). It follows that \(z^\tau = 0\) is a local minimum over \(-\xi _L< z < \xi _R\). \(\square \)

Proof of Proposition 1

Observe

$$\begin{aligned} r'_j(\theta _j) = g' (\tilde{\theta }_j) + g''(\tilde{\theta }_j)(\theta _j - \tilde{\theta }_j) \end{aligned}$$

and

$$\begin{aligned} r''_j(\theta _j) = g''(\tilde{\theta }_j)= -((\nabla ^2 \ell (\tilde{\theta }))_{jj}) = H(\tilde{\theta })_{jj}. \end{aligned}$$

Note that \(\tilde{\theta }_j - r'_j(\tilde{\theta }_j) /H(\tilde{\theta })_{jj} \) satisfies the equation \(r'_j(\theta _j) = 0\). Since \(r_j\) is a univariate quadratic function of \(\theta _j\), we have

$$\begin{aligned} r(\theta _j) = \frac{H(\tilde{\theta })_{jj}}{2}\left( \theta _j - \left( \tilde{\theta }_j - \frac{r'_j (\tilde{ \theta }_j)}{H(\tilde{\theta })_{jj}} \right) \right) ^2 + a(\tilde{\theta } ) \end{aligned}$$

where \(a(\tilde{\theta })\) is independent of \(\theta _j\). Applying Lemma 9 yields the desired result.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bak, K., Koo, J. Adaptive log-density estimation. J. Korean Stat. Soc. 49, 293–323 (2020). https://doi.org/10.1007/s42952-019-00018-8

Download citation

Keywords

  • \(\ell _1\) penalty
  • Log-density estimation
  • Minimax adaptivity
  • Model selection consistency
  • Oracle inequality
  • Wavelet basis