Approximations of semicontinuous functions with applications to stochastic optimization and statistical estimation


Upper semicontinuous (usc) functions arise in the analysis of maximization problems, distributionally robust optimization, and function identification, which includes many problems of nonparametric statistics. We establish that every usc function is the limit of a hypo-converging sequence of piecewise affine functions of the difference-of-max type and illustrate resulting algorithmic possibilities in the context of approximate solution of infinite-dimensional optimization problems. In an effort to quantify the ease with which classes of usc functions can be approximated by finite collections, we provide upper and lower bounds on covering numbers for bounded sets of usc functions under the Attouch-Wets distance. The result is applied in the context of stochastic optimization problems defined over spaces of usc functions. We establish confidence regions for optimal solutions based on sample average approximations and examine the accompanying rates of convergence. Examples from nonparametric statistics illustrate the results.

This is a preview of subscription content, log in to check access.


  1. 1.

    We stress that \(\nu \) is an index and not the power of q.

  2. 2.

    Recall that “open” here is according to the metric space \((S,\Vert \cdot -\cdot \Vert _\infty ).\)

  3. 3.

    For the significance of entropy integrals we refer to [44].

  4. 4.

    This reference states results only for finite dimensions, but since \((F,{\mathbb {d}})\) is a complete separable metric space, with compact balls, the proofs of the required results carry over nearly verbatim.

  5. 5.

    On \((F,{\mathbb {d}})\), we adopt the Borel sigma-algebra.

  6. 6.

    For measurable \(h:\Xi \rightarrow {\overline{{\mathbb {R}}}}\), \(\int h(\xi )d{\mathbb {P}}(\xi ) = \int \max \{0,h(\xi )\}d{\mathbb {P}}(\xi ) - \int \max \{0, -h(\xi )\}d{\mathbb {P}}(\xi )\), with \(\infty -\infty = \infty \).

  7. 7.

    A random variable Y is sub-exponential if for some \(\lambda \ge 0\), \({\mathbb {E}}[\exp (\tau (Y-{\mathbb {E}}Y))] \le \exp (\tau ^2\lambda ^2/2)\) for all \(|\tau |\le 1/\lambda \). Another assumption that ensures a Bernstein-type large-deviation result could have been substituted here.


  1. 1.

    Balabdaoui, F., Wellner, J.A.: Estimation of a k-monotone density: characterizations, consistency and minimax lower bounds. Stat. Neerl. 64(1), 45–70 (2010)

    MathSciNet  Google Scholar 

  2. 2.

    Bampou, D., Kuhn, D.: Polynomial approximations for continuous linear programs. SIAM J. Optim. 22, 628–648 (2012)

    MathSciNet  MATH  Google Scholar 

  3. 3.

    Bartlett, P.L., Kulkarni, S.R., Posner, S.E.: Covering numbers for real-valued function classes. IEEE Trans. Inf. Theory 43(5), 1721–1724 (1997)

    MathSciNet  MATH  Google Scholar 

  4. 4.

    Bayraksan, G., Morton, D.P.: Assessing solution quality in stochastic programs. Math. Program. 108, 495–514 (2006)

    MathSciNet  MATH  Google Scholar 

  5. 5.

    Birman, M.S., Solomjak, M.Z.: Piecewise-polynomial approximation of functions of the classes \(w_p^\alpha \). Math. USSR Sbornik 73, 295–317 (1967)

    Google Scholar 

  6. 6.

    Bronshtein, E.M.: \(\epsilon \)-Entropy of convex sets and functions. Sib. Math. J. 17(3), 393–398 (1976)

    MathSciNet  Google Scholar 

  7. 7.

    Brudnyi, A.: On covering numbers of sublevel sets of analytic functions. J. Approx. Theory 162(1), 72–93 (2010)

    MathSciNet  MATH  Google Scholar 

  8. 8.

    Cui, Y., Pang, J.-S., Sen, B.: Composite difference-max programs for modern statistical estimation problems. SIAM J. Optim. 28(4), 3344–3374 (2018)

    MathSciNet  MATH  Google Scholar 

  9. 9.

    Cule, M., Samworth, R.J., Stewart, M.: Maximum likelihood estimation of a multi-dimensional log-concave density. J. R. Stat. Soc. Ser. B 72, 545–600 (2010)

    MathSciNet  MATH  Google Scholar 

  10. 10.

    Devolder, O., Glineur, F., Nesterov, Y.: Solving infinite-dimensional optimization problems by polynomial approximation. In: Diehl, M., Glineur, F., Jarlebring, E., Michiels, W. (eds.) Recent Advances in Optimization and its Applications in Engineering, pp. 31–40. Springer, Berlin (2010)

    Google Scholar 

  11. 11.

    Dudley, R.M.: Metric entropy of some classes of sets with differentiable boundaries. J. Approx. Theory 10(3), 227–236 (1974)

    MathSciNet  MATH  Google Scholar 

  12. 12.

    Georghiou, A., Wiesemann, W., Kuhn, D.: Generalized decision rule approximations for stochastic programming via liftings. Math. Program. 152(1–2), 301–338 (2015)

    MathSciNet  MATH  Google Scholar 

  13. 13.

    Groeneboom, P., Jongbloed, G., Wellner, J.A.: Estimation of a convex function: characterizations and asymptotic theory. Ann. Stat. 29, 1653–1698 (2001)

    MathSciNet  MATH  Google Scholar 

  14. 14.

    Guntuboyina, A., Sen, B.: Covering numbers for convex functions. IEEE Trans. Inf. Theory 59(4), 1957–1965 (2013)

    MathSciNet  MATH  Google Scholar 

  15. 15.

    Guntuboyina, A., Sen, B.: Global risk bounds and adaptation in univariate convex regression. Probab. Theory Relat. Fields 163, 379–411 (2015)

    MathSciNet  MATH  Google Scholar 

  16. 16.

    Guo, Y., Bartlett, P.L., Shawe-Taylor, J., Williamson, R.C.: Covering numbers for support vector machines. IEEE Trans. Inf. Theory 48(1), 239–250 (2002)

    MathSciNet  MATH  Google Scholar 

  17. 17.

    Hanasusanto, G.A., Wiesemann, W., Kuhn, D.: K-adaptability in two-stage robust binary programming. Oper. Res. 63(4), 877–891 (2015)

    MathSciNet  MATH  Google Scholar 

  18. 18.

    Hartman, P.: On functions representable as a difference of convex functions. Pac. J. Math. 9, 707–713 (1959)

    MathSciNet  MATH  Google Scholar 

  19. 19.

    Higle, J.L., Sen, S.: Statistical verification of optimality conditions for stochastic programs with recourse. Ann. Oper. Res. 30, 215–240 (1991)

    MathSciNet  MATH  Google Scholar 

  20. 20.

    Higle, J.L., Sen, S.: Duality and statistical tests of optimality for two stage stochastic programs. Math. Program. 75, 257–275 (1996)

    MathSciNet  MATH  Google Scholar 

  21. 21.

    Horst, R., Thoai, N.V.: DC programming: overview. J. Optim. Theory Appl. 103(1), 1–43 (1999)

    MathSciNet  MATH  Google Scholar 

  22. 22.

    Kim, A.K.H., Samworth, R.J.: Global rates of convergence in log-concave density estimation. Ann. Stat. 44, 2756–2779 (2016)

    MathSciNet  MATH  Google Scholar 

  23. 23.

    Kolmogorov, A.N., Tikhomirov, V.M.: Epsilon-entropy and epsilon-capacity of sets in functional spaces. Am. Math. Soc. Transl. Ser. 2(17), 277–364 (1961)

    Google Scholar 

  24. 24.

    Kühn, T.: Covering numbers of Gaussian reproducing kernel Hilbert spaces. J. Complex. 27(5), 489–499 (2011)

    MathSciNet  MATH  Google Scholar 

  25. 25.

    Lamm, M., Lu, S.: Generalized conditioning based approaches to computing confidence intervals for solutions to stochastic variational inequalities. Math. Program. B 174, 99–127 (2018)

    MathSciNet  MATH  Google Scholar 

  26. 26.

    Lu, S., Liu, Y., Yin, L., Zhang, K.: Confidence intervals and regions for the lasso by using stochastic variational inequality techniques in optimization. J. R. Stat. Soc. Ser. B 79, 589–611 (2017)

    MathSciNet  MATH  Google Scholar 

  27. 27.

    Mak, W.K., Morton, D.P., Wood, R.K.: Monte Carlo bounding techniques for determining solution quality in stochastic programs. Oper. Res. Lett. 24, 47–56 (1999)

    MathSciNet  MATH  Google Scholar 

  28. 28.

    Miller, M.: Binary classification using piecewise affine functions. Master’s thesis, Naval Postgraduate School, Monterey, CA, June (2019)

  29. 29.

    Norkin, V.I., Pflug, G.C., Ruszczynski, A.: A branch and bound method for stochastic global optimization. Math. Program. 83, 425–450 (1998)

    MathSciNet  MATH  Google Scholar 

  30. 30.

    Pontil, M.: A note on different covering numbers in learning theory. J. Complex. 19(5), 665–671 (2003)

    MathSciNet  MATH  Google Scholar 

  31. 31.

    Rockafellar, R.T., Wets, R. J-B.: Variational Analysis, Grundlehren der Mathematischen Wissenschaft, vol. 317. Springer, Berlin (1998). (3rd printing-2009 edition)

  32. 32.

    Royset, J.O.: Optimality functions in stochastic programming. Math. Program. 135(1), 293–321 (2012)

    MathSciNet  MATH  Google Scholar 

  33. 33.

    Royset, J.O.: Approximations and solution estimates in optimization. Math. Program. 170(2), 479–506 (2018)

    MathSciNet  MATH  Google Scholar 

  34. 34.

    Royset, J.O., Wets, R.J.-B.: From data to assessments and decisions: epi-spline technology. In: Newman, A. (ed.) INFORMS Tutorials. INFORMS, Catonsville (2014)

    Google Scholar 

  35. 35.

    Royset, J.O., Wets, R.J.-B.: Multivariate epi-splines and evolving function identification problems. Set-Valued and Variational Analysis 24(4), 517–545 (2016). (Erratum: pp. 547–549)

    MathSciNet  MATH  Google Scholar 

  36. 36.

    Royset, J.O., Wets, R.J.-B.: Variational theory for optimization under stochastic ambiguity. SIAM J. Optim. 27(2), 1118–1149 (2017)

    MathSciNet  MATH  Google Scholar 

  37. 37.

    Royset, J.O., Wets, R.J.-B.: On univariate function identification problems. Math. Program. B 168(1–2), 449–474 (2018)

    MathSciNet  MATH  Google Scholar 

  38. 38.

    Royset, J.O., Wets, R.J.-B.: Variational analysis of constrained M-estimators. ArXiv e-prints (2018)

  39. 39.

    Salinetti, G., Wets, R.J.-B.: On the convergence in distribution of measurable multifunctions (random sets), normal integrands, stochastic processes and stochastic infima. Math. Oper. Res. 11(3), 385–419 (1986)

    MathSciNet  MATH  Google Scholar 

  40. 40.

    Salinetti, G., Wets, R.J.-B.: On the hypo-convergence of probability measures. In: Conti, R., De Giorgi, E., Gianessi, F. (eds.) Optimication and Related Fields, Proceedings, Erice 1984, Lecture Notes in Mathematics, vol. 1190, pp. 371–395. Springer, Berlin (1986)

    Google Scholar 

  41. 41.

    Seijo, E., Sen, B.: Nonparametric least squares estimation of a multivariate convex regression. Ann. Stat. 39, 1633–1657 (2011)

    MathSciNet  MATH  Google Scholar 

  42. 42.

    Shapiro, A., Dentcheva, D., Ruszczynski, A.: Lectures on Stochastic Programming: Modeling and Theory, 2nd edn. SIAM, Philadelphia (2014)

    Google Scholar 

  43. 43.

    Shapiro, A., Homem-de-Mello, T.: A simulation-based approach to two-stage stochastic programming with recourse. Math. Program. 81, 301–325 (1998)

    MathSciNet  MATH  Google Scholar 

  44. 44.

    van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes. Springer, Berlin (1996). (2nd printing 2000 edition)

    Google Scholar 

  45. 45.

    van de Geer, S.: Empirical Processes in M-Estimation. Cambridge University Press, Cambridge (2000)

    Google Scholar 

  46. 46.

    Wang, J., Huang, H., Luo, Z., Chen, B.: Estimation of covering number in learning theory. In: Proceeding of the Fifth International Conference on Semantics, Knowledge and Grid 2009, pp. 388–391 (2009)

  47. 47.

    Zhang, Z., Yang, X., Oseledets, I.V., Karniadakis, G.E., Daniel, L.: Enabling high-dimensional hierarchical uncertainty quantification by anova and tensor-train decomposition. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 34(1), 63–76 (2015)

    Google Scholar 

  48. 48.

    Zhou, D.-X.: The covering number in learning theory. J. Complex. 18(3), 739–767 (2002)

    MathSciNet  MATH  Google Scholar 

Download references


This work in supported in parts by DARPA under Grants HR0011-14-1-0060 and HR0011-8-34187, and Office of Naval Research (Science of Autonomy Program) under Grant N00014- 17-1-2372.

Author information



Corresponding author

Correspondence to Johannes O. Royset.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.



Proof of Theorem 4.4

Let \(\rho >0\) and \(F = \{f\in {\text {usc-fcns}}({\mathbb {R}}^d)~|~f(x)\ge -\rho \text{ for } \text{ at } \text{ least } \text{ one } x\in [0,\rho ]^n\}\). We show that F cannot be covered with a lower number of balls than stipulated. Clearly, \(\mathop {\mathrm{dist}}\nolimits _\infty (0,\mathrm{hypo} \;f) \le \rho \) for all \(f\in F\). Thus, in view of (3), \({\mathbb {d}}(0,f) \le \rho + 1\) for all \(f\in F\), where 0 is the zero function on \({\mathbb {R}}^n\), and F is therefore bounded.

Next, let \(\varepsilon \in (0,\rho e^{-\rho }/6]\). We discretize \([0,\rho ]^n\) by defining \(x_i^k = k \rho /\nu _\varepsilon \), \(k = 1, \ldots , \nu _\varepsilon -1\) and \(i=1, \ldots , n\), where

$$\begin{aligned} \nu _\varepsilon = \left\lfloor \frac{\rho e^{-\rho }}{3\varepsilon }\right\rfloor \ge 2, \end{aligned}$$

with \(\lfloor a \rfloor \) being the largest integer not exceeding a. The discretization of \([0,\rho ]^n\) then contains the points \((x_1^{k_1}, x_2^{k_2}, \ldots , x_n^{k_n})\), with \(k_i \in \{1, 2, \ldots , \nu _\varepsilon -1\}\) and \(i=1, \ldots , n\). Clearly, the distance between any two such points in the sup-norm is at least \(\rho /\nu _\varepsilon \ge 3\varepsilon e^\rho \). We carry out a similar discretization of \([-\rho ,0]\) and define \(y^l = l \rho / \nu _\varepsilon \), \(l=1, \ldots , \nu _\varepsilon \). The functions that are finite on the discretization points of \([0,\rho ]^n\), with values at each such point equal to \(y^l\) for some l, and have value minus infinity elsewhere are given by \(F_{\varepsilon }\), i.e.,

$$\begin{aligned} F_{\varepsilon } =&\{f\in {\text {usc-fcns}}({\mathbb {R}}^n)~|~ \text{ for } \text{ each } x=(x_1^{k_1}, \ldots , x_n^{k_n}), \\&\text{ with } k_i \in \{1, 2, \ldots , \nu _\varepsilon -1\}, f(x) = y^l\\&\text{ for } \text{ some } l=1, \ldots , \nu _\varepsilon ; f(x) = -\infty \text{ otherwise } \}. \end{aligned}$$

Certainly, \(F_{\varepsilon } \subset F\). We next define

$$\begin{aligned} G_\varepsilon (f) = \{g\in {\text {usc-fcns}}({\mathbb {R}}^n)~|~ \hat{\mathbb {d}}_\rho (f,g) \le \varepsilon e^\rho \}, ~ \text{ for } f\in {\text {usc-fcns}}({\mathbb {R}}^n). \end{aligned}$$

We establish that \(G_\varepsilon (f) \cap G_\varepsilon (f') = \emptyset \) for \(f,f'\in F_{\varepsilon }, f\ne f'\). Suppose for the sake of a contradiction that there is a g with \(g\in G_\varepsilon (f)\) and \(g\in G_\varepsilon (f')\) for \(f,f'\in F_\varepsilon \), \(f\ne f'\). Then, \(\hat{\mathbb {d}}_\rho (f,g) \le \varepsilon e^\rho \) and \(\hat{\mathbb {d}}_\rho (f',g) \le \varepsilon e^\rho \). However, since \(f\ne f'\), there exists a point \(x\in [0,\rho ]^n\) with \(|f(x) - f'(x)| \ge 3 \varepsilon e^\rho \). Without loss of generality, suppose that \(f(x) \ge f'(x) + 3\varepsilon e^\rho \). Since \(f(z), f'(z) = -\infty \) for all \(z\ne x\) with \(\Vert z-x\Vert _\infty < 3\varepsilon e^\rho \), we have that \(\hat{\mathbb {d}}_\rho (f,g) \le \varepsilon e^\rho \) implies that \(g(z) \ge f(x) - \varepsilon e^\rho \) for some \(z\in {\mathbb {B}}(x,\varepsilon e^\rho )\). Moreover, \(\hat{\mathbb {d}}_\rho (f',g) \le \varepsilon e^\rho \) implies that \(g(z) \le f'(x) + \varepsilon e^\rho \le f(x) - 3\varepsilon e^\rho + \varepsilon e^\rho = f(x) - 2\varepsilon e^\rho \) for all \(z\in {\mathbb {B}}(x,\varepsilon e^\rho )\). Since this is not possible for g, we have reached a contradiction. Thus, \(G_\varepsilon (f) \cap G_\varepsilon (f') = \emptyset \) for \(f,f'\in F_{\varepsilon }, f\ne f'\).

By Lemma 4.1, for any \(f\in {\text {usc-fcns}}({\mathbb {R}}^n)\),

$$\begin{aligned} {\mathbb {d}}(f,g) \ge e^{-\rho } \hat{\mathbb {d}}_\rho (f,g) > e^{-\rho } \varepsilon e^\rho = \varepsilon \text{ for } \text{ all } g\not \in G_\varepsilon (f). \end{aligned}$$

Hence, for \(f\in F_{\varepsilon }\), an \({\mathbb {d}}\)-ball with radius \(\varepsilon \) that contains f needs to be centered at some \(g\in G_\varepsilon (f)\). Since the sets \(G_\varepsilon (f)\), \(f\in F_{\varepsilon }\), are nonoverlapping, a cover of \(F_{\varepsilon }\) by \({\mathbb {d}}\)-balls with radius \(\varepsilon \) must involve a number of balls that is at least as great as the number of functions in \(F_{\varepsilon }\), which is \(\nu _\varepsilon ^{m_\varepsilon }\), where \(m_\varepsilon = (\nu _\varepsilon -1)^n\). Thus,

$$\begin{aligned} \log N(F,\varepsilon ) \ge \nu _\varepsilon ^n \log \nu _\varepsilon \ge \left( \frac{\rho e^{-\rho }}{3\varepsilon }-2\right) ^n \log \left( \frac{\rho e^{-\rho }}{3\varepsilon }-1\right) . \end{aligned}$$

Let \(c_1 = |\log (\rho e^{-\rho }/4)|\) and \({{\bar{\varepsilon }}} = \min \{\rho e^{-\rho }/12, e^{-2c_1}\}\). Continuing from (11), we then find that

$$\begin{aligned} \log N(F,\varepsilon ) \ge \left( \frac{\rho e^{-\rho }}{6}\right) ^n \left[ 1+ \frac{\log (\rho e^{-\rho }/4)}{\log \varepsilon ^{-1}} \right] \frac{1}{\varepsilon ^n}\log \frac{1}{\varepsilon }. \end{aligned}$$

Since \(\log \varepsilon ^{-1} \ge 2|\log (\rho e^{-\rho }/4)|\) for \(\varepsilon \in (0, {{\bar{\varepsilon }}}]\), we have that

$$\begin{aligned} \log N(F,\varepsilon ) \ge \left( \frac{\rho e^{-\rho }}{6}\right) ^n \frac{1}{2}\frac{1}{\varepsilon ^n}\log \frac{1}{\varepsilon }~ \text{ for } \varepsilon \in (0, {{\bar{\varepsilon }}}], \end{aligned}$$

and the conclusion is reached. \(\square \)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Royset, J.O. Approximations of semicontinuous functions with applications to stochastic optimization and statistical estimation. Math. Program. 184, 289–318 (2020).

Download citation


  • Hypo-convergence
  • Attouch-Wets distance
  • Approximation theory
  • Solution stability
  • Stochastic optimization
  • Epi-splines
  • Rate of convergence

Mathematics Subject Classification

  • 90C15 Stochastic programming
  • 62G07 Density estimation
  • 62G08 Nonparametric regression
  • 62G15 Tolerance and confidence regions