Skip to main content
Log in

Regression analysis: likelihood, error and entropy

  • Full Length Paper
  • Series B
  • Published:
Mathematical Programming Submit manuscript

Abstract

In a regression with independent and identically distributed normal residuals, the log-likelihood function yields an empirical form of the \(\mathcal{L}^2\)-norm, whereas the normal distribution can be obtained as a solution of differential entropy maximization subject to a constraint on the \(\mathcal{L}^2\)-norm of a random variable. The \(\mathcal{L}^1\)-norm and the double exponential (Laplace) distribution are related in a similar way. These are examples of an “inter-regenerative” relationship. In fact, \(\mathcal{L}^2\)-norm and \(\mathcal{L}^1\)-norm are just particular cases of general error measures introduced by Rockafellar et al. (Finance Stoch 10(1):51–74, 2006) on a space of random variables. General error measures are not necessarily symmetric with respect to ups and downs of a random variable, which is a desired property in finance applications where gains and losses should be treated differently. This work identifies a set of all error measures, denoted by \(\mathscr {E}\), and a set of all probability density functions (PDFs) that form “inter-regenerative” relationships (through log-likelihood and entropy maximization). It also shows that M-estimators, which arise in robust regression but, in general, are not error measures, form “inter-regenerative” relationships with all PDFs. In fact, the set of M-estimators, which are error measures, coincides with \(\mathscr {E}\). On the other hand, M-estimators are a particular case of L-estimators that also arise in robust regression. A set of L-estimators which are error measures is identified—it contains \(\mathscr {E}\) and the so-called trimmed \(\mathcal{L}^p\)-norms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

Notes

  1. The least squares method was used, although without proof, by Legendre in 1805 [28], see [17].

  2. The idea to minimize the sum of the absolute deviations of error residuals was first proposed by Boscovich in 1757 [4], see [17].

  3. Rockafellar et al. [38, 39] proposed a unifying axiomatic framework for general measures of error, deviation and risk—all of them are positively homogenous convex functionals defined on a space of r.v.’s, see also [34, 37], whereas recently, Grechuk and Zabarankin [15] analyzed sensitivity of optimal values of positively homogenous convex functionals in various optimization problems, including linear regression, to noise in the data.

  4. We assume that 0 ln 0 = 0.

  5. A deviation measure is a functional \(\mathcal{D}:\mathcal{L}^r(\Theta )\rightarrow [0,\infty ]\) satisfying axioms E2–E4 and such that \(\mathcal{D}(Z) = 0\) for constant Z, and \(\mathcal{D}(Z) > 0\) otherwise [38]. A deviation measure is called law-invariant if \(\mathcal{D}(X) = \mathcal{D}(Y)\) whenever r.v.’s X and Y have the same distribution [12].

References

  1. Alfons, A., Croux, C., Gelper, S.: Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Ann. Appl. Stat. 7(1), 226–248 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  2. Bartolucci, F., Scaccia, L.: The use of mixtures for dealing with non-normal regression errors. Comput. Stat. Data Anal. 48(4), 821–834 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  3. Bernholt, T.: Computing the least median of squares estimator in time o(\(n^d\)). In: International Conference on Computational Science and Its Applications, pp. 697–706. Springer (2005)

  4. Boscovich, R.J.: De litteraria expeditione per pontificiam ditionem, et synopsis amplioris operis, ac habentur plura ejus ex exemplaria etiam sensorum impressa. Bononiensi Scientarum et Artum Instituto Atque Academia Commentarii 4, 353–396 (1757)

    Google Scholar 

  5. Box, G.: Non-normality and tests on variances. Biometrika 40, 318–335 (1953)

    Article  MathSciNet  MATH  Google Scholar 

  6. Cover, T., Thomas, J.: Elements of Information Theory. Wiley, New York (2012)

    MATH  Google Scholar 

  7. Edgeworth, F.: On observations relating to several quantities. Hermathena 6(13), 279–285 (1887)

    Google Scholar 

  8. Efron, B.: Regression percentiles using asymmetric squared error loss. Stat. Sin. 1(1), 93–125 (1991)

    MathSciNet  MATH  Google Scholar 

  9. Föllmer, H., Schied, A.: Stochastic Finance, 3rd edn. de Gruyter, Berlin (2011)

    Book  MATH  Google Scholar 

  10. Gauss, C.F.: Theoria motus corporum coelestium in sectionibus conicis solem ambientium. sumtibus Frid. Perthes et IH Besser (1809)

  11. Grechuk, B., Molyboha, A., Zabarankin, M.: Maximum entropy principle with general deviation measures. Math. Oper. Res. 34(2), 445–467 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  12. Grechuk, B., Molyboha, A., Zabarankin, M.: Chebyshev inequalities with law-invariant deviation measures. Probab. Eng. Inf. Sci. 24(1), 145–170 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  13. Grechuk, B., Zabarankin, M.: Schur convex functionals: Fatou property and representation. Math. Finance 22(2), 411–418 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  14. Grechuk, B., Zabarankin, M.: Inverse portfolio problem with mean-deviation model. Eur. J. Oper. Res. 234(2), 481–490 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  15. Grechuk, B., Zabarankin, M.: Sensitivity analysis in applications with deviation, risk, regret, and error measures. SIAM J. Optim. 27(4), 2481–2507 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  16. Gu, Y., Zou, H.: High-dimensional generalizations of asymmetric least squares regression and their applications. Ann. Stat. 44(6), 2661–2694 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  17. Harter, L.: The method of least squares and some alternatives: Part I. In: International Statistical Review/Revue Internationale de Statistique, pp. 147–174 (1974)

  18. Hosking, J., Balakrishnan, N.: A uniqueness result for l-estimators, with applications to l-moments. Stat. Methodol. 24, 69–80 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  19. Huber, P.: Robust estimation of a location parameter. Ann. Math. Stat. 35(1), 73–101 (1964)

    Article  MathSciNet  MATH  Google Scholar 

  20. Huber, P.: Robust Statistics. Wiley, New York (1981)

    Book  MATH  Google Scholar 

  21. Jaynes, E.T.: Information theory and statistical mechanics (notes by the lecturer). Stat. Phys. 3 1, 181 (1963)

    MathSciNet  Google Scholar 

  22. Jouini, E., Schachermayer, W., Touzi, N.: Law invariant risk measures have the Fatou property. Adv. Math. Econ. 9, 49–71 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  23. Koenker, R., Bassett Jr., G.: Regression quantiles. Econ. J. Econ. Soc. 46(1), 33–50 (1978)

    MathSciNet  MATH  Google Scholar 

  24. Krokhmal, P.: Higher moment coherent risk measures. Quant. Finance 7(4), 373–387 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  25. Kullback, S., Leibler, R.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  26. Laplace, P.S.: Traité de mécanique céleste, vol. 2. J. B. M. Duprat, Paris (1799)

    Google Scholar 

  27. Lee, W.M., Hsu, Y.C., Kuan, C.M.: Robust hypothesis tests for m-estimators with possibly non-differentiable estimating functions. Econom. J. 18(1), 95–116 (2015)

    Article  MathSciNet  Google Scholar 

  28. Legendre, A.M.: Nouvelles méthodes pour la détermination des orbites des comètes. 1. F. Didot, Paris (1805)

    Google Scholar 

  29. Lisman, J., Van Zuylen, M.: Note on the generation of most probable frequency distributions. Stat. Neerl. 26(1), 19–23 (1972)

    Article  MATH  Google Scholar 

  30. Loh, P.L.: Statistical consistency and asymptotic normality for high-dimensional robust \(m\)-estimators Ann. Stat. 45(2), 866–896 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  31. Mafusalov, A., Uryasev, S.: CVaR (superquantile) norm: stochastic case. Eur. J. Oper. Res. 249(1), 200–208 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  32. Morales-Jimenez, D., Couillet, R., McKay, M.: Large dimensional analysis of robust m-estimators of covariance with outliers. IEEE Trans. Signal Process. 63(21), 5784–5797 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  33. Mount, D., Netanyahu, N., Piatko, C., Silverman, R., Wu, A.: On the least trimmed squares estimator. Algorithmica 69(1), 148–183 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  34. Rockafellar, R.T., Royset, J.: Measures of residual risk with connections to regression, risk tracking, surrogate models, and ambiguity. SIAM J. Optim. 25(2), 1179–1208 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  35. Rockafellar, R.T., Royset, J.: Random variables, monotone relations, and convex analysis. Math. Program. 148(1–2), 297–331 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  36. Rockafellar, R.T., Uryasev, S.: Conditional value-at-risk for general loss distributions. J. Bank. Finance 26(7), 1443–1471 (2002)

    Article  Google Scholar 

  37. Rockafellar, R.T., Uryasev, S.: The fundamental risk quadrangle in risk management, optimization and statistical estimation. Surv. Oper. Res. Manag. Sci. 18(1), 33–53 (2013)

    MathSciNet  Google Scholar 

  38. Rockafellar, R.T., Uryasev, S., Zabarankin, M.: Generalized deviations in risk analysis. Finance Stoch. 10(1), 51–74 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  39. Rockafellar, R.T., Uryasev, S., Zabarankin, M.: Risk tuning with generalized linear regression. Math. Oper. Res. 33(3), 712–729 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  40. Rousseeuw, P., Leroy, A.: Robust Regression and Outlier Detection, vol. 589. Wiley, New York (2005)

    MATH  Google Scholar 

  41. Rousseeuw, P., Van Driessen, K.: Computing LTS regression for large data sets. Data Min. Knowl. Disc. 12(1), 29–45 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  42. Rousseeuw, P.G.: Least median of squares regression. J. Am. Stat. Assoc. 79, 871–880 (1984)

    Article  MathSciNet  MATH  Google Scholar 

  43. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. 27, 379–423, 623–656 (1948)

  44. Xie, S., Zhou, Y., Wan, A.: A varying-coefficient expectile model for estimating value at risk. J. Bus. Econ. Stat. 32(4), 576–592 (2014)

    Article  MathSciNet  Google Scholar 

  45. Zabarankin, M., Uryasev, S.: Statistical Decision Problems: Selected Concepts and Portfolio Safeguard Case Studies. Springer, Berlin (2014)

    Book  MATH  Google Scholar 

Download references

Acknowledgements

We are grateful to the referees for the comments and suggestions, which helped to improve the quality of the paper. The first author thanks the University of Leicester for granting him the academic study leave to do this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Zabarankin.

Appendix A: Proofs of Propositions 1–6

Appendix A: Proofs of Propositions 16

1.1 Appendix A.1: Proof of Proposition 1

Since \(\mathcal{E}(Z)\) assumes all values in \([0,+\infty )\), the range of h is \([0,+\infty )\), hence it is continuous and \(h(0)=0\). This implies that h has a strictly increasing continuous inverse function \(h^{-1}:\mathbb {R}^+\rightarrow \mathbb {R}^+\), and

$$\begin{aligned} h^{-1}(\mathcal{E}(Z))=h^{-1}[h(\mathbb {E}[\rho (Z)])]=\mathbb {E}[\rho (Z)]. \end{aligned}$$

For constant \(Z=t\geqslant 0\),

$$\begin{aligned} \rho (t)=\mathbb {E}[\rho (t)]=h^{-1}(\mathcal{E}(t))=h^{-1}(|t|\mathcal{E}(1)). \end{aligned}$$

Similarly, \(\rho (t)=h^{-1}(|t|\mathcal{E}(-1))\) for \(t\leqslant 0\). Consequently, in general,

$$\begin{aligned} \rho (t)=h^{-1}\left( a\,[t]_+ +b\,[t]_-\right) , \end{aligned}$$

where \(a=\mathcal{E}(1)>0\) and \(b=\mathcal{E}(-1)>0\). Thus,

$$\begin{aligned} \mathcal{E}(Z)=\varphi ^{-1}\left( \mathbb {E}\left[ \varphi \left( \,a\,[Z]_++b\,[Z]_-\,\right) \right] \right) , \end{aligned}$$
(28)

where \(\varphi =h^{-1}\).

Since \(\Theta =(\Omega , \mathcal{M}, \mathbb {P})\) is non-trivial, there exists an event \(A\in \mathcal{M}\) such that \(p=\mathbb {P}[A]\in (0,1)\). For any non-negative constants c and d, let Z be an r.v. assuming values \(Z(\omega )=c/a\geqslant 0\) and \(Z(\omega )=d/a\geqslant 0\) for \(\omega \in A\) and \(\omega \not \in A\), respectively. Then

$$\begin{aligned} \begin{aligned} \varphi ^{-1}\left[ p \varphi (\lambda c) + (1-p)\varphi (\lambda d)\right]&= \mathcal{E}(\lambda \,Z) = \lambda \,\mathcal{E}(Z) \\&= \lambda \varphi ^{-1}\left[ p \varphi (c) + (1-p)\varphi (d)\right] \end{aligned} \end{aligned}$$
(29)

for any \(\lambda \geqslant 0\). Replacing c and d by \(\varphi ^{-1}(c)\) and \(\varphi ^{-1}(d)\), respectively, and applying \(\varphi (\cdot )\) to the left-hand and right-hand parts of (29), we obtain

$$\begin{aligned} p \varphi (\lambda \varphi ^{-1}(c)) + (1-p)\varphi (\lambda \varphi ^{-1}(d)) = \varphi (\lambda \varphi ^{-1}(pc + (1-p)d). \end{aligned}$$

Consequently, the function \(g(x)=\varphi (\lambda \varphi ^{-1}(x))\) satisfies

$$\begin{aligned} pg(c)+(1-p)g(d)=g(pc + (1-p)d) \quad \forall c,d\geqslant 0. \end{aligned}$$
(30)

Let

$$\begin{aligned} \mathcal{A}=\{a\in [0,1] \, : \, a g(c) + (1-a) g(d) = g(a c + (1-a)d) \,\, \forall c,d\geqslant 0 \}. \end{aligned}$$

By definition, \(0\in \mathcal{A}\) and \(1\in \mathcal{A}\). Also, (30) implies that \(p a + (1-p)b \in \mathcal{A}\) whenever \(a,b\in \mathcal{A}\), hence \(\mathcal{A}\) is a dense subset of [0, 1]. Finally, \(\mathcal{A}\) is closed due to continuity of g, so that \(\mathcal{A}=[0,1]\), and g is a linear function. Since \(g(0)=\varphi (\lambda \varphi ^{-1}(0))=0\), there exists a constant \(C(\lambda )\) such that

$$\begin{aligned} \varphi (\lambda \varphi ^{-1}(x))=g(x)=C(\lambda )x \quad \forall x, \lambda \geqslant 0. \end{aligned}$$
(31)

Setting \(x=\varphi (y)\) in (31), we obtain

$$\begin{aligned} \varphi (\lambda y)=C(\lambda )\varphi (y) \quad \forall y, \lambda \geqslant 0. \end{aligned}$$
(32)

Then setting \(y=1\) in (32), we obtain \(\varphi (\lambda )=C(\lambda )\varphi (1)\). Consequently, \(C(\lambda )=\varphi (\lambda )/\varphi (1)\), and (32) takes the form \(\varphi (\lambda y)=\varphi (\lambda )\varphi (y)/\varphi (1)\quad \forall y, \lambda \geqslant 0\). For the function

$$\begin{aligned} g(x)=\log \frac{\varphi (e^x)}{\varphi (1)}, \end{aligned}$$

this implies that

$$\begin{aligned} g(x+y)= \log \frac{\varphi (e^{x+y})}{\varphi (1)} = \log \frac{\varphi (e^{x})\varphi (e^{y})}{\varphi (1)^2}=g(x)+g(y). \end{aligned}$$

Since g is additive, continuous, and \(g(0)=0\), it is linear, i.e., \(g(x)=px\) for some constant p. Consequently, \(e^{px}=e^{g(x)}=\varphi (e^x)/\varphi (1)\). Finally, with \(e^x=y\), we obtain \(\varphi (y)=\varphi (1)y^p\), and (28) simplifies to

$$\begin{aligned} \mathcal{E}(Z)=\left( \mathbb {E}\left[ \,a\,[Z]_+ +b\,[Z]_-\,\right] ^p\right) ^{1/p}. \end{aligned}$$

The condition \(p\geqslant 1\) follows from sub-additivity of \(\mathcal{E}\).

1.2 Appendix A.2: Proof of Proposition 2

Proposition 4.7 (b) in [11] implies that if \(Z^*\in \mathcal{C}^1(\Theta )\) has a log-concave PDF, then it is a solution to

$$\begin{aligned} \max _{Z\in \mathcal{C}^1(\Theta )} S(Z)\quad \text {subject to}\quad \mathbb {E}[Z]=\mu , \quad \mathcal{D}(Z)\leqslant 1, \end{aligned}$$
(33)

for \(\mu =\mathbb {E}[Z^{*}]\) and some law-invariant the deviation measureFootnote 5\(\mathcal{D}\). Hence \(Z^*\) is a solution to with \(\mathcal{X}=\{Z\in \mathcal{C}^1(\Theta )\,|\,\mathbb {E}[Z]=\mu ,\,\mathcal{D}(Z)\leqslant 1\}\).

Conversely, let \(Z^*\in \mathcal{C}^1(\Theta )\) be a solution to (13) for some convex closed law-invariant set \(\mathcal{X}\). Then it is a solution to (33) for the deviation measure

$$\begin{aligned} \mathcal{D}(Z)=\sup \limits _{\alpha \in [0,1]}\frac{\mathrm{CVaR}_\alpha ^\Delta (Z)}{\mathrm{CVaR}_\alpha ^\Delta (Z^*)} \quad \hbox { for all}\ Z\in \mathcal{L}^1(\Theta ), \end{aligned}$$
(34)

where

$$\begin{aligned} \mathrm{CVaR}_\alpha ^\Delta (Z)\equiv \mathbb {E}[Z]-\frac{1}{\alpha }\int \nolimits _{0}^{\alpha }q_Z(s)\,ds, \quad \alpha \in (0,1), \end{aligned}$$

\(\mathrm{CVaR}_{0}^\Delta (Z)=\mathbb {E}[Z]-\inf Z\) and \(\mathrm{CVaR}_{1}^\Delta (Z)=\sup Z - \mathbb {E}[Z]\), see [14]. Indeed, if an r.v. Z satisfies the constraints in (33) with \(\mathcal{D}\) given by (34), then \( \mathbb {E}[Z]=\mu =\mathbb {E}[Z^*]\), and \(\mathrm{CVaR}_\alpha ^\Delta (Z)\leqslant \mathrm{CVaR}_\alpha ^\Delta (Z^*)\) for all \(\alpha \in [0,1]\), so that Z dominates \(Z^*\) with respect to concave ordering, see Proposition 1 in [14]. Since \(Z^*\) has a PDF, the underlying probability space \(\Theta \) is, by definition, atomless, and part “(a) to (d)” of Corollary 2.61 in [9] along with Lemma 4.2 in [22] implies that \(Z \in \mathcal{X}\). Since \(Z^*\in \mathcal{C}^1(\Theta )\) is a solution to (13), this yields \(S(Z^*)\geqslant S(Z)\), and consequently, \(Z^*\) is a solution to (33). Thus, \(Z^*\) has a log-concave PDF by Proposition 4.11 in [11].

1.3 Appendix A.3: Proof of Proposition 3

If \(Z^*\in \mathcal{C}^1(\Theta )\) has a log-concave PDF, then it is a solution to (33) for some law-invariant deviation measure \(\mathcal{D}\). On the other hand, Proposition 5.1 in [45] shows that problem (33) is equivalent to (14) with an error measure \(\mathcal{E}\) such that \(\mathcal{D}(Z)=\inf _{C\in \mathbb {R}} \mathcal{E}(Z-C)\), i.e., \(\mathcal{D}\) is the deviation measure projected from \(\mathcal{E}\). In general, for a given deviation measure \(\mathcal{D}\), such an error measure is non-unique and can be determined by

$$\begin{aligned} \mathcal{E}(Z)=\frac{1}{1+\mu }\left( \mathcal{D}(Z)+|\mathbb {E}[Z]|\right) , \end{aligned}$$
(35)

which is called inverse projection of \(\mathcal{D}\), see [39]. Thus, \(Z^*\) is a solution to (14) with (35).

Conversely, let \(Z^*\in \mathcal{C}^1(\Theta )\) be a solution to (14) for some law-invariant error measure \(\mathcal{E}\). Then positive homogeneity of \(\mathcal{E}\) and relation \(S(kZ)=S(Z)+\ln k,k>0\), imply that \(Z^*\) is also a solution to

$$\begin{aligned} \max _{Z\in \mathcal{L}^r(\Theta )} S(Z)\quad \text {subject to}\quad \mathcal{E}(Z)\leqslant 1. \end{aligned}$$

Since \(\{Z\,|\, \mathcal{E}(Z)\leqslant 1\}\) is a convex closed law-invariant set, \(Z^*\) has a log-concave PDF by Proposition 2.

1.4 Appendix A.4: Proof of Proposition 4

If \(\mathcal{E}\) and f satisfy the conditions of Proposition 4, then \(\mathcal{E}\) and \(\rho (t) = -\log (f(t))\) satisfy the conditions of Proposition 1. Consequently, \(\rho \) has the form in (12), which implies that \(f(t)=e^{-\rho (t)}\) has the form of (2b).

1.5 Appendix A.5: Proof of Proposition 5

Since h is strictly increasing, problem (8) with \(\mathcal{E}^*\) is equivalent to minimizing \(\mathbb {E}[\rho ^*(Z)]\) or to maximizing \(\mathbb {E}[\ln (f^*(Z))]\). For an r.v. Z such that \(\mathbb {P}[Z=z_i]=1/n,i=1,\dots ,n\), it reduces to (6).

With \(c=h\left( - \int _{-\infty }^\infty f^*(t)\ln f^*(t)\,dt\right) \), the constraint \(\mathcal{E}^*(Z)= c\) in (19) simplifies to

$$\begin{aligned} \int _{-\infty }^\infty f(t)\ln f^*(t)\,dt = \int _{-\infty }^\infty f^*(t)\ln f^*(t)\,dt, \end{aligned}$$

which holds for \(f=f^*\) and for any \(f \ne f^*\) implies that

$$\begin{aligned} -\int _{-\infty }^\infty f(t)\ln f(t)\,dt\leqslant & {} -\int _{-\infty }^\infty f(t)\ln f^*(t)\,dt \\= & {} -\int _{-\infty }^\infty f^*(t)\ln f^*(t)\,dt, \end{aligned}$$

where the first inequality follows from the non-negativity of relative entropy (Kullback-Leibler divergence between f and \(f^*\)), defined as \(D_{KL}(f||f^*)=\int _{-\infty }^\infty f(t)\ln \frac{f(t)}{f^*(t)}\,dt \geqslant 0\), see [25].

1.6 Appendix A.6: Proof of Proposition 6

We first prove the “if” part in (a) and (b). If \(\mathcal{E}\) is a particular case of (2a), it is an error measure that can be represented in the form of (11), which is (21) with M being a Lebesgue measure on (0, 1), and the “if” part in (a) follows. If \(\mathcal{E}\) is a particular case of (25), then it can be represented in the form of (23) with \(M(c,d)=\int _c^d w(\alpha ) \, d\alpha , \, 0\leqslant c<d\leqslant 1,\rho (t)=t_{a,b}^p\), and \(h(x)=x^{1/p}\). For \(Z\ne 0,q_{Z_{a,b}}^p(\alpha )\) is a non-negative non-decreasing function with \(\int _0^1 q_{Z_{a,b}}^p(\alpha ) \,d\alpha > 0\), so that \(L=\lim \limits _{\alpha \rightarrow 1} q_{Z_{a,b}}^p(\alpha ) > 0\), and we claim that

$$\begin{aligned} I=\int _0^1 w(\alpha )\,q_{Z_{a,b}}^p(\alpha )\,d\alpha > 0. \end{aligned}$$
(36)

Indeed, if \(w(\alpha )\) is a delta function at 1, (36) reduces to \(I=L>0\). Otherwise \(\lim \limits _{\alpha \rightarrow 1} w(\alpha ) > 0\), hence \(w(\alpha ^*)>0\) and \(q_{Z_{a,b}}^p(\alpha ^*)>0\) for some \(\alpha ^*<1\), and \(I \geqslant \int _{\alpha ^*}^1 w(\alpha ^*)q_{Z_{a,b}}^p(\alpha ^*) = (1-\alpha ^*)w(\alpha ^*)q_{Z_{a,b}}^p(\alpha ^*) > 0\).

Inequality \(I>0\) implies that \(\mathcal{E}(Z)\) is well-defined and satisfies E1. Property E2 is obvious, whereas E4 is proved for \(w(\alpha )=1\) in [38, Proposition 6], and the general case holds by a similar argument. Next, we claim that

$$\begin{aligned} \mathcal{E}(X+Y) \leqslant \left( \int _0^1 w(\alpha )\,(q_{X_{a,b}}+q_{Y_{a,b}})^p(\alpha )\,d\alpha \right) ^{1/p} \leqslant \mathcal{E}(X) + \mathcal{E}(Y) \end{aligned}$$
(37)

holds for all \(X,Y \in \mathcal{L}^r(\Theta )\). Indeed, the second inequality in (37) is a triangle inequality for the \(\mathcal{L}^p[0,1]\)-norm, and the first one states that

$$\begin{aligned} \int _0^1 w(\alpha )\,f(\alpha )\,d\alpha \leqslant \int _0^1 w(\alpha )\,g(\alpha )\,d\alpha \end{aligned}$$
(38)

for \(f(\alpha )=q_{(X+Y)_{a,b}}^p(\alpha )\) and \(g(\alpha )=(q_{X_{a,b}}(\alpha )+q_{Y_{a,b}}(\alpha ))^p\).

If \(f, g \in \mathcal{L}^r[0,1]\) are such that (38) holds for any non-negative non-decreasing \(w\in \mathcal{L}^1[0,1]\), we write \(g \succcurlyeq f\). The relation \(\succcurlyeq \) is

  1. (i)

    associative;

  2. (ii)

    monotone, in sense that \(f_1(\alpha ) \geqslant f_2(\alpha )\)\(\forall \alpha \in [0,1]\) implies that \(f_1 \succcurlyeq f_2\);

  3. (iii)

    \(q_{X}(\alpha ) + q_{Y}(\alpha ) \succcurlyeq q_{X+Y}(\alpha )\) for any r.v.’s \(X,Y \in \mathcal{L}^r(\Theta )\) due to sub-additivity of functional \(\mathcal{F}(Z) = \int _0^1 w(\alpha ) \, q_Z(\alpha ) \, d\alpha \), see [13, Proposition 4.3];

  4. (iv)

    \(f_1 \succcurlyeq f_2\) is equivalent to \(\int _c^1 f_1(\alpha )\,d\alpha \geqslant \int _c^1 \,f_2(\alpha )\,d\alpha \) for all \(c\in (0,1)\), which, in turn, is equivalent to \(\int _0^1 u(f_1(\alpha ))\,d\alpha \geqslant \int _0^1 u(f_2(\alpha ))\,d\alpha \) for all convex increasing u, see [35, Theorem 8]; and

  5. (v)

    \(f_1 \succcurlyeq f_2\) implies that \(u(f_1) \succcurlyeq u(f_2)\) for any convex increasing function u, which follows from (iv) and the fact that superposition of two convex increasing functions is convex increasing.

Properties (i)–(iii) imply that

$$\begin{aligned} q_{X_{a,b}}+q_{Y_{a,b}} \succcurlyeq q_{X_{a,b}+Y_{a,b}} \succcurlyeq q_{(X+Y)_{a,b}}, \end{aligned}$$

and since the function \(\xi (z)=z^p\) is convex increasing for \(z\geqslant 0\), (38) follows from (v). This finishes the proof of “if” part in (b).

Now we prove the “only if” part. Let \(\mathcal{E}\) be an error measure that can be represented in the form of either (21) or (23). Since \(\mathcal{E}(Z)\) assumes all values in \([0,+\infty ),h\) is a strictly increasing continuous function with \(h(0)=0\) and has a strictly increasing continuous inverse function \(h^{-1}:\mathbb {R}^+\rightarrow \mathbb {R}^+\). Applying \(h^{-1}\) to both parts of either (21) or (23) and setting \(Z=t\), we obtain

$$\begin{aligned} h^{-1}(\mathcal{E}(t)) = \int _0^1 \rho (t) M(d\alpha ) = \rho (t) M(0,1), \quad t \in {{\mathbb {R}}}. \end{aligned}$$

Consequently, \(M(0,1)\ne 0\) and \(\rho (t) = \frac{1}{M(0,1)}h^{-1}(\mathcal{E}(t))\). If M and \(\rho \) are replaced by \(-M\) by \(-\rho \), respectively, then \(\mathcal{E}\) in (21) remains unchanged. Consequently, without loss of generality, we may assume that \(M(0,1)>0\). Positive homogeneity of \(\mathcal{E}\) implies that

$$\begin{aligned} \rho (t)=\frac{1}{M(0,1)}\varphi \left( t_{a,b}\right) , \end{aligned}$$

where \(\varphi =h^{-1},t_{a,b}\) is given by (3), \(a=\mathcal{E}(1)>0\) and \(b=\mathcal{E}(-1)>0\). In particular, both (21) and (23) imply that

$$\begin{aligned} \mathcal{E}(Z)=\varphi ^{-1}\left( \frac{1}{M(0,1)}\int _0^1 q_{\varphi \left( aZ\right) }(\alpha )\,M(d\alpha )\right) , \quad Z\geqslant 0, \end{aligned}$$
(39)

where we used \(q_{\varphi \left( aZ\right) }(\alpha )=\varphi (q_{aZ}(\alpha ))\).

If \(M(0,\alpha )=0\) for all \(\alpha <1\), (21) reduces to \(\mathcal{E}(Z)= a\,[\sup \, Z]_+ +b\,[\sup \, Z]_-\), which is not an error measure (property E1 fails), whereas (23) simplifies to \(\mathcal{E}(Z)=\sup (Z_{a,b})\), which is a particular case of (25) with w being the Dirac delta function at 1. Otherwise there exists \(\alpha \in (0,1)\) such that \(q=M(0,\alpha )/M(0,1)>0\). Since \(\Theta \) is atomless, there exists an event \(A\in \Theta \) with \(\mathbb {P}[A]=\alpha \). Let \(0 \leqslant c \leqslant d\), and let Z be an r.v. such that \(Z(\omega )=c/a\) for \(\omega \in A\) and \(Z(\omega )=d/a\) for \(\omega \not \in A\). Then (39) implies that

$$\begin{aligned} \begin{aligned} \varphi ^{-1}\left[ q \varphi (\lambda c) + (1-q)\varphi (\lambda d)\right]&= \mathcal{E}(\lambda \,Z)= \lambda \,\mathcal{E}(Z) \\&= \lambda \varphi ^{-1}\left[ q \varphi (c) + (1-q)\varphi (d)\right] \end{aligned} \end{aligned}$$
(40)

for any \(\lambda \geqslant 0\). Expression (40) coincides with (29), and the proof of Proposition 1 implies that \(\varphi \) should be in the form of \(\varphi (y)=\varphi (1)y^p,p>0\). Consequently,

$$\begin{aligned} h(z)=\left( \frac{z}{\varphi (1)} \right) ^{1/p} = h(1) z^{1/p}, \end{aligned}$$
(41)

and

$$\begin{aligned} \rho (t)=\frac{\varphi (1)}{M(0,1)}t_{a,b}^p. \end{aligned}$$
(42)

In particular, (39) simplifies to

$$\begin{aligned} \mathcal{E}(Z)=\left( \frac{a^p}{M(0,1)}\int _0^1 q_Z(\alpha )^p\,M(d\alpha )\right) ^{1/p}, \quad Z\geqslant 0. \end{aligned}$$
(43)

Let \(0=\alpha _0\leqslant \alpha _1<\alpha _2<\alpha _3\leqslant \alpha _4=1\) be such that \(\alpha _2-\alpha _1=\alpha _3-\alpha _2\), and let

$$\begin{aligned} M_i=\frac{1}{M(0,1)}\int _{\alpha _{i-1}}^{\alpha _i} M(d\alpha ),\qquad i=1,2,3,4. \end{aligned}$$

Since \(\Theta \) is atomless, there exist events \(A,B \in \mathcal{M}\) such that \(\mathbb {P}[A]=\mathbb {P}[B]=\alpha _2\) and \(\mathbb {P}[A \cap B]=\alpha _1\). Subadditivity of \(\mathcal{E}\) implies that

$$\begin{aligned} \left[ \mathcal{E}\left( 1+\epsilon I_{\Omega /A}\right) + \mathcal{E}\left( 1+\epsilon I_{\Omega /B}\right) \right] ^p \geqslant \mathcal{E}\left( 2+\epsilon I_{\Omega /A} + \epsilon I_{\Omega /B}\right) ^p \quad \forall \epsilon >0, \end{aligned}$$

where I is an indicator function. With (43), this yields

$$\begin{aligned} 2^p\left( M_1+M_2+(1+\epsilon )^p(M_3+M_4) \right) \geqslant 2^p M_1 + (2+\epsilon )^p(M_2+M_3) + (2+2\epsilon )^p, \end{aligned}$$

which simplifies to

$$\begin{aligned}{}[(2+2\epsilon )^p - (2+\epsilon )^p] M_3\geqslant [(2+\epsilon )^p-2^p] M_2. \end{aligned}$$
(44)

Dividing both parts of (44) by \(\epsilon >0\) and taking limit \(\epsilon \rightarrow 0^+\), we obtain \(p2^{p-1}M_3\geqslant p2^{p-1}M_2\), or \(M_3\geqslant M_2\). This implies that the measure \(M(d\alpha )\) has a non-decreasing density \(\omega \) on [0, 1], which can be the Dirac delta function at the ends of the interval.

By selecting \(\alpha _1=\alpha _2-\delta \) and \(\alpha _3=\alpha _2+\delta \) and by taking \(\delta \rightarrow 0^+\), we can make \(M_3\) arbitrarily close to \(M_2\). Consequently, (44) may hold only if \((2+2\epsilon )^p - (2+\epsilon )^p\geqslant (2+\epsilon )^p-2^p\). With \(\epsilon =1\), this inequality reduces to \(4^p - 2\cdot 3^p + 2^p\geqslant 0\) and implies that \(p\geqslant 1\). If \(\mathcal{E}\) can be represented in the form of (23), inequality \(p \ge 1\) along with (41) and (42) yields (25). Moreover, \(\int _0^1 w(\alpha )d\alpha =M[0,1]>0\). To prove (b), it is left to verify that w is non-negative.

Let \(a\geqslant b\) in (25)—the case \(a \leqslant b\) is treated similarly. Since \(\Theta \) is atomless, for every \(\alpha \in (0,1/2]\), there exist events \(A,B \in \mathcal{M}\) such that \(\mathbb {P}[A]=\mathbb {P}[B]=\alpha \) and \(\mathbb {P}[A \cap B]=0\). Subadditivity of \(\mathcal{E}\) implies that

$$\begin{aligned} \mathcal{E}\left( 1-2 I_A\right) + \mathcal{E}\left( 1-2 I_B\right) \geqslant \mathcal{E}\left( 2-2 I_{A\cup B}\right) . \end{aligned}$$

With (25), this yields

$$\begin{aligned} 2 \left( b^p M(0,\alpha ) + a^p M(\alpha , 1)\right) ^{1/p} \geqslant \left( (2a)^p M(2\alpha ,1)\right) ^{1/p}, \end{aligned}$$

which simplifies to

$$\begin{aligned} a^p M(\alpha , 2\alpha ) \geqslant - b^p M(0,\alpha ) \quad \forall \alpha \in (0,1/2]. \end{aligned}$$
(45)

Let \(\alpha ^*=\sup \{\alpha : w(\alpha )<0\}\). Since \(w(\alpha )\) is non-decreasing, (45) fails for \(\alpha =\alpha ^*/2\), and consequently, \(\alpha ^*=0\). Then \(\lim \limits _{\alpha \rightarrow 0} M(\alpha , 2\alpha ) \leqslant \lim \limits _{\alpha \rightarrow 0} \alpha w(2\alpha ) = 0\), so that \(\lim \limits _{\alpha \rightarrow 0} M(0, \alpha ) \geqslant 0\) by (45), which implies that w has no negative delta function at 0 as well. This finishes the proof of (b).

Finally, suppose that \(\mathcal{E}\) has the form of (21). Then an analogue of (43) for negative r.v.’s is given by

$$\begin{aligned} \mathcal{E}(Z)=\left( \frac{b^p}{M(0,1)}\int _0^1 |q_Z(\alpha )|^p\,M(d\alpha )\right) ^{1/p}, \quad Z\leqslant 0. \end{aligned}$$
(46)

Since \(q_{-Z}(\alpha )=-q_{Z}(1-\alpha )\) for almost all \(\alpha \in (0,1)\), (46) can be written as

$$\begin{aligned} \mathcal{E}(Z')=\left( \frac{b^p}{M(0,1)}\int _0^1 |q_{Z'}(\alpha )|^p\,M'(d\alpha )\right) ^{1/p}, \quad Z'\geqslant 0, \end{aligned}$$

where \(Z'=-Z\) and \(M'\) is a measure such that \(M'(a,b)=M(1-b,1-a)\) for any interval (ab). The last expression coincides with (43) and the same argument implies that \(M'(d\alpha )\) has a non-decreasing density \(\omega '\) on (0, 1). Since \(\omega '(\alpha )=\omega (1-\alpha ),\alpha \in (0,1)\), both \(\omega \) and \(\omega '\) may be non-decreasing only if \(\omega \) is constant, which along with (41) and (42) yields (2a) and proves (a).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Grechuk, B., Zabarankin, M. Regression analysis: likelihood, error and entropy. Math. Program. 174, 145–166 (2019). https://doi.org/10.1007/s10107-018-1256-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-018-1256-6

Keywords

Mathematics Subject Classification

Navigation