Robust regression using biased objectives
 887 Downloads
Abstract
For the regression task in a nonparametric setting, designing the objective function to be minimized by the learner is a critical task. In this paper we propose a principled method for constructing and minimizing robust losses, which are resilient to errant observations even under small samples. Existing proposals typically utilize very strong estimates of the true risk, but in doing so require a priori information that is not available in practice. As we abandon direct approximation of the risk, this lets us enjoy substantial gains in stability at a tolerable price in terms of bias, all while circumventing the computational issues of existing procedures. We analyze existence and convergence conditions, provide practical computational routines, and also show empirically that the proposed method realizes superior robustness over wide data classes with no prior knowledge assumptions.
Keywords
Robust loss Heavytailed noise Risk minimization1 Introduction
Accurate prediction of response \(y \in \mathbb {R}\) from novel pattern \(\varvec{x}\in \mathbb {R}^{d}\), based on an observed sample sequence of patternresponse pairs \((\varvec{z}_{1},\ldots ,\varvec{z}_{n}),\,\varvec{z}:=(\varvec{x},y)\), is one of the most fundamental of statistical estimation tasks. Under particular assumptions such as bounded losses or subGaussian residuals, a rich theory has developed in recent decades (Kearns and Schapire 1994; Bartlett et al. 1996, 2012; Alon et al. 1997; Bartlett and Mendelson 2006; Srebro et al. 2010), with variants of empirical risk minimization (ERM) routines playing a central role. The principle underlying such procedures is the use of the sample mean to approximate the risk (expected loss), which in turn functions as a location parameter of the unknown loss distribution. When the loss is concentrated around this value, this approximation is accurate, and ERM procedures perform well with appealing optimality properties (ShalevShwartz et al. 2010).
Unfortunately, these assumptions are stringent, and in general, without a priori evidence of the contrary, our data cannot reasonably be expected to satisfy them. The fundamental problem manifests itself clearly in the simple setting of heavytailed real observations, in which the suboptimality of the empirical mean is wellknown (Catoni 2012). A simple solution when using ERM is to leverage slowergrowing loss functions (e.g., \(\ell _{1}\) instead of \(\ell _{2}\)), but making this decision is inherently ad hoc and requires substantial prior information. Another option is model regularization (Tibshirani 1996; Bartlett et al. 2012; Hsu et al. 2014), potentially combined with quantile regression (Koenker and Bassett 1978; Takeuchi et al. 2006), though both methods introduce new parameters and we are faced with a difficult model selection problem (Cucker and Smale 2002), whose optimal solution is in practice often very sensitive to the empirical distribution. Put simply, in a nonparametric setting, one incurs a major risk of bias in the form of minimizing an impractical location parameter (e.g., the median under asymmetric losses), in order to ensure estimates are stable.
2 Background and contributions
In this section we review the technical literature which is closely related to our work, and then within this context establish the main contributions made in this paper.
Related work Many tasks involve minimizing a function, say \(L(\cdot )\), as a function of candidate \(h \in \mathcal {H}\), which depends on the underlying distribution and is thus unknown. One line of work explicitly looks at refining the approximate objective function used. A key theme is to downweight errant observations automatically, and to construct a new estimate \(\widehat{L}(h) \approx L(h)\) of the risk, recoding the algorithm as \(\widehat{h}:={{\mathrm{arg\,min}}}_{h \in \mathcal {H}} \widehat{L}(h)\). The nowclassic work of Rousseeuw and Yohai (1984) on Sestimators highlights important concepts in our work. They use the Mestimator of scale of the residual \(h(\varvec{x})y\), written \(\widehat{s}(h)\), directly as objective function, setting \(\widehat{L}(h)=\widehat{s}(h)\). The idea is appealing and has (classical) robustness properties, though serious issues of stability and computational cost have been raised (Huber and Ronchetti 2009), and indeed even the fast modern routines are designed only for the rather special parametric setting where errant data can be discarded (SalibianBarrera and Yohai 2006), which severely limits utility in our setting.

A fast minimizer of robust losses for general regression tasks, which is easily implemented, inexpensive, and requires no knowledge of higherorder moments of the data.

Analysis of conditions for existence and convergence of the core routine.

Comprehensive empirical performance testing, illustrating dominant robustness in both simulated settings and on realworld benchmark data sets.
3 Fast minimization of robust objectives
In this section, we introduce the learning task of interest and give an intuitive derivation of our proposed algorithm. More formal analysis of the convergence properties of this procedure, from both statistical and computational viewpoints, is carried out in Sect. 4.
Example 1
(Typical formulations) The pattern recognition problem has generic input space \(\mathcal {X}\) and discrete labels, namely \(\varvec{x}\in \mathcal {X}\) and \(y \in \{1,\ldots ,C\}\). Here the “zeroone” loss \(l(h;\varvec{z})=I\{h(\varvec{x}) \ne y\}\) makes for a natural penalty to classifier h. More generally, the regression problem task has response \(y \in \mathbb {R}\), and the classic metric for evaluating the quality of predictor \(h{:}\,\mathcal {X}\rightarrow \mathbb {R}\) is the quadratic loss \(l(h;\varvec{z})=(yh(\varvec{x}))^{2}\).
Issues to overcome Intuitively, if our approximation, say \(\widehat{L}\), of \(L_{\mu }\), is not very accurate, then any minima of \(\widehat{L}\) will likely be useless. Thus the first item to deal with is making sure the approximation \(\widehat{L}\approx L_{\mu }\) is sharp. Perhaps the most typical approach is to set \(\widehat{L}(h)\) to the sample mean, \(\sum _{i=1}^{n}l(h;\varvec{z}_{i})/n\). In this case, the estimate is “unbiased” as \({{\mathrm{\mathbf {E}}}}\widehat{L}(h) = L_{\mu }(h)\), but unfortunately the variance can be highly undesirable (Catoni 2009, 2012). There is no need to constrain ourselves to unbiased estimators, as Fig. 2a illustrates; paying a small cost in term of bias [allowing \({{\mathrm{\mathbf {E}}}}\widehat{L}(h) \ne L_{\mu }(h)\)] for much stabler output (large reduction in variance of \(\widehat{L}\)) is an appealing route.
Example 2
(Update under linear model) In the special case of a linear model where \(h(\varvec{x}) = \varvec{w}^{T}\varvec{x}\) for some vector \(\varvec{w}\in \mathbb {R}^{d}\), then inverting a \(d \times d\) matrix and then some matrix multiplication is all that is required. Writing \(X = [\varvec{x}_{1},\ldots ,\varvec{x}_{n}]^{T}\) for the \(n \times d\) design matrix, \(\varvec{y}= (y_{1},\ldots ,y_{n})\), \(h(X)=(h(\varvec{x}_{1}),\ldots ,h(\varvec{x}_{n}))\), and \(U={{\mathrm{diag}}}(u_{1},\ldots ,u_{n})\), then the solution is \((X^{T}UX)^{\dagger }X^{T}U(\varvec{y}h(X))\), where \((\cdot )^{\dagger }\) denotes the Moore–Penrose inverse.\(\square \)
Example 3
Actual computation of key quantities Here we discuss precisely how we carry out the various subroutines required in Algorithm 1, namely the tasks of initialization, recentring, rescaling, and finding robust loss estimates. Initialization is the first and the easiest: \(h_{(0)}\) is initialized to the \(\ell _{2}\) empirical risk minimizer. When this value is optimal, it should be difficult to improve \(\widehat{L}\), and thus the algorithm should finish quickly; when it is highly suboptimal, this should result in a large value for \(\widehat{L}(h_{(0)};\rho ,s)\), upon which subsequent steps of the algorithm seek to improve.
The “pivot” term \(\gamma \) is computed given a set of losses \(D=\{l(h;\varvec{z}_{i})\}_{i=1}^{n}\) evaluated at some h; in particular, the losses are computed for \(h_{(t1)}\) at iteration t of Algorithm 1. This \(\gamma (D)\) is used to centre the data; terms \(l(h;\varvec{z}_{i})\) which are inordinately far away from \(\gamma (D)\), either above or below, are treated as errant. One natural choice that requires sorting the data is the median D. A rough but fast choice is the arithmetic mean of D, which we have used throughout our tests.
Intuitively, for h and D, we expect that as \(k \rightarrow \infty \)
\(\widehat{\theta }_{(k)} \rightarrow \widehat{L}(h;\rho ,s)\) and \(s_{(k)} \rightarrow s(D)\),
and indeed such properties can be both formally and empirically established (see Sect. 4.4).
Example 4
Summary of fRLM algorithm To recapitulate, we have put forward a procedure for minimizing the robust loss \(\widehat{L}(h;\rho ,s)\) in h, by using a fast reweighted least squares technique that is guaranteed to improve a quantity (q above) very closely related to the actual unwieldy objective \(\widehat{L}\). Using the iterative nature of this routine, we can perform the rescaling and location estimates sequentially (rather than simultaneously), making for simple and fast updates. All together, this allows us to leverage the ability of \(\rho \) to truncate errant observations, while utilizing the fast approximate minimization program to alleviate issues with \(\widehat{L}\) being implicit, all without using moment oracles for scaling as in the analysis of Catoni (2012) and Brownlees et al. (2015), which are notable merits of our proposed approach.
This algorithm makes use of statistical quantities that are defined as the minimizer of a class of estimators. As discussed in our literature review of Sect. 2, the properties of learning algorithms that leverage these statistics have been analyzed by Brownlees et al. (2015). This does not, however, capture the properties of the resulting estimator itself: how does it behave as a function of sample size? Does it converge to a readilyinterpreted parameter? We address these questions in the following section.
4 Analysis of convergence
After giving some additional notation in Sect. 4.1, we provide some fundamental existence results in Sect. 4.2, and then show that robust loss minimizers converges in a manner analogous to classical Mestimators in Sect. 4.3, using computationally congenial subroutines examined in Sect. 4.4. All proofs are relegated to Appendix B.
4.1 Preliminaries
In addition to the notation of h, l, \(\varvec{z}\), and \(L_{\mu }\) from the previous sections, we specify that \(\mu \) is a probability on \(\mathbb {R}^{d+1}\), equipped with some appropriate \(\sigma \)field, say the Borel sets \(\mathcal {B}_{d+1}\). Let \(\mu _{n}\) denote the empirical measure supported on the sample, namely \(\mu _{n}(B) :=n^{1}\sum _{i=1}^{n}I\{\varvec{z}_{i} \in B\}\), \(B \in \mathcal {B}_{d+1}\). Expectation of vectors is naturally elementwise, namely \({{\mathrm{\mathbf {E}}}}_{\mu }(\varvec{x},y) = ({{\mathrm{\mathbf {E}}}}_{\mu }x_{1},\ldots ,{{\mathrm{\mathbf {E}}}}_{\mu }x_{d}, {{\mathrm{\mathbf {E}}}}_{\mu }y)\), and we shall use \({{\mathrm{var}}}_{\mu }\varvec{z}\) to denote the \((d+1) \times (d+1)\) covariance matrix of \(\varvec{z}\), and so forth. \({{\mathrm{\mathbf {P}}}}\) will be used to denote a generic probability measure, though in almost all cases it will be over the nsized data sample, and thus correspond to the product measure \(\mu ^{n}\). Let \([k] :=\{1,\ldots ,k\}\) for integer k. We shall frequently use \(\widehat{h}\) to denote the output of an algorithm, typically as \(\widehat{h}_{n}(\varvec{x}) :=\widehat{h}(\varvec{x};\varvec{z}_{1},\ldots ,\varvec{z}_{n})\), a process which takes the nsized data sample and returns a function \(\widehat{h}_{n} \in \mathcal {H}\) to be used for prediction. Since the underlying distribution \(\mu \) is unknown, the risk \(L_{\mu }\) can either be estimated formally, using inequalities that provide highprobability confidence intervals for this error over the random draw of the sample, or via controlled simulations where the performance metrics are computed over many independent trials.
Example 5
As a concrete case, the classical linear regression model with quadratic risk has \(\varvec{z}=(\varvec{x},y)\) with \(h(\varvec{x})=\varvec{w}^{T}\varvec{x}\) for some \(\varvec{w}\in \mathbb {R}^{d}\), and \(l(h;\varvec{z})=(y\varvec{w}^{T}\varvec{x})^{2}\). When the model is correctly specified, i.e., when we have \(y = \varvec{w}_{0}^{T}+\epsilon \) for an unknown \(\varvec{w}_{0} \in \mathbb {R}^{d}\), and noise \({{\mathrm{\mathbf {E}}}}_{\mu }\epsilon =0\), the loss takes on a convenient form, making additional results easy to obtain, though our general approach does not require such assumptions.
4.2 Existence of valid estimates
Generalization performance is completely captured by the distribution of \(l(h;\varvec{z})\). Unfortunately, inferring this distribution from a finite sample is exceedingly difficult, and so we estimate parameters of this distribution to gain insight into performance; the expected value \(L_{\mu }(h)\) is a case in point. In pursuit of a routine for estimating the risk, with low variance and controllable risk, the basic strategy ideas in Sect. 3 seem intuitively promising. Here we show that following the strategy outlined, one can create a procedure which is valid in a statistical sense, under very weak assumptions.
Our starting point is to introduce new parameters, distinct from the risk, which have controllable bias, and can be approximated more reliably than the expected value, using a finite sample. The following definition specifies such a parameter class.
Definition 6
Remark 7
First, we show that these new “objectives” are indeed welldefined objective functions, which is important since our algorithm seeks to minimize them.
Lemma 8
With a welldefined objective function, next we consider the existence of the minimizer of this new objective. While measurability is by no means our chief concern here, for completeness we include a technical result useful for proving the existence of a valid minimizer of the proxy objective.
Lemma 9
This gives us a formal definition of \(\widehat{\theta }(h)\) which has the desired property specified by (7). It simply remains to show that we can always minimize this objective in h.
Theorem 10
There are many potential methods for carrying out the scaling in practice. Here we verify that the simple method proposed in Sect. 3 does not disrupt the assurances above. First a definition.
Definition 11
Proposition 12
Note that \(\gamma _{\mu _{n}}(h)\) here corresponds directly to \(\gamma (D)\) in Algorithm 1, where \(D=\{l(h;\varvec{z}_{i})\}_{i=1}^{n}\). With basic facts related to existence and measurability in place, we proceed to look at some convergence properties of the estimators and computational procedures concerned in the Sects. 4.3 and 4.4.
4.3 Statistical convergence
For some context, we start with a wellknown consistency property of Mestimators, adapted to our setting.
Theorem 13
Lemma 14
A corollary of this general result will be particularly useful.
Corollary 15
These facts are sufficient for showing that a very natural analogue of the strong pointwise consistency of Mestimators (Theorem 13) holds in a uniform fashion for our robust objective minimizer \(\widehat{h}_{n}\).
Theorem 16
With these rather natural statistical properties understood, we shift our focus to the behaviour of the computational routines used.
4.4 Computational convergence
As regards computational convergence, since Algorithm 1 is meant to be a fast approximation to a minimizer of \(\widehat{L}(\cdot )\) on \(\mathcal {H}\), we should not expect the \(\widehat{h}\) produced after \(t \rightarrow \infty \) iterations to actually converge to the true \(\widehat{h}_{n}\) in (8). What we should expect, however, is that the subroutines (3) and (4), used to compute \(\widehat{L}_{(t)}\) and \(s(D_{(t)})\) for each t, should converge to the true values specified by (1) and (2) respectively. We show that this convergence holds.
Proposition 17
Using \(\rho \) as in Definition 6 and \(\chi \) as in Proposition 12, note that the above convergence guarantees will not be ambiguous, since the location and scale estimates are uniquely determined.
Efficiency of iterative subroutines As a complement to the formal convergence properties just examined, we conduct numerical tests in which we run (3) and (4) until they respectively compute the true \(\widehat{\theta }\) and s values up to a specified degree of precision. It is of practical importance to answer the following questions: Do the iterative routines reliably converge to the correct optimal value? How many iterations does this take on average? How does this depend on factors such as the data distribution, sample size, and our choice of \(\rho \) and \(\chi \)?
We have convergence at a high level of precision, requiring very few iterations, and this holds uniformly across the conditions observed. As such, the convergence of the routines is just as expected (Proposition 17), and the speed is encouraging. In general, convergence tends to speed up for larger n, and the relative difference in speed is very minor across distinct \(\rho \) choices, though slightly more pronounced in the case of \(\chi \), but even the slowest choice seems tolerable. Finally, location estimation is slightly slower in the Normal case than in the logNormal case, while the opposite holds for scale estimation.
5 Numerical performance tests
 1.
How well does fRLM (Algorithm 1) generalize offsample?
 2.
Fixing \(\rho \), can we still succeed under both light and heavytailed noise?
 3.
How does performance depend on n and d?
5.1 Experimental setup
Every experimental condition and trial has us generating n training observations, of the form \(y_{i} = \varvec{w}_{0}^{T}\varvec{x}+ \epsilon _{i}, i \in [n]\). Distinct experimental conditions are specified by the setting of (n, d) and \(\mu \). Inputs \(\varvec{x}\) are assumed to follow a ddimensional isotropic Gaussian distribution, and thus to determine \(\mu \) is to specify the distribution of noise \(\epsilon \). In particular, we look at several families of distributions, and within each family look at 15 distinct noise levels. Each noise level is simply a particular parameter setting, designed such that \({{\mathrm{sd}}}_{\mu }(\epsilon )\) monotonically increases over the range 0.3–20.0, approximately linearly over the levels.
To ensure a wide range of signal/noise ratios is spanned, for each trial, \(\varvec{w}_{0} \in \mathbb {R}^{d}\) is randomly generated as follows. Defining the sequence \(w_{k} :=\pi /4 + (1)^{k1}(k1)\pi /8, k=1,2,\ldots \) and uniformly sampling \(i_{1},\ldots ,i_{d} \in [d_{0}]\) with \(d_{0}=500\), we set \(\varvec{w}_{0} = (w_{i_{1}},\ldots ,w_{i_{d}})\). As such, given our control of noise standard deviation, and noting that the signal to noise ratio in this setting is computed as \(\text {SN}_{\mu } = \Vert \varvec{w}_{0}\Vert _{2}^{2}/{{\mathrm{var}}}_{\mu }(\epsilon )\), the ratio ranges between \(0.2 \le \text {SN}_{\mu } \le 1460.6\).
Regarding the noise distribution families, the tests described above were run for 27 different families, but as space is limited, here we provide results for four representative families: logNormal (denoted lnorm in figures), Normal (norm), Pareto (pareto), and Weibull (weibull). Even with just these four, we capture both symmetric and asymmetric families, subGaussian families, as well as heavytailed families both with and without finite higherorder moments.
Our chief performance indicator is prediction error, computed as follows. For each condition and each trial, an independent test set of m observations is generated identically to the corresponding nsized training set. All competing methods use common sample sets for training and are evaluated on the same test data, for all conditions/trials. For each method, in the kth trial, some estimate \(\widehat{\varvec{w}}\) is determined. To approximate the \(\ell _{2}\)risk, compute root mean squared error \(e_{k}(\widehat{\varvec{w}}) :=(m^{1}\sum _{i=1}^{m}(\widehat{\varvec{w}}^{T}\varvec{x}_{k,i}y_{k,i})^{2})^{1/2}\), and output prediction error as the average of normalized errors \(e_{k}(\widehat{\varvec{w}}(k))  e_{k}(\varvec{w}_{0}(k))\) taken over all trials. While n and d values vary, in all experiments the number of trials is fixed at 250, and test size \(m=1000\).
5.2 Competing methods
Benchmark routines used in these experiments are as follows. Ordinary least squares, denoted ols and least absolute deviations, denoted lad, represent classic methods. In addition, we look at three very modern alternatives, namely three routines directly from the references papers of Minsker (2015) (geomed), Brownlees et al. (2015) (bjl), and Hsu and Sabato (2016) (hs). The hs routine used here is a faithful R translation of the MATLAB code published by the authors. Our implementation of geomed uses the geometric median algorithm of Vardi and Zhang (2000, Eqn. 2.6), and all partitioning conditions as given in the original paper are satisfied. Regarding bjl, scaling is done using a samplebased estimate of the true variance bound used in their analysis, with optimization carried out using the Nelder–Mead gradientfree method implemented in the R function optim.
For our fRLM (Algorithm 1, Sect. 3), we tried several different choices of \(\rho \) and \(\chi \), including those in Appendix A, and overall trends were almost identical. Thus as a representative, we use the Gudermannian for \(\rho \) and \(\chi (u)={{\mathrm{sign}}}(u1)\) as a particularly simple and illustrative example implementation. Estimates of location and scale were carried out by (3) and (4).
5.3 Test results: simulation
Here we assemble the results of distinct experiments which highlight different facets of the statistical procedures being evaluated.
Impact of dimension The role played by model dimension is also of interest, and can highlight weaknesses in optimization routines that do not appear when only a few parameters are being determined. Such issues are captured most effectively by keeping the d / n ratio fixed and increasing the model dimension.
Prediction error results are given in Fig. 8, at the middle noise level, for different model dimensions ranging over \(5 \le d \le 140\). The sample size is determined such that \(d/n = 1/6\) holds; this is a rather generous size, and thus where we observe deterioration in performance, we infer a lack of utility in more complex models, even when a sample of sufficient size is available. We see clearly that most procedures considered see a performance drop as model dimension grows, whereas our routine performs exactly the same, regardless of dimension size. This is a particularly appealing result illustrating the scalability of our fRLM in “bigger” tasks.
5.4 Test results: realworld data
Results are given in Fig. 9. While the data sets come from wildly varying domains (economics, manufacturing of petroleum products, human physiology and psychology), it is apparent that the results here very closely parallel those of our simulations, which again are the kind of performance that the theoretical exposition of Sects. 3 and 4 would have us expect. Strong performance is achieved with no a priori information, and with no finetuning whatsoever. Exactly the same routine is deployed in all problems. Of particular importance here is that we are able to beat or match the bjl routine under all settings here as well; both of these routines attempt to minimize similar robust losses (defined implicitly), however our routine does it at a fraction of the cost, since we have no need to appeal to generalpurpose nonlinear optimizers, a very promising result.
6 Concluding remarks
In this work, we have introduced and explored a novel approach to the regression problem, using robust loss estimates and an efficient routine for minimizing these estimates without requiring prior knowledge of the underlying distribution. In addition to theoretical analysis of the fundamental properties of the algorithm being used, we showed through comprehensive empirical testing that the proposed technique indeed has extremely desirable robustness properties. In a wide variety of problem settings, our routine was shown to uniformly outperform wellknown competitors both classical and modern, with cost requirements that are tolerable, suggesting a strong general approach for regression in the nonparametric setting.
Looking ahead, there are a number of interesting lines of work to be taken up. Extending this work to unsupervised learning problems is an immediate goal. Beyond this, a more careful look at the optimality of different algorithms from a cost/performance standpoint would assuredly be of interest. When is it more profitable (under some metric) to use “balanced” methods such as that of Minsker (2015), Brownlees et al. (2015) and Hsu and Sabato (2016) or ours, rather than committing to one of two extremes, say OLS or LAD? The former perform very well, but require extra computation. Characterizing such situations in terms of the underlying data distribution is both technically and conceptually interesting. Clear tradeoffs between formal assurances and extra computational cost could shed new light on precisely where traditional ERM algorithms and close variants fail to be economical.
Footnotes
 1.
All materials available at https://github.com/feedbackward/rtm_code.
 2.
Compiled online by J. Burkardt at http://people.sc.fsu.edu/~jburkardt/.
Notes
Acknowledgements
The authors would like to thank the anonymous reviewers for their constructive comments, which resulted in substantial improvements to the manuscript.
Supplementary material
References
 Abramowitz, M., & Stegun, I. A. (1964). Handbook of mathematical functions with formulas, graphs, and mathematical tables, National Bureau of Standards Applied Mathematics Series (Vol. 55). US National Bureau of Standards.Google Scholar
 Alon, N., BenDavid, S., CesaBianchi, N., & Haussler, D. (1997). Scalesensitive dimensions, uniform convergence, and learnability. Journal of the ACM, 44(4), 615–631.MathSciNetCrossRefMATHGoogle Scholar
 Ash, R. B., & DoléansDade, C. A. (2000). Probability and measure theory (2nd ed.). New York: Academic Press.MATHGoogle Scholar
 Audibert, J. Y., & Catoni, O. (2011). Robust linear least squares regression. Annals of Statistics, 39(5), 2766–2794.MathSciNetCrossRefMATHGoogle Scholar
 Bartlett, P. L., Long, P. M., & Williamson, R. C. (1996). Fatshattering and the learnability of realvalued functions. Journal of Computer and System Sciences, 52(3), 434–452.MathSciNetCrossRefMATHGoogle Scholar
 Bartlett, P. L., & Mendelson, S. (2006). Empirical minimization. Probability Theory and Related Fields, 135(3), 311–334.MathSciNetCrossRefMATHGoogle Scholar
 Bartlett, P. L., Mendelson, S., & Neeman, J. (2012). \(\ell _{1}\)regularized linear regression: Persistence and oracle inequalities. Probability Theory and Related Fields, 154(1–2), 193–224.MathSciNetCrossRefMATHGoogle Scholar
 Breiman, L. (1968). Probability. Reading, MA: AddisonWesley.MATHGoogle Scholar
 Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.MATHGoogle Scholar
 Brent, R. P. (1973). Algorithms for minimization without derivatives. Englewood Cliffs, NJ: PrenticeHall.MATHGoogle Scholar
 Brownlees, C., Joly, E., & Lugosi, G. (2015). Empirical risk minimization for heavytailed losses. Annals of Statistics, 43(6), 2507–2536.MathSciNetCrossRefMATHGoogle Scholar
 Catoni, O. (2009). High confidence estimates of the mean of heavytailed real random variables. arXiv preprint arXiv:0909.5366.
 Catoni, O. (2012). Challenging the empirical mean and empirical variance: A deviation study. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 48(4), 1148–1185.MathSciNetCrossRefMATHGoogle Scholar
 Cucker, F., & Smale, S. (2002). On the mathematical foundations of learning. Bulletin (New Series) of the American Mathematical Society, 39(1), 1–49.MathSciNetCrossRefMATHGoogle Scholar
 Dellacherie, C., & Meyer, P. A. (1978). Probabilities and potential, NorthHolland Mathematics Studies (Vol. 29). Amsterdam: NorthHolland.Google Scholar
 Devroye, L., Lerasle, M., Lugosi, G., & Oliveira, R. I. (2015). SubGaussian mean estimators. arXiv preprint arXiv:1509.05845.
 Dudley, R. M. (1978). Central limit theorems for empirical measures. Annals of Probability, 6(6), 899–929.MathSciNetCrossRefMATHGoogle Scholar
 Dudley, R. M. (2014). Uniform central limit theorems (2nd ed.). Cambridge, MA: Cambridge University Press.MATHGoogle Scholar
 Freund, Y., & Schapire, R. E. (1997). A decisiontheoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.MathSciNetCrossRefMATHGoogle Scholar
 Geman, D., & Reynolds, G. (1992). Constrained restoration and the recovery of discontinuities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(3), 367–383.CrossRefGoogle Scholar
 Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741.CrossRefMATHGoogle Scholar
 Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Stahel, W. A. (1986). Robust statistics: The approach based on influence functions. New York: Wiley.MATHGoogle Scholar
 Hsu, D., & Sabato, S. (2014). Heavytailed regression with a generalized medianofmeans. In Proceedings of the 31st international conference on machine learning (ICML2014) (pp. 37–45).Google Scholar
 Hsu, D., & Sabato, S. (2016). Loss minimization and parameter estimation with heavy tails. Journal of Machine Learning Research, 17(18), 1–40.MathSciNetMATHGoogle Scholar
 Hsu, D., Kakade, S. M., & Zhang, T. (2014). Random design analysis of ridge regression. Foundations of Computational Mathematics, 14(3), 569–600.MathSciNetCrossRefMATHGoogle Scholar
 Huber, P. J. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics, 35(1), 73–101.MathSciNetCrossRefMATHGoogle Scholar
 Huber, P. J. (1981). Robust statistics (1st ed.). New York: Wiley.CrossRefMATHGoogle Scholar
 Huber, P. J., & Ronchetti, E. M. (2009). Robust statistics (2nd ed.). New York: Wiley.CrossRefMATHGoogle Scholar
 Kearns, M. J., & Schapire, R. E. (1994). Efficient distributionfree learning of probabilistic concepts. Journal of Computer and System Sciences, 48, 464–497.MathSciNetCrossRefMATHGoogle Scholar
 Koenker, R., & Bassett, G. (1978). Regression quantiles. Econometrica, 46(1), 33–50.MathSciNetCrossRefMATHGoogle Scholar
 Lerasle, M., & Oliveira, R. I. (2011). Robust empirical mean estimators. arXiv preprint arXiv:1112.3914.
 Lugosi, G., & Mendelson, S. (2016). Risk minimization by medianofmeans tournaments. arXiv preprint arXiv:1608.00757.
 Minsker, S. (2015). Geometric median and robust estimation in Banach spaces. Bernoulli, 21(4), 2308–2335.MathSciNetCrossRefMATHGoogle Scholar
 Pollard, D. (1981). Limit theorems for empirical processes. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 57(2), 181–195.MathSciNetCrossRefMATHGoogle Scholar
 Pollard, D. (1984). Convergence of stochastic processes. Berlin: Springer.CrossRefMATHGoogle Scholar
 R Core Team. (2016). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. https://www.Rproject.org/
 Rousseeuw, P., & Yohai, V. (1984). Robust regression by means of Sestimators. In Robust and nonlinear time series analysis, Lecture Notes in Statistics (Vol. 26, pp. 256–272). Berlin: Springer.Google Scholar
 SalibianBarrera, M., & Yohai, V. J. (2006). A fast algorithm for Sregression estimates. Journal of Computational and Graphical Statistics, 15(2), 1–14.MathSciNetCrossRefGoogle Scholar
 ShalevShwartz, S., Shamir, O., Srebro, N., & Sridharan, K. (2010). Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11, 2635–2670.MathSciNetMATHGoogle Scholar
 Srebro, N., Sridharan, K., & Tewari, A. (2010). Smoothness, low noise and fast rates. In J. D. Lafferty, C. K. I. Williams, J. ShaweTaylor, R. S. Zemel, & A. Culotta (Eds.), Advances in neural information processing systems (Vol. 23, pp. 2199–2207).Google Scholar
 Steele, J. M. (1975). Combinatorial entropy and uniform limit laws, Ph.D thesis. Stanford University.Google Scholar
 Takeuchi, I., Le, Q. V., Sears, T. D., & Smola, A. J. (2006). Nonparametric quantile estimation. Journal of Machine Learning Research, 7, 1231–1264.MathSciNetMATHGoogle Scholar
 Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B (Methodological), 58(1), 267–288.Google Scholar
 Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2), 264–280.CrossRefMATHGoogle Scholar
 Vardi, Y., & Zhang, C. H. (2000). The multivariate \(L_{1}\)median and associated data depth. Proceedings of the National Academy of Sciences, 97(4), 1423–1426.MathSciNetCrossRefMATHGoogle Scholar
 Yu, Y., Aslan, Ö., & Schuurmans, D. (2012). A polynomialtime form of robust regression. Advances in Neural Information Processing Systems, 25, 2483–2491.Google Scholar