Abstract
In statistical learning the excess risk of empirical risk minimization (ERM) is controlled by \(\left( \frac{\text {COMP}_{n}(\mathcal {F})}{n}\right) ^{\alpha }\), where n is a size of a learning sample, \(\text {COMP}_{n}(\mathcal {F})\) is a complexity term associated with a given class \(\mathcal {F}\) and \(\alpha \in [\frac{1}{2}, 1]\) interpolates between slow and fast learning rates. In this paper we introduce an alternative localization approach for binary classification that leads to a novel complexity measure: fixed points of the local empirical entropy. We show that this complexity measure gives a tight control over \(\text {COMP}_{n}(\mathcal {F})\) in the upper bounds under bounded noise. Our results are accompanied by a novel minimax lower bound that involves the same quantity. In particular, we practically answer the question of optimality of ERM under bounded noise for general VC classes.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Alexander, K.S.: Rates of growth and sample moduli for weighted empirical processes indexed by sets. Probab. Theory Relat. Fields 75, 379–423 (1987)
Balcan, M.F., Long, P.M.: Active and passive learning of linear separators under log-concave distributions. In: 26th Conference on Learning Theory (2013)
Bartlett, P.L., Bousquet, O., Mendelson, S.: Local Rademacher complexities. Ann. Stat. 33(4), 1497–1537 (2005)
Boucheron, S., Bousquet, O., Lugosi, G.: Theory of classification: a survey of recent advances. ESAIM: Probab. Stat. 9, 323–375 (2005)
Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Applications of Mathematics, vol. 31. Springer, New York (1996)
Giné, E., Koltchinskii, V.: Concentration inequalities and asymptotic results for ratio type empirical processes. Ann. Probab. 34(3), 1143–1216 (2006)
Hanneke, S.: Theory of disagreement-based active learning. Found. Trends Mach. Learn. 7(2–3), 131–309 (2014)
Hanneke, S., Yang, L.: Minimax analysis of active learning. J. Mach. Learn. Res. 16(12), 3487–3602 (2015)
Hanneke, S.: Refined error bounds for several learning algorithms (2015). http://arXiv.org/abs/1512.07146
Haussler, D., Littlestone, N., Warmuth, M.: Predicting \(\{0, 1\}\)-functions on randomly drawn points. Inf. Comput. 115, 248–292 (1994)
Haussler, D.: Sphere packing numbers for subsets of the Boolean n-cube with bounded Vapnik–Chervonenkis dimension. J. Combin. Theory Ser. A 69, 217–232 (1995)
Koltchinskii, V.: Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Stat. 34(6), 2593–2656 (2006)
Le Cam, L.M.: Convergence of estimates under dimensionality restrictions. Ann. Statist. 1, 38–53 (1973)
Lecué, G., Mitchell, C.: Oracle inequalities for cross-validation type procedures. Electron. J. Stat. 6, 1803–1837 (2012)
Lecué, G., Mendelson, S.: Learning subgaussian classes: upper and minimax bounds (2013). http://arXiv.org/abs/1305.4825
Liang, T., Rakhlin, A., Sridharan, K.: Learning with square loss: localization through offset Rademacher complexity. In: Proceedings of The 28th Conference on Learning Theory (2015)
Massart, P.: Concentration Inequalties and Model Selection. Ecole dEtè de Probabilités, Saint Flour. Springer, New York (2003)
Massart, P., Nédélec, E.: Risk bounds for statistical learning. Ann. Stat. 34(5), 2326–2366 (2006)
Mendelson, S.: ‘Local’ vs. ‘global’ parameters – breaking the Gaussian complexity barrier (2015). http://arXiv.org/abs/1504.02191
Raginsky, M., Rakhlin, A.: Lower bounds for passive and active learning. In: Advances in Neural Information Processing Systems 24, NIPS (2011)
Rakhlin, A., Sridharan, K., Tsybakov, A.B.: Empirical entropy, minimax regret and minimax risk. Bernoulli (2015, forthcoming)
Talagrand, M.: Upper and Lower Bounds for Stochastic Processes. Springer, Heidelberg (2014)
Vapnik, V., Chervonenkis, A.: On the uniform convergence of relative frequencies of events to their probabilities. Proc. USSR Acad. Sci. 181(4), 781–783 (1968). English tranlation: Soviet Math. Dokl. 9, 915–918
Vidyasagar, M.: Learning and Generalization with Applications to Neural Networks, 2nd edn. Springer, Heidelberg (2003)
Yang, Y., Barron, A.: Information-theoretic determination of minimax rates of convergence. Ann. Stat. 27, 1564–1599 (1999)
Acknowledgments
The authors would like to thank Sasha Rakhlin for his suggestion to use offset Rademacher processes to analyze binary classification under Tsybakov noise conditions and anonymous reviewers for their helpful comments. NZ was supported solely by the Russian Science Foundation grant (project 14-50-00150).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
Proof
(Theorem 1 ). Let \(\text {DIS}_{0}\) be a disagreement set of the version space of first \(\lfloor n/2 \rfloor \) instances of the learning sample. The random error set will be denoted by \(E_{1} = \{x \in \mathcal {X}|\hat{f}(x) \ne f^{*}(x)\}\). Using symmetrization Lemmas 2 and 1 we have \( \mathbb {E}P(E_{1}) = \mathbb {E}R(\hat{f}) \le \mathbb {E}\sup \limits _{g \in \mathcal {G}_{f^*}}(Pg - (1 + c)P_{n}g) \le \frac{2\left( 1 + \frac{c}{2}\right) ^2}{c}\frac{\log \left( \mathcal {S}_{\mathcal {F}}\left( n\right) \right) }{n} \) for \(c > 0\). We fix \(c = 2\) and prove that for any distribution \(\mathbb {E}P(E_{1}) \le \frac{4\log \left( \mathcal {S}_{\mathcal {F}}\left( n\right) \right) }{n}\). Now we use \( R(\hat{f}) = P(E_{1}|\text {DIS}_{0})P(\text {DIS}_{0}). \) Let \(\xi = |\text {DIS}_{0} \cap \{X_{\lfloor n/2 \rfloor + 1}, \ldots , X_{n}\}|\). Conditionally on the first \(\lfloor n/2 \rfloor \) instances \(\xi \) has binomial distribution. Expectations with respect to the first and the last parts of the sample will be denoted respectfully by \(\mathbb {E}\) and \(\mathbb {E}'\). Conditionally on \(\{x_{1}, \ldots , x_{\lfloor n/2 \rfloor }\}\) we introduce two events: \( A_{1}: \xi < \frac{nP(\text {DIS}_{0})}{4}\) and \( A_{2}: \xi > \frac{3nP(\text {DIS}_{0})}{4}. \) Using Chernoff bounds we have \(P(A_{1}) \le \exp \left( -\frac{nP(\text {DIS}_{0})}{16}\right) \) and \(P(A_{2}) \le \exp \left( -\frac{nP(\text {DIS}_{0})}{16}\right) \). Denote \(A = A_{1}\cup A_{2}\). Then \( \mathbb {E}'P(E_{1}|\text {DIS}_{0}) = \mathbb {E}'\left[ P(E_{1}|\text {DIS}_{0})\Big |\overline{A}\right] P(\overline{A}) + \mathbb {E}'\left[ P(E_{1}|\text {DIS}_{0})\Big |A\right] P(A). \) For the first term we have \( \mathbb {E}'\left[ P(E_{1}|\text {DIS}_{0})\Big |\overline{A}\right] P(\overline{A}) \le \frac{16\log \left( \mathcal {S}_{\mathcal {F}}\left( \frac{3nP(\text {DIS}_{0})}{4}\right) \right) }{nP(\text {DIS}_{0})} \) We can directly prove for the second term that \(\mathbb {E}'\left[ P(E_{1}|\text {DIS}_{0})\Big |A\right] P(\text {DIS}_{0})P(A) \le \frac{12}{n}\). It easy to see, that for all natural k, r we have \(\left( \mathcal {S}_{\mathcal {F}}(kr)\right) ^{\frac{1}{r}} \le \mathcal {S}_{\mathcal {F}}(k)\). Finally, \( \mathbb {E}R(\hat{f}) \le \mathbb {E}\tfrac{16\log \left( \mathcal {S}_{\mathcal {F}}\left( \tfrac{3nP(\text {DIS}_{0})}{4}\right) \right) }{n} + \tfrac{12}{n} \le \tfrac{40\log \left( \mathcal {S}_{\mathcal {F}}\left( \mathbf {s}\right) \right) }{n} + \tfrac{12}{n}. \) \(\square \)
Proof
(Lemma 4 ). Once again, given \(X_1,\ldots ,X_n\), let \(V = \{ (g(X_1),\ldots ,g(X_n)) : g \in \mathcal {G}\}\) denote the set of binary vectors corresponding to the values of functions in \(\mathcal {G}\). As above, for a fixed \(\gamma \) and fixed minimal \(\gamma \)-covering subset \(\mathcal {N}_{\gamma } \subseteq V\), for each \(v \in V\), p(v) will denote the closest vector to v in \(\mathcal {N}_{\gamma }\). We will denote by \(\mathbb {E}_{\xi }\) the conditional expectation over the \(\xi _i\) variables, given \(X_1,\ldots ,X_n\). We follow the decomposition proposed by Liang, Rakhlin, and Sridharan [16]:
The first term is \(\lesssim \frac{\gamma }{n}\) by the \(\gamma \)-cover property and the fact that \(|\xi _i| \lesssim 1\). Furthermore it is easy to show that the second term is at most \(\frac{c}{4} \frac{\gamma }{n}\). Now we analyze the last term carefully. First we use the standard peeling argument. Given a set W of binary vectors we define \(W[a, b] = \{w \in W|a \le \rho _{H}(w, 0) < b\}\).
The first term is upper bounded by \(\frac{2\log (\mathcal {M}^{\text {loc}}_{1}(V, \gamma , n,c))}{cn}\) by Lemma 1 and by noting that \(|\mathcal {N}_{\gamma }[0, 2\gamma /c]| \!\!\le \!\! \mathcal {M}_{1}(\mathcal {B}_{H}(0, (2\gamma )/c,\{X_1,\ldots ,X_n\}), (2\gamma )/2)\!\le \! \mathcal {M}^{\text {loc}}_{1}(V,\gamma ,n,c)\). Now we upper-bound the second term. We start with an arbitrary summand. For any \(\lambda > 0\), we have
Here we used that \(\left| \mathcal {M}_{\gamma }\left[ 0, 2^{k + 1}\gamma /c\right] \right| \le \left| \mathcal {M}^{\text {loc}}_{1}(\mathcal {G}, 2\gamma , n, c)\right| ^{2^{k + 1}}\) and that any minimal covering is also a packing. We fix \(\gamma = K\gamma ^{\text {loc}}_{c, c}(n)\) for some \(K > 2\). Observe that local entropy is nonincreasing and \(K\gamma ^{\text {loc}}_{c, c}(n) > 2\gamma ^{\text {loc}}_{c,c}(n) \ge \gamma ^{\text {loc}}_{c, c}(n) + 1\). Thus,
Then we have for \(\lambda = \frac{c}{8}\),
We set \(K = 2^{9}\) and have \(\sum \limits _{k = 1}^{\infty }\ln \left( \exp \left( 2^{k + 2}c\gamma ^{\text {loc}}_{c, c}(n) -2^{k - 6}Kc\gamma ^{\text {loc}}_{c, c}(n)\right) + 1\right) \le C, \) where \(C > 0\) is an absolute constant. Here we used that \(\ln (x + 1) \le x\) for \(x > 0\) and \(c\gamma ^{\text {loc}}_{c, c} \gtrsim 1\). Combining with the first two terms we finish the proof. \(\square \)
Proof
(Proposition 1). The first part of the proof closely follows the proof of Theorem 17 in [8], with slight modifications, to arrive at an upper bound on \(\mathcal {M}^{\text {loc}}_{1}(\mathcal {F}, \gamma , n, h)\). The suprema in the definition of local empirical entropy are achieved at some set \(\{x_{1}, \ldots , x_{n}\}\), some function \(f \in \mathcal {F}\), and some \(\varepsilon \in [\gamma ,n]\). Letting \(r = \varepsilon /n\), denote by \(\mathcal {M}_r\) the maximal (rn / 2)-packing (under \(\rho _H\)) of \(\mathcal {B}_{H}(f,rn/h,\{x_1,\ldots ,x_n\})\), so that \(|\mathcal {M}_r| = \mathcal {M}^{\text {loc}}_{1}(\mathcal {F},\gamma ,n,h)\). Also introduce a uniform probability measure \(P_X\) on \(\{x_1,\ldots ,x_n\}\) and fix \(m = \left\lceil \frac{4}{r}\log (|\mathcal {M}_r|)\right\rceil \). Let \(X_{1}, \ldots , X_{m}\) be m independent \(P_X\)-distributed random variables, and let A denote the event that, for all \(g,g' \in \mathcal {M}_{r}\) with \(g \ne g'\), there exists an \(i \in \{1, \ldots , n\}\) such that \(g(X_{i}) \ne g'(X_{i})\). For a given pair of distinct functions \(g,g' \in \mathcal {M}_r\), they disagree on some \(X_i\) with probability \( 1 - (1 - P_X(g(X) \ne g'(X)))^m > 1 - \exp (-rm/2) \ge 1 - \frac{1}{|\mathcal {M}_r|^2}. \) Using a union bound and summing over all possible unordered pairs \(g, g' \in \mathcal {M}_r\) will give us that \(\mathbb {P}(A) > \frac{1}{2}\). On the event A, functions in \(\mathcal {M}_{r}\) realize distinct classifications of \(X_{1}, \ldots , X_{m}\). For any \(X_i \notin \text {DIS}(\mathcal {B}_{H}(f, rn/h, \{x_{1}, \ldots , x_{n}\})\), all classifiers in \(\mathcal {M}_{r}\) agree. Thus, \(|\mathcal {M}_r|\) is bounded by the number of classifications \(\{X_{1}, \ldots , X_{m}\} \cap \text {DIS}(\mathcal {B}_{H}(f, rn/h))\) realized by classifiers in \(\mathcal {F}\). By the Chernoff bound, on an event B with \(\mathbb {P}(B) \ge \frac{1}{2}\) we have \( |\{X_{1}, \ldots , X_{m}\} \cap \text {DIS}(\mathcal {B}_{H}(f, rn/h))| \le 1 + 2eP_X(\text {DIS}(\mathcal {B}_{H}(f, rn/h))m. \) Using the definition of \(\tau (\cdot )\) (Definition 2) we have \( 1 + 2eP_X(\text {DIS}(\mathcal {B}_{H}(f, rn/h)))m \le 1 + 2e \tau \left( \frac{r}{h}\right) \frac{r}{h} m \le 11e\tau \left( \frac{r}{h}\right) \frac{\log (|\mathcal {M}_{r}|)}{h}. \) With probability at least \(\frac{1}{2}\), \( |\{X_{1}, \ldots , X_{m}\} \cap \text {DIS}(\mathcal {B}_{H}(f, rn/h))| \le 11e\tau \left( \frac{r}{h}\right) \frac{\log (|\mathcal {M}_{r}|)}{h}. \) Using the union bound, we have that with positive probability there exists a sequence of at most \(11e\tau \left( \frac{r}{h}\right) \frac{\log (|\mathcal {M}_r|)}{h}\) elements, such that all functions in \(\mathcal {M}_r\) classify this sequence distinctly. By the VC lemma [23], we therefore have that \( |\mathcal {M}_{r}| \le \left( \frac{11e^2\tau \left( \frac{r}{h}\right) \frac{\log (|\mathcal {M}_{r}|)}{h}}{d}\right) ^{d}. \) Using Corollary 4.1 from [24] we have \( \log (|\mathcal {M}_{r}|) \le 2d\log \left( 11e^2\tau \left( \frac{r}{h}\right) \frac{1}{h}\right) . \) Using \(\tau \left( \frac{r}{h}\right) \le \mathbf {s} \wedge \frac{h}{r} \le \mathbf {s} \wedge \frac{nh}{\gamma }\) (Theorem 10 in [8]) we finally have \( \log (\mathcal {M}^{\text {loc}}_{1}(\mathcal {F}, \gamma , n, h)) \le 2d\log \left( 11e^2\left( \frac{n}{\gamma } \wedge \frac{\mathbf {s}}{h}\right) \right) . \) Observe that \( h\gamma ^{\text {loc}}_{h, h}(n) \le 2d\log \left( 11e^2\left( \frac{n}{\gamma ^{\text {loc}}_{h, h}(n)} \wedge \frac{\mathbf {s}}{h}\right) \right) . \) We have \(\gamma ^{\text {loc}}_{h, h}(n) \le \frac{2d\log \left( 11e^2\frac{\mathbf {s}}{h}\right) }{h}\). If \(\gamma = \frac{2d\log \left( 11e^2\frac{nh}{d}\right) }{h}\), then \(h\gamma = 2d\log \left( 11e^2\frac{nh}{d}\right) \), but \(2d\log \left( 11e^2\frac{n}{\gamma }\right) \le 2d\log \left( 11e^2\frac{nh}{d}\right) \) if \(h > \frac{d}{11en}\). Finally, we have \( \gamma ^{\text {loc}}_{h, h}(n) \le \frac{2d\log \left( 11e^2\left( \frac{nh}{d} \wedge \frac{\mathbf {s}}{h}\right) \right) }{h}. \) Now we prove the lower bound. From (2) established above, we know that \(\frac{\gamma ^{\text {loc}}_{h, h}(n)}{n}\) is, up to an absolute constant, a distribution-free upper bound for \(\mathbb {E}(R(\hat{f}) - R(f^{*}))\), holding for all ERM learners \(\hat{f}\). Then a lower bound on \(\sup \limits _{P \in \mathcal {P}(h,\mathcal {F})} \mathbb {E}(R(\hat{f}) - R(f^{*}))\) holding for any ERM learner is also a lower bound for \(\frac{\gamma ^{\text {loc}}_{h, h}(n)}{n}\). In particular, it is known [9, 18] that for any learning procedure \(\tilde{f}\), if \(h \ge \sqrt{\frac{d}{n}}\), then \(\sup \limits _{P \in \mathcal {P}(h,\mathcal {F})} \mathbb {E}(R(\tilde{f}) - R(f^{*})) \gtrsim \frac{d + (1-h)\log (n h^2 \wedge \mathbf {s})}{nh}\), while if \(h < \sqrt{\frac{d}{n}}\) then \(\sup \limits _{P \in \mathcal {P}(h,\mathcal {F})} \mathbb {E}(R(\tilde{f}) - R(f^{*})) \gtrsim \sqrt{\frac{d}{n}}\). Furthermore, in the particular case of ERM, [9] proves that any upper bound on \(\sup \limits _{P \in \mathcal {P}(1,\mathcal {F})} \mathbb {E}(R(\hat{f}) - R(f^*))\) holding for all ERM learners \(\hat{f}\) must have size, up to an absolute constant, at least \(\frac{\log (n \wedge \mathbf {s})}{n}\). Together, these lower bounds imply \(\gamma ^{\text {loc}}_{h, h}(n) \gtrsim \frac{d + \log (n h^2 \wedge \mathbf {s})}{h} \wedge \sqrt{d n}\). \(\square \)
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Zhivotovskiy, N., Hanneke, S. (2016). Localization of VC Classes: Beyond Local Rademacher Complexities. In: Ortner, R., Simon, H., Zilles, S. (eds) Algorithmic Learning Theory. ALT 2016. Lecture Notes in Computer Science(), vol 9925. Springer, Cham. https://doi.org/10.1007/978-3-319-46379-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-46379-7_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46378-0
Online ISBN: 978-3-319-46379-7
eBook Packages: Computer ScienceComputer Science (R0)