Skip to main content

Localization of VC Classes: Beyond Local Rademacher Complexities

  • Conference paper
  • First Online:
  • 1261 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9925))

Abstract

In statistical learning the excess risk of empirical risk minimization (ERM) is controlled by \(\left( \frac{\text {COMP}_{n}(\mathcal {F})}{n}\right) ^{\alpha }\), where n is a size of a learning sample, \(\text {COMP}_{n}(\mathcal {F})\) is a complexity term associated with a given class \(\mathcal {F}\) and \(\alpha \in [\frac{1}{2}, 1]\) interpolates between slow and fast learning rates. In this paper we introduce an alternative localization approach for binary classification that leads to a novel complexity measure: fixed points of the local empirical entropy. We show that this complexity measure gives a tight control over \(\text {COMP}_{n}(\mathcal {F})\) in the upper bounds under bounded noise. Our results are accompanied by a novel minimax lower bound that involves the same quantity. In particular, we practically answer the question of optimality of ERM under bounded noise for general VC classes.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Alexander, K.S.: Rates of growth and sample moduli for weighted empirical processes indexed by sets. Probab. Theory Relat. Fields 75, 379–423 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  2. Balcan, M.F., Long, P.M.: Active and passive learning of linear separators under log-concave distributions. In: 26th Conference on Learning Theory (2013)

    Google Scholar 

  3. Bartlett, P.L., Bousquet, O., Mendelson, S.: Local Rademacher complexities. Ann. Stat. 33(4), 1497–1537 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  4. Boucheron, S., Bousquet, O., Lugosi, G.: Theory of classification: a survey of recent advances. ESAIM: Probab. Stat. 9, 323–375 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  5. Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Applications of Mathematics, vol. 31. Springer, New York (1996)

    MATH  Google Scholar 

  6. Giné, E., Koltchinskii, V.: Concentration inequalities and asymptotic results for ratio type empirical processes. Ann. Probab. 34(3), 1143–1216 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  7. Hanneke, S.: Theory of disagreement-based active learning. Found. Trends Mach. Learn. 7(2–3), 131–309 (2014)

    Article  MATH  Google Scholar 

  8. Hanneke, S., Yang, L.: Minimax analysis of active learning. J. Mach. Learn. Res. 16(12), 3487–3602 (2015)

    MathSciNet  MATH  Google Scholar 

  9. Hanneke, S.: Refined error bounds for several learning algorithms (2015). http://arXiv.org/abs/1512.07146

  10. Haussler, D., Littlestone, N., Warmuth, M.: Predicting \(\{0, 1\}\)-functions on randomly drawn points. Inf. Comput. 115, 248–292 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  11. Haussler, D.: Sphere packing numbers for subsets of the Boolean n-cube with bounded Vapnik–Chervonenkis dimension. J. Combin. Theory Ser. A 69, 217–232 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  12. Koltchinskii, V.: Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Stat. 34(6), 2593–2656 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  13. Le Cam, L.M.: Convergence of estimates under dimensionality restrictions. Ann. Statist. 1, 38–53 (1973)

    Article  MathSciNet  MATH  Google Scholar 

  14. Lecué, G., Mitchell, C.: Oracle inequalities for cross-validation type procedures. Electron. J. Stat. 6, 1803–1837 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  15. Lecué, G., Mendelson, S.: Learning subgaussian classes: upper and minimax bounds (2013). http://arXiv.org/abs/1305.4825

  16. Liang, T., Rakhlin, A., Sridharan, K.: Learning with square loss: localization through offset Rademacher complexity. In: Proceedings of The 28th Conference on Learning Theory (2015)

    Google Scholar 

  17. Massart, P.: Concentration Inequalties and Model Selection. Ecole dEtè de Probabilités, Saint Flour. Springer, New York (2003)

    Google Scholar 

  18. Massart, P., Nédélec, E.: Risk bounds for statistical learning. Ann. Stat. 34(5), 2326–2366 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  19. Mendelson, S.: ‘Local’ vs. ‘global’ parameters – breaking the Gaussian complexity barrier (2015). http://arXiv.org/abs/1504.02191

  20. Raginsky, M., Rakhlin, A.: Lower bounds for passive and active learning. In: Advances in Neural Information Processing Systems 24, NIPS (2011)

    Google Scholar 

  21. Rakhlin, A., Sridharan, K., Tsybakov, A.B.: Empirical entropy, minimax regret and minimax risk. Bernoulli (2015, forthcoming)

    Google Scholar 

  22. Talagrand, M.: Upper and Lower Bounds for Stochastic Processes. Springer, Heidelberg (2014)

    Book  MATH  Google Scholar 

  23. Vapnik, V., Chervonenkis, A.: On the uniform convergence of relative frequencies of events to their probabilities. Proc. USSR Acad. Sci. 181(4), 781–783 (1968). English tranlation: Soviet Math. Dokl. 9, 915–918

    MathSciNet  MATH  Google Scholar 

  24. Vidyasagar, M.: Learning and Generalization with Applications to Neural Networks, 2nd edn. Springer, Heidelberg (2003)

    MATH  Google Scholar 

  25. Yang, Y., Barron, A.: Information-theoretic determination of minimax rates of convergence. Ann. Stat. 27, 1564–1599 (1999)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

The authors would like to thank Sasha Rakhlin for his suggestion to use offset Rademacher processes to analyze binary classification under Tsybakov noise conditions and anonymous reviewers for their helpful comments. NZ was supported solely by the Russian Science Foundation grant (project 14-50-00150).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nikita Zhivotovskiy .

Editor information

Editors and Affiliations

Appendix

Appendix

Proof

(Theorem  1 ). Let \(\text {DIS}_{0}\) be a disagreement set of the version space of first \(\lfloor n/2 \rfloor \) instances of the learning sample. The random error set will be denoted by \(E_{1} = \{x \in \mathcal {X}|\hat{f}(x) \ne f^{*}(x)\}\). Using symmetrization Lemmas 2 and 1 we have \( \mathbb {E}P(E_{1}) = \mathbb {E}R(\hat{f}) \le \mathbb {E}\sup \limits _{g \in \mathcal {G}_{f^*}}(Pg - (1 + c)P_{n}g) \le \frac{2\left( 1 + \frac{c}{2}\right) ^2}{c}\frac{\log \left( \mathcal {S}_{\mathcal {F}}\left( n\right) \right) }{n} \) for \(c > 0\). We fix \(c = 2\) and prove that for any distribution \(\mathbb {E}P(E_{1}) \le \frac{4\log \left( \mathcal {S}_{\mathcal {F}}\left( n\right) \right) }{n}\). Now we use \( R(\hat{f}) = P(E_{1}|\text {DIS}_{0})P(\text {DIS}_{0}). \) Let \(\xi = |\text {DIS}_{0} \cap \{X_{\lfloor n/2 \rfloor + 1}, \ldots , X_{n}\}|\). Conditionally on the first \(\lfloor n/2 \rfloor \) instances \(\xi \) has binomial distribution. Expectations with respect to the first and the last parts of the sample will be denoted respectfully by \(\mathbb {E}\) and \(\mathbb {E}'\). Conditionally on \(\{x_{1}, \ldots , x_{\lfloor n/2 \rfloor }\}\) we introduce two events: \( A_{1}: \xi < \frac{nP(\text {DIS}_{0})}{4}\) and \( A_{2}: \xi > \frac{3nP(\text {DIS}_{0})}{4}. \) Using Chernoff bounds we have \(P(A_{1}) \le \exp \left( -\frac{nP(\text {DIS}_{0})}{16}\right) \) and \(P(A_{2}) \le \exp \left( -\frac{nP(\text {DIS}_{0})}{16}\right) \). Denote \(A = A_{1}\cup A_{2}\). Then \( \mathbb {E}'P(E_{1}|\text {DIS}_{0}) = \mathbb {E}'\left[ P(E_{1}|\text {DIS}_{0})\Big |\overline{A}\right] P(\overline{A}) + \mathbb {E}'\left[ P(E_{1}|\text {DIS}_{0})\Big |A\right] P(A). \) For the first term we have \( \mathbb {E}'\left[ P(E_{1}|\text {DIS}_{0})\Big |\overline{A}\right] P(\overline{A}) \le \frac{16\log \left( \mathcal {S}_{\mathcal {F}}\left( \frac{3nP(\text {DIS}_{0})}{4}\right) \right) }{nP(\text {DIS}_{0})} \) We can directly prove for the second term that \(\mathbb {E}'\left[ P(E_{1}|\text {DIS}_{0})\Big |A\right] P(\text {DIS}_{0})P(A) \le \frac{12}{n}\). It easy to see, that for all natural kr we have \(\left( \mathcal {S}_{\mathcal {F}}(kr)\right) ^{\frac{1}{r}} \le \mathcal {S}_{\mathcal {F}}(k)\). Finally, \( \mathbb {E}R(\hat{f}) \le \mathbb {E}\tfrac{16\log \left( \mathcal {S}_{\mathcal {F}}\left( \tfrac{3nP(\text {DIS}_{0})}{4}\right) \right) }{n} + \tfrac{12}{n} \le \tfrac{40\log \left( \mathcal {S}_{\mathcal {F}}\left( \mathbf {s}\right) \right) }{n} + \tfrac{12}{n}. \)    \(\square \)

Proof

(Lemma  4 ). Once again, given \(X_1,\ldots ,X_n\), let \(V = \{ (g(X_1),\ldots ,g(X_n)) : g \in \mathcal {G}\}\) denote the set of binary vectors corresponding to the values of functions in \(\mathcal {G}\). As above, for a fixed \(\gamma \) and fixed minimal \(\gamma \)-covering subset \(\mathcal {N}_{\gamma } \subseteq V\), for each \(v \in V\), p(v) will denote the closest vector to v in \(\mathcal {N}_{\gamma }\). We will denote by \(\mathbb {E}_{\xi }\) the conditional expectation over the \(\xi _i\) variables, given \(X_1,\ldots ,X_n\). We follow the decomposition proposed by Liang, Rakhlin, and Sridharan [16]:

$$\begin{aligned}&\frac{1}{n}\mathbb {E}_{\xi }\max \limits _{v \in V}\left( \sum \limits _{i = 1}^{n}\xi _{i}v_{i}- cv_{i}\right) \le \frac{1}{n}\mathbb {E}_{\xi }\max \limits _{v \in V}\left( \sum \limits _{i = 1}^{n}\xi _{i}\!\left( v_{i} \!-\! p(v)_{i}\right) \right) \\&+ \frac{1}{n}\mathbb {E}_{\xi }\max \limits _{v \in V}\left( \sum \limits _{i = 1}^{n}\frac{c}{4}p(v)_{i} \!-\! cv_{i}\right) + \frac{1}{n}\mathbb {E}_{\xi }\max \limits _{v \in V}\left( \sum \limits _{i = 1}^{n}\xi _{i}p(v)_{i} \!-\! \frac{c}{4}p(v)_{i}\right) . \end{aligned}$$

The first term is \(\lesssim \frac{\gamma }{n}\) by the \(\gamma \)-cover property and the fact that \(|\xi _i| \lesssim 1\). Furthermore it is easy to show that the second term is at most \(\frac{c}{4} \frac{\gamma }{n}\). Now we analyze the last term carefully. First we use the standard peeling argument. Given a set W of binary vectors we define \(W[a, b] = \{w \in W|a \le \rho _{H}(w, 0) < b\}\).

$$\begin{aligned}&\mathbb {E}_{\xi }\max \limits _{v \in V}\left( \sum \limits _{i = 1}^{n}\xi _{i}p(v)_{i}- \frac{c}{4}p(v)_{i}\right) =\mathbb {E}_{\xi }\max \limits _{v \in \mathcal {N}_{\gamma }}\left( \sum \limits _{i = 1}^{n}\xi _{i}v_{i}- \frac{c}{4}v_{i}\right) \\&\le \mathbb {E}_{\xi }\max \limits _{v \in \mathcal {N}_{\gamma }[0, 2\gamma /c]}\left( \xi _{i}v_{i}- \frac{c}{4}v_{i}\right) + \sum \limits _{k = 1}^{\infty }\mathbb {E}_{\xi }\max \limits _{\mathcal {N}_{\gamma }\left[ 2^{k}\gamma /c, 2^{k + 1}\gamma /c\right] }\left( \sum \limits _{i = 1}^{n}\xi _{i}v_{i}- \frac{c}{4}v_{i}\right) _{+}. \end{aligned}$$

The first term is upper bounded by \(\frac{2\log (\mathcal {M}^{\text {loc}}_{1}(V, \gamma , n,c))}{cn}\) by Lemma 1 and by noting that \(|\mathcal {N}_{\gamma }[0, 2\gamma /c]| \!\!\le \!\! \mathcal {M}_{1}(\mathcal {B}_{H}(0, (2\gamma )/c,\{X_1,\ldots ,X_n\}), (2\gamma )/2)\!\le \! \mathcal {M}^{\text {loc}}_{1}(V,\gamma ,n,c)\). Now we upper-bound the second term. We start with an arbitrary summand. For any \(\lambda > 0\), we have

$$\begin{aligned}&\mathbb {E}_{\xi }\max \limits _{v \in \{0\} \cup \mathcal {N}_{\gamma }\left[ 2^{k}\gamma /c, 2^{k + 1}\gamma /c\right] }\left( \sum \limits _{i = 1}^{n}\xi _{i}v_{i}- \frac{c}{4}v_{i}\right) \\&\le \frac{1}{\lambda }\ln \left( \sum \limits _{v \in \mathcal {N}_{\gamma }\left[ 2^{k}\gamma /c, 2^{k + 1}\gamma /c\right] }\mathbb {E}_{\xi }\exp \left\{ \sum \limits _{i = 1}^{n}\lambda \xi _{i}v_{i} - \frac{\lambda c}{4} v_{i}\right\} + 1\right) \\&\le \frac{1}{\lambda }\ln \left( \left| \mathcal {N}_{\gamma }\left[ 2^{k}\gamma /c, 2^{k + 1}\gamma /c\right] \right| \exp \left\{ 2^{k-2}\gamma (4\lambda ^2 - \lambda c)/c\right\} + 1\right) \\&\le \frac{1}{\lambda }\ln \left( \left( \mathcal {M}^{\text {loc}}_{1}(\mathcal {G}, 2\gamma , n, c)\right) ^{2^{k + 1}}\exp \left\{ 2^{k-2}\gamma (4\lambda ^2 - \lambda c)/c\right\} + 1\right) . \end{aligned}$$

Here we used that \(\left| \mathcal {M}_{\gamma }\left[ 0, 2^{k + 1}\gamma /c\right] \right| \le \left| \mathcal {M}^{\text {loc}}_{1}(\mathcal {G}, 2\gamma , n, c)\right| ^{2^{k + 1}}\) and that any minimal covering is also a packing. We fix \(\gamma = K\gamma ^{\text {loc}}_{c, c}(n)\) for some \(K > 2\). Observe that local entropy is nonincreasing and \(K\gamma ^{\text {loc}}_{c, c}(n) > 2\gamma ^{\text {loc}}_{c,c}(n) \ge \gamma ^{\text {loc}}_{c, c}(n) + 1\). Thus,

$$\begin{aligned}&\ln \left( \exp \left\{ 2^{k + 1}\log \left( \mathcal {M}^{\text {loc}}_{1}(V, 2K\gamma ^{\text {loc}}_{c, c}(n), n, c)\right) + 2^{k-2}K\gamma ^{\text {loc}}_{c, c}(n)(4\lambda ^2 - \lambda c)/c\right\} + 1\right) \\&\le \ln \left( \exp \left\{ 2^{k + 1}c(\gamma ^{\text {loc}}_{c, c}(n) + 1) + 2^{k-2}K\gamma ^{\text {loc}}_{c, c}(n)(4\lambda ^2 - \lambda c)/c\right\} + 1\right) . \end{aligned}$$

Then we have for \(\lambda = \frac{c}{8}\),

$$\begin{aligned}&\sum \limits _{k = 1}^{\infty }\frac{8}{c}\ln \left( \exp \left( 2^{k + 1}\log \left( \mathcal {M}^{\text {loc}}_{1}(\mathcal {G}, 2K\gamma ^{\text {loc}}_{c, c}(n), n)\right) \right) \exp \left( -2^{k - 6}Kc\gamma ^{\text {loc}}_{c, c}(n)\right) + 1\right) \\&\le \sum \limits _{k = 1}^{\infty }\frac{8}{c}\ln \left( \exp \left( 2^{k + 2}c\gamma ^{\text {loc}}_{c, c}(n) -2^{k - 6}Kc\gamma ^{\text {loc}}_{c, c}(n)\right) + 1\right) . \end{aligned}$$

We set \(K = 2^{9}\) and have \(\sum \limits _{k = 1}^{\infty }\ln \left( \exp \left( 2^{k + 2}c\gamma ^{\text {loc}}_{c, c}(n) -2^{k - 6}Kc\gamma ^{\text {loc}}_{c, c}(n)\right) + 1\right) \le C, \) where \(C > 0\) is an absolute constant. Here we used that \(\ln (x + 1) \le x\) for \(x > 0\) and \(c\gamma ^{\text {loc}}_{c, c} \gtrsim 1\). Combining with the first two terms we finish the proof.    \(\square \)

Proof

(Proposition 1). The first part of the proof closely follows the proof of Theorem 17 in [8], with slight modifications, to arrive at an upper bound on \(\mathcal {M}^{\text {loc}}_{1}(\mathcal {F}, \gamma , n, h)\). The suprema in the definition of local empirical entropy are achieved at some set \(\{x_{1}, \ldots , x_{n}\}\), some function \(f \in \mathcal {F}\), and some \(\varepsilon \in [\gamma ,n]\). Letting \(r = \varepsilon /n\), denote by \(\mathcal {M}_r\) the maximal (rn / 2)-packing (under \(\rho _H\)) of \(\mathcal {B}_{H}(f,rn/h,\{x_1,\ldots ,x_n\})\), so that \(|\mathcal {M}_r| = \mathcal {M}^{\text {loc}}_{1}(\mathcal {F},\gamma ,n,h)\). Also introduce a uniform probability measure \(P_X\) on \(\{x_1,\ldots ,x_n\}\) and fix \(m = \left\lceil \frac{4}{r}\log (|\mathcal {M}_r|)\right\rceil \). Let \(X_{1}, \ldots , X_{m}\) be m independent \(P_X\)-distributed random variables, and let A denote the event that, for all \(g,g' \in \mathcal {M}_{r}\) with \(g \ne g'\), there exists an \(i \in \{1, \ldots , n\}\) such that \(g(X_{i}) \ne g'(X_{i})\). For a given pair of distinct functions \(g,g' \in \mathcal {M}_r\), they disagree on some \(X_i\) with probability \( 1 - (1 - P_X(g(X) \ne g'(X)))^m > 1 - \exp (-rm/2) \ge 1 - \frac{1}{|\mathcal {M}_r|^2}. \) Using a union bound and summing over all possible unordered pairs \(g, g' \in \mathcal {M}_r\) will give us that \(\mathbb {P}(A) > \frac{1}{2}\). On the event A, functions in \(\mathcal {M}_{r}\) realize distinct classifications of \(X_{1}, \ldots , X_{m}\). For any \(X_i \notin \text {DIS}(\mathcal {B}_{H}(f, rn/h, \{x_{1}, \ldots , x_{n}\})\), all classifiers in \(\mathcal {M}_{r}\) agree. Thus, \(|\mathcal {M}_r|\) is bounded by the number of classifications \(\{X_{1}, \ldots , X_{m}\} \cap \text {DIS}(\mathcal {B}_{H}(f, rn/h))\) realized by classifiers in \(\mathcal {F}\). By the Chernoff bound, on an event B with \(\mathbb {P}(B) \ge \frac{1}{2}\) we have \( |\{X_{1}, \ldots , X_{m}\} \cap \text {DIS}(\mathcal {B}_{H}(f, rn/h))| \le 1 + 2eP_X(\text {DIS}(\mathcal {B}_{H}(f, rn/h))m. \) Using the definition of \(\tau (\cdot )\) (Definition 2) we have \( 1 + 2eP_X(\text {DIS}(\mathcal {B}_{H}(f, rn/h)))m \le 1 + 2e \tau \left( \frac{r}{h}\right) \frac{r}{h} m \le 11e\tau \left( \frac{r}{h}\right) \frac{\log (|\mathcal {M}_{r}|)}{h}. \) With probability at least \(\frac{1}{2}\), \( |\{X_{1}, \ldots , X_{m}\} \cap \text {DIS}(\mathcal {B}_{H}(f, rn/h))| \le 11e\tau \left( \frac{r}{h}\right) \frac{\log (|\mathcal {M}_{r}|)}{h}. \) Using the union bound, we have that with positive probability there exists a sequence of at most \(11e\tau \left( \frac{r}{h}\right) \frac{\log (|\mathcal {M}_r|)}{h}\) elements, such that all functions in \(\mathcal {M}_r\) classify this sequence distinctly. By the VC lemma [23], we therefore have that \( |\mathcal {M}_{r}| \le \left( \frac{11e^2\tau \left( \frac{r}{h}\right) \frac{\log (|\mathcal {M}_{r}|)}{h}}{d}\right) ^{d}. \) Using Corollary 4.1 from [24] we have \( \log (|\mathcal {M}_{r}|) \le 2d\log \left( 11e^2\tau \left( \frac{r}{h}\right) \frac{1}{h}\right) . \) Using \(\tau \left( \frac{r}{h}\right) \le \mathbf {s} \wedge \frac{h}{r} \le \mathbf {s} \wedge \frac{nh}{\gamma }\) (Theorem 10 in [8]) we finally have \( \log (\mathcal {M}^{\text {loc}}_{1}(\mathcal {F}, \gamma , n, h)) \le 2d\log \left( 11e^2\left( \frac{n}{\gamma } \wedge \frac{\mathbf {s}}{h}\right) \right) . \) Observe that \( h\gamma ^{\text {loc}}_{h, h}(n) \le 2d\log \left( 11e^2\left( \frac{n}{\gamma ^{\text {loc}}_{h, h}(n)} \wedge \frac{\mathbf {s}}{h}\right) \right) . \) We have \(\gamma ^{\text {loc}}_{h, h}(n) \le \frac{2d\log \left( 11e^2\frac{\mathbf {s}}{h}\right) }{h}\). If \(\gamma = \frac{2d\log \left( 11e^2\frac{nh}{d}\right) }{h}\), then \(h\gamma = 2d\log \left( 11e^2\frac{nh}{d}\right) \), but \(2d\log \left( 11e^2\frac{n}{\gamma }\right) \le 2d\log \left( 11e^2\frac{nh}{d}\right) \) if \(h > \frac{d}{11en}\). Finally, we have \( \gamma ^{\text {loc}}_{h, h}(n) \le \frac{2d\log \left( 11e^2\left( \frac{nh}{d} \wedge \frac{\mathbf {s}}{h}\right) \right) }{h}. \) Now we prove the lower bound. From (2) established above, we know that \(\frac{\gamma ^{\text {loc}}_{h, h}(n)}{n}\) is, up to an absolute constant, a distribution-free upper bound for \(\mathbb {E}(R(\hat{f}) - R(f^{*}))\), holding for all ERM learners \(\hat{f}\). Then a lower bound on \(\sup \limits _{P \in \mathcal {P}(h,\mathcal {F})} \mathbb {E}(R(\hat{f}) - R(f^{*}))\) holding for any ERM learner is also a lower bound for \(\frac{\gamma ^{\text {loc}}_{h, h}(n)}{n}\). In particular, it is known [9, 18] that for any learning procedure \(\tilde{f}\), if \(h \ge \sqrt{\frac{d}{n}}\), then \(\sup \limits _{P \in \mathcal {P}(h,\mathcal {F})} \mathbb {E}(R(\tilde{f}) - R(f^{*})) \gtrsim \frac{d + (1-h)\log (n h^2 \wedge \mathbf {s})}{nh}\), while if \(h < \sqrt{\frac{d}{n}}\) then \(\sup \limits _{P \in \mathcal {P}(h,\mathcal {F})} \mathbb {E}(R(\tilde{f}) - R(f^{*})) \gtrsim \sqrt{\frac{d}{n}}\). Furthermore, in the particular case of ERM, [9] proves that any upper bound on \(\sup \limits _{P \in \mathcal {P}(1,\mathcal {F})} \mathbb {E}(R(\hat{f}) - R(f^*))\) holding for all ERM learners \(\hat{f}\) must have size, up to an absolute constant, at least \(\frac{\log (n \wedge \mathbf {s})}{n}\). Together, these lower bounds imply \(\gamma ^{\text {loc}}_{h, h}(n) \gtrsim \frac{d + \log (n h^2 \wedge \mathbf {s})}{h} \wedge \sqrt{d n}\).    \(\square \)

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Zhivotovskiy, N., Hanneke, S. (2016). Localization of VC Classes: Beyond Local Rademacher Complexities. In: Ortner, R., Simon, H., Zilles, S. (eds) Algorithmic Learning Theory. ALT 2016. Lecture Notes in Computer Science(), vol 9925. Springer, Cham. https://doi.org/10.1007/978-3-319-46379-7_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-46379-7_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-46378-0

  • Online ISBN: 978-3-319-46379-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics