Skip to main content

Teaching and Compressing for Low VC-Dimension

  • Chapter
  • First Online:
  • 2061 Accesses

Abstract

In this work we study the quantitative relation between VC-dimension and two other basic parameters related to learning and teaching. Namely, the quality of sample compression schemes and of teaching sets for classes of low VC-dimension. Let C be a binary concept class of size m and VC-dimension d. Prior to this work, the best known upper bounds for both parameters were log(m), while the best lower bounds are linear in d. We present significantly better upper bounds on both as follows. Set k = O(d2dloglog | C | ).

We show that there always exists a concept c in C with a teaching set (i.e. a list of c-labeled examples uniquely identifying c in C) of size k. This problem was studied by Kuhlmann (On teaching and learning intersection-closed concept classes. In: EuroCOLT, pp 168–182, 1999). Our construction implies that the recursive teaching (RT) dimension of C is at most k as well. The RT-dimension was suggested by Zilles et al. (J Mach Learn Res 12:349–384, 2011) and Doliwa et al. (Recursive teaching dimension, learning complexity, and maximum classes. In: ALT, pp 209–223, 2010). The same notion (under the name partial-ID width) was independently studied by Wigderson and Yehudayoff (Population recovery and partial identification. In: FOCS, pp 390–399, 2012). An upper bound on this parameter that depends only on d is known just for the very simple case d = 1, and is open even for d = 2. We also make small progress towards this seemingly modest goal.

We further construct sample compression schemes of size k for C, with additional information of klog(k) bits. Roughly speaking, given any list of C-labelled examples of arbitrary length, we can retain only k labeled examples in a way that allows to recover the labels of all others examples in the list, using additional klog(k) information bits. This problem was first suggested by Littlestone and Warmuth (Relating data compression and learnability. Unpublished, 1986).

This is a preview of subscription content, log in via an institution.

Notes

  1. 1.

    A preliminary version of this work, combined with the paper “Sample compression schemes for VC classes” by the first and the last authors, was published in the proceeding of FOCS’15.

  2. 2.

    The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007–2013) under grant agreement number 257575, and from the Israel Science Foundation (grant number 339/10).

  3. 3.

    Horev fellow – supported by the Taub foundation. Research is also supported by ISF and BSF.

  4. 4.

    In this text O( f) means at most αf + β for α, β > 0 constants.

  5. 5.

    In metric spaces such a set is called an ε-net, however in learning theory and combinatorial geometry the term ε-net has a different meaning, so we use ε-approximating instead.

  6. 6.

    That is, C satisfies Sauer–Shelah–Perles Lemma with equality.

  7. 7.

    An algorithm that outputs an hypothesis in C that is consistent with the input examples.

  8. 8.

    That is c I (x) = 1 iff xI.

  9. 9.

    That is, it is defined over a subset of {0, 1, , T} and it is injective on its domain.

  10. 10.

    We shall assume w.l.o.g. that there is some well known order on X.

  11. 11.

    The choice of r(x) also depends on C, ε, but to simplify the notation we do not explicitly mention it.

  12. 12.

    Remember that f is a partial function.

  13. 13.

    The function s can be thought of as the inverse of r. Since r is not necessarily invertible we use a different notation than r −1.

  14. 14.

    For ((Z, z), ( f, T)) not in the image of κ we set ρ((Z, z), ( f, T)) to be some arbitrary concept.

  15. 15.

    A similar statement holds in general.

References

  1. N. Alon, S. Moran, A. Yehudayoff, Sign rank, VC dimension and spectral gaps. Electronic Colloquium on Computational Complexity (ECCC) vol. 21, no. 135 (2014)

    Google Scholar 

  2. D. Angluin, M. Krikis, Learning from different teachers. Mach. Learn. 51(2), 137–163 (2003)

    Article  MATH  Google Scholar 

  3. M. Anthony, G. Brightwell, D.A. Cohen, J. Shawe-Taylor. On exact specification by examples, in COLT, 1992, pp. 311–318

    Google Scholar 

  4. P. Assouad, Densite et dimension. Ann. Inst. Fourier 3, 232–282 (1983)

    MATH  Google Scholar 

  5. F. Balbach, Models for algorithmic teaching. PhD thesis, University of Lübeck, 2007

    Google Scholar 

  6. S. Ben-David, A. Litman, Combinatorial variability of Vapnik–Chervonenkis classes with applications to sample compression schemes. Discret. Appl. Math. 86(1), 3–25 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  7. A. Blumer, A. Ehrenfeucht, D. Haussler, M.K. Warmuth, Occam’s razor. Inf. Process. Lett. 24(6), 377–380 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  8. A. Blumer, A. Ehrenfeucht, D. Haussler, M.K. Warmuth, Learnability and the Vapnik–Chervonenkis dimension. J. Assoc. Comput. Mach. 36(4), 929–965 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  9. X. Chen, Y. Cheng, B. Tang, A note on teaching for VC classes. Electronic Colloquium on Computational Complexity (ECCC), vol. 23, no. 65 (2016)

    Google Scholar 

  10. A. Chernikov, P. Simon, Externally definable sets and dependent pairs. Isr. J. Math. 194(1), 409–425 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  11. N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods (Cambridge University Press, Cambridge, 2000)

    Book  MATH  Google Scholar 

  12. T. Doliwa, H.-U. Simon, S. Zilles, Recursive teaching dimension, learning complexity, and maximum classes, in ALT, 2010, pp. 209–223

    Google Scholar 

  13. P. Domingos, The role of Occam’s razor in knowledge discovery. Data Min. Knowl. Discov. 3(4), 409–425 (1999)

    Article  Google Scholar 

  14. R.M. Dudley, Central limit theorems for empirical measures. Ann. Probab. 6, 899–929 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  15. Z. Dvir, A. Rao, A. Wigderson, A. Yehudayoff, Restriction access, in Innovations in Theoretical Computer Science, Cambridge, 8–10, Jan 2012, pp. 19–33

    Google Scholar 

  16. S. Floyd, Space-bounded learning and the Vapnik–Chervonenkis dimension, in COLT, 1989, pp. 349–364

    Google Scholar 

  17. S. Floyd, M.K. Warmuth, Sample compression, learnability, and the Vapnik–Chervonenkis dimension. Mach. Learn. 21(3), 269–304 (1995)

    Google Scholar 

  18. Y. Freund, Boosting a weak learning algorithm by majority. Inf. Comput. 121(2), 256–285 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  19. S.A. Goldman, M. Kearns, On the complexity of teaching. J. Comput. Syst. Sci. 50(1), 20–31 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  20. S.A. Goldman, H.D. Mathias, Teaching a smarter learner. J. Comput. Syst. Sci. 52(2), 255–267 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  21. S.A. Goldman, R.L. Rivest, R.E. Schapire, Learning binary relations and total orders. SIAM J. Comput. 22(5), 1006–1034 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  22. S. Hanneke, Teaching dimension and the complexity of active learning, in COLT, 2007, pp. 66–81

    Google Scholar 

  23. D. Haussler, Sphere packing numbers for subsets of the Boolean n-cube with bounded Vapnik–Chervonenkis dimension. J. Comb. Theory Ser. A 69(2), 217–232 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  24. D. Haussler, E. Welzl, epsilon-nets and simplex range queries. Discret. Comput. Geom. 2, 127–151 (1987)

    Google Scholar 

  25. D.P. Helmbold, R.H. Sloan, M.K. Warmuth, Learning integer lattices. SIAM J. Comput. 21(2), 240–266 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  26. D.P. Helmbold, M.K. Warmuth, On weak learning. J. Comput. Syst. Sci. 50(3), 551–573 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  27. J.C. Jackson, A. Tomkins, A computational model of teaching, in COLT, 1992, pp. 319–326

    Google Scholar 

  28. M. Kearns, U.V. Vazirani, An Introduction to Computational Learning Theory (MIT Press, Cambridge, 1994)

    Google Scholar 

  29. H. Kobayashi, A. Shinohara, Complexity of teaching by a restricted number of examples, in COLT, 2009

    Google Scholar 

  30. C. Kuhlmann, On teaching and learning intersection-closed concept classes, in EuroCOLT, 1999, pp. 168–182

    Google Scholar 

  31. D. Kuzmin, M.K. Warmuth, Unlabeled compression schemes for maximum classes. J. Mach. Learn. Res. 8, 2047–2081 (2007)

    MathSciNet  MATH  Google Scholar 

  32. R. Livni, P. Simon, Honest compressions and their application to compression schemes, in COLT, 2013, pp. 77–92

    Google Scholar 

  33. M. Marchand, J. Shawe-Taylor, The set covering machine. J. Mach. Learn. Res. 3, 723–746 (2002)

    MathSciNet  MATH  Google Scholar 

  34. S. Moran, A. Yehudayoff. Sample compression for VC classes. Electronic Colloquium on Computational Complexity (ECCC), vol. 22, no. 40 (2015)

    Google Scholar 

  35. J. von Neumann, Zur theorie der gesellschaftsspiele. Mathematische Annalen 100, 295–320 (1928)

    Article  MathSciNet  MATH  Google Scholar 

  36. J.R. Quinlan, R.L. Rivest, Inferring decision trees using the minimum description length principle. Inf. Comput. 80(3), 227–248 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  37. B.I.P. Rubinstein, P.L. Bartlett, J.H. Rubinstein, Shifting: one-inclusion mistake bounds and sample compression. J. Comput. Syst. Sci. 75(1), 37–59 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  38. B.I.P. Rubinstein, J.H. Rubinstein, A geometric approach to sample compression. J. Mach. Learn. Res. 13, 1221–1261 (2012)

    MathSciNet  MATH  Google Scholar 

  39. R. Samei, P. Semukhin, B. Yang, S. Zilles, Algebraic methods proving Sauer’s bound for teaching complexity. Theor. Comput. Sci. 558, 35–50 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  40. R. Samei, P. Semukhin, B. Yang, S. Zilles, Sample compression for multi-label concept classes, in COLT, vol. 35, 2014, pp. 371–393

    Google Scholar 

  41. N. Sauer, On the density of families of sets. J. Comb. Theory Ser. A 13, 145–147 (1972)

    Article  MathSciNet  MATH  Google Scholar 

  42. A. Shinohara, S. Miyano, Teachability in computational learning, in ALT, 1990, pp. 247–255

    Google Scholar 

  43. L.G. Valiant, A theory of the learnable. Commun. ACM 27, 1134–1142 (1984)

    Article  MATH  Google Scholar 

  44. V.N. Vapnik, A.Ya. Chervonenkis, On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 264–280 (1971)

    Google Scholar 

  45. M.K. Warmuth, Compressing to VC dimension many points, in COLT/Kernel, 2003, pp. 743–744

    Google Scholar 

  46. A. Wigderson, A. Yehudayoff, Population recovery and partial identification, in FOCS, 2012, pp. 390–399

    Google Scholar 

  47. S. Zilles, S. Lange, R. Holte, M. Zinkevich, Models of cooperative teaching and learning. J. Mach. Learn. Res. 12, 349–384 (2011)

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

We thank Noga Alon and Gillat Kol for helpful discussions in various stages of this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shay Moran .

Editor information

Editors and Affiliations

Appendix: Double Sampling

Appendix: Double Sampling

Here we provide our version of the double sampling argument from [8] that upper bounds the sample complexity of PAC learning for classes of constant VC-dimension. We use the following simple general lemma.

Lemma A.1

Let \((\Omega,\mathcal{F},\mu )\) and \((\Omega ',\mathcal{F}',\mu ')\) be countableFootnote 15 probability spaces. Let

$$\displaystyle{F_{1},F_{2},F_{3},\ldots \in \mathcal{F},\ F_{1}^{{\prime}},F_{ 2}^{{\prime}},F_{ 3}^{{\prime}},\ldots \in \mathcal{F}'}$$

be so that μ′(F i ) ≥ 1∕2 for all i. Then

$$\displaystyle{\mu \times \mu '\left (\bigcup _{i}F_{i} \times F_{i}^{{\prime}}\right ) \geq \frac{1} {2}\mu \left (\bigcup _{i}F_{i}\right ),}$$

where μ ×μ′ is the product measure.

Proof

Let \(F =\bigcup _{i}F_{i}\) . For every ωF, let \(F'(\omega ) =\bigcup _{i:\omega \in F_{i}}F_{i}^{{\prime}}\). As there exists i such that ωF i it holds that F i F′(ω) and hence μ′(F′(ω)) ≥ 1∕2. Thus,

$$\displaystyle{ \mu \times \mu '\left (\bigcup _{i}F_{i} \times F_{i}^{{\prime}}\right ) =\sum _{\omega \in F}\mu (\{\omega \}) \cdot \mu '(F'(\omega )) \geq \sum _{\omega \in F}\mu (\{\omega \})/2 =\mu (F)/2. }$$

We now give a proof of Theorem 1.2. To ease the reading we repeat the statement of the theorem.

Theorem

Let X be a set and C ⊆ {0, 1}X be a concept class of VC-dimension d. Let μ be a distribution over X. Let ε, δ > 0 and m an integer satisfying 2(2m + 1)d(1 −ε∕4)m < δ. Let cC and Y = (x 1, , x m ) be a multiset of m independent samples from μ. Then, the probability that there is c′C so that c | Y = c′ | Y but μ({x: c(x) ≠ c′(x)}) > ε is at most δ.

Proof of Theorem 1.2

Let Y ′ = (x 1 , , x m ) be another m independent samples from μ, chosen independently of Y. Let

$$\displaystyle{H =\{ h \in C: \mathsf{dist}_{\mu }(h,c)>\epsilon \}.}$$

For hC, define the event

$$\displaystyle{F_{h} =\{ Y: c\vert _{Y } = h\vert _{Y }\},}$$

and let \(F =\bigcup _{h\in H}F_{h}\). Our goal is thus to upper bound \(\Pr (F)\). For that, we also define the independent event

$$\displaystyle{F_{h}^{{\prime}} =\{ Y ': \mathsf{dist}_{ Y '}(h,c)>\epsilon /2\}.}$$

We first claim that \(\Pr (F_{h}^{{\prime}}) \geq 1/2\) for all hH. This follows from Chernoff’s bound, but even Chebyshev’s inequality suffices: For every i ∈ [m], let V i be the indicator variables of the event h(x i ) ≠ c(x i ) (i.e., V i = 1 if and only if h(x i ) ≠ c(x i )). The event F h is equivalent to V = i V i m > ε∕2. Since hH, we have \(p:= \mathbb{E}[V ]>\epsilon\). Since elements of Y ′ are chosen independently, it follows that Var(V ) = p(1 − p)∕m. Thus, the probability of the complement of F h satisfies

$$\displaystyle{ \Pr ((F_{h}^{{\prime}})^{c}) \leq \Pr (\vert V - p\vert \geq p -\epsilon /2) \leq \frac{p(1 - p)} {(p -\epsilon /2)^{2}m} <\frac{4} {\epsilon m} \leq 1/2. }$$

We now give an upper bound on \(\Pr (F)\). We note that

Let S = YY ′, where the union is as multisets. Conditioned on the value of S, the multiset Y is a uniform subset of half of the elements of S. Thus,

Notice that if dist Y(h′, c) > ε∕2 then dist S (h′, c) > ε∕4, hence the probability that we choose Y such that h′ | Y = c | Y is at most (1 −ε∕4)m. Using Theorem 1.1 we get

$$\displaystyle{ \Pr (F) \leq 2\mathop{ \mathbb{E}} _{S}\left [\sum _{h'\in H\vert _{S}}(1 -\epsilon /4)^{m}\right ] \leq 2(2m + 1)^{d}(1 -\epsilon /4)^{m}. }$$

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International publishing AG

About this chapter

Cite this chapter

Moran, S., Shpilka, A., Wigderson, A., Yehudayoff, A. (2017). Teaching and Compressing for Low VC-Dimension. In: Loebl, M., Nešetřil, J., Thomas, R. (eds) A Journey Through Discrete Mathematics. Springer, Cham. https://doi.org/10.1007/978-3-319-44479-6_26

Download citation

Publish with us

Policies and ethics