Abstract
In this work we study the quantitative relation between VC-dimension and two other basic parameters related to learning and teaching. Namely, the quality of sample compression schemes and of teaching sets for classes of low VC-dimension. Let C be a binary concept class of size m and VC-dimension d. Prior to this work, the best known upper bounds for both parameters were log(m), while the best lower bounds are linear in d. We present significantly better upper bounds on both as follows. Set k = O(d2dloglog | C | ).
We show that there always exists a concept c in C with a teaching set (i.e. a list of c-labeled examples uniquely identifying c in C) of size k. This problem was studied by Kuhlmann (On teaching and learning intersection-closed concept classes. In: EuroCOLT, pp 168–182, 1999). Our construction implies that the recursive teaching (RT) dimension of C is at most k as well. The RT-dimension was suggested by Zilles et al. (J Mach Learn Res 12:349–384, 2011) and Doliwa et al. (Recursive teaching dimension, learning complexity, and maximum classes. In: ALT, pp 209–223, 2010). The same notion (under the name partial-ID width) was independently studied by Wigderson and Yehudayoff (Population recovery and partial identification. In: FOCS, pp 390–399, 2012). An upper bound on this parameter that depends only on d is known just for the very simple case d = 1, and is open even for d = 2. We also make small progress towards this seemingly modest goal.
We further construct sample compression schemes of size k for C, with additional information of klog(k) bits. Roughly speaking, given any list of C-labelled examples of arbitrary length, we can retain only k labeled examples in a way that allows to recover the labels of all others examples in the list, using additional klog(k) information bits. This problem was first suggested by Littlestone and Warmuth (Relating data compression and learnability. Unpublished, 1986).
This is a preview of subscription content, log in via an institution.
Notes
- 1.
A preliminary version of this work, combined with the paper “Sample compression schemes for VC classes” by the first and the last authors, was published in the proceeding of FOCS’15.
- 2.
The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007–2013) under grant agreement number 257575, and from the Israel Science Foundation (grant number 339/10).
- 3.
Horev fellow – supported by the Taub foundation. Research is also supported by ISF and BSF.
- 4.
In this text O( f) means at most αf + β for α, β > 0 constants.
- 5.
In metric spaces such a set is called an ε-net, however in learning theory and combinatorial geometry the term ε-net has a different meaning, so we use ε-approximating instead.
- 6.
That is, C satisfies Sauer–Shelah–Perles Lemma with equality.
- 7.
An algorithm that outputs an hypothesis in C that is consistent with the input examples.
- 8.
That is c I (x) = 1 iff x ∈ I.
- 9.
That is, it is defined over a subset of {0, 1, …, T} and it is injective on its domain.
- 10.
We shall assume w.l.o.g. that there is some well known order on X.
- 11.
The choice of r(x) also depends on C, ε, but to simplify the notation we do not explicitly mention it.
- 12.
Remember that f is a partial function.
- 13.
The function s can be thought of as the inverse of r. Since r is not necessarily invertible we use a different notation than r −1.
- 14.
For ((Z, z), ( f, T)) not in the image of κ we set ρ((Z, z), ( f, T)) to be some arbitrary concept.
- 15.
A similar statement holds in general.
References
N. Alon, S. Moran, A. Yehudayoff, Sign rank, VC dimension and spectral gaps. Electronic Colloquium on Computational Complexity (ECCC) vol. 21, no. 135 (2014)
D. Angluin, M. Krikis, Learning from different teachers. Mach. Learn. 51(2), 137–163 (2003)
M. Anthony, G. Brightwell, D.A. Cohen, J. Shawe-Taylor. On exact specification by examples, in COLT, 1992, pp. 311–318
P. Assouad, Densite et dimension. Ann. Inst. Fourier 3, 232–282 (1983)
F. Balbach, Models for algorithmic teaching. PhD thesis, University of Lübeck, 2007
S. Ben-David, A. Litman, Combinatorial variability of Vapnik–Chervonenkis classes with applications to sample compression schemes. Discret. Appl. Math. 86(1), 3–25 (1998)
A. Blumer, A. Ehrenfeucht, D. Haussler, M.K. Warmuth, Occam’s razor. Inf. Process. Lett. 24(6), 377–380 (1987)
A. Blumer, A. Ehrenfeucht, D. Haussler, M.K. Warmuth, Learnability and the Vapnik–Chervonenkis dimension. J. Assoc. Comput. Mach. 36(4), 929–965 (1989)
X. Chen, Y. Cheng, B. Tang, A note on teaching for VC classes. Electronic Colloquium on Computational Complexity (ECCC), vol. 23, no. 65 (2016)
A. Chernikov, P. Simon, Externally definable sets and dependent pairs. Isr. J. Math. 194(1), 409–425 (2013)
N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods (Cambridge University Press, Cambridge, 2000)
T. Doliwa, H.-U. Simon, S. Zilles, Recursive teaching dimension, learning complexity, and maximum classes, in ALT, 2010, pp. 209–223
P. Domingos, The role of Occam’s razor in knowledge discovery. Data Min. Knowl. Discov. 3(4), 409–425 (1999)
R.M. Dudley, Central limit theorems for empirical measures. Ann. Probab. 6, 899–929 (1978)
Z. Dvir, A. Rao, A. Wigderson, A. Yehudayoff, Restriction access, in Innovations in Theoretical Computer Science, Cambridge, 8–10, Jan 2012, pp. 19–33
S. Floyd, Space-bounded learning and the Vapnik–Chervonenkis dimension, in COLT, 1989, pp. 349–364
S. Floyd, M.K. Warmuth, Sample compression, learnability, and the Vapnik–Chervonenkis dimension. Mach. Learn. 21(3), 269–304 (1995)
Y. Freund, Boosting a weak learning algorithm by majority. Inf. Comput. 121(2), 256–285 (1995)
S.A. Goldman, M. Kearns, On the complexity of teaching. J. Comput. Syst. Sci. 50(1), 20–31 (1995)
S.A. Goldman, H.D. Mathias, Teaching a smarter learner. J. Comput. Syst. Sci. 52(2), 255–267 (1996)
S.A. Goldman, R.L. Rivest, R.E. Schapire, Learning binary relations and total orders. SIAM J. Comput. 22(5), 1006–1034 (1993)
S. Hanneke, Teaching dimension and the complexity of active learning, in COLT, 2007, pp. 66–81
D. Haussler, Sphere packing numbers for subsets of the Boolean n-cube with bounded Vapnik–Chervonenkis dimension. J. Comb. Theory Ser. A 69(2), 217–232 (1995)
D. Haussler, E. Welzl, epsilon-nets and simplex range queries. Discret. Comput. Geom. 2, 127–151 (1987)
D.P. Helmbold, R.H. Sloan, M.K. Warmuth, Learning integer lattices. SIAM J. Comput. 21(2), 240–266 (1992)
D.P. Helmbold, M.K. Warmuth, On weak learning. J. Comput. Syst. Sci. 50(3), 551–573 (1995)
J.C. Jackson, A. Tomkins, A computational model of teaching, in COLT, 1992, pp. 319–326
M. Kearns, U.V. Vazirani, An Introduction to Computational Learning Theory (MIT Press, Cambridge, 1994)
H. Kobayashi, A. Shinohara, Complexity of teaching by a restricted number of examples, in COLT, 2009
C. Kuhlmann, On teaching and learning intersection-closed concept classes, in EuroCOLT, 1999, pp. 168–182
D. Kuzmin, M.K. Warmuth, Unlabeled compression schemes for maximum classes. J. Mach. Learn. Res. 8, 2047–2081 (2007)
R. Livni, P. Simon, Honest compressions and their application to compression schemes, in COLT, 2013, pp. 77–92
M. Marchand, J. Shawe-Taylor, The set covering machine. J. Mach. Learn. Res. 3, 723–746 (2002)
S. Moran, A. Yehudayoff. Sample compression for VC classes. Electronic Colloquium on Computational Complexity (ECCC), vol. 22, no. 40 (2015)
J. von Neumann, Zur theorie der gesellschaftsspiele. Mathematische Annalen 100, 295–320 (1928)
J.R. Quinlan, R.L. Rivest, Inferring decision trees using the minimum description length principle. Inf. Comput. 80(3), 227–248 (1989)
B.I.P. Rubinstein, P.L. Bartlett, J.H. Rubinstein, Shifting: one-inclusion mistake bounds and sample compression. J. Comput. Syst. Sci. 75(1), 37–59 (2009)
B.I.P. Rubinstein, J.H. Rubinstein, A geometric approach to sample compression. J. Mach. Learn. Res. 13, 1221–1261 (2012)
R. Samei, P. Semukhin, B. Yang, S. Zilles, Algebraic methods proving Sauer’s bound for teaching complexity. Theor. Comput. Sci. 558, 35–50 (2014)
R. Samei, P. Semukhin, B. Yang, S. Zilles, Sample compression for multi-label concept classes, in COLT, vol. 35, 2014, pp. 371–393
N. Sauer, On the density of families of sets. J. Comb. Theory Ser. A 13, 145–147 (1972)
A. Shinohara, S. Miyano, Teachability in computational learning, in ALT, 1990, pp. 247–255
L.G. Valiant, A theory of the learnable. Commun. ACM 27, 1134–1142 (1984)
V.N. Vapnik, A.Ya. Chervonenkis, On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 264–280 (1971)
M.K. Warmuth, Compressing to VC dimension many points, in COLT/Kernel, 2003, pp. 743–744
A. Wigderson, A. Yehudayoff, Population recovery and partial identification, in FOCS, 2012, pp. 390–399
S. Zilles, S. Lange, R. Holte, M. Zinkevich, Models of cooperative teaching and learning. J. Mach. Learn. Res. 12, 349–384 (2011)
Acknowledgements
We thank Noga Alon and Gillat Kol for helpful discussions in various stages of this work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix: Double Sampling
Appendix: Double Sampling
Here we provide our version of the double sampling argument from [8] that upper bounds the sample complexity of PAC learning for classes of constant VC-dimension. We use the following simple general lemma.
Lemma A.1
Let \((\Omega,\mathcal{F},\mu )\) and \((\Omega ',\mathcal{F}',\mu ')\) be countableFootnote 15 probability spaces. Let
be so that μ′(F i ′) ≥ 1∕2 for all i. Then
where μ ×μ′ is the product measure.
Proof
Let \(F =\bigcup _{i}F_{i}\) . For every ω ∈ F, let \(F'(\omega ) =\bigcup _{i:\omega \in F_{i}}F_{i}^{{\prime}}\). As there exists i such that ω ∈ F i it holds that F i ′ ⊆ F′(ω) and hence μ′(F′(ω)) ≥ 1∕2. Thus,
□
We now give a proof of Theorem 1.2. To ease the reading we repeat the statement of the theorem.
Theorem
Let X be a set and C ⊆ {0, 1}X be a concept class of VC-dimension d. Let μ be a distribution over X. Let ε, δ > 0 and m an integer satisfying 2(2m + 1)d(1 −ε∕4)m < δ. Let c ∈ C and Y = (x 1, …, x m ) be a multiset of m independent samples from μ. Then, the probability that there is c′ ∈ C so that c | Y = c′ | Y but μ({x: c(x) ≠ c′(x)}) > ε is at most δ.
Proof of Theorem 1.2
Let Y ′ = (x 1 ′, …, x m ′) be another m independent samples from μ, chosen independently of Y. Let
For h ∈ C, define the event
and let \(F =\bigcup _{h\in H}F_{h}\). Our goal is thus to upper bound \(\Pr (F)\). For that, we also define the independent event
We first claim that \(\Pr (F_{h}^{{\prime}}) \geq 1/2\) for all h ∈ H. This follows from Chernoff’s bound, but even Chebyshev’s inequality suffices: For every i ∈ [m], let V i be the indicator variables of the event h(x i ′) ≠ c(x i ′) (i.e., V i = 1 if and only if h(x i ′) ≠ c(x i ′)). The event F h ′ is equivalent to V = ∑ i V i ∕m > ε∕2. Since h ∈ H, we have \(p:= \mathbb{E}[V ]>\epsilon\). Since elements of Y ′ are chosen independently, it follows that Var(V ) = p(1 − p)∕m. Thus, the probability of the complement of F h ′ satisfies
We now give an upper bound on \(\Pr (F)\). We note that
Let S = Y ∪ Y ′, where the union is as multisets. Conditioned on the value of S, the multiset Y is a uniform subset of half of the elements of S. Thus,
Notice that if dist Y ′(h′, c) > ε∕2 then dist S (h′, c) > ε∕4, hence the probability that we choose Y such that h′ | Y = c | Y is at most (1 −ε∕4)m. Using Theorem 1.1 we get
□
Rights and permissions
Copyright information
© 2017 Springer International publishing AG
About this chapter
Cite this chapter
Moran, S., Shpilka, A., Wigderson, A., Yehudayoff, A. (2017). Teaching and Compressing for Low VC-Dimension. In: Loebl, M., Nešetřil, J., Thomas, R. (eds) A Journey Through Discrete Mathematics. Springer, Cham. https://doi.org/10.1007/978-3-319-44479-6_26
Download citation
DOI: https://doi.org/10.1007/978-3-319-44479-6_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44478-9
Online ISBN: 978-3-319-44479-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)