Teaching and Compressing for Low VC-Dimension

Moran, Shay; Shpilka, Amir; Wigderson, Avi; Yehudayoff, Amir

doi:10.1007/978-3-319-44479-6_26

Teaching and Compressing for Low VC-Dimension

Shay Moran^4,5,
Amir Shpilka⁶,
Avi Wigderson⁷ &
…
Amir Yehudayoff⁸

Chapter
First Online: 06 October 2017

2061 Accesses

Abstract

In this work we study the quantitative relation between VC-dimension and two other basic parameters related to learning and teaching. Namely, the quality of sample compression schemes and of teaching sets for classes of low VC-dimension. Let C be a binary concept class of size m and VC-dimension d. Prior to this work, the best known upper bounds for both parameters were log(m), while the best lower bounds are linear in d. We present significantly better upper bounds on both as follows. Set k = O(d2^dloglog | C | ).

We show that there always exists a concept c in C with a teaching set (i.e. a list of c-labeled examples uniquely identifying c in C) of size k. This problem was studied by Kuhlmann (On teaching and learning intersection-closed concept classes. In: EuroCOLT, pp 168–182, 1999). Our construction implies that the recursive teaching (RT) dimension of C is at most k as well. The RT-dimension was suggested by Zilles et al. (J Mach Learn Res 12:349–384, 2011) and Doliwa et al. (Recursive teaching dimension, learning complexity, and maximum classes. In: ALT, pp 209–223, 2010). The same notion (under the name partial-ID width) was independently studied by Wigderson and Yehudayoff (Population recovery and partial identification. In: FOCS, pp 390–399, 2012). An upper bound on this parameter that depends only on d is known just for the very simple case d = 1, and is open even for d = 2. We also make small progress towards this seemingly modest goal.

We further construct sample compression schemes of size k for C, with additional information of klog(k) bits. Roughly speaking, given any list of C-labelled examples of arbitrary length, we can retain only k labeled examples in a way that allows to recover the labels of all others examples in the list, using additional klog(k) information bits. This problem was first suggested by Littlestone and Warmuth (Relating data compression and learnability. Unpublished, 1986).

This is a preview of subscription content, log in via an institution.

Notes

1.
A preliminary version of this work, combined with the paper “Sample compression schemes for VC classes” by the first and the last authors, was published in the proceeding of FOCS’15.
2.
The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007–2013) under grant agreement number 257575, and from the Israel Science Foundation (grant number 339/10).
3.
Horev fellow – supported by the Taub foundation. Research is also supported by ISF and BSF.
4.
In this text O( f) means at most αf + β for α, β > 0 constants.
5.
In metric spaces such a set is called an ε-net, however in learning theory and combinatorial geometry the term ε-net has a different meaning, so we use ε-approximating instead.
6.
That is, C satisfies Sauer–Shelah–Perles Lemma with equality.
7.
An algorithm that outputs an hypothesis in C that is consistent with the input examples.
8.
That is c _I(x) = 1 iff x ∈ I.
9.
That is, it is defined over a subset of {0, 1, …, T} and it is injective on its domain.
10.
We shall assume w.l.o.g. that there is some well known order on X.
11.
The choice of r(x) also depends on C, ε, but to simplify the notation we do not explicitly mention it.
12.
Remember that f is a partial function.
13.
The function s can be thought of as the inverse of r. Since r is not necessarily invertible we use a different notation than r ⁻¹.
14.
For ((Z, z), ( f, T)) not in the image of κ we set ρ((Z, z), ( f, T)) to be some arbitrary concept.
15.
A similar statement holds in general.

References

N. Alon, S. Moran, A. Yehudayoff, Sign rank, VC dimension and spectral gaps. Electronic Colloquium on Computational Complexity (ECCC) vol. 21, no. 135 (2014)
Google Scholar
D. Angluin, M. Krikis, Learning from different teachers. Mach. Learn. 51(2), 137–163 (2003)
Article MATH Google Scholar
M. Anthony, G. Brightwell, D.A. Cohen, J. Shawe-Taylor. On exact specification by examples, in COLT, 1992, pp. 311–318
Google Scholar
P. Assouad, Densite et dimension. Ann. Inst. Fourier 3, 232–282 (1983)
MATH Google Scholar
F. Balbach, Models for algorithmic teaching. PhD thesis, University of Lübeck, 2007
Google Scholar
S. Ben-David, A. Litman, Combinatorial variability of Vapnik–Chervonenkis classes with applications to sample compression schemes. Discret. Appl. Math. 86(1), 3–25 (1998)
Article MathSciNet MATH Google Scholar
A. Blumer, A. Ehrenfeucht, D. Haussler, M.K. Warmuth, Occam’s razor. Inf. Process. Lett. 24(6), 377–380 (1987)
Article MathSciNet MATH Google Scholar
A. Blumer, A. Ehrenfeucht, D. Haussler, M.K. Warmuth, Learnability and the Vapnik–Chervonenkis dimension. J. Assoc. Comput. Mach. 36(4), 929–965 (1989)
Article MathSciNet MATH Google Scholar
X. Chen, Y. Cheng, B. Tang, A note on teaching for VC classes. Electronic Colloquium on Computational Complexity (ECCC), vol. 23, no. 65 (2016)
Google Scholar
A. Chernikov, P. Simon, Externally definable sets and dependent pairs. Isr. J. Math. 194(1), 409–425 (2013)
Article MathSciNet MATH Google Scholar
N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods (Cambridge University Press, Cambridge, 2000)
Book MATH Google Scholar
T. Doliwa, H.-U. Simon, S. Zilles, Recursive teaching dimension, learning complexity, and maximum classes, in ALT, 2010, pp. 209–223
Google Scholar
P. Domingos, The role of Occam’s razor in knowledge discovery. Data Min. Knowl. Discov. 3(4), 409–425 (1999)
Article Google Scholar
R.M. Dudley, Central limit theorems for empirical measures. Ann. Probab. 6, 899–929 (1978)
Article MathSciNet MATH Google Scholar
Z. Dvir, A. Rao, A. Wigderson, A. Yehudayoff, Restriction access, in Innovations in Theoretical Computer Science, Cambridge, 8–10, Jan 2012, pp. 19–33
Google Scholar
S. Floyd, Space-bounded learning and the Vapnik–Chervonenkis dimension, in COLT, 1989, pp. 349–364
Google Scholar
S. Floyd, M.K. Warmuth, Sample compression, learnability, and the Vapnik–Chervonenkis dimension. Mach. Learn. 21(3), 269–304 (1995)
Google Scholar
Y. Freund, Boosting a weak learning algorithm by majority. Inf. Comput. 121(2), 256–285 (1995)
Article MathSciNet MATH Google Scholar
S.A. Goldman, M. Kearns, On the complexity of teaching. J. Comput. Syst. Sci. 50(1), 20–31 (1995)
Article MathSciNet MATH Google Scholar
S.A. Goldman, H.D. Mathias, Teaching a smarter learner. J. Comput. Syst. Sci. 52(2), 255–267 (1996)
Article MathSciNet MATH Google Scholar
S.A. Goldman, R.L. Rivest, R.E. Schapire, Learning binary relations and total orders. SIAM J. Comput. 22(5), 1006–1034 (1993)
Article MathSciNet MATH Google Scholar
S. Hanneke, Teaching dimension and the complexity of active learning, in COLT, 2007, pp. 66–81
Google Scholar
D. Haussler, Sphere packing numbers for subsets of the Boolean n-cube with bounded Vapnik–Chervonenkis dimension. J. Comb. Theory Ser. A 69(2), 217–232 (1995)
Article MathSciNet MATH Google Scholar
D. Haussler, E. Welzl, epsilon-nets and simplex range queries. Discret. Comput. Geom. 2, 127–151 (1987)
Google Scholar
D.P. Helmbold, R.H. Sloan, M.K. Warmuth, Learning integer lattices. SIAM J. Comput. 21(2), 240–266 (1992)
Article MathSciNet MATH Google Scholar
D.P. Helmbold, M.K. Warmuth, On weak learning. J. Comput. Syst. Sci. 50(3), 551–573 (1995)
Article MathSciNet MATH Google Scholar
J.C. Jackson, A. Tomkins, A computational model of teaching, in COLT, 1992, pp. 319–326
Google Scholar
M. Kearns, U.V. Vazirani, An Introduction to Computational Learning Theory (MIT Press, Cambridge, 1994)
Google Scholar
H. Kobayashi, A. Shinohara, Complexity of teaching by a restricted number of examples, in COLT, 2009
Google Scholar
C. Kuhlmann, On teaching and learning intersection-closed concept classes, in EuroCOLT, 1999, pp. 168–182
Google Scholar
D. Kuzmin, M.K. Warmuth, Unlabeled compression schemes for maximum classes. J. Mach. Learn. Res. 8, 2047–2081 (2007)
MathSciNet MATH Google Scholar
R. Livni, P. Simon, Honest compressions and their application to compression schemes, in COLT, 2013, pp. 77–92
Google Scholar
M. Marchand, J. Shawe-Taylor, The set covering machine. J. Mach. Learn. Res. 3, 723–746 (2002)
MathSciNet MATH Google Scholar
S. Moran, A. Yehudayoff. Sample compression for VC classes. Electronic Colloquium on Computational Complexity (ECCC), vol. 22, no. 40 (2015)
Google Scholar
J. von Neumann, Zur theorie der gesellschaftsspiele. Mathematische Annalen 100, 295–320 (1928)
Article MathSciNet MATH Google Scholar
J.R. Quinlan, R.L. Rivest, Inferring decision trees using the minimum description length principle. Inf. Comput. 80(3), 227–248 (1989)
Article MathSciNet MATH Google Scholar
B.I.P. Rubinstein, P.L. Bartlett, J.H. Rubinstein, Shifting: one-inclusion mistake bounds and sample compression. J. Comput. Syst. Sci. 75(1), 37–59 (2009)
Article MathSciNet MATH Google Scholar
B.I.P. Rubinstein, J.H. Rubinstein, A geometric approach to sample compression. J. Mach. Learn. Res. 13, 1221–1261 (2012)
MathSciNet MATH Google Scholar
R. Samei, P. Semukhin, B. Yang, S. Zilles, Algebraic methods proving Sauer’s bound for teaching complexity. Theor. Comput. Sci. 558, 35–50 (2014)
Article MathSciNet MATH Google Scholar
R. Samei, P. Semukhin, B. Yang, S. Zilles, Sample compression for multi-label concept classes, in COLT, vol. 35, 2014, pp. 371–393
Google Scholar
N. Sauer, On the density of families of sets. J. Comb. Theory Ser. A 13, 145–147 (1972)
Article MathSciNet MATH Google Scholar
A. Shinohara, S. Miyano, Teachability in computational learning, in ALT, 1990, pp. 247–255
Google Scholar
L.G. Valiant, A theory of the learnable. Commun. ACM 27, 1134–1142 (1984)
Article MATH Google Scholar
V.N. Vapnik, A.Ya. Chervonenkis, On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 264–280 (1971)
Google Scholar
M.K. Warmuth, Compressing to VC dimension many points, in COLT/Kernel, 2003, pp. 743–744
Google Scholar
A. Wigderson, A. Yehudayoff, Population recovery and partial identification, in FOCS, 2012, pp. 390–399
Google Scholar
S. Zilles, S. Lange, R. Holte, M. Zinkevich, Models of cooperative teaching and learning. J. Mach. Learn. Res. 12, 349–384 (2011)
MathSciNet MATH Google Scholar

Download references

Acknowledgements

We thank Noga Alon and Gillat Kol for helpful discussions in various stages of this work.

Author information

Authors and Affiliations

Department of Computer Science, Technion-IIT, Haifa, Israel
Shay Moran
Max Planck Institute for Informatics, Saarbrücken, Germany
Shay Moran
Department of Computer Science, Tel Aviv University, Tel Aviv-Yafo, Israel
Amir Shpilka
School of Mathematics, Institute for Advanced Study, Princeton, NJ, USA
Avi Wigderson
Department of Mathematics, Technion-IIT, Haifa, Israel
Amir Yehudayoff

Authors

Shay Moran
View author publications
You can also search for this author in PubMed Google Scholar
Amir Shpilka
View author publications
You can also search for this author in PubMed Google Scholar
Avi Wigderson
View author publications
You can also search for this author in PubMed Google Scholar
Amir Yehudayoff
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shay Moran .

Editor information

Editors and Affiliations

Department of Applied Mathematics, Charles University, Praha, Czech Republic
Martin Loebl
Computer Science Institute of Charles University, Charles University, Praha, Czech Republic
Jaroslav Nešetřil
School of Mathematics, Georgia Institute of Technology, Atlanta, Georgia, USA
Robin Thomas

Appendix: Double Sampling

Here we provide our version of the double sampling argument from [8] that upper bounds the sample complexity of PAC learning for classes of constant VC-dimension. We use the following simple general lemma.

Lemma A.1

Let $(\Omega,\mathcal{F},\mu )$ and $(\Omega ',\mathcal{F}',\mu ')$ be countable^{Footnote 15} probability spaces. Let

$$\displaystyle{F_{1},F_{2},F_{3},\ldots \in \mathcal{F},\ F_{1}^{{\prime}},F_{ 2}^{{\prime}},F_{ 3}^{{\prime}},\ldots \in \mathcal{F}'}$$

be so that μ′(F _i ^′) ≥ 1∕2 for all i. Then

$$\displaystyle{\mu \times \mu '\left (\bigcup _{i}F_{i} \times F_{i}^{{\prime}}\right ) \geq \frac{1} {2}\mu \left (\bigcup _{i}F_{i}\right ),}$$

where μ ×μ′ is the product measure.

Proof

Let $F =\bigcup _{i}F_{i}$ . For every ω ∈ F, let $F'(\omega ) =\bigcup _{i:\omega \in F_{i}}F_{i}^{{\prime}}$. As there exists i such that ω ∈ F _i it holds that F _i ^′ ⊆ F′(ω) and hence μ′(F′(ω)) ≥ 1∕2. Thus,

$$\displaystyle{ \mu \times \mu '\left (\bigcup _{i}F_{i} \times F_{i}^{{\prime}}\right ) =\sum _{\omega \in F}\mu (\{\omega \}) \cdot \mu '(F'(\omega )) \geq \sum _{\omega \in F}\mu (\{\omega \})/2 =\mu (F)/2. }$$

□

We now give a proof of Theorem 1.2. To ease the reading we repeat the statement of the theorem.

Theorem

Let X be a set and C ⊆ {0, 1}^X be a concept class of VC-dimension d. Let μ be a distribution over X. Let ε, δ > 0 and m an integer satisfying 2(2m + 1)^d(1 −ε∕4)^m < δ. Let c ∈ C and Y = (x ₁, …, x _m) be a multiset of m independent samples from μ. Then, the probability that there is c′ ∈ C so that c |_Y = c′ |_Y but μ({x: c(x) ≠ c′(x)}) > ε is at most δ.

Proof of Theorem 1.2

Let Y ′ = (x ₁ ^′, …, x _m ^′) be another m independent samples from μ, chosen independently of Y. Let

$$\displaystyle{H =\{ h \in C: \mathsf{dist}_{\mu }(h,c)>\epsilon \}.}$$

For h ∈ C, define the event

$$\displaystyle{F_{h} =\{ Y: c\vert _{Y } = h\vert _{Y }\},}$$

and let $F =\bigcup _{h\in H}F_{h}$. Our goal is thus to upper bound $\Pr (F)$. For that, we also define the independent event

$$\displaystyle{F_{h}^{{\prime}} =\{ Y ': \mathsf{dist}_{ Y '}(h,c)>\epsilon /2\}.}$$

We first claim that $\Pr (F_{h}^{{\prime}}) \geq 1/2$ for all h ∈ H. This follows from Chernoff’s bound, but even Chebyshev’s inequality suffices: For every i ∈ [m], let V _i be the indicator variables of the event h(x _i ^′) ≠ c(x _i ^′) (i.e., V _i = 1 if and only if h(x _i ^′) ≠ c(x _i ^′)). The event F _h ^′ is equivalent to V = ∑ _i V _i∕m > ε∕2. Since h ∈ H, we have $p:= \mathbb{E}[V ]>\epsilon$. Since elements of Y ′ are chosen independently, it follows that Var(V ) = p(1 − p)∕m. Thus, the probability of the complement of F _h ^′ satisfies

$$\displaystyle{ \Pr ((F_{h}^{{\prime}})^{c}) \leq \Pr (\vert V - p\vert \geq p -\epsilon /2) \leq \frac{p(1 - p)} {(p -\epsilon /2)^{2}m} <\frac{4} {\epsilon m} \leq 1/2. }$$

We now give an upper bound on $\Pr (F)$. We note that

Let S = Y ∪ Y ′, where the union is as multisets. Conditioned on the value of S, the multiset Y is a uniform subset of half of the elements of S. Thus,

Notice that if dist _{Y ′}(h′, c) > ε∕2 then dist _S(h′, c) > ε∕4, hence the probability that we choose Y such that h′ |_Y = c |_Y is at most (1 −ε∕4)^m. Using Theorem 1.1 we get

$$\displaystyle{ \Pr (F) \leq 2\mathop{ \mathbb{E}} _{S}\left [\sum _{h'\in H\vert _{S}}(1 -\epsilon /4)^{m}\right ] \leq 2(2m + 1)^{d}(1 -\epsilon /4)^{m}. }$$

□

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Moran, S., Shpilka, A., Wigderson, A., Yehudayoff, A. (2017). Teaching and Compressing for Low VC-Dimension. In: Loebl, M., Nešetřil, J., Thomas, R. (eds) A Journey Through Discrete Mathematics. Springer, Cham. https://doi.org/10.1007/978-3-319-44479-6_26

Download citation

DOI: https://doi.org/10.1007/978-3-319-44479-6_26
Published: 06 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44478-9
Online ISBN: 978-3-319-44479-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Abstract

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: Double Sampling

Appendix: Double Sampling

Lemma A.1

Proof

Theorem

Proof of Theorem 1.2

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation