Abstract
The version space compression set size n is the size of the smallest subset of a training set that induces the same (agnostic) version space with respect to a hypothesis class.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
More formally, \(f^*\) a classifier such that \(R(f^{*}) = \inf _{f \in \mathcal F} R(f)\) and \(\inf _{f \in \mathcal F} P( (x,y) : f(x) \ne f^{*}(x)) = 0\). Its existence is guaranteed by topological arguments (see [9]).
- 2.
- 3.
This definition makes sense for positive coverage.
References
Bentley, J.L., Kung, H.T., Schkolnick, M., Thompson, C.D.: On the average number of maxima in a set of vectors and applications. JACM: J. ACM 25(4), 536–543 (1978)
Cohn, D., Atlas, L., Ladner, R.: Improving generalization with active learning. Mach. Learn. 15(2), 201–221 (1994)
El-Yaniv, R., Wiener, Y.: On the foundations of noise-free selective classification. J. Mach. Learn. Res. 11, 1605–1641 (2010)
El-Yaniv, R., Wiener, Y.: Agnostic selective classification. In: Shawe-Taylor, J. et al. (eds.) Advances in Neural Information Processing Systems (NIPS), vol. 24, pp. 1665–1673 (2011)
El-Yaniv, R., Wiener, Y.: Active learning via perfect selective classification. J. Mach. Learn. Res. 13(1), 255–279 (2012)
Hanneke, S.: A bound on the label complexity of agnostic active learning. In: ICML, pp. 353–360. ACM, New York (2007)
Hanneke, S.: Teaching dimension and the complexity of active learning. In: Bshouty, N.H., Gentile, C. (eds.) Proceedings of the 20th Annual Conference on Learning Theory (COLT). Lecture Notes in Artificial Intelligence, vol. 4539, pp. 66–81. Springer, Berlin (2007)
Hanneke, S.: Theoretical foundations of active learning. Ph.D. thesis, Carnegie Mellon University (2009)
Hanneke, S.: Activized learning: transforming passive to active with improved label complexity. J. Mach. Learn. Res. 13(1), 1469–1587 (2012)
Hanneke, S.: A statistical theory of active learning. Unpublished (2013)
Herbrich, R.: Learning Kernel Classifiers. The MIT Press, Cambridge (2002)
Littlestone, N., Warmuth, M.: Relating Data Compression and Learnability. Technical report, University of California, Santa Cruz (1986)
Mitchell, T.: Version spaces: a candidate elimination approach to rule learning. In: IJCAI’77: Proceedings of the 5th international joint conference on Artificial intelligence, pp. 305–310. Morgan Kaufmann, San Francisco (1977)
Preparata, F.P., Shamos, M.I.: Computational Geometry: An Introduction. Springer, Berlin (1985)
Vapnik, V.N., Chervonenkis, A.Y.: On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16(2), 264–280 (1971) (This volume, Chap. 3)
Wiener, Y.: Theoretical foundations of selective prediction. Ph.D. thesis, Technion—Israel Institute of Technology (2013)
Wiener, Y., El Yaniv, R.: Pointwise tracking the optimal regression function. In: Bartlett, P. et al. (eds.) Advances in Neural Information Processing Systems 25, pp. 2051–2059 (2012)
Wiener, Y., El-Yaniv, R.: Agnostic pointwise-competitive selective classification. J. Artif. Intell. Res. 52, 179–201 (2015)
Wiener, Y., Hanneke, S., El-Yaniv, R.: A compression technique for analyzing disagreement-based active learning. Technical report, arXiv preprint arXiv:1404.1504 (2014)
Yogananda, A.P., Murty, M.N., Gopal, L.: A fast linear separability test by projection of positive points on subspaces. In: Proceedings of the Twenty-Fourth International Conference on Machine Learning (ICML 2007), pp. 713–720. ACM, New York (2007)
Acknowledgments
We are grateful to Steve Hanneke for helpful and insightful discussions. Also, we warmly thank the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI) for their generous support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
Proof
(of Lemma 23.1 [19]) Let \(I_n\) be the set of all sets of n distinct indices \(\{i_{1},\ldots ,i_{n}\}\) from \(\{1,\ldots , m\}\), We denote by \(\mathbf {i}\) and \(\mathbf {j}\) elements of \(I_n\). Clearly, \(|I_n| = \left( {\begin{array}{c}m\\ n\end{array}}\right) \). Given a labeled sample \(S_m\) and \(\mathbf {i}\in I_n\), denote by \(S_m^{\mathbf {i}}\) the projection of \(S_m\) over the indices in \(\mathbf {i}\), and by \(S_m^{-\mathbf {i}}\), the projection of \(S_m\) over \(\{1,\ldots ,m\} \setminus \mathbf {i}\).
Define \(\omega (\mathbf {i}, m)\) to be the event \(S_{m} \cap \phi _{n}(S_m^{\mathbf {i}}) = \emptyset \), and \(\omega (\mathbf {i}, m-n)\) the event \(S_{m}^{-\mathbf {i}} \cap \phi _{n}(S_m^{\mathbf {i}})) = \emptyset \). We thus have
where inequality (23.3) holds due to the permutation invariance of \(\phi \), and because each sample among the \(m-n\) examples in \(S_m^{-\mathbf {j}}\) is drawn i.i.d., and it hits the set \(\phi _n(S_m^{\mathbf {j}})\) with probability greater than \(\epsilon \), so with probability at most \(1-\epsilon \) it is not contained in that set. Inequality (23.4) follows from the standard inequalities, \((1-\epsilon )^{m-n} \le e^{-\epsilon (m-n)}\) and \(\left( {\begin{array}{c}m\\ n\end{array}}\right) \le (\frac{em}{n})^{n}\) (see Theorems A.101 and A.105 in [11]). The proof is completed by taking \(\epsilon \) equal to the right-hand side of (23.1) (or 1 if this is greater than 1). \(\square \)
The proof of Theorem 23.2 below relies on the following Lemma 23.2 from [19].
Lemma 23.2
([19]) In the realizable case, for any \(r_{0} \in (0,1)\),
Proof
The claim is that, for any \(r \in (0,1)\),
The result then follows by taking the supremum of both sides over \(r \in (r_{0},1)\).
Fix \(r \in (0,1)\), let \(m = \lceil 1 / r \rceil \), and for \(i \in \{1,\ldots ,m\}\), define \(S_{m \setminus i} = S_{m} \setminus \{(x_{i},y_{i})\}\). Also define \(D_{m \setminus i} = \mathrm{DIS}( \mathrm{VS}_{\mathcal F,S_{m \setminus i}} \cap \mathrm{B}(f^{*},r) )\) and \(\varDelta _{m \setminus i} = \mathbb {P}(x_{i} \in D_{m \setminus i} | S_{m \setminus i}) = P( D_{m \setminus i} \times \mathcal Y)\). If \(\varDelta \mathrm{B}(f^{*},r) m \le 512\), (23.5) clearly holds. Otherwise, suppose \(\varDelta \mathrm{B}(f^{*},r) m > 512\). If \(x_{i} \in \mathrm{DIS}( \mathrm{VS}_{\mathcal F,S_{m \setminus i}} )\), then we must have \((x_{i},y_{i}) \in \hat{\mathcal {C}}_{S_{m}}\). So
Therefore,
Since we are considering the case \(\varDelta \mathrm{B}(f^{*},r) m > 512\), a Chernoff bound implies
Furthermore, Markov’s inequality implies
Since the \(x_{i}\) values are exchangeable,
[9] proves that this is at least
In particular, when \(\varDelta \mathrm{B}(f^{*},r) m > 512\), we must have \(r < 1/511 < 1/2\), which implies \((1-r)^{\lceil 1/r \rceil - 1} \ge 1/4\), so that we have
Altogether, we have established that
Thus, since \(\hat{n}(S_{m}) \le \mathcal {B}_{\hat{n}}\!\left( m,\frac{1}{20}\right) \) with probability at least 19 / 20, we must have that
\(\square \)
Proof
(of Theorem 23.2, [19]) Assuming that
holds, there exists a constant \(\delta _{1} \in (0,1/20)\) for which
Because \(\mathcal {B}_{\hat{n}}\!\left( m,\delta \right) \) is non-increasing with \(\delta \), \(\mathcal {B}_{\hat{n}}\!\left( m,\frac{1}{20}\right) \le \mathcal {B}_{\hat{n}}\!\left( m,\delta _{1}\right) \), and thus \(\mathcal {B}_{\hat{n}}\!\left( m,\frac{1}{20}\right) = O\left( \mathrm{polylog}(m) \right) \). Therefore,
and using Lemma 23.2 we have
\(\square \)
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
El-Yaniv, R., Wiener, Y. (2015). On the Version Space Compression Set Size and Its Applications. In: Vovk, V., Papadopoulos, H., Gammerman, A. (eds) Measures of Complexity. Springer, Cham. https://doi.org/10.1007/978-3-319-21852-6_23
Download citation
DOI: https://doi.org/10.1007/978-3-319-21852-6_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21851-9
Online ISBN: 978-3-319-21852-6
eBook Packages: Computer ScienceComputer Science (R0)