On the Version Space Compression Set Size and Its Applications

El-Yaniv, Ran; Wiener, Yair

doi:10.1007/978-3-319-21852-6_23

Ran El-Yaniv⁴ &
Yair Wiener⁴

2807 Accesses

Abstract

The version space compression set size n is the size of the smallest subset of a training set that induces the same (agnostic) version space with respect to a hypothesis class.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
More formally, $f^*$ a classifier such that $R(f^{*}) = \inf _{f \in \mathcal F} R(f)$ and $\inf _{f \in \mathcal F} P( (x,y) : f(x) \ne f^{*}(x)) = 0$. Its existence is guaranteed by topological arguments (see [9]).
2.
We assume that $f^*$ is unique with respect to P. This is always true in a realizable setting. For the agnostic case it requires an additional smoothness assumption, e.g., a low noise condition, or, more generally, a Bernstein type condition on the excess loss class [4, 18].
3.
This definition makes sense for positive coverage.

References

Bentley, J.L., Kung, H.T., Schkolnick, M., Thompson, C.D.: On the average number of maxima in a set of vectors and applications. JACM: J. ACM 25(4), 536–543 (1978)
Article MathSciNet MATH Google Scholar
Cohn, D., Atlas, L., Ladner, R.: Improving generalization with active learning. Mach. Learn. 15(2), 201–221 (1994)
Google Scholar
El-Yaniv, R., Wiener, Y.: On the foundations of noise-free selective classification. J. Mach. Learn. Res. 11, 1605–1641 (2010)
MathSciNet MATH Google Scholar
El-Yaniv, R., Wiener, Y.: Agnostic selective classification. In: Shawe-Taylor, J. et al. (eds.) Advances in Neural Information Processing Systems (NIPS), vol. 24, pp. 1665–1673 (2011)
Google Scholar
El-Yaniv, R., Wiener, Y.: Active learning via perfect selective classification. J. Mach. Learn. Res. 13(1), 255–279 (2012)
Google Scholar
Hanneke, S.: A bound on the label complexity of agnostic active learning. In: ICML, pp. 353–360. ACM, New York (2007)
Google Scholar
Hanneke, S.: Teaching dimension and the complexity of active learning. In: Bshouty, N.H., Gentile, C. (eds.) Proceedings of the 20th Annual Conference on Learning Theory (COLT). Lecture Notes in Artificial Intelligence, vol. 4539, pp. 66–81. Springer, Berlin (2007)
Google Scholar
Hanneke, S.: Theoretical foundations of active learning. Ph.D. thesis, Carnegie Mellon University (2009)
Google Scholar
Hanneke, S.: Activized learning: transforming passive to active with improved label complexity. J. Mach. Learn. Res. 13(1), 1469–1587 (2012)
Google Scholar
Hanneke, S.: A statistical theory of active learning. Unpublished (2013)
Google Scholar
Herbrich, R.: Learning Kernel Classifiers. The MIT Press, Cambridge (2002)
Google Scholar
Littlestone, N., Warmuth, M.: Relating Data Compression and Learnability. Technical report, University of California, Santa Cruz (1986)
Google Scholar
Mitchell, T.: Version spaces: a candidate elimination approach to rule learning. In: IJCAI’77: Proceedings of the 5th international joint conference on Artificial intelligence, pp. 305–310. Morgan Kaufmann, San Francisco (1977)
Google Scholar
Preparata, F.P., Shamos, M.I.: Computational Geometry: An Introduction. Springer, Berlin (1985)
Book Google Scholar
Vapnik, V.N., Chervonenkis, A.Y.: On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16(2), 264–280 (1971) (This volume, Chap. 3)
Google Scholar
Wiener, Y.: Theoretical foundations of selective prediction. Ph.D. thesis, Technion—Israel Institute of Technology (2013)
Google Scholar
Wiener, Y., El Yaniv, R.: Pointwise tracking the optimal regression function. In: Bartlett, P. et al. (eds.) Advances in Neural Information Processing Systems 25, pp. 2051–2059 (2012)
Google Scholar
Wiener, Y., El-Yaniv, R.: Agnostic pointwise-competitive selective classification. J. Artif. Intell. Res. 52, 179–201 (2015)
MathSciNet Google Scholar
Wiener, Y., Hanneke, S., El-Yaniv, R.: A compression technique for analyzing disagreement-based active learning. Technical report, arXiv preprint arXiv:1404.1504 (2014)
Yogananda, A.P., Murty, M.N., Gopal, L.: A fast linear separability test by projection of positive points on subspaces. In: Proceedings of the Twenty-Fourth International Conference on Machine Learning (ICML 2007), pp. 713–720. ACM, New York (2007)
Google Scholar

Download references

Acknowledgments

We are grateful to Steve Hanneke for helpful and insightful discussions. Also, we warmly thank the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI) for their generous support.

Author information

Authors and Affiliations

Technion – Israel Institute of Technology, Haifa, Israel
Ran El-Yaniv & Yair Wiener

Authors

Ran El-Yaniv
View author publications
You can also search for this author in PubMed Google Scholar
Yair Wiener
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ran El-Yaniv .

Editor information

Editors and Affiliations

Dept. of Computer Science, Royal Holloway, Univ of London, Egham, Surrey, United Kingdom
Vladimir Vovk
Frederick University, Nicosia, Cyprus
Harris Papadopoulos
Dept. of Computer Science, University of London, Egham, Surrey, United Kingdom
Alexander Gammerman

Appendix

Proof

(of Lemma 23.1 [19]) Let $I_n$ be the set of all sets of n distinct indices $\{i_{1},\ldots ,i_{n}\}$ from $\{1,\ldots , m\}$, We denote by $\mathbf {i}$ and $\mathbf {j}$ elements of $I_n$. Clearly, $|I_n| = \left( {\begin{array}{c}m\\ n\end{array}}\right) $. Given a labeled sample $S_m$ and $\mathbf {i}\in I_n$, denote by $S_m^{\mathbf {i}}$ the projection of $S_m$ over the indices in $\mathbf {i}$, and by $S_m^{-\mathbf {i}}$, the projection of $S_m$ over $\{1,\ldots ,m\} \setminus \mathbf {i}$.

Define $\omega (\mathbf {i}, m)$ to be the event $S_{m} \cap \phi _{n}(S_m^{\mathbf {i}}) = \emptyset $, and $\omega (\mathbf {i}, m-n)$ the event $S_{m}^{-\mathbf {i}} \cap \phi _{n}(S_m^{\mathbf {i}})) = \emptyset $. We thus have

$$\begin{aligned} P_{S_m}&\left\{ P \{ \phi _n( S_m^{\mathbf {i}} ) \} > \epsilon \ \wedge \ \omega (\mathbf {i}, m) \right\} \nonumber \\&\le \sum _{\mathbf {j}} P_{S_m} \left\{ P \{ \phi _n( S_m^{\mathbf {j}} )\} > \epsilon \ \wedge \ \omega (\mathbf {j}, m) \right\} \nonumber \\&\le \sum _{\mathbf {j}} \mathbb {E}_{S_m^{\mathbf {j}}} \left\{ P_{S_m^{-\mathbf {j}}} \left\{ P \{ \phi _n( S_m^{\mathbf {j}} ) \} > \epsilon \ \wedge \ \omega (\mathbf {j}, m-n) \ | \ S_m^{\mathbf {j}} \right\} \right\} \nonumber \\&\le \left( {\begin{array}{c}m\\ n\end{array}}\right) (1-\epsilon )^{m-n} \end{aligned}$$

(23.3)

$$\begin{aligned}&\le \left( \frac{em}{n}\right) ^n e^{-\epsilon (m-n)} , \end{aligned}$$

(23.4)

where inequality (23.3) holds due to the permutation invariance of $\phi $, and because each sample among the $m-n$ examples in $S_m^{-\mathbf {j}}$ is drawn i.i.d., and it hits the set $\phi _n(S_m^{\mathbf {j}})$ with probability greater than $\epsilon $, so with probability at most $1-\epsilon $ it is not contained in that set. Inequality (23.4) follows from the standard inequalities, $(1-\epsilon )^{m-n} \le e^{-\epsilon (m-n)}$ and $\left( {\begin{array}{c}m\\ n\end{array}}\right) \le (\frac{em}{n})^{n}$ (see Theorems A.101 and A.105 in [11]). The proof is completed by taking $\epsilon $ equal to the right-hand side of (23.1) (or 1 if this is greater than 1). $\square $

The proof of Theorem 23.2 below relies on the following Lemma 23.2 from [19].

Lemma 23.2

([19]) In the realizable case, for any $r_{0} \in (0,1)$,

$$\begin{aligned} \theta (r_{0}) \le \max \left\{ \max _{r \in (r_{0},1)} 16 \mathcal {B}_{\hat{n}}\!\left( \left\lceil \frac{1}{r}\right\rceil ,\frac{1}{20}\right) , 512\right\} . \end{aligned}$$

Proof

The claim is that, for any $r \in (0,1)$,

$$\begin{aligned} \frac{\varDelta \mathrm{B}(f^{*}, r)}{r} \le \max \left\{ 16 \mathcal {B}_{\hat{n}}\!\left( \left\lceil \frac{1}{r} \right\rceil ,\frac{1}{20}\right) , 512\right\} . \end{aligned}$$

(23.5)

The result then follows by taking the supremum of both sides over $r \in (r_{0},1)$.

Fix $r \in (0,1)$, let $m = \lceil 1 / r \rceil $, and for $i \in \{1,\ldots ,m\}$, define $S_{m \setminus i} = S_{m} \setminus \{(x_{i},y_{i})\}$. Also define $D_{m \setminus i} = \mathrm{DIS}( \mathrm{VS}_{\mathcal F,S_{m \setminus i}} \cap \mathrm{B}(f^{*},r) )$ and $\varDelta _{m \setminus i} = \mathbb {P}(x_{i} \in D_{m \setminus i} | S_{m \setminus i}) = P( D_{m \setminus i} \times \mathcal Y)$. If $\varDelta \mathrm{B}(f^{*},r) m \le 512$, (23.5) clearly holds. Otherwise, suppose $\varDelta \mathrm{B}(f^{*},r) m > 512$. If $x_{i} \in \mathrm{DIS}( \mathrm{VS}_{\mathcal F,S_{m \setminus i}} )$, then we must have $(x_{i},y_{i}) \in \hat{\mathcal {C}}_{S_{m}}$. So

$$\begin{aligned} \hat{n}(S_{m}) \ge \sum _{i=1}^{m} \mathbbm {1}_{\mathrm{DIS}(\mathrm{VS}_{\mathcal F,S_{m \setminus i}})}(x_{i}). \end{aligned}$$

Therefore,

$$\begin{aligned}&\mathbb {P}\left\{ \hat{n}(S_{m}) \le (1/16) \varDelta \mathrm{B}(f^{*},r) m \right\} \\&\le \mathbb {P}\left\{ \sum _{i=1}^{m} \mathbbm {1}_{\mathrm{DIS}(\mathrm{VS}_{\mathcal F,S_{m \setminus i}})}(x_{i}) \le (1/16) \varDelta \mathrm{B}(f^{*},r) m \right\} \\&\le \mathbb {P}\left\{ \sum _{i=1}^{m} \mathbbm {1}_{D_{m \setminus i}}(x_{i}) \le (1/16) \varDelta \mathrm{B}(f^{*},r) m \right\} \\&= \mathbb {P}\left\{ \sum _{i=1}^{m} \mathbbm {1}_{\mathrm{DIS}(\mathrm{B}(f^{*},r))}(x_{i}) - \mathbbm {1}_{D_{m \setminus i}}(x_{i})\right. \\&\qquad \left. {}\ge \sum _{i=1}^{m} \mathbbm {1}_{\mathrm{DIS}(\mathrm{B}(f^{*},r))}(x_{i}) - (1/16) \varDelta \mathrm{B}(f^{*},r) m \right\} \\&= \mathbb {P} \left\{ \sum _{i=1}^m \mathbbm {1}_{\mathrm{DIS}( \mathrm{B}( f^{*}, r))}(x_i) - \mathbbm {1}_{D_{m \setminus i}}(x_{i})\right. \\&\qquad {}\ge \sum _{i=1}^m \mathbbm {1}_{\mathrm{DIS}( \mathrm{B}( f^{*}, r))}(x_i) - \frac{1}{16} \varDelta \mathrm{B}(f^{*}, r) m ,\\&\qquad \left. \sum _{i=1}^m \mathbbm {1}_{\mathrm{DIS}( \mathrm{B}( f^{*}, r))}(x_i) < \frac{7}{8} \varDelta \mathrm{B}(f^{*}, r)m\right\} \\&+ \mathbb {P} \left\{ \sum _{i=1}^m \mathbbm {1}_{\mathrm{DIS}( \mathrm{B}( f^{*}, r))}(x_i) - \mathbbm {1}_{D_{m \setminus i}}(x_{i})\right. \\&\qquad {}\ge \sum _{i=1}^m \mathbbm {1}_{\mathrm{DIS}( \mathrm{B}( f^{*}, r))}(x_i) - \frac{1}{16} \varDelta \mathrm{B}(f^{*}, r) m,\\&\qquad \left. \sum _{i=1}^m \mathbbm {1}_{\mathrm{DIS}( \mathrm{B}( f^{*}, r))}(x_i) \ge \frac{7}{8} \varDelta \mathrm{B}(f^{*}, r) m\right\} \\&\le \mathbb {P}\left\{ \sum _{i=1}^{m} \mathbbm {1}_{\mathrm{DIS}(\mathrm{B}(f^{*},r))}(x_{i}) < (7/8) \varDelta \mathrm{B}(f^{*},r) m \right\} \\&\qquad {} + \mathbb {P} \left\{ \sum _{i=1}^{m} \mathbbm {1}_{\mathrm{DIS}(\mathrm{B}(f^{*},r))}(x_{i}) - \mathbbm {1}_{D_{m \setminus i}}(x_{i}) \ge (13/16) \varDelta \mathrm{B}(f^{*},r) m \right\} . \end{aligned}$$

Since we are considering the case $\varDelta \mathrm{B}(f^{*},r) m > 512$, a Chernoff bound implies

$$\begin{aligned} \mathbb {P}&\left( \sum _{i=1}^{m} \mathbbm {1}_{\mathrm{DIS}(\mathrm{B}(f^{*},r))}(x_{i}) < (7/8) \varDelta \mathrm{B}(f^{*},r) m \right) \\&\qquad \qquad \qquad \qquad \qquad \qquad \le \exp \left\{ - \varDelta \mathrm{B}(f^{*},r) m / 128 \right\} < e^{-4}. \end{aligned}$$

Furthermore, Markov’s inequality implies

$$\begin{aligned} \mathbb {P}&\left( \sum _{i=1}^{m} \mathbbm {1}_{\mathrm{DIS}(\mathrm{B}(f^{*},r))}(x_{i}) - \mathbbm {1}_{D_{m \setminus i}}(x_{i}) \ge (13/16) \varDelta \mathrm{B}(f^{*},r) m \right) \\&\qquad \qquad \qquad \qquad \qquad \le \frac{ m \varDelta \mathrm{B}(f^{*},r) - \mathbb {E}\left[ \sum _{i=1}^{m} \mathbbm {1}_{D_{m \setminus i}}(x_{i}) \right] }{(13/16) m \varDelta \mathrm{B}(f^{*},r)}. \end{aligned}$$

Since the $x_{i}$ values are exchangeable,

$$\begin{aligned} \mathbb {E}&\left[ \sum _{i=1}^{m} \mathbbm {1}_{D_{m \setminus i}}(x_{i}) \right] = \sum _{i=1}^{m} \mathbb {E}\left[ \mathbb {E}\left[ \mathbbm {1}_{D_{m \setminus i}}(x_{i}) \Big | S_{m \setminus i} \right] \right] \\&\qquad \qquad \qquad \qquad \qquad \qquad = \sum _{i=1}^{m} \mathbb {E}\left[ \varDelta _{m \setminus i} \right] = m \mathbb {E}\left[ \varDelta _{m \setminus m}\right] . \end{aligned}$$

[9] proves that this is at least

$$\begin{aligned} m (1-r)^{m-1} \varDelta \mathrm{B}(f^{*},r). \end{aligned}$$

In particular, when $\varDelta \mathrm{B}(f^{*},r) m > 512$, we must have $r < 1/511 < 1/2$, which implies $(1-r)^{\lceil 1/r \rceil - 1} \ge 1/4$, so that we have

$$\begin{aligned} \mathbb {E}\left[ \sum _{i=1}^{m} \mathbbm {1}_{D_{m \setminus i}}(x_{i}) \right] \ge (1/4) m \varDelta \mathrm{B}(f^{*},r). \end{aligned}$$

Altogether, we have established that

$$\begin{aligned} \mathbb {P}&\left( \hat{n}(S_{m}) \le (1/16) \varDelta \mathrm{B}(f^{*},r) m \right) < \frac{ m \varDelta \mathrm{B}(f^{*},r) - (1/4) m \varDelta \mathrm{B}(f^{*},r) }{(13/16) m \varDelta \mathrm{B}(f^{*},r)} + e^{-4} \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad = \frac{12}{13} + e^{-4} < 19/20. \end{aligned}$$

Thus, since $\hat{n}(S_{m}) \le \mathcal {B}_{\hat{n}}\!\left( m,\frac{1}{20}\right) $ with probability at least 19 / 20, we must have that

$$\begin{aligned} \mathcal {B}_{\hat{n}}\!\left( m,\frac{1}{20}\right) > (1/16) \varDelta \mathrm{B}(f^{*},r) m \ge (1/16) \frac{ \varDelta B(f^{*},r) }{r}. \end{aligned}$$

$\square $

Proof

(of Theorem 23.2, [19]) Assuming that

$$\mathcal {B}_{\hat{n}}\!\left( m,\delta \right) = O\left( \mathrm{polylog}( m ) \log \left( \frac{1}{\delta }\right) \right) $$

holds, there exists a constant $\delta _{1} \in (0,1/20)$ for which

$$\mathcal {B}_{\hat{n}}\!\left( m,\delta _{1}\right) = O\left( \mathrm{polylog}( m ) \right) .$$

Because $\mathcal {B}_{\hat{n}}\!\left( m,\delta \right) $ is non-increasing with $\delta $, $\mathcal {B}_{\hat{n}}\!\left( m,\frac{1}{20}\right) \le \mathcal {B}_{\hat{n}}\!\left( m,\delta _{1}\right) $, and thus $\mathcal {B}_{\hat{n}}\!\left( m,\frac{1}{20}\right) = O\left( \mathrm{polylog}(m) \right) $. Therefore,

$$\begin{aligned} \max _{m \le 1/r_{0}} \mathcal {B}_{\hat{n}}\!\left( m,\frac{1}{20}\right) = O\left( \max _{m \le 1/r_{0}} \mathrm{polylog}( m ) \right) = O\left( \mathrm{polylog}\left( \frac{1}{r_{0}}\right) \right) , \end{aligned}$$

and using Lemma 23.2 we have

$$\begin{aligned} \theta (r_{0})&\le \max \left\{ \max _{m \le \lceil 1/r_{0} \rceil } 16 \mathcal {B}_{\hat{n}}\!\left( m,\frac{1}{20}\right) , 512\right\} \\&\le 528 + 16 \max _{m \le 1/r_{0}} \mathcal {B}_{\hat{n}}\!\left( m,\frac{1}{20}\right) = O\left( \mathrm{polylog}\left( \frac{1}{r_{0}} \right) \right) . \end{aligned}$$

$\square $

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

El-Yaniv, R., Wiener, Y. (2015). On the Version Space Compression Set Size and Its Applications. In: Vovk, V., Papadopoulos, H., Gammerman, A. (eds) Measures of Complexity. Springer, Cham. https://doi.org/10.1007/978-3-319-21852-6_23

Download citation

DOI: https://doi.org/10.1007/978-3-319-21852-6_23
Published: 04 September 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21851-9
Online ISBN: 978-3-319-21852-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

On the Version Space Compression Set Size and Its Applications

Abstract

Access this chapter

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Proof

Lemma 23.2

Proof

Proof

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation