Skip to main content

On the Version Space Compression Set Size and Its Applications

  • Chapter
  • First Online:
Measures of Complexity
  • 2807 Accesses

Abstract

The version space compression set size n is the size of the smallest subset of a training set that induces the same (agnostic) version space with respect to a hypothesis class.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    More formally, \(f^*\) a classifier such that \(R(f^{*}) = \inf _{f \in \mathcal F} R(f)\) and \(\inf _{f \in \mathcal F} P( (x,y) : f(x) \ne f^{*}(x)) = 0\). Its existence is guaranteed by topological arguments (see [9]).

  2. 2.

    We assume that \(f^*\) is unique with respect to P. This is always true in a realizable setting. For the agnostic case it requires an additional smoothness assumption, e.g., a low noise condition, or, more generally, a Bernstein type condition on the excess loss class [4, 18].

  3. 3.

    This definition makes sense for positive coverage.

References

  1. Bentley, J.L., Kung, H.T., Schkolnick, M., Thompson, C.D.: On the average number of maxima in a set of vectors and applications. JACM: J. ACM 25(4), 536–543 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  2. Cohn, D., Atlas, L., Ladner, R.: Improving generalization with active learning. Mach. Learn. 15(2), 201–221 (1994)

    Google Scholar 

  3. El-Yaniv, R., Wiener, Y.: On the foundations of noise-free selective classification. J. Mach. Learn. Res. 11, 1605–1641 (2010)

    MathSciNet  MATH  Google Scholar 

  4. El-Yaniv, R., Wiener, Y.: Agnostic selective classification. In: Shawe-Taylor, J. et al. (eds.) Advances in Neural Information Processing Systems (NIPS), vol. 24, pp. 1665–1673 (2011)

    Google Scholar 

  5. El-Yaniv, R., Wiener, Y.: Active learning via perfect selective classification. J. Mach. Learn. Res. 13(1), 255–279 (2012)

    Google Scholar 

  6. Hanneke, S.: A bound on the label complexity of agnostic active learning. In: ICML, pp. 353–360. ACM, New York (2007)

    Google Scholar 

  7. Hanneke, S.: Teaching dimension and the complexity of active learning. In: Bshouty, N.H., Gentile, C. (eds.) Proceedings of the 20th Annual Conference on Learning Theory (COLT). Lecture Notes in Artificial Intelligence, vol. 4539, pp. 66–81. Springer, Berlin (2007)

    Google Scholar 

  8. Hanneke, S.: Theoretical foundations of active learning. Ph.D. thesis, Carnegie Mellon University (2009)

    Google Scholar 

  9. Hanneke, S.: Activized learning: transforming passive to active with improved label complexity. J. Mach. Learn. Res. 13(1), 1469–1587 (2012)

    Google Scholar 

  10. Hanneke, S.: A statistical theory of active learning. Unpublished (2013)

    Google Scholar 

  11. Herbrich, R.: Learning Kernel Classifiers. The MIT Press, Cambridge (2002)

    Google Scholar 

  12. Littlestone, N., Warmuth, M.: Relating Data Compression and Learnability. Technical report, University of California, Santa Cruz (1986)

    Google Scholar 

  13. Mitchell, T.: Version spaces: a candidate elimination approach to rule learning. In: IJCAI’77: Proceedings of the 5th international joint conference on Artificial intelligence, pp. 305–310. Morgan Kaufmann, San Francisco (1977)

    Google Scholar 

  14. Preparata, F.P., Shamos, M.I.: Computational Geometry: An Introduction. Springer, Berlin (1985)

    Book  Google Scholar 

  15. Vapnik, V.N., Chervonenkis, A.Y.: On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16(2), 264–280 (1971) (This volume, Chap. 3)

    Google Scholar 

  16. Wiener, Y.: Theoretical foundations of selective prediction. Ph.D. thesis, Technion—Israel Institute of Technology (2013)

    Google Scholar 

  17. Wiener, Y., El Yaniv, R.: Pointwise tracking the optimal regression function. In: Bartlett, P. et al. (eds.) Advances in Neural Information Processing Systems 25, pp. 2051–2059 (2012)

    Google Scholar 

  18. Wiener, Y., El-Yaniv, R.: Agnostic pointwise-competitive selective classification. J. Artif. Intell. Res. 52, 179–201 (2015)

    MathSciNet  Google Scholar 

  19. Wiener, Y., Hanneke, S., El-Yaniv, R.: A compression technique for analyzing disagreement-based active learning. Technical report, arXiv preprint arXiv:1404.1504 (2014)

  20. Yogananda, A.P., Murty, M.N., Gopal, L.: A fast linear separability test by projection of positive points on subspaces. In: Proceedings of the Twenty-Fourth International Conference on Machine Learning (ICML 2007), pp. 713–720. ACM, New York (2007)

    Google Scholar 

Download references

Acknowledgments

We are grateful to Steve Hanneke for helpful and insightful discussions. Also, we warmly thank the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI) for their generous support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ran El-Yaniv .

Editor information

Editors and Affiliations

Appendix

Appendix

Proof

(of Lemma 23.1 [19]) Let \(I_n\) be the set of all sets of n distinct indices \(\{i_{1},\ldots ,i_{n}\}\) from \(\{1,\ldots , m\}\), We denote by \(\mathbf {i}\) and \(\mathbf {j}\) elements of \(I_n\). Clearly, \(|I_n| = \left( {\begin{array}{c}m\\ n\end{array}}\right) \). Given a labeled sample \(S_m\) and \(\mathbf {i}\in I_n\), denote by \(S_m^{\mathbf {i}}\) the projection of \(S_m\) over the indices in \(\mathbf {i}\), and by \(S_m^{-\mathbf {i}}\), the projection of \(S_m\) over \(\{1,\ldots ,m\} \setminus \mathbf {i}\).

Define \(\omega (\mathbf {i}, m)\) to be the event \(S_{m} \cap \phi _{n}(S_m^{\mathbf {i}}) = \emptyset \), and \(\omega (\mathbf {i}, m-n)\) the event \(S_{m}^{-\mathbf {i}} \cap \phi _{n}(S_m^{\mathbf {i}})) = \emptyset \). We thus have

$$\begin{aligned} P_{S_m}&\left\{ P \{ \phi _n( S_m^{\mathbf {i}} ) \} > \epsilon \ \wedge \ \omega (\mathbf {i}, m) \right\} \nonumber \\&\le \sum _{\mathbf {j}} P_{S_m} \left\{ P \{ \phi _n( S_m^{\mathbf {j}} )\} > \epsilon \ \wedge \ \omega (\mathbf {j}, m) \right\} \nonumber \\&\le \sum _{\mathbf {j}} \mathbb {E}_{S_m^{\mathbf {j}}} \left\{ P_{S_m^{-\mathbf {j}}} \left\{ P \{ \phi _n( S_m^{\mathbf {j}} ) \} > \epsilon \ \wedge \ \omega (\mathbf {j}, m-n) \ | \ S_m^{\mathbf {j}} \right\} \right\} \nonumber \\&\le \left( {\begin{array}{c}m\\ n\end{array}}\right) (1-\epsilon )^{m-n} \end{aligned}$$
(23.3)
$$\begin{aligned}&\le \left( \frac{em}{n}\right) ^n e^{-\epsilon (m-n)} , \end{aligned}$$
(23.4)

where inequality (23.3) holds due to the permutation invariance of \(\phi \), and because each sample among the \(m-n\) examples in \(S_m^{-\mathbf {j}}\) is drawn i.i.d., and it hits the set \(\phi _n(S_m^{\mathbf {j}})\) with probability greater than \(\epsilon \), so with probability at most \(1-\epsilon \) it is not contained in that set. Inequality (23.4) follows from the standard inequalities, \((1-\epsilon )^{m-n} \le e^{-\epsilon (m-n)}\) and \(\left( {\begin{array}{c}m\\ n\end{array}}\right) \le (\frac{em}{n})^{n}\) (see Theorems A.101 and A.105 in [11]). The proof is completed by taking \(\epsilon \) equal to the right-hand side of (23.1) (or 1 if this is greater than 1). \(\square \)

The proof of Theorem 23.2 below relies on the following Lemma 23.2 from [19].

Lemma 23.2

([19]) In the realizable case, for any \(r_{0} \in (0,1)\),

$$\begin{aligned} \theta (r_{0}) \le \max \left\{ \max _{r \in (r_{0},1)} 16 \mathcal {B}_{\hat{n}}\!\left( \left\lceil \frac{1}{r}\right\rceil ,\frac{1}{20}\right) , 512\right\} . \end{aligned}$$

Proof

The claim is that, for any \(r \in (0,1)\),

$$\begin{aligned} \frac{\varDelta \mathrm{B}(f^{*}, r)}{r} \le \max \left\{ 16 \mathcal {B}_{\hat{n}}\!\left( \left\lceil \frac{1}{r} \right\rceil ,\frac{1}{20}\right) , 512\right\} . \end{aligned}$$
(23.5)

The result then follows by taking the supremum of both sides over \(r \in (r_{0},1)\).

Fix \(r \in (0,1)\), let \(m = \lceil 1 / r \rceil \), and for \(i \in \{1,\ldots ,m\}\), define \(S_{m \setminus i} = S_{m} \setminus \{(x_{i},y_{i})\}\). Also define \(D_{m \setminus i} = \mathrm{DIS}( \mathrm{VS}_{\mathcal F,S_{m \setminus i}} \cap \mathrm{B}(f^{*},r) )\) and \(\varDelta _{m \setminus i} = \mathbb {P}(x_{i} \in D_{m \setminus i} | S_{m \setminus i}) = P( D_{m \setminus i} \times \mathcal Y)\). If \(\varDelta \mathrm{B}(f^{*},r) m \le 512\), (23.5) clearly holds. Otherwise, suppose \(\varDelta \mathrm{B}(f^{*},r) m > 512\). If \(x_{i} \in \mathrm{DIS}( \mathrm{VS}_{\mathcal F,S_{m \setminus i}} )\), then we must have \((x_{i},y_{i}) \in \hat{\mathcal {C}}_{S_{m}}\). So

$$\begin{aligned} \hat{n}(S_{m}) \ge \sum _{i=1}^{m} \mathbbm {1}_{\mathrm{DIS}(\mathrm{VS}_{\mathcal F,S_{m \setminus i}})}(x_{i}). \end{aligned}$$

Therefore,

$$\begin{aligned}&\mathbb {P}\left\{ \hat{n}(S_{m}) \le (1/16) \varDelta \mathrm{B}(f^{*},r) m \right\} \\&\le \mathbb {P}\left\{ \sum _{i=1}^{m} \mathbbm {1}_{\mathrm{DIS}(\mathrm{VS}_{\mathcal F,S_{m \setminus i}})}(x_{i}) \le (1/16) \varDelta \mathrm{B}(f^{*},r) m \right\} \\&\le \mathbb {P}\left\{ \sum _{i=1}^{m} \mathbbm {1}_{D_{m \setminus i}}(x_{i}) \le (1/16) \varDelta \mathrm{B}(f^{*},r) m \right\} \\&= \mathbb {P}\left\{ \sum _{i=1}^{m} \mathbbm {1}_{\mathrm{DIS}(\mathrm{B}(f^{*},r))}(x_{i}) - \mathbbm {1}_{D_{m \setminus i}}(x_{i})\right. \\&\qquad \left. {}\ge \sum _{i=1}^{m} \mathbbm {1}_{\mathrm{DIS}(\mathrm{B}(f^{*},r))}(x_{i}) - (1/16) \varDelta \mathrm{B}(f^{*},r) m \right\} \\&= \mathbb {P} \left\{ \sum _{i=1}^m \mathbbm {1}_{\mathrm{DIS}( \mathrm{B}( f^{*}, r))}(x_i) - \mathbbm {1}_{D_{m \setminus i}}(x_{i})\right. \\&\qquad {}\ge \sum _{i=1}^m \mathbbm {1}_{\mathrm{DIS}( \mathrm{B}( f^{*}, r))}(x_i) - \frac{1}{16} \varDelta \mathrm{B}(f^{*}, r) m ,\\&\qquad \left. \sum _{i=1}^m \mathbbm {1}_{\mathrm{DIS}( \mathrm{B}( f^{*}, r))}(x_i) < \frac{7}{8} \varDelta \mathrm{B}(f^{*}, r)m\right\} \\&+ \mathbb {P} \left\{ \sum _{i=1}^m \mathbbm {1}_{\mathrm{DIS}( \mathrm{B}( f^{*}, r))}(x_i) - \mathbbm {1}_{D_{m \setminus i}}(x_{i})\right. \\&\qquad {}\ge \sum _{i=1}^m \mathbbm {1}_{\mathrm{DIS}( \mathrm{B}( f^{*}, r))}(x_i) - \frac{1}{16} \varDelta \mathrm{B}(f^{*}, r) m,\\&\qquad \left. \sum _{i=1}^m \mathbbm {1}_{\mathrm{DIS}( \mathrm{B}( f^{*}, r))}(x_i) \ge \frac{7}{8} \varDelta \mathrm{B}(f^{*}, r) m\right\} \\&\le \mathbb {P}\left\{ \sum _{i=1}^{m} \mathbbm {1}_{\mathrm{DIS}(\mathrm{B}(f^{*},r))}(x_{i}) < (7/8) \varDelta \mathrm{B}(f^{*},r) m \right\} \\&\qquad {} + \mathbb {P} \left\{ \sum _{i=1}^{m} \mathbbm {1}_{\mathrm{DIS}(\mathrm{B}(f^{*},r))}(x_{i}) - \mathbbm {1}_{D_{m \setminus i}}(x_{i}) \ge (13/16) \varDelta \mathrm{B}(f^{*},r) m \right\} . \end{aligned}$$

Since we are considering the case \(\varDelta \mathrm{B}(f^{*},r) m > 512\), a Chernoff bound implies

$$\begin{aligned} \mathbb {P}&\left( \sum _{i=1}^{m} \mathbbm {1}_{\mathrm{DIS}(\mathrm{B}(f^{*},r))}(x_{i}) < (7/8) \varDelta \mathrm{B}(f^{*},r) m \right) \\&\qquad \qquad \qquad \qquad \qquad \qquad \le \exp \left\{ - \varDelta \mathrm{B}(f^{*},r) m / 128 \right\} < e^{-4}. \end{aligned}$$

Furthermore, Markov’s inequality implies

$$\begin{aligned} \mathbb {P}&\left( \sum _{i=1}^{m} \mathbbm {1}_{\mathrm{DIS}(\mathrm{B}(f^{*},r))}(x_{i}) - \mathbbm {1}_{D_{m \setminus i}}(x_{i}) \ge (13/16) \varDelta \mathrm{B}(f^{*},r) m \right) \\&\qquad \qquad \qquad \qquad \qquad \le \frac{ m \varDelta \mathrm{B}(f^{*},r) - \mathbb {E}\left[ \sum _{i=1}^{m} \mathbbm {1}_{D_{m \setminus i}}(x_{i}) \right] }{(13/16) m \varDelta \mathrm{B}(f^{*},r)}. \end{aligned}$$

Since the \(x_{i}\) values are exchangeable,

$$\begin{aligned} \mathbb {E}&\left[ \sum _{i=1}^{m} \mathbbm {1}_{D_{m \setminus i}}(x_{i}) \right] = \sum _{i=1}^{m} \mathbb {E}\left[ \mathbb {E}\left[ \mathbbm {1}_{D_{m \setminus i}}(x_{i}) \Big | S_{m \setminus i} \right] \right] \\&\qquad \qquad \qquad \qquad \qquad \qquad = \sum _{i=1}^{m} \mathbb {E}\left[ \varDelta _{m \setminus i} \right] = m \mathbb {E}\left[ \varDelta _{m \setminus m}\right] . \end{aligned}$$

[9] proves that this is at least

$$\begin{aligned} m (1-r)^{m-1} \varDelta \mathrm{B}(f^{*},r). \end{aligned}$$

In particular, when \(\varDelta \mathrm{B}(f^{*},r) m > 512\), we must have \(r < 1/511 < 1/2\), which implies \((1-r)^{\lceil 1/r \rceil - 1} \ge 1/4\), so that we have

$$\begin{aligned} \mathbb {E}\left[ \sum _{i=1}^{m} \mathbbm {1}_{D_{m \setminus i}}(x_{i}) \right] \ge (1/4) m \varDelta \mathrm{B}(f^{*},r). \end{aligned}$$

Altogether, we have established that

$$\begin{aligned} \mathbb {P}&\left( \hat{n}(S_{m}) \le (1/16) \varDelta \mathrm{B}(f^{*},r) m \right) < \frac{ m \varDelta \mathrm{B}(f^{*},r) - (1/4) m \varDelta \mathrm{B}(f^{*},r) }{(13/16) m \varDelta \mathrm{B}(f^{*},r)} + e^{-4} \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad = \frac{12}{13} + e^{-4} < 19/20. \end{aligned}$$

Thus, since \(\hat{n}(S_{m}) \le \mathcal {B}_{\hat{n}}\!\left( m,\frac{1}{20}\right) \) with probability at least 19 / 20, we must have that

$$\begin{aligned} \mathcal {B}_{\hat{n}}\!\left( m,\frac{1}{20}\right) > (1/16) \varDelta \mathrm{B}(f^{*},r) m \ge (1/16) \frac{ \varDelta B(f^{*},r) }{r}. \end{aligned}$$

\(\square \)

Proof

(of Theorem 23.2, [19]) Assuming that

$$\mathcal {B}_{\hat{n}}\!\left( m,\delta \right) = O\left( \mathrm{polylog}( m ) \log \left( \frac{1}{\delta }\right) \right) $$

holds, there exists a constant \(\delta _{1} \in (0,1/20)\) for which

$$\mathcal {B}_{\hat{n}}\!\left( m,\delta _{1}\right) = O\left( \mathrm{polylog}( m ) \right) .$$

Because \(\mathcal {B}_{\hat{n}}\!\left( m,\delta \right) \) is non-increasing with \(\delta \), \(\mathcal {B}_{\hat{n}}\!\left( m,\frac{1}{20}\right) \le \mathcal {B}_{\hat{n}}\!\left( m,\delta _{1}\right) \), and thus \(\mathcal {B}_{\hat{n}}\!\left( m,\frac{1}{20}\right) = O\left( \mathrm{polylog}(m) \right) \). Therefore,

$$\begin{aligned} \max _{m \le 1/r_{0}} \mathcal {B}_{\hat{n}}\!\left( m,\frac{1}{20}\right) = O\left( \max _{m \le 1/r_{0}} \mathrm{polylog}( m ) \right) = O\left( \mathrm{polylog}\left( \frac{1}{r_{0}}\right) \right) , \end{aligned}$$

and using Lemma 23.2 we have

$$\begin{aligned} \theta (r_{0})&\le \max \left\{ \max _{m \le \lceil 1/r_{0} \rceil } 16 \mathcal {B}_{\hat{n}}\!\left( m,\frac{1}{20}\right) , 512\right\} \\&\le 528 + 16 \max _{m \le 1/r_{0}} \mathcal {B}_{\hat{n}}\!\left( m,\frac{1}{20}\right) = O\left( \mathrm{polylog}\left( \frac{1}{r_{0}} \right) \right) . \end{aligned}$$

\(\square \)

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

El-Yaniv, R., Wiener, Y. (2015). On the Version Space Compression Set Size and Its Applications. In: Vovk, V., Papadopoulos, H., Gammerman, A. (eds) Measures of Complexity. Springer, Cham. https://doi.org/10.1007/978-3-319-21852-6_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-21852-6_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-21851-9

  • Online ISBN: 978-3-319-21852-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics