Elements of Computational Learning Theory

  • Ke-Lin DuEmail author
  • M. N. S. Swamy


PAC learning theory is the foundation of computational learning theory. VC-dimension, Rademacher complexity, and empirical risk-minimization principle are three concepts for deriving a generalization error bound for a trained machine. The fundamental theorem of learning theory relates PAC learnability, VC-dimension, and empirical risk-minimization principle. Another basic theorem in computational learning theory is no-free-lunch theorem. These topics are addressed in this chapter.


  1. 1.
    Anguita, D., Ghio, A., Oneto, L., & Ridella, S. (2014). A deep connection between the Vapnik-Chervonenkis entropy and the Rademacher complexity. IEEE Transactions on Neural Networks and Learning Systems, 25(12), 2202–2211.CrossRefGoogle Scholar
  2. 2.
    Anthony, M., & Biggs, N. (1992). Computational learning theory. Cambridge, UK: Cambridge University Press.zbMATHGoogle Scholar
  3. 3.
    Bartlett, P. L. (1993). Lower bounds on the Vapnik-Chervonenkis dimension of multi-layer threshold networks. In Proceedings of the 6th Annual ACM Conference on Computational Learning Theory (pp. 144–150). New York: ACM Press.Google Scholar
  4. 4.
    Bartlett, P. L., & Maass, W. (2003). Vapnik-Chervonenkis dimension of neural nets. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (2nd ed., pp. 1188–1192). Cambridge: MIT Press.Google Scholar
  5. 5.
    Bartlett, P. L., & Mendelson, S. (2003). Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3, 463–482.MathSciNetzbMATHGoogle Scholar
  6. 6.
    Bartlett, P. L., Long, P. M., & Williamson, R. C. (1994). Fat-shattering and the learnability of real-valued functions. In Proceedings of the 7th Annual ACM Conference on Computational Learning Theory (pp. 299–310). New Brunswick, NJ.Google Scholar
  7. 7.
    Bartlett, P. L., Bousquet, O., & Mendelson, S. (2005). Local Rademacher complexities. Annals of Statistics, 33(4), 1497–1537.MathSciNetCrossRefGoogle Scholar
  8. 8.
    Baum, E. B., & Haussler, D. (1989). What size net gives valid generalization? Neural Computation, 1, 151–160.CrossRefGoogle Scholar
  9. 9.
    Cataltepe, Z., Abu-Mostafa, Y. S., & Magdon-Ismail, M. (1999). No free lunch for early stropping. Neural Computation, 11, 995–1009.CrossRefGoogle Scholar
  10. 10.
    Cherkassky, V., & Ma, Y. (2003). Comparison of model selection for regression. Neural Computation, 15, 1691–1714.CrossRefGoogle Scholar
  11. 11.
    Cherkassky, V., & Ma, Y. (2009). Another look at statistical learning theory and regularization. Neural Networks, 22, 958–969.CrossRefGoogle Scholar
  12. 12.
    Cover, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers, 14, 326–334.CrossRefGoogle Scholar
  13. 13.
    Dudley, R. (1967). The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. Journal of Functional Analysis, 1(3), 290–330.MathSciNetCrossRefGoogle Scholar
  14. 14.
    Friedrichs, F., & Schmitt, M. (2005). On the power of Boolean computations in generalized RBF neural networks. Neurocomputing, 63, 483–498.CrossRefGoogle Scholar
  15. 15.
    Goldman, S., & Kearns, M. (1995). On the complexity of teaching. Journal of Computer and Systems Sciences, 50(1), 20–31.MathSciNetCrossRefGoogle Scholar
  16. 16.
    Goutte, C. (1997). Note on free lunches and cross-validation. Neural Computation, 9(6), 1245–1249.CrossRefGoogle Scholar
  17. 17.
    Gribonval, R., Jenatton, R., Bach, F., Kleinsteuber, M., & Seibert, M. (2015). Sample complexity of dictionary learning and other matrix factorizations. IEEE Transactions on Information Theory, 61(6), 3469–3486.MathSciNetCrossRefGoogle Scholar
  18. 18.
    Hanneke, S., & Yang, L. (2015). Minimax analysis of active learning. Journal of Machine Learning Research, 16, 3487–3602.MathSciNetzbMATHGoogle Scholar
  19. 19.
    Hastie, T., Tibshirani, R., & Friedman, J. (2005). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). Berlin: Springer.zbMATHGoogle Scholar
  20. 20.
    Haussler, D. (1990). Probably approximately correct learning. In Proceedings of the 8th National Conference on Artificial Intelligence (Vol. 2, pp. 1101–1108). Boston, MA.Google Scholar
  21. 21.
    Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). Upper Saddle River, NJ: Prentice Hall.zbMATHGoogle Scholar
  22. 22.
    Koiran, P., & Sontag, E. D. (1996). Neural networks with quadratic VC dimension. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems (Vol. 8, pp. 197–203). Cambridge, MA: MIT Press.Google Scholar
  23. 23.
    Koltchinskii, V. (2001). Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5), 1902–1914.MathSciNetCrossRefGoogle Scholar
  24. 24.
    Lei, Y., Ding, L., & Zhang, W. (2015). Generalization performance of radial basis function networks. IEEE Transactions on Neural Networks and Learning Systems, 26(3), 551–564.MathSciNetCrossRefGoogle Scholar
  25. 25.
    Liu, J., & Zhu, X. (2016). The teaching dimension of linear learners. Journal of Machine Learning Research, 17, 1–25.MathSciNetzbMATHGoogle Scholar
  26. 26.
    Magdon-Ismail, M. (2000). No free lunch for noise prediction. Neural Computation, 12, 547–564.CrossRefGoogle Scholar
  27. 27.
    Mendelson, S. (2002). Rademacher averages and phase transitions in Glivenko-Cantelli classes. IEEE Transactions on Information Theory, 48(1), 251–263.MathSciNetCrossRefGoogle Scholar
  28. 28.
    Mendelson, S. (2003). A few notes on statistical learning theory. In S. Mendelson & A. Smola (Eds.), Advanced lectures on machine learning (Lecture notes computer science) (Vol. 2600, pp. 1–40). Berlin: Springer-Verlag.CrossRefGoogle Scholar
  29. 29.
    Oneto, L., Anguita, D., & Ridella, S. (2016). A local Vapnik-Chervonenkis complexity. Neural Networks, 82, 62–75.CrossRefGoogle Scholar
  30. 30.
    Rivals, I., & Personnaz, L. (1999). On cross-validation for model selection. Neural Computation, 11(4), 863–870.CrossRefGoogle Scholar
  31. 31.
    Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5, 197–227.Google Scholar
  32. 32.
    Schmitt, M. (2005). On the capabilities of higher-order neurons: A radial basis function approach. Neural Computation, 17, 715–729.MathSciNetCrossRefGoogle Scholar
  33. 33.
    Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. New York, NY: Cambridge University Press.CrossRefGoogle Scholar
  34. 34.
    Shao, X., Cherkassky, V., & Li, W. (2000). Measuring the VC-dimension using optimized experimental design. Neural Computation, 12, 1969–1986.CrossRefGoogle Scholar
  35. 35.
    Shawe-Taylor, J. (1995). Sample sizes for sigmoidal neural networks. In Proceedings of the 8th Annual Conference on Computational Learning Theory (pp. 258–264). Santa Cruz, CA.Google Scholar
  36. 36.
    Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1998). Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44, 1926–1940.MathSciNetCrossRefGoogle Scholar
  37. 37.
    Shinohara, A., & Miyano, S. (1991). Teachability in computational learning. New Generation Computing, 8(4), 337–348.CrossRefGoogle Scholar
  38. 38.
    Simon, H. U. (2015). An almost optimal PAC algorithm. In Proceedings of the 28th Conference on Learning Theory (pp. 1–12). Paris.Google Scholar
  39. 39.
    Valiant, P. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142.CrossRefGoogle Scholar
  40. 40.
    Vapnik, V. N. (1982). Estimation of dependences based on empirical data. New York: Springer-Verlag.zbMATHGoogle Scholar
  41. 41.
    Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer.CrossRefGoogle Scholar
  42. 42.
    Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley.zbMATHGoogle Scholar
  43. 43.
    Vapnik, V. N., & Chervonenkis, A. J. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & its Applications, 16, 264–280.CrossRefGoogle Scholar
  44. 44.
    Vapnik, V., Levin, E., & Le Cun, Y. (1994). Measuring the VC-dimension of a learning machine. Neural Computation, 6, 851–876.CrossRefGoogle Scholar
  45. 45.
    Wolpert, D. H., & Macready, W. G. (1995). No free lunch theorems for search, SFI-TR-95-02-010, Santa Fe Institute.Google Scholar
  46. 46.
    Yu, H.-F., Jain, P., & Dhillon, I. S. (2014). Large-scale multi-label learning with missing labels. In Proceedings of the 21st International Conference on Machine Learning (pp. 1–9).Google Scholar
  47. 47.
    Zhu, H. (1996). No free lunch for cross validation. Neural Computation, 8(7), 1421–1426.MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Electrical and Computer EngineeringConcordia UniversityMontrealCanada
  2. 2.Xonlink Inc.HangzhouChina

Personalised recommendations