Skip to main content

Suboptimality of Penalized Empirical Risk Minimization in Classification

  • Conference paper
Learning Theory (COLT 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4539))

Included in the following conference series:

Abstract

Let \(\cal F\) be a set of M classification procedures with values in [ − 1,1]. Given a loss function, we want to construct a procedure which mimics at the best possible rate the best procedure in \(\cal F\). This fastest rate is called optimal rate of aggregation. Considering a continuous scale of loss functions with various types of convexity, we prove that optimal rates of aggregation can be either ((logM)/n)1/2 or (logM)/n. We prove that, if all the M classifiers are binary, the (penalized) Empirical Risk Minimization procedures are suboptimal (even under the margin/low noise condition) when the loss function is somewhat more than convex, whereas, in that case, aggregation procedures with exponential weights achieve the optimal rate of aggregation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Audibert, J.-Y.: A randomized online learning algorithm for better variance control. In: Lugosi, G., Simon, H.U. (eds.) COLT 2006. LNCS (LNAI), vol. 4005, pp. 392–407. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  2. Barron, A., Li, J.: Mixture density estimation. Biometrics 53, 603–618 (1997)

    Article  Google Scholar 

  3. Bartlett, P.L., Jordan, M.I., McAuliffe, J.D.: Convexity, classification, and risk bounds. Journal of the American Statistical Association 101(473), 138–156 (2006)

    Article  MathSciNet  Google Scholar 

  4. Bickel, P., Doksum, K.: Mathematical Statistics: Basic Ideas and Selected Topics, vol. 1. Prentice-Hall, Englewood Cliffs (2001)

    Google Scholar 

  5. Boucheron, S., Bousquet, O., Lugosi, G.: Theory of classification: some recent advances. ESAIM Probability and Statistics 9, 323–375 (2005)

    MATH  MathSciNet  Google Scholar 

  6. Bühlmann, P., Yu, B.: Analyzing bagging. Ann. Statist. 30(4), 927–961 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  7. Catoni, O.: Statistical Learning Theory and Stochastic Optimization. Ecole d’été de Probabilités de Saint-Flour 2001. Lecture Notes in Mathematics. Springer, Heidelberg (2001)

    Google Scholar 

  8. Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press, New York (2006)

    MATH  Google Scholar 

  9. Chesneau, C., Lecué, G.: Adapting to unknown smoothness by aggregation of thresholded wavelet estimators. Submitted (2006)

    Google Scholar 

  10. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995)

    MATH  Google Scholar 

  11. Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Springer, Heidelberg (1996)

    MATH  Google Scholar 

  12. Einmahl, U., Mason, D.: Some Universal Results on the Behavior of Increments of Partial Sums. Ann. Probab. 24, 2626–2635 (1996)

    MathSciNet  Google Scholar 

  13. Freund, Y., Schapire, R.: A decision-theoric generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55, 119–139 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  14. Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting. Ann. Statist. 28, 337–407 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  15. Haussler, D., Kivinen, J., Warmuth, M.K.: Sequential prediction of individual sequences under general loss functions. IEEE Trans. on Information Theory 44(5), 1906–1925

    Google Scholar 

  16. Hartigan, J.: Bayesian regression using akaike priors. Yale University, New Haven, Preprint (2002)

    Google Scholar 

  17. Juditsky, A., Rigollet, P., Tsybakov, A.: Learning by mirror averaging. Preprint n.1034, LPMA

    Google Scholar 

  18. Juditsky, A., Nazin, A., Tsybakov, A.B., Vayatis, N.: Recursive Aggregation of Estimators by Mirror Descent Algorithm with averaging. Problems of Information Transmission 41(4), 368–384

    Google Scholar 

  19. Kivinen, J., Warmuth, M.K.: Averaging expert predictions. In: Fischer, P., Simon, H.U. (eds.) EuroCOLT 1999. LNCS (LNAI), vol. 1572, pp. 153–167. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  20. Koltchinskii, V.: Local Rademacher Complexities and Oracle Inequalities in Risk Minimization (IMS Medallion Lecture). Ann. Statist. 34(6), 1–50 (2006)

    MathSciNet  Google Scholar 

  21. Lecué, G.: Optimal rates of aggregation in classification. Submitted (2005)

    Google Scholar 

  22. Lecué, G.: Simultaneous adaptation to the margin and to complexity in classification. To appear in Ann. Statist (2005)

    Google Scholar 

  23. Lecué, G.: Optimal oracle inequality for aggregation of classifiers under low noise condition. In: Lugosi, G., Simon, H.U. (eds.) COLT 2006. LNCS (LNAI), vol. 4005, pp. 364–378. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  24. Lecué, G.: Suboptimality of Penalized Empirical Risk Minimization. Manuscript (2006)

    Google Scholar 

  25. Leung, G., Barron, A.: Information theory and mixing least-square regressions. IEEE Transactions on Information Theory 52(8), 3396–3410 (2006)

    Article  MathSciNet  Google Scholar 

  26. Lugosi, G., Vayatis, N.: On the Bayes-risk consistency of regularized boosting methods. Ann. Statist. 32(1), 30–55 (2004)

    MATH  MathSciNet  Google Scholar 

  27. Nemirovski, A.: Topics in Non-parametric Statistics, Ecole d’été de Probabilités de Saint-Flour 1998. Lecture Notes in Mathematics, vol. 1738. Springer, Heidelberg (2000)

    Google Scholar 

  28. Tsybakov, A.: Introduction à l’estimation non-paramétrique. Springer, Heidelberg (2004)

    MATH  Google Scholar 

  29. Tsybakov, A.B.: Optimal rates of aggregation. In: Schölkopf, B., Warmuth, M. (eds.) Computational Learning Theory and Kernel Machines. LNCS (LNAI), vol. 2777, pp. 303–313. Springer, Heidelberg (2003)

    Google Scholar 

  30. Tsybakov, A.B.: Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32(1), 135–166 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  31. Vapnik, V.N., Chervonenkis, A.Y.: Necessary and sufficient conditions for the uniform convergence of empirical means to their true values. Teor. Veroyatn. Primen. 26, 543–563 (1981)

    MATH  MathSciNet  Google Scholar 

  32. Vovk, V.: Aggregating Strategies. In: Proceedings of the 3rd Annual Workshop on Computational Learning Theory, COLT1990, pp. 371–386. Morgan Kaufmann, San Francisco, CA (1990)

    Google Scholar 

  33. Yang, Y.: Mixing strategies for density estimation. Ann. Statist. 28(1), 75–87 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  34. Zhang, T.: Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Statist. 32(1), 56–85 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  35. Zhang, T.: Adaptive estimation in Pattern Recognition by combining different procedures. Statistica Sinica 10, 1069–1089 (2000)

    MathSciNet  Google Scholar 

  36. Zhang, T.: From epsilon-entropy to KL-complexity: analysis of minimum information complexity density estimation, To appear in Ann. Statist (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Nader H. Bshouty Claudio Gentile

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Lecué, G. (2007). Suboptimality of Penalized Empirical Risk Minimization in Classification. In: Bshouty, N.H., Gentile, C. (eds) Learning Theory. COLT 2007. Lecture Notes in Computer Science(), vol 4539. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72927-3_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-72927-3_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-72925-9

  • Online ISBN: 978-3-540-72927-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics