Machine Learning

, Volume 66, Issue 2–3, pp 297–319 | Cite as

Feature space perspectives for learning the kernel

  • Charles A. Micchelli
  • Massimiliano Pontil


In this paper, we continue our study of learning an optimal kernel in a prescribed convex set of kernels (Micchelli & Pontil, 2005) . We present a reformulation of this problem within a feature space environment. This leads us to study regularization in the dual space of all continuous functions on a compact domain with values in a Hilbert space with a mix norm. We also relate this problem in a special case to \({\cal L}^p\) regularization.


Banach space regularization Convex optimization Learning the kernels Kernel methods Sparsity 


  1. Argyriou, A., Micchelli, C. A., & Pontil, M. (2005). Learning convex combinations of continuously parameterized basic kernels. In Proc. 18-th Annual Conference on Learning Theory (COLT’05), Bertinoro, Italy.Google Scholar
  2. Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc., 686, 337–404.CrossRefMathSciNetGoogle Scholar
  3. Bach, F. R., Lanckriet, G. R. G., & Jordan, M. I. (2004). Multiple kernels learning, conic duality, and the SMO algorithm. In Proc. of the Int. Conf. on Machine Learning (ICML’04).Google Scholar
  4. Borwein, J. M., & Lewis, A. S. (2000). Convex analysis and nonlinear optimization. Theory and examples. CMS (Canadian Mathematical Society) Springer-Verlag, New York.Google Scholar
  5. Bousquet, O., & Herrmann, D. J. L. (2003). On the complexity of learning the kernel matrix. Advances in Neural Information Processing Systems, 15.Google Scholar
  6. Chen, S. S., Donoho, D. L., Saunders, M. A. (1998). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput., 20(1), 33–61.CrossRefMathSciNetGoogle Scholar
  7. Cristianini, N., Shawe-Taylor, J., Elisseeff, A., Kandola, J. S. (2002). On kernel-target alignment. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems, vol. 14.Google Scholar
  8. Fung, G. M., & Mangasarian, O. L. (2004). A feature selection Newton method for support vector machine classification. Comput. Optim. Appl., 28(2), 185–202.MATHCrossRefMathSciNetGoogle Scholar
  9. Gunn, S. R., & Kandola, J. S. (2002) Structural modelling with sparse kernels Machine Learning, 48(1), 137–163.MATHCrossRefGoogle Scholar
  10. Herbster, M. (2004). Relative loss bounds and polynomial-time predictions for the K-LMS-NET algorithm. In Proc. of the 15th Int. Conference on Algorithmic Learning Theory.Google Scholar
  11. Lanckriet, G. R. G., Cristianini, N., Bartlett, P., El Ghaoui, L., Jordan, M. I. (2004). Learning the kernel matrix with semi-definite programming. J. of Machine Learning Research, 5, 27–72.Google Scholar
  12. Lee, Y., Kim, Y., Lee, S., & Koo, J.-Y. (2004). Structured multicategory support vector machine with ANOVA decomposition. Technical Report No. 743, Department of Statistics, The Ohio State University.Google Scholar
  13. Lin, Y., & Zhang, H. H. (2003). Component selection and smoothing in smoothing spline analysis of variance models–cosso. Institute of Statistics Mimeo Series 2556, NCSU.Google Scholar
  14. Micchelli, C. A. (1992). Curves from variational principles. Mathematical Modeling and Numerical Analysis, 26, 77–93.MATHMathSciNetGoogle Scholar
  15. Micchelli, C. A., & Pinkus, A. (1994). Variational problems arising from balancing different error criteria. Rendiconti di Matematica, Serie VII, 14, 37–86.MATHMathSciNetGoogle Scholar
  16. Micchelli, C. A., & Pontil, M. (2004). A function representation for learning in Banach spaces. In Proc. of the 17th Annual Conference on Learning Theory (COLT’04), Banff, Alberta.Google Scholar
  17. Micchelli, C. A. & Pontil, M. (2005). On learning vector-valued functions. Neural Computation, 17, 177–204.MATHCrossRefMathSciNetGoogle Scholar
  18. Micchelli, C. A., & Pontil, M. (2005). Learning the kernel function via regularization. J. of Machine Learning Research, 6, 1099–1125.Google Scholar
  19. Micchelli, C. A., Pontil, M., Wu, Q., & Zhou, D. X. (2005). Error bounds for learning the kernel. Research Note 05/09, Dept. of Computer Science, University College London.Google Scholar
  20. Ong, C. S., Smola, A. J., & Williamson, R. C. (2003). Hyperkernels. In S. Becker, S. Thrun, K. Obermayer (Eds.), Advances in Neural Information Processing Systems, vol. 15, MIT Press, Cambridge, MA.Google Scholar
  21. Royden, H. L. (1964). Real analysis, 2nd edition. Macmillan Publishing Company, New York.Google Scholar
  22. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Royal Statist. Soc. B, 58, 267– 288.MATHMathSciNetGoogle Scholar
  23. Wahba, G. (1990). Splines models for observational data. Series in Applied Mathematics, vol. 59, SIAM, Philadelphia.Google Scholar
  24. Wu, Q., Ying, Y., & Zhou, D. X. Multi-kernel regularization classifiers. J. of Complexity (to appear).Google Scholar

Copyright information

© Springer Science + Business Media, LLC 2007

Authors and Affiliations

  1. 1.Department of Mathematics and StatisticsState University of New York, The University at AlbanyAlbanyUSA
  2. 2.Department of Computer ScienceUniversity College LondonLondonEngland, UK

Personalised recommendations