Advertisement

Machine Learning

, Volume 79, Issue 1–2, pp 73–103 | Cite as

Composite kernel learning

  • Marie Szafranski
  • Yves Grandvalet
  • Alain Rakotomamonjy
Article

Abstract

The Support Vector Machine is an acknowledged powerful tool for building classifiers, but it lacks flexibility, in the sense that the kernel is chosen prior to learning. Multiple Kernel Learning enables to learn the kernel, from an ensemble of basis kernels, whose combination is optimized in the learning process. Here, we propose Composite Kernel Learning to address the situation where distinct components give rise to a group structure among kernels. Our formulation of the learning problem encompasses several setups, putting more or less emphasis on the group structure. We characterize the convexity of the learning problem, and provide a general wrapper algorithm for computing solutions. Finally, we illustrate the behavior of our method on multi-channel data where groups correspond to channels.

Keywords

Supervized learning Support vector machine Kernel learning Structured kernels Feature selection and sparsity 

References

  1. Argyriou, A., Hauser, R., Micchelli, C. A., & Pontil, M. (2006). A dc-programming algorithm for kernel selection. In W. W. Cohen & A. Moore (Eds.), Proceedings of the twenty-third international conference on machine learning (pp. 41–48). New York: ACM. Google Scholar
  2. Argyriou, A., Evgeniou, T., & Pontil, M. (2008). Convex multi-task feature learning. Machine Learning, 73(3), 243–272. CrossRefGoogle Scholar
  3. Bach, F. (2009). Exploring large feature spaces with hierarchical multiple kernel learning. In Advances in neural information processing systems 21. Cambridge: MIT Press. Google Scholar
  4. Bach, F. R., Lanckriet, G. R. G., & Jordan, M. I. (2004). Multiple kernel learning, conic duality, and the SMO algorithm. In ACM international conference proceeding series. Proceedings of the 21th annual international conference on machine learning (ICML 2004) (pp. 41–48). New York: ACM. Google Scholar
  5. Bengio, Y., & Grandvalet, Y. (2004). No unbiased estimator of the variance of K-fold cross-validation. Journal of Machine Learning Research (JMLR), 5, 1089–1105. MathSciNetGoogle Scholar
  6. Blankertz, B., Müller, K.-R., Curio, G., Vaughan, T. M., Schalk, G., Wolpaw, J. R., Schlögl, A., Neuper, C., Pfurtscheller, G., Hinterberger, T., Schröder, M., & Birbaumer, N. (2004). The BCI competition 2003: progress and perspectives in detection and discrimination of EEG single trials. IEEE Transactions on Biomedical Engineering, 51(6), 1044–1051. CrossRefGoogle Scholar
  7. Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. Journal of Machine Learning Research, 2, 499–526. MATHCrossRefMathSciNetGoogle Scholar
  8. Bousquet, O., & Herrmann, D. J. L. (2003). On the complexity of learning the kernel matrix. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems 15 (pp. 399–406). Cambridge: MIT Press. Google Scholar
  9. Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Annals of Statistics, 24(6), 2350–2383. MATHCrossRefMathSciNetGoogle Scholar
  10. Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46(1), 131–159. MATHCrossRefGoogle Scholar
  11. Cristianini, N., Campbell, C., & Shawe-Taylor, J. (1999). Dynamically adapting kernels in support vector machines. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems 11 (pp. 204–210). Cambridge: MIT Press. Google Scholar
  12. Cristianini, N., Shawe-Taylor, J., Elisseeff, A., & Kandola, K. (2002). On kernel-target alignment. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems 14 (pp. 367–373). Cambridge: MIT Press. Google Scholar
  13. Farwell, A., & Donchin, E. (1998). Talking off the top of your head: toward a mental prosthesis utilizing event-related brain potentials. Electroencephalography and Clinical Neurophysiology, 70(6), 510–523. CrossRefGoogle Scholar
  14. Garipelli, G., Chavarriaga, R., & del Millán, J. R. (2009). Fast recognition of anticipation related potentials. IEEE Transactions on Biomedical Engineering, 56(4), 1257–1260. CrossRefGoogle Scholar
  15. Grandvalet, Y., & Canu, S. (1999). Outcomes of the equivalence of adaptive ridge with least absolute shrinkage. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems 11 (NIPS 1998) (pp. 445–451). Cambridge: MIT Press. Google Scholar
  16. Grandvalet, Y., & Canu, S. (2003). Adaptive scaling for feature selection in SVMs. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems 15 (pp. 569–576). Cambridge: MIT Press. Google Scholar
  17. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182. MATHCrossRefGoogle Scholar
  18. Kowalski, M., & Torrésani, B. (2008). Sparsity and persistence: mixed norms provide simple signals models with dependent coefficients. Signal, Image and Video Processing, 1863–1703. Google Scholar
  19. Lanckriet, G. R. G., Cristianini, N., Bartlett, P., El Ghaoui, L., & Jordan, M. I. (2004). Learning the kernel matrix with semi-definite programming. Journal of Machine Learning Research, 5, 27–72. MathSciNetGoogle Scholar
  20. Nikolova, M. (2000). Local strong homogeneity of a regularized estimator. SIAM Journal on Applied Mathematics, 61(2), 633–658. MATHCrossRefMathSciNetGoogle Scholar
  21. Ong, C. S., Smola, A. J., & Williamson, R. C. (2005). Learning the kernel with hyperkernels. Journal of Machine Learning Research, 6, 1043–1071. MathSciNetGoogle Scholar
  22. Rakotomamonjy, A., & Guigue, V. (2008). BCI competition 3: Dataset 2—ensemble of SVM for BCI P300 speller. IEEE Transactions on Biomedical Engineering, 55(3), 1147–1154. CrossRefGoogle Scholar
  23. Rakotomamonjy, A., Bach, F. R., Canu, S., & Grandvalet, Y. (2008). SimpleMKL. Journal of Machine Learning Research (JMLR), 9, 2491–2521. MathSciNetGoogle Scholar
  24. Schölkopf, B., & Smola, A. J. (2001). Learning with kernels: support vector machines, regularization, optimization, and beyond. Cambridge: MIT Press. Google Scholar
  25. Schröder, M., Lal, T. N., Hinterberger, T., Bogdan, M., Hill, J., Birbaumer, N., Rosenstiel, W., & Schölkopf, B. (2005). Robust EEG channel selection across subjects for brain computer interfaces. EURASIP Journal on Applied Signal Processing, 19, 3103–3112. Google Scholar
  26. Sonnenburg, S., Rätsch, G., Schäfer, C., & Schölkopf, B. (2006). Large scale multiple kernel learning. Journal of Machine Learning Research, 7, 1531–1565. Google Scholar
  27. Srebro, N., & Ben-David, S. (2006). Learning bounds for support vector machines with learned kernels. In G. Lugosi & H.-U. Simon (Eds.), 19th annual conference on learning theory (Vol. 4005, pp. 169–183). Berlin: Springer. Google Scholar
  28. Szafranski, M., Grandvalet, Y., & Morizet-Mahoudeaux, P. (2008a). Hierarchical penalization. In J. C. Platt, D. Koller, Y. Singer, & S. Roweis (Eds.), Advances in neural information processing systems 20 (pp. 1457–1464). Cambridge: MIT Press. Google Scholar
  29. Szafranski, M., Grandvalet, Y., & Rakotomamonjy, A. (2008b). Composite kernel learning. In A. McCallum & S. Roweis (Eds.), Proceedings of the 25th annual international conference on machine learning (ICML 2008) (pp. 1040–1047). Eastbourne: Omnipress. CrossRefGoogle Scholar
  30. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1), 267–288. MATHMathSciNetGoogle Scholar
  31. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer. MATHGoogle Scholar
  32. Walter, W. G., Cooper, R., Aldridge, V. J., McCallum, W. C., & Winter, A. L. (1964). Contingent negative variation: An electric sign of sensorimotor association and expectancy in the human brain. Nature, 203, 380–384. CrossRefGoogle Scholar
  33. Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., & Vapnik, V. (2001). Feature selection for SVMs. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems 13 (pp. 668–674). Cambridge: MIT Press. Google Scholar
  34. Xu, Z., Jin, R., King, I., & Lyu, M. (2009). An extended level method for efficient multiple kernel learning. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in neural information processing systems 21 (pp. 1825–1832). Cambridge: MIT Press. Google Scholar
  35. Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68(1), 49–67. MATHCrossRefMathSciNetGoogle Scholar
  36. Zhao, P., Rocha, G., & Yu, B. (2009). The composite absolute penalties family for grouped and hierarchical variable selection. Annals of Statistics, 37(6A), 3468–3497. MATHCrossRefMathSciNetGoogle Scholar

Copyright information

© The Author(s) 2009

Authors and Affiliations

  • Marie Szafranski
    • 1
    • 2
  • Yves Grandvalet
    • 3
  • Alain Rakotomamonjy
    • 4
  1. 1.CNRS FRE 3190—IBISCUniversité d’Évry Val d’EssonneÉvry CedexFrance
  2. 2.CNRS UMR 6166—LIFUniversités d’Aix-MarseilleMarseilleFrance
  3. 3.CNRS UMR 6599—HeudiasycUniversité de Technologie de CompiègneCompiègne CedexFrance
  4. 4.EA 4108—LITISUniversité de RouenSaint-Étienne-du-Rouvray CedexFrance

Personalised recommendations