Advertisement

Machine Learning

, Volume 63, Issue 1, pp 69–101 | Cite as

Model-based transductive learning of the kernel matrix

  • Zhihua Zhang
  • James T. Kwok
  • Dit-Yan Yeung
Article

Abstract

This paper addresses the problem of transductive learning of the kernel matrix from a probabilistic perspective. We define the kernel matrix as a Wishart process prior and construct a hierarchical generative model for kernel matrix learning. Specifically, we consider the target kernel matrix as a random matrix following the Wishart distribution with a positive definite parameter matrix and a degree of freedom. This parameter matrix, in turn, has the inverted Wishart distribution (with a positive definite hyperparameter matrix) as its conjugate prior and the degree of freedom is equal to the dimensionality of the feature space induced by the target kernel. Resorting to a missing data problem, we devise an expectation-maximization (EM) algorithm to infer the missing data, parameter matrix and feature dimensionality in a maximum a posteriori (MAP) manner. Using different settings for the target kernel and hyperparameter matrices, our model can be applied to different types of learning problems. In particular, we consider its application in a semi-supervised learning setting and present two classification methods. Classification experiments are reported on some benchmark data sets with encouraging results. In addition, we also devise the EM algorithm for kernel matrix completion.

Keywords

Kernel learning Wishart process Bayesian inference Transductive learning EM algorithm MAP estimation Semi-supervised learning 

References

  1. Amari, S. (1995). Information geometry of the EM and em algorithms for neural networks. Neural Networks, 8:9, 1379--1408.Google Scholar
  2. Baker, R. C. (1973). Joint measures and cross-covariance operators. Transactions of the American Mathematical Society, 186, 273--289.Google Scholar
  3. Baudat, G., & Anouar, F. (2000). Generalized discriminant analysis using a kernel approach. Neural Computation, 12, 2385--2404.Google Scholar
  4. Bishop, M. C. (1995). Neural networks for pattern recognition. Oxford University Press, Oxford.Google Scholar
  5. Bousquet, O., & Herrmann, D. J. L. (2003). On the complexity of learning the kernel matrix. In Advances in neural information processing systems 15. Cambridge, MA. MIT Press.Google Scholar
  6. Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46:1-3, 131--159.Google Scholar
  7. Chen, C.-F. (1979). Bayesian inference for a normal dispersion matrix and its application to stochastic multiple regression analysis. Journal of the Royal Statistical Society Series B, 41:2, 235--248.Google Scholar
  8. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273--297.Google Scholar
  9. Crammer, K., Keshet, J., & Singer, Y. (2003). Kernel design using boosting. In Advances in neural information processing systems 15, Cambridge, MA, MIT Press.Google Scholar
  10. Cressie, N. A. C. (1991). Statistics for spatial data. Wiley, New York.Google Scholar
  11. Cristianini, N., Kandola, J., Elissee, A., & Shawe-Taylor., J. (2002). On kernel target alignment. In T.G. Dietterich, S. Becker, and Z. Ghahramani, (Eds.), Advances in neural information processing systems 14. Cambridge, MA, MIT Press.Google Scholar
  12. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39:1, 1--38.Google Scholar
  13. Diggle, P. J., Tawn, J. A., & Moyeed, R. A. (1998) Model-based geostatistics (with discussions). Applied Statistics. 47:3, 299--350.Google Scholar
  14. Friedman, J., Hastie, T., & Tibshirani, R. (2000) Additive logistic regression: A statistical view of boosting. Annals of Statistics, 38:2,337--374.Google Scholar
  15. Graepel, T. (2002). Kernel matrix completion by semidefinite programming. In Proceedings of the international conference on artificial neural networks (pp. 687--693). Madrid, Spain.Google Scholar
  16. Gupta, A. K., & Nagar, D. K. (2000). Matrix variate distributions. Chapman & Hall/CRC.Google Scholar
  17. Horn, R. A., & Johnson, C. R. (1985). Matrix analysis. Cambridge University Press. Cambridge. UK.Google Scholar
  18. Horn, R. A., & Johnson, C. R. (1991). Topics in matrix analysis. Cambridge University Press. Cambridge. UK.Google Scholar
  19. Joachims, T. (1999). Transductive inference for text classification using support vector machines. In Proceedings of the international conference on machine learning.Google Scholar
  20. Joachims, T. (2003) Transductive learning via spectral graph partitioning. In Proceedings of the international conference on machine learning.Google Scholar
  21. Kandola, J., Shawe-Taylor, J., & Cristianini, N. (2002) On the extensions of kernel alignment. Technical Report 2002--120, NeuroCOLT.Google Scholar
  22. Kandola, J, Shawe-Taylor, J., & Cristianini, N. (2002). Optimizing kernel alignment over combinations of kernels. Technical Report 2002--121, NeuroCOLT.Google Scholar
  23. Kin, T., Kato, T., Tsuda, K., & Asai, K. (1954). Protein classification via kernel matrix completion. Annals of Mathematical Statistics, 25, 40--75.Google Scholar
  24. Kwok, J. T. (2000). The evidence framework applied to support vector machines. IEEE Transactions on Neural Networks. 11:5, 1162--1173.Google Scholar
  25. Lanckriet, G. R. N., Cristianini, G., Bartlett, P., El Ghaoui, L., & Jordan, M. I. (2004). Learning the kernel matrix with semide.nite programming. Journal of Machine Learning Research, 5, 27-72.Google Scholar
  26. Lütkepohl, H. (1996). Handbook of matrices. John Wiley & Sons, New York.Google Scholar
  27. Mardia, K. V., & Marshall, R. J. (1984). Maximum likelihood estimation for models of residual covariance in spatial regression. Biometrika. 71:1, 135--146.Google Scholar
  28. Neal, R. M. (1999). Regression and classification using Gaussian process priors (with discussion). In J. M. Bernardo, J. O. Berger, A. P. Dawid, & A. F. M. Smith (Eds.), Bayesian statistics, volume 6, pages 475--501. Oxford University Press.Google Scholar
  29. Ong, C. S., Smola, A. J., & Williamson, R. C. (2003). Hyperkernels. In Advances in neural information processing systems 15. Cambridge, MA, MIT Press.Google Scholar
  30. Schölkopf, B., Smola, A., & Klaus-Robert Müller. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319.Google Scholar
  31. Schölkopf B., & Smola, A. J. (2002). Learning with kernels. The MIT Press.Google Scholar
  32. Seeger, M. (2000). Relationships between Gaussian processes, support vector machines and smoothing splines. Technical report, Institute for Adaptive and Neural Computation, University of Edinburgh.Google Scholar
  33. Smola, A. J., & Schölkopf, B. (2002). Bayesian kernel methods. In S. Mendelson and A. J. Smola (Eds.), Advanced lectures on machine learning (pp. 65--117). Springer.Google Scholar
  34. Smola, A. J., Vishwanathan, S. V. N., & Hoffman, T. (2004) Kernel methods for missing variables. In NIPS'04 Workshop on Graphical Models and Kernels.Google Scholar
  35. Sollich, P. (2000) Probabilistic methods for support vector machines. In Advances in neural information processing systems 12, pp. 349--355.Google Scholar
  36. Tanner, M. A., & Wong, W. H. (1987) The calculation of posterior distributions by data augmentation (with discussion). Journal of the American Statistical Association. 82:398, 528--550.Google Scholar
  37. Tsuda, K., Akaho, S., & Asai, K. (2003). The em algorithm for kernel matrix completion with auxiliary data. Journal of Machine Learning Research, 4, 67-81.Google Scholar
  38. Vandenberghe, L., & Boyd, S. (1996). Semidefinite programming. SIAM Review, 38:1, 49--95.Google Scholar
  39. Vapnik, V. (1998). Statistical learning theory. John Wiley and Sons, New York.Google Scholar
  40. Wahba, G. (1990). Spline models for observational data. SIAM, Philadelphia.Google Scholar
  41. Williams, C. K. I., & Barber, D. (1998). Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:12, 1342--1351.Google Scholar
  42. Zhang, Z. (2003) Learning metrics via discriminant kernels and multidimensional scaling: Toward expected Euclidean representation. In Proceedings of the 20th international conference on machine learning.Google Scholar
  43. Zhang, Z., Yeung, D. Y., & Kwok, J. K. (2004). Bayesian inference for transductive learning of kernel matrix using the Tanner-Wong data augmentation algorithm. In Proceedings of the 21st international conference on machine learning.Google Scholar
  44. Zhou, D., Bousquet, O., Lal, T. N., Weston, J., & Schölkopf, B., (2004). Learning with local and global consistency. In Advances in neural information processing systems 16.Google Scholar

Copyright information

© Springer Science + Business Media, LLC 2006

Authors and Affiliations

  1. 1.Department of Computer ScienceHong Kong University of Science and TechnologyKowloonHong Kong

Personalised recommendations