Information-Theoretic Transfer Learning Framework for Bayesian Optimisation

  • Anil RamachandranEmail author
  • Sunil Gupta
  • Santu Rana
  • Svetha Venkatesh
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11052)


Transfer learning in Bayesian optimisation is a popular way to alleviate “cold start” issue. However, most of the existing transfer learning algorithms use overall function space similarity, not a more aligned similarity measure for Bayesian optimisation based on the location of the optima. That makes these algorithms fragile to noisy perturbations, and even simple scaling of function values. In this paper, we propose a robust transfer learning based approach that transfer knowledge of the optima using a consistent probabilistic framework. From the finite samples for both source and target, a distribution on the optima is computed and then divergence between these distributions are used to compute similarities. Based on the similarities a mixture distribution is constructed, which is then used to build a new information-theoretic acquisition function in a manner similar to Predictive Entropy Search (PES). The proposed approach also offers desirable “no bias” transfer learning in the limit. Experiments on both synthetic functions and a set of hyperparameter tuning tests clearly demonstrate the effectiveness of our approach compared to the existing transfer learning methods. Code related to this paper is available at: and Data related to this paper is available at:



This research was partially funded by the Australian Government through the Australian Research Council (ARC) and the Telstra-Deakin Centre of Excellence in Big Data and Machine Learning. Professor Venkatesh is the recipient of an ARC Australian Laureate Fellowship (FL170100006).


  1. 1.
    González, J., Longworth, J., James, D.C., Lawrence, N.D.: Bayesian optimization for synthetic gene design. arXiv preprint arXiv:1505.01627 (2015)
  2. 2.
    Li, C., Gupta, S., Rana, S., Nguyen, V., Venkatesh, S., Shilton, A.: High dimensional Bayesian optimization using dropout. arXiv preprint arXiv:1802.05400 (2018)
  3. 3.
    Li, C., et al.: Rapid Bayesian optimisation for synthesis of short polymer fiber materials. Sci. Rep. 7(1), 5683 (2017)CrossRefGoogle Scholar
  4. 4.
    Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 2951–2959. Curran Associates, Inc. (2012)Google Scholar
  5. 5.
    Kushner, H.J.: A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. J. Basic Eng. 86(1), 97–106 (1964)CrossRefGoogle Scholar
  6. 6.
    Mo\(\tilde{\rm c}\)kus, J., Tiesis, V., \(\tilde{\rm Z}\)ilinskas, A.: The application of bayesian methods for seeking the extremum. In: Toward Global Optimization, vol. 2, pp. 117–128. Elsevier (1978)Google Scholar
  7. 7.
    Srinivas, N., Krause, A., Kakade, S.M., Seeger, M.W.: Information-theoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE Trans. Inf. Theory 58(5), 3250–3265 (2012)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Hernández-Lobato, J.M., Hoffman, M.W., Ghahramani, Z.: Predictive entropy search for efficient global optimization of black-box functions. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 27, pp. 918–926. Curran Associates, Inc. (2014)Google Scholar
  9. 9.
    Brochu, E., Cora, V.M., De Freitas, N.: A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599 (2010)
  10. 10.
    Jones, D.R., Perttunen, C.D., Stuckman, B.E.: Lipschitzian optimization without the Lipschitz constant. J. Optim. Theory Appl. 79(1), 157–181 (1993)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Bardenet, R., Brendel, M., Kégl, B., Sebag, M.: Collaborative hyperparameter tuning. In: ICML (2), pp. 199–207 (2013)Google Scholar
  12. 12.
    Yogatama, D., Mann, G.: Efficient transfer learning method for automatic hyperparameter tuning. Transfer 1, 1 (2014)Google Scholar
  13. 13.
    Swersky, K., Snoek, J., Adams, R.P.: Multi-task Bayesian optimization. In: Advances in Neural Information Processing Systems, pp. 2004–2012 (2013)Google Scholar
  14. 14.
    Joy, T.T., Rana, S., Gupta, S.K., Venkatesh, S.: Flexible transfer learning framework for bayesian optimisation. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J.Z., Wang, R. (eds.) PAKDD 2016. LNCS (LNAI), vol. 9651, pp. 102–114. Springer, Cham (2016). Scholar
  15. 15.
    Shilton, A., Gupta, S., Rana, S., Venkatesh, S.: Regret bounds for transfer learning in Bayesian optimisation. In: Artificial Intelligence and Statistics, pp. 307–315 (2017)Google Scholar
  16. 16.
    Feurer, M., Springenberg, J.T., Hutter, F.: Initializing Bayesian hyperparameter optimization via meta-learning. In: AAAI, pp. 1128–1135 (2015)Google Scholar
  17. 17.
    Feurer, M., Letham, B., Bakshy, E.: Scalable meta-learning for Bayesian optimization. arXiv preprint arXiv:1802.02219 (2018)
  18. 18.
    Rasmussen, C., Williams, C.: Gaussian processes for machine learning. Gaussian Processes for Machine Learning (2006)Google Scholar
  19. 19.
    Williams, C.K.I., Barber, D.: Bayesian classification with Gaussian processes. IEEE Trans. Pattern Anal. Mach. Intell. 20(12), 1342–1351 (1998)CrossRefGoogle Scholar
  20. 20.
    Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: International Conference on Learning and Intelligent Optimization, pp. 507–523. Springer (2011)Google Scholar
  21. 21.
    Snoek, J., et al.: Scalable Bayesian optimization using deep neural networks. In: International Conference on Machine Learning, pp. 2171–2180 (2015)Google Scholar
  22. 22.
    Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., de Freitas, N.: Taking the human out of the loop: a review of Bayesian optimization. Proc. IEEE 104(1), 148–175 (2016)CrossRefGoogle Scholar
  23. 23.
    Villemonteix, J., Vazquez, E., Walter, E.: An informational approach to the global optimization of expensive-to-evaluate functions. J. Glob. Optim. 44(4), 509 (2009)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Hennig, P., Schuler, C.J.: Entropy search for information-efficient global optimization. J. Mach. Learn. Res. 13, 1809–1837 (2012)MathSciNetzbMATHGoogle Scholar
  25. 25.
    Ramachandran, A., Gupta, S., Rana, S., Venkatesh, S.: Selecting optimal source for transfer learning in bayesian optimisation. In: Geng, X., Kang, B.-H. (eds.) PRICAI 2018. LNCS (LNAI), vol. 11012, pp. 42–56. Springer, Cham (2018). Scholar
  26. 26.
    Minka, T.P.: A family of algorithms for approximate Bayesian inference. Ph.D. thesis, Massachusetts Institute of Technology (2001)Google Scholar
  27. 27.
    Chapelle, O., Li, L.: An empirical evaluation of Thompson sampling. In: Advances in Neural Information Processing Systems, pp. 2249–2257 (2011)Google Scholar
  28. 28.
    Liese, F., Vajda, I.: On divergences and informations in statistics and information theory. IEEE Trans. Inf. Theory 52(10), 4394–4412 (2006)MathSciNetCrossRefGoogle Scholar
  29. 29.
    Wang, Q., Kulkarni, S.R., Verdú, S.: A nearest-neighbor approach to estimating divergence between continuous random vectors. In: 2006 IEEE International Symposium on Information Theory, pp. 242–246. IEEE (2006)Google Scholar
  30. 30.
    Pérez-Cruz, F.: Kullback-Leibler divergence estimation of continuous distributions. In: IEEE International Symposium on Information Theory, 2008, ISIT 2008, pp. 1666–1670. IEEE (2008)Google Scholar
  31. 31.
    Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)Google Scholar
  32. 32.
    Krause, A., Singh, A., Guestrin, C.: Near-optimal sensor placements in Gaussian processes: theory, efficient algorithms and empirical studies. J. Mach. Learn. Res. 9, 235–284 (2008)zbMATHGoogle Scholar
  33. 33.
    Busby, D.: Hierarchical adaptive experimental design for Gaussian process emulators. Reliab. Eng. Syst. Saf. 94(7), 1183–1193 (2009)CrossRefGoogle Scholar
  34. 34.
    Martino, L., Vicent, J., Camps-Valls, G.: Automatic emulator and optimized look-up table generation for radiative transfer models. In: Proceedings of IEEE International Geoscience and Remote Sensing Symposium (IGARSS) (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Anil Ramachandran
    • 1
    Email author
  • Sunil Gupta
    • 1
  • Santu Rana
    • 1
  • Svetha Venkatesh
    • 1
  1. 1.Centre for Pattern Recognition and Data Analytics (PRaDA)Deakin UniversityGeelongAustralia

Personalised recommendations