Advertisement

Accelerating adaptive online learning by matrix approximation

  • Yuanyu Wan
  • Lijun ZhangEmail author
Regular Paper
  • 11 Downloads

Abstract

Adaptive subgradient methods are able to leverage the second-order information of functions to improve the regret and have become popular for online learning and optimization. According to the amount of information used, these methods can be divided into diagonal-matrix version (ADA-DIAG) and full-matrix version (ADA-FULL). In practice, ADA-DIAG is the most commonly adopted instead of ADA-FULL, because ADA-FULL is computationally intractable in high dimensions though it has smaller regret when gradients are correlated. In this paper, we propose to employ techniques of matrix approximation to accelerate ADA-FULL and develop two methods based on random projections. Compared with ADA-FULL, at each iteration, our methods reduce the space complexity from \(O(d^2)\) to \(O(\tau d)\) and the time complexity from \(O(d^3)\) to \(O(\tau ^2 d)\) where d is the dimensionality of the data and \(\tau \ll d\) is the number of random projections. Experimental results about online convex optimization and training convolutional neural networks show that our methods are comparable to ADA-FULL and outperform other state-of-the-art algorithms including ADA-DIAG.

Keywords

Online learning Adaptive methods Matrix approximation Random projection 

Notes

Acknowledgements

This work was partially supported by the National Key R&D Program of China (2018YFB1004300), NSFC-NRF Joint Research Project (61861146001) and YESS (2017QNRC001).

References

  1. 1.
    Achlioptas, D.: Database-friendly random projections: Johnson–Lindenstrauss with binary coins. J. Comput. Syst. Sci. 66(4), 671–687 (2003)MathSciNetzbMATHGoogle Scholar
  2. 2.
    Allesiardo, R., Fraud, R., Maillard, O.A.: The non-stationary stochastic multi-armed bandit problem. Int. J. Data Sci. Anal. 3(4), 267–283 (2017)Google Scholar
  3. 3.
    Awerbuch, B., Kleinberg, R.: Online linear optimization and adaptive routing. J. Comput. Syst. Sci. 74(1), 97–114 (2008)MathSciNetzbMATHGoogle Scholar
  4. 4.
    Boutsidis, C., Zouzias, A., Drineas, P.: Random projections for \(k\)-means clustering. In: Advances in Neural Information Processing Systems, vol. 23, pp. 298–306 (2010)Google Scholar
  5. 5.
    Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011)Google Scholar
  6. 6.
    Dasgupta, S., Freund, Y.: Random projection trees and low dimensional manifolds. In: Proceedings of the 40th Annual ACM Symposium on Theory of computing, pp. 537–546 (2008)Google Scholar
  7. 7.
    Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)MathSciNetzbMATHGoogle Scholar
  8. 8.
    Duchi, J., Shalev-Shwartz, S., Singer, Y., Tewari, A.: Composite objective mirror descent. In: Proceedings of the 23rd Annual Conference on Learning Theory, pp. 14–26 (2010)Google Scholar
  9. 9.
    Fern, X.Z., Brodley, C.E.: Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of the 20th International Conference on Machine Learning, pp. 186–93 (2003)Google Scholar
  10. 10.
    Fradkin, D., Madigan, D.: Experiments with random projections for machine learning. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 517–522 (2003)Google Scholar
  11. 11.
    Freund, Y., Dasgupta, S., Kabra, M., Verma, N.: Learning the structure of manifolds using random projections. In: Advances in Neural Information Processing Systems, vol. 21, pp. 473–480 (2008)Google Scholar
  12. 12.
    Gao, W., Jin, R., Zhu, S., Zhou, Z.H.: One-pass AUC optimization. In: Proceedings of the 30th International Conference on Machine Learning, pp. 906–914 (2013)Google Scholar
  13. 13.
    Ghashami, M., Liberty, E., Phillips, J.M., Woodruff, D.P.: Frequent directions: simple and deterministic matrix sketching. SIAM J. Comput. 45(5), 1762–1792 (2016)MathSciNetzbMATHGoogle Scholar
  14. 14.
    Hager, W.W.: Updating the inverse of a matrix. SIAM Rev. 31(2), 221–239 (1989)MathSciNetzbMATHGoogle Scholar
  15. 15.
    Hassani, M., Töws, D., Cuzzocrea, A., Seidl, T.: BFSPMiner: an effective and efficient batch-free algorithm for mining sequential patterns over data streams. Int. J. Data Sci. Anal. 1–17 (2017).  https://doi.org/10.1007/S41060-017-0084-8
  16. 16.
    Hazan, E., Agarwal, A., Kale, S.: Logarithmic regret algorithms for online convex optimization. Mach. Learn. 69(2), 169–192 (2007)zbMATHGoogle Scholar
  17. 17.
    Kaski, S.: Dimensionality reduction by random mapping: fast similarity computation for clustering. In: Proceedings of the 1998 IEEE International Joint Conference on Neural Networks, vol. 1, pp. 413–418 (1998)Google Scholar
  18. 18.
    Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical report, University of Toronto (2009)Google Scholar
  19. 19.
    Krummenacher, G., McWilliams, B., Kilcher, Y., Buhmann, J.M., Meinshausen, N.: Scalable adaptive stochastic optimization using random projections. In: Advances in Neural Information Processing Systems, vol. 29, pp. 1750–1758 (2016)Google Scholar
  20. 20.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, vol. 86, pp. 2278–2324 (1998)Google Scholar
  21. 21.
    Liberty, E., Ailon, N., Singer, A.: Dense fast random projections and lean walsh transforms. Discrete Comput. Geom. 45(1), 34–44 (2011)MathSciNetzbMATHGoogle Scholar
  22. 22.
    Luo, H., Agarwal, A., Cesa-Bianchi, N., Langford, J.: Efficient second order online learning by sketching. In: Advances in Neural Information Processing Systems, vol. 29, pp. 902–910 (2016)Google Scholar
  23. 23.
    Magen, A., Zouzias, A.: Low rank matrix-valued Chernoff bounds and approximate matrix multiplication. In: Proceedings of the 22nd Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1422–1436 (2011)Google Scholar
  24. 24.
    Maillard, O.A., Munos, R.: Linear regression with random projections. J. Mach. Learn. Res. 13, 2735–2772 (2012)MathSciNetzbMATHGoogle Scholar
  25. 25.
    Miyaguchi, K., Yamanishi, K.: Online detection of continuous changes in stochastic processes. Int. J. Data Sci. Anal. 3(3), 213–229 (2017)Google Scholar
  26. 26.
    Nalko, N., Martinsson, P.G., Tropp, J.A.: Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011)MathSciNetzbMATHGoogle Scholar
  27. 27.
    Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011 (2011)Google Scholar
  28. 28.
    Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, vol. 21, pp. 1177–1184 (2008)Google Scholar
  29. 29.
    Tropp, J.A.: An introduction to matrix concentration inequalities. Found. Trends Mach. Learn. 8(1–2), 1–230 (2015)zbMATHGoogle Scholar
  30. 30.
    Wan, Y., Wei, N., Zhang, L.: Efficient adaptive online learning via frequent directions. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 2748–2754 (2018)Google Scholar
  31. 31.
    Wan, Y., Zhang, L.: Accelerating adaptive online learning by matrix approximation. In: Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 405–417 (2018)Google Scholar
  32. 32.
    Woodruff, D.P.: Sketching as a tool for numerical linear algebra. Found. Trends Mach. Learn. 10(1–2), 1–157 (2014)MathSciNetzbMATHGoogle Scholar
  33. 33.
    Xiao, L.: Dual averaging method for regularized stochastic learning and online optimization. In: Advances in Neural Information Processing Systems, vol. 22, pp. 2116–2124 (2009)Google Scholar
  34. 34.
    Yenala, H., Jhanwar, A., Chinnakotla, M.K., Goyal, J.: Deep learning for detecting inappropriate content in text. Int. J. Data Sci. Anal. 6(4), 273–286 (2018)Google Scholar
  35. 35.
    Zhang, L., Mahdavi, M., Jin, R., Yang, T., Zhu, S.: Recovering the optimal solution by dual random projection. In: Proceedings of the 26th Annual Conference on Learning Theory, pp. 135–157 (2013)Google Scholar
  36. 36.
    Zhang, L., Yang, T., Jin, R., Xiao, Y., Zhou, Z.H.: Online stochastic linear optimization under one-bit feedback. In: Proceedings of the 33rd International Conference on Machine Learning, pp. 392–401 (2016)Google Scholar
  37. 37.
    Zinkevich, M.: Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th International Conference on Machine Learning, pp. 928–936 (2003)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.National Key Laboratory for Novel Software TechnologyNanjing UniversityNanjingChina

Personalised recommendations