Journal of Global Optimization

, Volume 47, Issue 3, pp 369–401 | Cite as

Machine learning problems from optimization perspective



Both optimization and learning play important roles in a system for intelligent tasks. On one hand, we introduce three types of optimization tasks studied in the machine learning literature, corresponding to the three levels of inverse problems in an intelligent system. Also, we discuss three major roles of convexity in machine learning, either directly towards a convex programming or approximately transferring a difficult problem into a tractable one in help of local convexity and convex duality. No doubly, a good optimization algorithm takes an essential role in a learning process and new developments in the literature of optimization may thrust the advances of machine learning. On the other hand, we also interpret that the key task of learning is not simply optimization, as sometimes misunderstood in the optimization literature. We introduce the key challenges of learning and the current status of efforts towards the challenges. Furthermore, learning versus optimization has also been examined from a unified perspective under the name of Bayesian Ying-Yang learning, with combinatorial optimization made more effectively in help of learning.


Three levels of inverse problems Parameter learning Model selection Local convexity Convex duality Learning versus optimization Convex programming Bayesian Ying-Yang learning Automatic model selection Learning based combinatorial optimization 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Akaike H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19, 714–723 (1974)Google Scholar
  2. 2.
    Akaike H.: Likelihood of a model and information criteria. J. Econom. 16, 3–14 (1981)CrossRefGoogle Scholar
  3. 3.
    Amari, S., Cichocki, A., Yang, H.: A New Learning Algorithm for Blind Signal Separation. Advances in NIPS, 8, pp. 757–763. MIT Press (1996)Google Scholar
  4. 4.
    Dang C., Xu L.: A globally convergent Lagrange and barrier function iterative algorithm for the traveling salesman problem. Neural Netw. 14(2), 217–230 (2001)CrossRefGoogle Scholar
  5. 5.
    Dang C., Xu L.: A Lagrange multiplier and Hopfield-type barrier function method for the traveling salesman problem.. Neural Comput. 14(2), 303–324 (2001)CrossRefGoogle Scholar
  6. 6.
    Dayan P., Hinton G.E., Neal R.M., Zemel R.S.: The Helmholtz machine. Neural Comput. 7(5), 889–904 (1995)CrossRefGoogle Scholar
  7. 7.
    Edelman A., Arias T.A., Smith S.T.: The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl. 20, 303–353 (1998)CrossRefGoogle Scholar
  8. 8.
    Eshera E., Fu K.S.: A graph distance measure for image analysis. IEEE Trans. SMC 14(3), 396–408 (1984)Google Scholar
  9. 9.
    Hinton G.E., Zemel R.S.: Autoencoders, minimum description length and Helmholtz free energy. Adv. NIPS 6, 3–10 (1994)Google Scholar
  10. 10.
    Hopfield J.J., Tank D.W.: Neural computation of decisions in optimization problems. Biol. Cybern. 52, 141–152 (1985)Google Scholar
  11. 11.
    Horst, R., Pardalos, P.M.: Handbook of Global Optimization, Nonconvex Optimization and its Applications, vol. 2. Kluwer (1995)Google Scholar
  12. 12.
    Jaakkola, T.S.: Tutoiral on variational approximation methods. In: Opper, M., Saad, D. (eds.) Advanced Mean Field Methods: Theory and Pratice, pp. 129–160. MIT press (2001)Google Scholar
  13. 13.
    Jordan M., Ghahramani Z., Jaakkola T., Saul L.: Introduction to variational methods for graphical models. Mach. Learn. 37, 183–233 (1999)CrossRefGoogle Scholar
  14. 14.
    Kass R.E., Raftery A.E.: Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995)CrossRefGoogle Scholar
  15. 15.
    Kirkpatrick S., Gelatt C.G. Jr, Vecchi M.P.: Optimization by simulated annealing. Science 220, 671–680 (1983)CrossRefGoogle Scholar
  16. 16.
    MacKay, D.J.C.: Information Theory, Inference, and Learning Algorithms. Cambridge University Press (2003)Google Scholar
  17. 17.
    McLachlan, G.J., Geoffrey, J.: The EM Algorithms and Extensions, Wiley (1997)Google Scholar
  18. 18.
    Moulines, E., Cardoso, J., Gassiat, E.: Maximum likelihood for blind separation and deconvolution of noisy signals using mixture models. Proc. ICASSP97, pp. 3617–3620 (1997)Google Scholar
  19. 19.
    Neal R., Hinton G.E.: A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Jordan, M.I. (eds) Learning in Graphical Models, pp. 355–368. MIT Press, Cambridge, MA (1999)Google Scholar
  20. 20.
    Neath A.A., Cavanaugh J.E.: Regression and time series model selection using variants of the schwarz information criterion. Commun. Stat. A 26, 559–580 (1997)CrossRefGoogle Scholar
  21. 21.
    Poggio T., Girosi F.: Networks for approximation and learning. Proc. IEEE 78, 1481–1497 (1990)CrossRefGoogle Scholar
  22. 22.
    Press, S.J.: Bayesian statistics: principles, models, and applications. Factors. Wiley (1989)Google Scholar
  23. 23.
    Rissanen J.: Stochastic Complexity in Statistical Inquiry. World Scientific, Singapore (1989)Google Scholar
  24. 24.
    Rivals I., Personnaz L.: On cross validation for model selection. Neural Comput. 11, 863–870 (1999)CrossRefGoogle Scholar
  25. 25.
    Rockafellar, R.: Convex Analysis. Princeton University Press (1972)Google Scholar
  26. 26.
    Ruanaidh O., Joseph J.K.: Numerical Bayesian methods applied to signal processing. Springer-Verlag, New York (1996)Google Scholar
  27. 27.
    Rustagi J.: Variational Method in Statistics. Academic Press, New York (1976)Google Scholar
  28. 28.
    Schwarz G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)CrossRefGoogle Scholar
  29. 29.
    Stone M.: Cross-validation: a review. Math. Operat. Stat. 9, 127–140 (1978)Google Scholar
  30. 30.
    Tikhonov, A.N., Arsenin, V.Y.: Solutions of Ill-posed Problems. Winston and Sons (1977)Google Scholar
  31. 31.
    Umeyama S.: An eigendecomposition approach to weighted graph matching problems. IEEE Trans. Pattern Anal. Mach. Intell. 10(5), 695–703 (1988)CrossRefGoogle Scholar
  32. 32.
    Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer (1995)Google Scholar
  33. 33.
    Vapnik, V.N.: Estimation of Dependences Based on Empirical Data. Springer (2006)Google Scholar
  34. 34.
    Wallace C.S., Boulton D.M.: An information measure for classification. Comput. J. 11, 185–194 (1968)Google Scholar
  35. 35.
    Wang, L., Feng, J.: Learning Gaussian mixture models by structural risk minimization. Proc. 2005 Int. Conf. Machine Learning and Cybernetics (ICMLC), pp. 4858–4863. 19–21 Aug 2005, Guangzhou, China (2005)Google Scholar
  36. 36.
    Xu, L.: Combinatorial optimization neural nets based on a hybrid of Lagrange and transformation approaches. Proc. 1994 World Congress Neural Networks, pp. 399–404. SanDiego (1994)Google Scholar
  37. 37.
    Xu, L.: Bayesian-Kullback coupled YING-YANG machines: unified learnings and new results on vector quantization. Proc. ICONIP95, pp. 977–988. Beijing (1995)Google Scholar
  38. 38.
    Xu, L .: On the hybrid LT combinatorial optimization: new U-shape barrier, sigmoid activation, least leaking energy and maximum entropy. Proc. ICONIP’95, pp. 309–312. Beijing (1995)Google Scholar
  39. 39.
    Xu, L.: Bayesian Ying-Yang system and theory as a unified statistical learning approach (I): unsupervised and semi-unsupervised learning. In: Amari, K. (ed.) Brain-like Computing and Intelligent Information Systems, pp. 241–274. Springer-Verlag (1997)Google Scholar
  40. 40.
    Xu L.: BYY harmony learning, independent state space and generalized APT financial analyses. IEEE Trans. Neural Netw. 12, 822–849 (2001)CrossRefGoogle Scholar
  41. 41.
    Xu, L.: Distribution approximation, combinatorial optimization, and Lagrange-barrier. Proc. Intl. Joint Conf. on Neural Networks 2003, July 20–24, pp. 2354–2359. Portland (2003)Google Scholar
  42. 42.
    Xu L.: Independent component analysis and extensions with noise and time: A Bayesian Ying-Yang learning perspective. Neural Inf. Process. Lett. Rev. 1, 1–52 (2003)Google Scholar
  43. 43.
    Xu L.: Temporal BYY encoding, Markovian state spaces, and space dimension determination. IEEE Trans. Neural Netw. 15, 1276–1295 (2004)CrossRefGoogle Scholar
  44. 44.
    Xu L.: Advances on BYY harmony learning: information theoretic perspective, generalized projection geometry, and independent factor auto-determination. IEEE Trans. Neural Netw. 15, 885–902 (2004)CrossRefGoogle Scholar
  45. 45.
    Xu, L.: Bayesian Ying Yang learning (I): a unified perspective for statistical modeling. In: Zhong, L. (ed.) Intelligent Technologies for Information analysis, pp. 615–659. Springer (2004)Google Scholar
  46. 46.
    Xu, L.: One-Bit-Matching ICA Theorem, convex–concave programming, and combinatorial optimization. Lecture Notes in Computer Science, Advances in Neural Networks, vol. 3496, pp. 5–20. Springer-Verlag (2005)Google Scholar
  47. 47.
    Xu L.: One-Bit-Matching theorem for ICA, convex-concave programming on polyhedral set, and distribution approximation for combinatorics. Neural Comput. 19, 546–569 (2007)CrossRefGoogle Scholar
  48. 48.
    Xu, L.: A trend on regularization and model selection in statistical learning: A Bayesian Ying Yang learning perspective. In: Duch, M. (ed.) Challenges for Computational Intelligence, pp. 365–406. Springer-Verlag (2007)Google Scholar
  49. 49.
    Xu L.: A unified perspective and new results on RHT computing, mixture based learning, and multi-learner based problem solving. Pattern Recognit. 40, 2129–2153 (2007)CrossRefGoogle Scholar
  50. 50.
    Xu, L.: Bayesian Ying Yang learning. Scholarpedia 2(3), 1809 (2007). Google Scholar
  51. 51.
    Xu L.: From normalized RBF networks to subspace based functions. In: Soria, E., Mart쬬 J.D., Magdalena, R., Mart쭥z, M., Serrano, A.J. (eds.) To Appear in Handbook of Research on Machine Learning Applications. IGI Global (formerly Idea Group Publishing) (2008a)Google Scholar
  52. 52.
    Xu, L.: Bayesian Ying Yang system, best harmony learning, and Gaussian manifold based family. In: Zurada, J.M. (ed.) Computational Intelligence: Research Frontiers, WCCI2008 Plenary/Invited Lectures, LNCS5050, pp. 48–78 (2008b)Google Scholar
  53. 53.
    Xu L., Jordan M.I.: On convergence properties of the EM algorithm for Gaussian mixtures. Neural Comput. 8(1), 129–151 (1996)CrossRefGoogle Scholar
  54. 54.
    Xu L., King I.: A PCA approach for fast retrieval of structural patterns in attributed graphs. IEEE Trans. Syst. Man Cybernet. B 31(5), 812–817 (2001)CrossRefGoogle Scholar
  55. 55.
    Xu, L., Klasa, S.: A PCA like rule for pattern classification based on attributed graph. Proc. 1993 Intl. Joint Conf. on Neural Networks, Oct. 1993, pp. 1281–1284. Nagoya, Japan (1993)Google Scholar
  56. 56.
    Xu, L., Krzyzak, A., Oja, E.: Rival penalized competitive learning for clustering analysis, RBF net and curve detection. IEEE Trans. on Neural Netw. 4, 636–649. Its early version on Proc. of 11th ICPR92, vol. I, pp. 672–675 (1992 and 1993)Google Scholar
  57. 57.
    XU, L.: Rival penalized competitive learning. Scholarpedia 2(8), 1810 Retried from

Copyright information

© Springer Science+Business Media, LLC. 2008

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringChinese University of Hong KongShatinHong Kong, China

Personalised recommendations