A Trend on Regularization and Model Selection in Statistical Learning: A Bayesian Ying Yang Learning Perspective

Part of the Studies in Computational Intelligence book series (SCI, volume 63)


In this chapter, advances on regularization and model selection in statistical learning have been summarized, and a trend has been discussed from a Bayesian Ying Yang learning perspective. After briefly introducing Bayesian Ying- Yang system and best harmony learning, not only its advantages of automatic model selection and of integrating regularization and model selection have been addressed, but also its differences and relations to several existing typical learning methods have been discussed and elaborated. Taking the tasks of Gaussian mixture, local subspaces, local factor analysis as examples, not only detailed model selection criteria are given, but also a general learning procedure is provided, which unifies those automatic model selection featured adaptive algorithms for these tasks. Finally, a trend of studies on model selection (i.e., automatic model selection during parametric learning), has been further elaborated. Moreover, several theoretical issues in a large sample size and a number of challenges in a small sample size have been presented.

Key words

Statistical learning Model selection Regularization Bayesian Ying-Yang system Best harmony learning Best matching Best fitting AIC BIC Automatic model selection Gaussian mixture Local factor analysis theoretical issues challenges 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    Akaike, H (1974), “A new look at the statistical model identification”, IEEE Tr. Automatic Control,19,714-723.Google Scholar
  2. [2]
    Akaike, H (1981), “Likelihood of a model and information criteria”, Journal of Econometrics,16,3-14.zbMATHCrossRefGoogle Scholar
  3. [3]
    Akaike, H (1987),“Factor analysis and AIC”, Psychometrika, 52, 317-332. zbMATHCrossRefMathSciNetGoogle Scholar
  4. [4]
    Anderson, TW, & Rubin, H (1956), “Statistical inference in factor analysis”, Proc. Berkeley Symp. Math. Statist. Prob. 3rd 5, UC Berkeley, 111-150.Google Scholar
  5. [5]
    Bishop, C.M., (1995), “Training with noise is equivalent to Tikhonov regularization”, Neural Computation 7, 108-116.CrossRefGoogle Scholar
  6. [6]
    Bozdogan, H (1987) “Model Selection and Akaike's Information Criterion: The general theory and its analytical extension”, Psychometrika, 52,345-370.zbMATHCrossRefMathSciNetGoogle Scholar
  7. [7]
    Bozdogan, H & Ramirez, DE (1988), “FACAIC: Model selection algorithm for the orthogonal factor model using AIC and FACAIC”, Psychometrika,53(3),407-415.zbMATHCrossRefGoogle Scholar
  8. [8]
    Cavanaugh, JE (1997), “Unifying the derivations for the Akaike and corrected Akaike information criteria”, Statistics & Probability Letters, 33,201-208.zbMATHCrossRefMathSciNetGoogle Scholar
  9. [9]
    Cooper, G & Herskovitz, E (1992), “A Bayesian method for the induction of probabilistic networks from data”, Machine Learning,9,309-347.zbMATHGoogle Scholar
  10. [10]
    Dayan, P. & Hinton, GE (1995), “The Helmholtz machine”, Neural Computation 7, No.5, 889-904.CrossRefGoogle Scholar
  11. [11]
    Girosi, F, et al, (1995) “Regularization theory and neural architectures”, Neural Computation,7,219-269.CrossRefGoogle Scholar
  12. [12]
    Hinton, GE, Dayan, P, Frey, BJ, & Neal, RN (1995), “The wake-sleep algorithm for unsupervised learning neural networks”, Science 268, 1158-1160.CrossRefGoogle Scholar
  13. [13]
    Hinton, GE & Zemel, RS (1994), “Autoencoders, minimum description length and Helmholtz free energy”, Advances in NIPS,6,3-10.Google Scholar
  14. [14]
    Hurvich, CM, & Tsai, CL (1989), “Regression and time series model in samll samples”, Biometrika,76,297-307.zbMATHCrossRefMathSciNetGoogle Scholar
  15. [15]
    Kashyap, RL (1982), “Optimal choice of AR and MA parts in autoregressive and moving-average models”, IEEE Trans. PAMI, 4,99-104.zbMATHGoogle Scholar
  16. [16]
    Z.Y. Liu, H. Qiao, & L. Xu, “Multisets Mixture learning based Ellipse Detection”, Pattern Recognition 39, pp 731-735, 2006.zbMATHCrossRefGoogle Scholar
  17. [17]
    Z.Y. Liu, K.C. Chiu, & L. Xu, “Strip Line Detection and Thinning by RPCL-Based Local PCA”, Pattern Recognition Letters 24,2335-2344, 2003.zbMATHCrossRefGoogle Scholar
  18. [18]
    Liu, ZY, Chiu, KC, & Xu, L (2003), “ Improved system for object detection and star/galaxy classification via local subspace analysis”, Neural Networks 16, 437-451.CrossRefGoogle Scholar
  19. [19]
    Ma, J, Wang, T, & Xu, L (2004), “A gradient BYY harmony learning rule on Gaussian mixture with automated model selection”, Neurocomputing 56,481-487.CrossRefGoogle Scholar
  20. [20]
    Ma, J & Xu, L (2002), “Convergence Analysis of Rival Penalized Competitive Learning (RPCL) Algorithm”, Proc. of Intl. Joint Conf. on Neural Networks (IJCNN '02), Hawaii, USA, May 12-17, 2002, pp 1596-1602.Google Scholar
  21. [21]
    Ma, J & Xu, L “The Correct Convergence of the Rival Penalized Competitive Learning (RPCL) Algorithm”, Proc. of Intl. Conf. on Neural Information Processing (ICONIP'98), October 21-23, 1998, Kitakyushu, Japan, Vo.1, pp239-242.Google Scholar
  22. [22]
    Mackey, D (1992) “A practical Bayesian framework for backpropagation”, Neural Computation,4,448-472.CrossRefGoogle Scholar
  23. [23]
    Neath, AA & Cavanaugh, JE (1997), “Regression and Time Series model selection using variants of the Schwarz information criterion”, Communications in Statistics A,26,559-580.zbMATHCrossRefMathSciNetGoogle Scholar
  24. [24]
    T.Poggio & F.Girosi, “Networks for approximation and learning”, Proc. of IEEE, 78, 1481-1497 (1990).CrossRefGoogle Scholar
  25. [25]
    Redner, RA & Walker, HF (1984), “Mixture densities, maximum likelihood, and the EM algorithm”, SIAM Review,26,195-239.zbMATHCrossRefMathSciNetGoogle Scholar
  26. [26]
    Rissanen, J (1986), “Stochastic complexity and modeling”, Annals of Statistics,14(3),1080-1100.zbMATHCrossRefMathSciNetGoogle Scholar
  27. [27]
    Rissanen, J (1989), Stochastic Complexity in Statistical Inquiry, World Scientific: Singapore.zbMATHGoogle Scholar
  28. [28]
    Rivals, I & Personnaz, L (1999) “On Cross Validation for Model Selection”, Neural Computation,11,863-870.CrossRefGoogle Scholar
  29. [29]
    Schwarz, G (1978), “Estimating the dimension of a model”, Annals of Statistics,6,461-464.zbMATHCrossRefMathSciNetGoogle Scholar
  30. [30]
    Stone, M (1974), “Cross-validatory choice and assessment of statistical prediction”, J. Royal Statistical Society B, 36, 111-147.zbMATHGoogle Scholar
  31. [31]
    Stone, M (1977), “An asymptotic equivalence of choice of model by cross-validation and Akaike's criterion”, J. Royal Statistical Society B,39(1), 44-47.zbMATHGoogle Scholar
  32. [32]
    Stone, M (1978), “Cross-validation: A review”, Math. Operat. Statist., 9,127-140.zbMATHGoogle Scholar
  33. [33]
    Stone, M (1979), “Comments on model selection criteria of Akaike and Schwartz. J. Royal Statistical Society B, 41 (2), 276-278.Google Scholar
  34. [34]
    Sugiura, N (1978), “Further analysis of data by Akaike's information criterion and the finite corrections”, Communications in Statistics A, 7, 12-26.MathSciNetGoogle Scholar
  35. [35]
    Tikhonov, AN & Arsenin, VY (1977), Solutions of Ill-posed Problems, Winston and Sons.Google Scholar
  36. [36]
    Wallace, CS & Boulton, DM (1968), “An information measure for classification”, Computer Journal,11,185-194.zbMATHGoogle Scholar
  37. [37]
    Wallace, CS & Freeman, PR (1987), “Estimation and inference by compact coding”, J. of the Royal Statistical Society,49(3),240-265.zbMATHMathSciNetGoogle Scholar
  38. [38]
    Wallace, CS & Dowe, DR (1999), “Minimum message length and Kolmogorov complexity”, Computer Journal,42(4),270-280.zbMATHCrossRefGoogle Scholar
  39. [39]
    Vapnik, VN (1995), The Nature Of Statistical Learning Theory, Springer.Google Scholar
  40. [40]
    Xu, L.,(2007),“A Unified Perspective and New Results on RHT Computing, Mixture Based Learning, and Multi-learner Based Problem Solving”, Pattern Recognition, Vol. 40, pp. 2129-2153, 2007.zbMATHCrossRefGoogle Scholar
  41. [41]
    Xu, L.,(2005),“Fundamentals, Challenges, and Advances of Statistical Learning for Knowledge Discovery and Problem Solving: A BYY Harmony Perspective”, Keynote talk, Proc. of Intl. Conf. on Neural Networks and Brain, Oct. 13-15, 2005, Beijing, China, Vol. 1, pp. 24-55.Google Scholar
  42. [42]
    Xu, L. (2004), “Temporal BYY Encoding, Markovian State Spaces, and Space Dimension Determination”, IEEE Tr. Neural Networks, V15, N5, pp. 1276-1295.CrossRefGoogle Scholar
  43. [43]
    Xu, L (2004),“Advances on BYY harmony learning: information theoretic perspective, generalized projection geometry, and independent factor auto-determination”, IEEE Tr. Neural Networks, V15, N4, pp. 885-902.CrossRefGoogle Scholar
  44. [44]
    Xu, L. (2004), “Bayesian Ying Yang Learning (I): A Unified Perspective for Statistical Modeling”, Intelligent Technologies for Information Analysis, N. Zhong and J. Liu (eds), Springer, pp. 615-659.Google Scholar
  45. [45]
    Xu, L. (2004), “Bayesian Ying Yang Learning (II): A New Mechanism for Model Selection and Regularization”, Intelligent Technologies for Information Analysis, N. Zhong and J. Liu (eds), Springer, pp. 661-706.Google Scholar
  46. [46]
    Xu, L. (2004), “BI-directional BYY Learning for Mining Structures with Projected Polyhedra and Topological Map”, Invited talk, in Proc. of FDM 2004: Foundations of Data Mining, eds., T.Y.Lin, S.Smale, T. Poggio, and C.J. Liau, Brighton, UK, Nov. 01, 2004, pp. 5-18.Google Scholar
  47. [47]
    Xu, L. (2003), “Data smoothing regularization, multi-sets-learning, and problem solving strategies”, Neural Networks, V. 15, No. 5-6, 817-825.CrossRefGoogle Scholar
  48. [48]
    Xu, L. (2003), “Independent Component Analysis and Extensions with Noise and Time: A Bayesian Ying-Yang Learning Perspective”, Neural Information Processing Letters and Reviews, Vol.1, No.1, 1-52.Google Scholar
  49. [49]
    Xu, L (2002), “BYY Harmony Learning, Structural RPCL, and Topological Self-Organizing on Mixture Models ”, Neural Networks, V15, N8-9, 1125-1151.CrossRefGoogle Scholar
  50. [50]
    Xu, L, (2002), “Bayesian Ying Yang Harmony Learning”, The Handbook of Brain Theory and Neural Networks, Second edition, (MA Arbib, Ed.), Cambridge, MA: The MIT Press, pp. 1231-1237.Google Scholar
  51. [51]
    Xu, L (2001), “BYY Harmony Learning, Independent State Space and Generalized APT Financial Analyses ”, IEEE Tr. Neural Networks, 12 (4),822-849.CrossRefGoogle Scholar
  52. [52]
    Xu, L (2001), “Best Harmony, Unified RPCL and Automated Model Selection for Unsupervised and Supervised Learning on Gaussian Mixtures, Three-Layer Nets and ME-RBF-SVM Models”, Intl J of Neural Systems 11 (1), 43-69.Google Scholar
  53. [53]
    Xu, L (2000), “Temporal BYY Learning for State Space Approach, Hidden Markov Model and Blind Source Separation”, IEEE Tr. Signal Processing 48, 2132-2144.zbMATHCrossRefGoogle Scholar
  54. [54]
    Xu, L (1998), “RBF Nets, Mixture Experts, and Bayesian Ying-Yang Learning”, Neurocomputing, Vol. 19, No.1-3, 223-257.zbMATHCrossRefGoogle Scholar
  55. [55]
    Xu, L, (1998), “Rival Penalized Competitive Learning, Finite Mixture, and Multisets Clustering” , Proc. of IJCNN98, Anchorage, Vol.II, pp. 2525-2530.Google Scholar
  56. [56]
    Xu, L (1997), “Bayesian Ying-Yang Machine, Clustering and Number of Clusters”, Pattern Recognition Letters 18,No.11-13, 1167-1178.CrossRefGoogle Scholar
  57. [57]
    Xu, L, (1997), “New Advances on Bayesian Ying-Yang Learning System with Kullback and Non-Kullback Separation Functionals”, Proc. IEEEINNS Intl. Joint Conf. on Neural Networks (IJCNN97), Houston, Vol. III, pp. 1942-1947.Google Scholar
  58. [58]
    Xu, L, & Jordan, MI (1996), “On convergence properties of the EM algorithm for Gaussian mixtures”, Neural Computation, 8, No.1, 1996, 129-151.CrossRefGoogle Scholar
  59. [59]
    Xu, L, (1995), “Bayesian-Kullback Coupled YING-YANG Machines: Unified Learnings and New Results on Vector Quantization”, Proc. Intl. Conf. on Neural Information Processing, Oct 30-Nov.3, 1995, Beijing, pp. 977-988.Google Scholar
  60. [60]
    L. Xu, “A Unified Learning Framework: Multisets Modeling Learning”, Invited Talk, Proc. of World Congress on Neural Networks (WCNN95), Washington, DC, July 17-21, 1995, Vol.I, pp. 35-42.Google Scholar
  61. [61]
    Xu, L, Jordan, MI, & Hinton, GE (1995), “An Alternative Model for Mixtures of Experts”, Advances in Neural Information Processing Systems 7, eds, Cowan, JD, et al, MIT Press, 633-640, 1995.Google Scholar
  62. [62]
    L. Xu, “Multisets Modeling Learning: An Unified Theory for Supervised and Unsupervised Learning”, Invited Talk, Proc. of IEEE ICNN94, Orlando, Florida, June 26-July 2, 1994, Vol.I, 315-320.Google Scholar
  63. [63]
    Xu, L, Krzyzak, A, & Yuille, AL (1994), “On Radial Basis Function Nets and Kernel Regression: Statistical Consistency, Convergence Rates and Receptive Field Size”, Neural Networks, 7, 609-628.zbMATHCrossRefGoogle Scholar
  64. [64]
    Xu, L, Krzyzak, A & Oja, E (1993), “Rival Penalized Competitive Learning for Clustering Analysis, RBF net and Curve Detection”, IEEE Tr. on Neural Networks 4, 636-649.CrossRefGoogle Scholar
  65. [65]
    Xu, L & Oja, E. (1993), “Randomized Hough Transform (RHT): Basic Mechanisms, Algorithms and Complexities”, Computer Vision, Graphics, and Image Processing : Image Understanding, Vol.57, No.2, pp. 131-154.CrossRefGoogle Scholar
  66. [66] Xu, L, Krzyzak, A & Oja, E (1992), “Unsupervised and Supervised Classifications by Rival Penalized Competitive Learning”, Proc. of 11th Intl Conf. on Pattern Recognition (ICPR92), Hauge, Netherlands, Vol.I, pp. 672-675.Google Scholar
  67. [67]
    Xu, L, Klasa, A, & Yuille, A.L. (1992), “Recent Advances on Techniques Static Feedforward Networks with Supervised Learning”, International Journal of Neural Systems, Vol.3, No.3, pp. 253-290.CrossRefGoogle Scholar
  68. [68]
    Xu, L., Krzyzak, A., & Suen, C.Y. (1992), “Several Methods for Combining Multiple Classifiers and Their Applications in Handwritten Character Recognition”, IEEE Tr. System, Man and Cybernetics, Vol. 22, No.3, pp. 418-435.CrossRefGoogle Scholar
  69. [69]
    Xu, L, Oja, E., & Kultanen, P. (1990), “A New Curve Detection Method: Randomized Hough Transform (RHT)”, Pattern Recognition Letters, Vol.11, pp. 331-338.zbMATHCrossRefGoogle Scholar
  70. [70]
    Xu, L, P.F. Yan, & T. Chang (1988), “Best First Strategy for Feature Selection”, Proc. of 9th Intl Conf. on Pattern Recognition (ICPR98), Nov. 14-17, 1988, Rome, Italy, Vol.II, pp. 706-709.Google Scholar
  71. [71]
    Xu, L, (2007), “Bayesian Ying Yang Learning”, Scholarpedia, p. 10469, Ying Yang Learning.

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Lei Xu
    • 1
  1. 1.Department of Computer Science and EngineeringChinese University of Hong KongHong KongP.R. China

Personalised recommendations