Mining Dependence Structures from Statistical Learning Perspective

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2412)


Mining various dependence structures from data are important to many data mining applications. In this paper, several major dependence structure mining tasks are overviewed from statistical learning perspective, with a number of major results on unsupervised learning models that range from a single-object world to a multi-object world. Moreover, efforts towards a key challenge to learning have been discussed in three typical streams, based on generalization error bounds, Ockham principle, and BYY harmony learning, respectively.


Independent Component Analysis Dependence Structure Independent Component Analysis Blind Source Separation Royal Statistical Society 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Akaike, H. (1974), “A new look at the statistical model identification”, IEEE Tr. Automatic Control, 19, 714–723.Google Scholar
  2. 2.
    Akaike, H., (1981), “Likelihood of a model and information criteria”, Journal of Econometrics, 16, 3–14.zbMATHCrossRefGoogle Scholar
  3. 3.
    Akaike, H., (1987), “Factor analysis and AIC”, Psychometrika, 52, 317–332.zbMATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    Amari, S.-I., Cichocki, A., & Yang, H.H., (1996), “A new learning algorithm for blind separation of sources”, in D. S. Touretzky, et al, eds, Advances in Neural Information Processing 8, MIT Press, 757–763.Google Scholar
  5. 5.
    Anderson, T.W., & Rubin, H., (1956), “Statistical inference in factor analysis”, Proc. Berkeley Symp. Math. Statist. Prob. 3rd 5, UC Berkeley, 111–150.MathSciNetGoogle Scholar
  6. 6.
    Atkinson, A. C., (1981), “Likelihood ratios, posterior odds and information criteria”, Journal of Econometrics, 16, 15–20.CrossRefGoogle Scholar
  7. 7.
    Barlow, H.B., (1989), “Unsupervised learning”, Neural Computation, 1, 295–311.CrossRefGoogle Scholar
  8. 8.
    Bell, A.J. & Sejnowski, T.J., (1995), “An information-maximization approach to blind separation and blind de-convolution”, Neural Computation 7, 1129–1159.CrossRefGoogle Scholar
  9. 9.
    Berger, J., (1985), Statistical Decision Theory and Bayesian Analyses, Springer-Verlag, New York.Google Scholar
  10. 10.
    Bozdogan, H. (1987) “Model Selection and Akaike’s Information Criterion: The general theory and its analytical extension”, Psychometrika, 52, 345–370.zbMATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Bozdogan, H. & Ramirez, D. E., (1988), “FACAIC: Model selection algorithm for the orthogonal factor model using AIC and FACAIC”, Psychometrika, 53(3), 407–415.zbMATHCrossRefGoogle Scholar
  12. 12.
    Cavanaugh, J.E. (1997), “Unifying the derivations for the Akaike and corrected Akaike information criteria”, Statistics & Probability Letters, 33, 201–208.zbMATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Chib, S. (1995), “Marginal likelihood from the Gibbs output”, Journal of the American Statistical Association, 90(432), 1313–1321.zbMATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Chow, G. C., (1981), “A comparison of the information and posterior probability criteria for model selection”, Journal of Econometrics, 16, 21–33.zbMATHCrossRefGoogle Scholar
  15. 15.
    Cooper, G. & Herskovitz, E., (1992), “A Bayesian method for the induction of probabilistic networks from data”, Machine Learning, 9, 309–347.zbMATHGoogle Scholar
  16. 16.
    Comon, P. (1994), “Independent component analysis-a new concept?”, Signal Processing 36, 287–314.zbMATHCrossRefGoogle Scholar
  17. 17.
    Dempster, A.P., et al, (1977), “Maximum-likelihood from incomplete data via the EM algorithm”, J. of Royal Statistical Society, B39, 1–38.MathSciNetGoogle Scholar
  18. 18.
    Devijver, P.A., & Kittler, J., (1982), Pattern Recognition: A Statistical Approach, Prentice-Hall.Google Scholar
  19. 19.
    Devroye, L., et al (1996), A Probability Theory of Pattern Recognition, Springer.Google Scholar
  20. 20.
    DiCiccio, T. J., et al, (1997), “ Computing Bayes factors by combining simulations and asymptotic Approximations”, Journal of the American Statistical Association, 92(439), 903–915.zbMATHCrossRefMathSciNetGoogle Scholar
  21. 21.
    R.O. Duda and P.E. Hart, Pattern classification and Scene analysis, Wiley (1973).Google Scholar
  22. 22.
    Efron, B. (1983) “Estimating the error rate of a prediction rule: Improvement on cross-validation”, Journal of the American Statistical Association, 78, 316–331.zbMATHCrossRefMathSciNetGoogle Scholar
  23. 23.
    Efron, B. & Tibshirani, R., (1993), An Introduction to the Bootstrap, Chaoman and Hall, New York.zbMATHGoogle Scholar
  24. 24.
    Fyfe, C.,et al, ed. (1998), Special issue on Independence and artificial neural networks, Neurocomputing, Vol. 22, No. 1–3.Google Scholar
  25. 25.
    Gaeta, M., & Lacounme, J.-L, (1990), “Source Separation without a priori knowledge: the maximum likelihood solution”, in Proc. EUSIPCO90, 621–624.Google Scholar
  26. 26.
    Gelfand, A. E. & Dey, D. K. (1994), “ Bayesian model choice: Asymptotics and exact calculations”, Journal of the Royal Statistical Society B, 56(3), 501–514.zbMATHMathSciNetGoogle Scholar
  27. 27.
    Geman, S., Bienenstock, E., & Doursat, R., (1992), “Neural Networks and the bias-variance dilemma”, Neural Computation, 4, 1–58.CrossRefGoogle Scholar
  28. 28.
    Ghahramani, Z. & Beal, M.J., (2000), “Variational inference for Bayesian mixture of factor analysis”, S.A. Solla, T.K. Leen & K.-R. Muller, eds, Advances in Neural Information Processing Systems 12, Cambridge, MA: MIT Press, 449–455.Google Scholar
  29. 29.
    Girosi, F., et al, (1995) “Regularization theory and neural architectures”, Neural Computation, 7, 219–269.CrossRefGoogle Scholar
  30. 30.
    Han, J. and Kamber, M., (2001), Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001.Google Scholar
  31. 31.
    Hinton, G.E. & Zemel, R.S., (1994), “Autoencoders, minimum description length and Helmholtz free energy”, Advances in NIPS, 6, 3–10.Google Scholar
  32. 32.
    Hotelling, H., (1936), “Simplified calculation of principal components”, Psy-chometrika 1, 27–35.Google Scholar
  33. 33.
    Hurvich, C.M., & Tsai, C.L. (1989), “Regression and time series model in samll samples”, Biometrika, 76, 297–307.zbMATHCrossRefMathSciNetGoogle Scholar
  34. 34.
    Hurvich, C.M., & Tsai, C.L. (1993), “A corrected Akaike information criterion for vector autoregressive model selection”, J. of Time Series Analysis, 14, 271–279.zbMATHMathSciNetCrossRefGoogle Scholar
  35. 35.
    Jacobs, R.A., et al, (1991), “Adaptive mixtures of local experts”, Neural Computation, 3, 79–87.CrossRefGoogle Scholar
  36. 36.
    Jacobs, R. A., (1997) “Bias/Variance Analyses of Mixtures-of-Experts Architectures”, Neural Computation, 9.Google Scholar
  37. 37.
    Jeffreys, H., (1939), Theory of Probability, Clarendon Press, Oxford.Google Scholar
  38. 38.
    Jensen, F.V., (1996), An introduction to Bayesian networks, University of Collage London Press.Google Scholar
  39. 39.
    Jordan, M. I., & Jacobs, R.A., (1994), “Hierarchical mixtures of experts and the EM algorithm”, Neural Computation, 6, 181–214.CrossRefGoogle Scholar
  40. 40.
    Jordan, M. I., & Xu, L., (1995), “Convergence results for the EM approach to mixtures of experts”, Neural Networks, 8, 1409–1431.CrossRefGoogle Scholar
  41. 41.
    Jutten, C & Herault, J., (1988), “Independent Component Analysis versus Principal Component Analysis”, Proc. EUSIPCO88, 643–646.Google Scholar
  42. 42.
    Kashyap, R.L., (1982), “Optimal choice of AR and MA parts in autoregressive and moving-average models”, IEEE Trans. PAMI, 4, 99–104.zbMATHGoogle Scholar
  43. 43.
    Kass, R. E. & Raftery, A. E., (1995), “ Bayes factors”, Journal of the American Statistical Association, 90(430), 773–795.zbMATHCrossRefGoogle Scholar
  44. 44.
    Kass, R. E. & Wasserman, L., (1996), “The selection of prior distributions by formal rules”, Journal of the American Statistical Association, 91(435), 1343–1370.zbMATHCrossRefGoogle Scholar
  45. 45.
    Katz, R. W., (1981), “On some criteria for estimating the order of a Markov chain“, Technometrics, 23(3), 243–249.zbMATHCrossRefMathSciNetGoogle Scholar
  46. 46.
    King, I. & Xu, L., (1995), “Adaptive contrast enhancement by entropy maximization with a 1-K-1 constrained network”, Proc. ICONIP’95, pp703–706.Google Scholar
  47. 47.
    Kohonen, T, (1995), Self-Organizing Maps, Springer-Verlag, Berlin.Google Scholar
  48. 48.
    Kohonen, T., (1982), “Self-organized formation of topologically correct feature maps”, Biological Cybernetics 43, 59–69.zbMATHCrossRefMathSciNetGoogle Scholar
  49. 49.
    Kontkanen, P., et al, (1998), “Bayesian and Information-Theoretic priors for Bayeisan network parameters”, Machine Learning: ECML-98, Lecture Notes in Artificial Intelligence, Vol. 1398, 89–94, Springer-Verlag.Google Scholar
  50. 50.
    Mackey, D. (1992a) “A practical Bayesian framework for backpropagation”, Neural Computation, 4, 448–472.CrossRefGoogle Scholar
  51. 51.
    Mackey, D. (1992b) “Bayesian Interpolation”, Neural Computation, 4, 405–447.Google Scholar
  52. 52.
    von der Malsburg, Ch. (1973), Self-organization of orientation sensitive cells in the striate cortex, Kybernetik 14, 85–100.CrossRefGoogle Scholar
  53. 53.
    McDonald, R, (1985), Factor Analysis and Related Techniques, Lawrence Erlbaum.Google Scholar
  54. 54.
    McLachlan, G. J. & Krishnan, T. (1997) The EM Algorithm and Extensions, John Wiley & Son, INC.Google Scholar
  55. 55.
    Moody, J. & Darken, J., (1989) “Fast learning in networks of locally-tuned processing units”, Neural Computation, 1, 281–294.CrossRefGoogle Scholar
  56. 56.
    Neath, A.A., & Cavanaugh, J.E., (1997), “Regression and Time Series model selection using variants of the Schwarz information criterion”, Communications in Statistics A, 26, 559–580.zbMATHMathSciNetCrossRefGoogle Scholar
  57. 57.
    Neal, R.M., (1996), Bayesian learning for neural networks, New York: Springer-Verlag.zbMATHGoogle Scholar
  58. 58.
    Newton, M. A. & Raftery, A. E., (1994), “Approximate Bayesian inference with the weighted likelihood Bootstrap”, J. Royal Statistical Society B, 56(1), 3–48.zbMATHMathSciNetGoogle Scholar
  59. 59.
    Nowlan, S.J., (1990), “Max likelihood competition in RBF networks”, Tech. Rep. CRG-Tr-90-2, Dept. of Computer Sci., U. of Toronto.Google Scholar
  60. 60.
    O’Hagan, A., (1995), “Fractional Bayes factors for model comparison”, J. Royal Statistical Society B, 57(1), 99–138.zbMATHMathSciNetGoogle Scholar
  61. 61.
    Oja, E., (1983), Subspace Methods of Pattern Recognition, Research Studies Press, UK.Google Scholar
  62. 62.
    Pearl, J, (1988), Probabilistic reasoning in intelligent systems: networks of plausible inference, San Fransisca, CA: Morgan Kaufman.Google Scholar
  63. 63.
    Rabiner, L. & Juang, B.H., (1993), Fundamentals of Speech Recognition, Prentice Hall, Inc..Google Scholar
  64. 64.
    Wolpert, D. H., (1997), “On Bias Plus Variance”, Neural Computation, 9.Google Scholar
  65. 65.
    Vapnik, V.N., (1995), The Nature Of Statistical Learning Theory, Springer-Verlag.Google Scholar
  66. 66.
    Redner, R.A. & Walker, H.F., (1984), “Mixture densities, maximum likelihood, and the EM algorithm”, SIAM Review, 26, 195–239.zbMATHCrossRefMathSciNetGoogle Scholar
  67. 67.
    Rissanen, J. (1986), “Stochastic complexity and modeling”, Annals of Statistics, 14(3), 1080–1100.zbMATHMathSciNetCrossRefGoogle Scholar
  68. 68.
    Rissanen, J. (1989), Stochastic Complexity in Statistical Inquiry, World Scientific: Singapore.zbMATHGoogle Scholar
  69. 69.
    Rissanen, J. (1999), “Hypothesis selection and testing by the MDL principle”, Computer Journal, 42(4), 260–269.zbMATHCrossRefMathSciNetGoogle Scholar
  70. 70.
    Rivals, I. & Personnaz, L., (1999) “On Cross Validation for Model Selection”, Neural Computation, 11, 863–870.CrossRefGoogle Scholar
  71. 71.
    Rubi, D & Thayer, D., (1976), “EM algorithm for ML factor analysis”, Psychome-trika 57, 69–76.Google Scholar
  72. 72.
    Rumelhart, D.E., Hinton, G.E., & Williams, R.J., (1986), “Learning internal representations by error propagation”, Parallel Distributed Processing, 1, MIT press.Google Scholar
  73. 73.
    Sato, M., (2001), “Online model selection based on the vairational Bayes”, Neural Computation, 13, 1649–1681.zbMATHCrossRefGoogle Scholar
  74. 74.
    Schwarz, G., (1978), “Estimating the dimension of a model”, Annals of Statistics, 6, 461–464.zbMATHMathSciNetCrossRefGoogle Scholar
  75. 75.
    Sclove, S. L., (1987), “ Application of model-selection criteria to some problems in multivariate analysis”, Psychometrika, 52(3), 333–343.CrossRefGoogle Scholar
  76. 76.
    Spearman, C., (1904), “General intelligence domainively determined and measured”, Am. J. Psychol. 15, 201–293.CrossRefGoogle Scholar
  77. 77.
    Stone, M. (1974), “Cross-validatory choice and assessment of statistical prediction”, J. Royal Statistical Society B, 36, 111–147.zbMATHGoogle Scholar
  78. 78.
    Stone, M. (1977a), “Asymptotics for and against cross-validation”, Biometrika, 64(1), 29–35.zbMATHCrossRefMathSciNetGoogle Scholar
  79. 79.
    Stone, M. (1977b), “An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion”, J. Royal Statistical Society B, 39(1), 44–47.zbMATHGoogle Scholar
  80. 80.
    Stone, M. (1978),“Cross-validation: A review”, Math. Operat. Statist., 9, 127–140.zbMATHGoogle Scholar
  81. 81.
    Stone, M. (1979), “Comments on model selection criteria of Akaike and Schwartz. J. Royal Statistical Society B, 41(2), 276–278.Google Scholar
  82. 82.
    Sugiura, N. (1978), “Further analysis of data by Akaike’s infprmation criterion and the finite corrections”, Communications in Statistics A, 7, 12–26.MathSciNetGoogle Scholar
  83. 83.
    Tikhonov, A.N. & Arsenin, V.Y., (1977), Solutions of Ill-posed Problems, V.H. Winston and Sons.Google Scholar
  84. 84.
    Tipping, M.E., and Bishop, C.M., (1999), “Mixtures of probabilistic principal component analysis”, Neural Computation, 11, 443–482.CrossRefGoogle Scholar
  85. 85.
    Tong, L., Inouye, Y., & Liu, R., (1993) “Waveform-preserving blind estimation of multiple independent sources”, IEEE Trans. on Signal Processing 41, 2461–2470.zbMATHCrossRefGoogle Scholar
  86. 86.
    Wallace, C.S. & Boulton, D.M., (1968), “An information measure for classification”, Computer Journal, 11, 185–194.zbMATHGoogle Scholar
  87. 87.
    Wallace, C.S. & Freeman, P.R., (1987), “Estimation and inference by compact coding”, J. of the Royal Statistical Society, 49(3), 240–265.zbMATHMathSciNetGoogle Scholar
  88. 88.
    Wallace, C.S. & Dowe, D.R., (1999), “Minimum message length and Kolmogorov complexity”, Computer Journal, 42(4), 270–280.zbMATHCrossRefGoogle Scholar
  89. 89.
    Waterhouse, S., et al, (1996), “Bayesian method for mixture of experts”, D.S. Touretzky, et al, eds, Advances in NIPS 8, 351–357.Google Scholar
  90. 90.
    Xu, L., (2002a), “ BYY Harmony Learning, Structural RPCL, and Topological Self-Organizing on Mixture Models ”, to appear on Neural Networks, 2002.Google Scholar
  91. 91.
    Xu, L., (2002b), “ BYY Learning, Regularized Implementation, and Model Selection on Modular Networks with One Hidden Layer of Binary Units ”, to appear on a special issue, Neurocomputing, 2002.Google Scholar
  92. 92.
    Xu, L., (2001a), “BYY Harmony Learning, Independent State Space and Generalized APT Financial Analyses ”, IEEE Trans on Neural Networks, 12(4), 822–849.CrossRefGoogle Scholar
  93. 93.
    Xu, L., (2001b), “Best Harmony, Unified RPCL and Automated Model Selection for Unsupervised and Supervised Learning on Gaussian Mixtures, Three-Layer Nets and ME-RBF-SVM Models”, Intl J. of Neural Systems 11(1), 43–69.Google Scholar
  94. 94.
    Xu, L., (2000), “Temporal BYY Learning for State Space Approach, Hidden Markov Model and Blind Source Separation”, IEEE Trans on Signal Processing 48, 2132–2144.zbMATHCrossRefGoogle Scholar
  95. 95.
    Xu, L., (1998a), “RBF Nets, Mixture Experts, and Bayesian Ying-Yang Learning“, Neurocomputing, Vol. 19, No. 1-3, 223–257.zbMATHCrossRefGoogle Scholar
  96. 96.
    Xu, L., (1998b), “Bayesian Kullback Ying-Yang Dependence Reduction Theory”, Neurocomputing 22, No. 1-3, 81–112, 1998.zbMATHCrossRefGoogle Scholar
  97. 97.
    Xu, L., Cheung, C.C., & Amari, S.-I., (1998) “Learned Parametric Mixture Based ICA Algorithm”, Neurocomputing 22, No. 1-3, 69–80. A part of its preliminary version on Proc. ESANN97, 291–296.zbMATHCrossRefGoogle Scholar
  98. 98.
    Xu, L., (1997), “Bayesian Ying-Yang Machine, Clustering and Number of Clusters”, Pattern Recognition Letters 18, No. 11-13, 1167–1178.CrossRefGoogle Scholar
  99. 99.
    Xu, L. Yang, H.H., & Amari, S.-I., (1996), “Signal Source Separation by Mixtures Accumulative Distribution Functions or Mixture of Bell-Shape Density Distribution Functions ”, Research Proposal, presented at FRONTIER FORUM (speakers: D. Sherrington, S. Tanaka, L.Xu & J. F. Cardoso), organized by S.Amari, S.Tanaka & A.Cichocki, RIKEN, Japan, April 10, 1996.Google Scholar
  100. 100.
    Xu, L., (1996&95), “A Unified Learning Scheme: Bayesian-Kullback YING-YANG Machine”, Advances in Neural Information Processing Systems, 8, 444–450 (1996). A part of its preliminary version on Proc. ICONIP95-Peking, 977–988(1995).Google Scholar
  101. 101.
    Xu, L., (1995), “A unified learning framework: multisets modeling learning,” Proceedings of 1995 World Congress on Neural Networks, vol. 1, pp. 35–42.Google Scholar
  102. 102.
    Xu, L., Jordan, M.I., & Hinton, G.E., (1995), “An Alternative Model for Mixtures of Experts”, Advances in Neural Information Processing Systems 7, eds., Cowan, J.D., et al, MIT Press, 633–640, 1995.Google Scholar
  103. 103.
    Xu, L., Krzyzak, A., & Yuille, A.L., (1994), “On Radial Basis Function Nets and Kernel Regression: Statistical Consistency, Convergence Rates and Receptive Field Size”, Neural Networks, 7, 609–628.zbMATHCrossRefGoogle Scholar
  104. 104.
    Xu, L., Krzyzak, A. & Oja, E. (1993), “Rival Penalized Competitive Learning for Clustering Analysis, RBF net and Curve Detection”, IEEE Tr. on Neural Networks 4, 636–649.CrossRefGoogle Scholar
  105. 105.
    Xu, L., (1991&93) “Least mean square error reconstruction for self-organizing neural-nets”, Neural Networks 6, 627–648, 1993. Its early version on Proc. IJCNN91’Singapore, 2363-2373, 1991.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Lei Xu
    • 1
  1. 1.Department of Computer Science and EngineeringChinese University of Hong KongHong Kong, P.R. China

Personalised recommendations