Abstract
Mining various dependence structures from data are important to many data mining applications. In this paper, several major dependence structure mining tasks are overviewed from statistical learning perspective, with a number of major results on unsupervised learning models that range from a single-object world to a multi-object world. Moreover, efforts towards a key challenge to learning have been discussed in three typical streams, based on generalization error bounds, Ockham principle, and BYY harmony learning, respectively.
The work described in this paper was fully supported by a grant from the Research Grant Council of the Hong Kong SAR (project No: CUHK4383/99E).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Akaike, H. (1974), “A new look at the statistical model identification”, IEEE Tr. Automatic Control, 19, 714–723.
Akaike, H., (1981), “Likelihood of a model and information criteria”, Journal of Econometrics, 16, 3–14.
Akaike, H., (1987), “Factor analysis and AIC”, Psychometrika, 52, 317–332.
Amari, S.-I., Cichocki, A., & Yang, H.H., (1996), “A new learning algorithm for blind separation of sources”, in D. S. Touretzky, et al, eds, Advances in Neural Information Processing 8, MIT Press, 757–763.
Anderson, T.W., & Rubin, H., (1956), “Statistical inference in factor analysis”, Proc. Berkeley Symp. Math. Statist. Prob. 3rd 5, UC Berkeley, 111–150.
Atkinson, A. C., (1981), “Likelihood ratios, posterior odds and information criteria”, Journal of Econometrics, 16, 15–20.
Barlow, H.B., (1989), “Unsupervised learning”, Neural Computation, 1, 295–311.
Bell, A.J. & Sejnowski, T.J., (1995), “An information-maximization approach to blind separation and blind de-convolution”, Neural Computation 7, 1129–1159.
Berger, J., (1985), Statistical Decision Theory and Bayesian Analyses, Springer-Verlag, New York.
Bozdogan, H. (1987) “Model Selection and Akaike’s Information Criterion: The general theory and its analytical extension”, Psychometrika, 52, 345–370.
Bozdogan, H. & Ramirez, D. E., (1988), “FACAIC: Model selection algorithm for the orthogonal factor model using AIC and FACAIC”, Psychometrika, 53(3), 407–415.
Cavanaugh, J.E. (1997), “Unifying the derivations for the Akaike and corrected Akaike information criteria”, Statistics & Probability Letters, 33, 201–208.
Chib, S. (1995), “Marginal likelihood from the Gibbs output”, Journal of the American Statistical Association, 90(432), 1313–1321.
Chow, G. C., (1981), “A comparison of the information and posterior probability criteria for model selection”, Journal of Econometrics, 16, 21–33.
Cooper, G. & Herskovitz, E., (1992), “A Bayesian method for the induction of probabilistic networks from data”, Machine Learning, 9, 309–347.
Comon, P. (1994), “Independent component analysis-a new concept?”, Signal Processing 36, 287–314.
Dempster, A.P., et al, (1977), “Maximum-likelihood from incomplete data via the EM algorithm”, J. of Royal Statistical Society, B39, 1–38.
Devijver, P.A., & Kittler, J., (1982), Pattern Recognition: A Statistical Approach, Prentice-Hall.
Devroye, L., et al (1996), A Probability Theory of Pattern Recognition, Springer.
DiCiccio, T. J., et al, (1997), “ Computing Bayes factors by combining simulations and asymptotic Approximations”, Journal of the American Statistical Association, 92(439), 903–915.
R.O. Duda and P.E. Hart, Pattern classification and Scene analysis, Wiley (1973).
Efron, B. (1983) “Estimating the error rate of a prediction rule: Improvement on cross-validation”, Journal of the American Statistical Association, 78, 316–331.
Efron, B. & Tibshirani, R., (1993), An Introduction to the Bootstrap, Chaoman and Hall, New York.
Fyfe, C.,et al, ed. (1998), Special issue on Independence and artificial neural networks, Neurocomputing, Vol. 22, No. 1–3.
Gaeta, M., & Lacounme, J.-L, (1990), “Source Separation without a priori knowledge: the maximum likelihood solution”, in Proc. EUSIPCO90, 621–624.
Gelfand, A. E. & Dey, D. K. (1994), “ Bayesian model choice: Asymptotics and exact calculations”, Journal of the Royal Statistical Society B, 56(3), 501–514.
Geman, S., Bienenstock, E., & Doursat, R., (1992), “Neural Networks and the bias-variance dilemma”, Neural Computation, 4, 1–58.
Ghahramani, Z. & Beal, M.J., (2000), “Variational inference for Bayesian mixture of factor analysis”, S.A. Solla, T.K. Leen & K.-R. Muller, eds, Advances in Neural Information Processing Systems 12, Cambridge, MA: MIT Press, 449–455.
Girosi, F., et al, (1995) “Regularization theory and neural architectures”, Neural Computation, 7, 219–269.
Han, J. and Kamber, M., (2001), Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001.
Hinton, G.E. & Zemel, R.S., (1994), “Autoencoders, minimum description length and Helmholtz free energy”, Advances in NIPS, 6, 3–10.
Hotelling, H., (1936), “Simplified calculation of principal components”, Psy-chometrika 1, 27–35.
Hurvich, C.M., & Tsai, C.L. (1989), “Regression and time series model in samll samples”, Biometrika, 76, 297–307.
Hurvich, C.M., & Tsai, C.L. (1993), “A corrected Akaike information criterion for vector autoregressive model selection”, J. of Time Series Analysis, 14, 271–279.
Jacobs, R.A., et al, (1991), “Adaptive mixtures of local experts”, Neural Computation, 3, 79–87.
Jacobs, R. A., (1997) “Bias/Variance Analyses of Mixtures-of-Experts Architectures”, Neural Computation, 9.
Jeffreys, H., (1939), Theory of Probability, Clarendon Press, Oxford.
Jensen, F.V., (1996), An introduction to Bayesian networks, University of Collage London Press.
Jordan, M. I., & Jacobs, R.A., (1994), “Hierarchical mixtures of experts and the EM algorithm”, Neural Computation, 6, 181–214.
Jordan, M. I., & Xu, L., (1995), “Convergence results for the EM approach to mixtures of experts”, Neural Networks, 8, 1409–1431.
Jutten, C & Herault, J., (1988), “Independent Component Analysis versus Principal Component Analysis”, Proc. EUSIPCO88, 643–646.
Kashyap, R.L., (1982), “Optimal choice of AR and MA parts in autoregressive and moving-average models”, IEEE Trans. PAMI, 4, 99–104.
Kass, R. E. & Raftery, A. E., (1995), “ Bayes factors”, Journal of the American Statistical Association, 90(430), 773–795.
Kass, R. E. & Wasserman, L., (1996), “The selection of prior distributions by formal rules”, Journal of the American Statistical Association, 91(435), 1343–1370.
Katz, R. W., (1981), “On some criteria for estimating the order of a Markov chain“, Technometrics, 23(3), 243–249.
King, I. & Xu, L., (1995), “Adaptive contrast enhancement by entropy maximization with a 1-K-1 constrained network”, Proc. ICONIP’95, pp703–706.
Kohonen, T, (1995), Self-Organizing Maps, Springer-Verlag, Berlin.
Kohonen, T., (1982), “Self-organized formation of topologically correct feature maps”, Biological Cybernetics 43, 59–69.
Kontkanen, P., et al, (1998), “Bayesian and Information-Theoretic priors for Bayeisan network parameters”, Machine Learning: ECML-98, Lecture Notes in Artificial Intelligence, Vol. 1398, 89–94, Springer-Verlag.
Mackey, D. (1992a) “A practical Bayesian framework for backpropagation”, Neural Computation, 4, 448–472.
Mackey, D. (1992b) “Bayesian Interpolation”, Neural Computation, 4, 405–447.
von der Malsburg, Ch. (1973), Self-organization of orientation sensitive cells in the striate cortex, Kybernetik 14, 85–100.
McDonald, R, (1985), Factor Analysis and Related Techniques, Lawrence Erlbaum.
McLachlan, G. J. & Krishnan, T. (1997) The EM Algorithm and Extensions, John Wiley & Son, INC.
Moody, J. & Darken, J., (1989) “Fast learning in networks of locally-tuned processing units”, Neural Computation, 1, 281–294.
Neath, A.A., & Cavanaugh, J.E., (1997), “Regression and Time Series model selection using variants of the Schwarz information criterion”, Communications in Statistics A, 26, 559–580.
Neal, R.M., (1996), Bayesian learning for neural networks, New York: Springer-Verlag.
Newton, M. A. & Raftery, A. E., (1994), “Approximate Bayesian inference with the weighted likelihood Bootstrap”, J. Royal Statistical Society B, 56(1), 3–48.
Nowlan, S.J., (1990), “Max likelihood competition in RBF networks”, Tech. Rep. CRG-Tr-90-2, Dept. of Computer Sci., U. of Toronto.
O’Hagan, A., (1995), “Fractional Bayes factors for model comparison”, J. Royal Statistical Society B, 57(1), 99–138.
Oja, E., (1983), Subspace Methods of Pattern Recognition, Research Studies Press, UK.
Pearl, J, (1988), Probabilistic reasoning in intelligent systems: networks of plausible inference, San Fransisca, CA: Morgan Kaufman.
Rabiner, L. & Juang, B.H., (1993), Fundamentals of Speech Recognition, Prentice Hall, Inc..
Wolpert, D. H., (1997), “On Bias Plus Variance”, Neural Computation, 9.
Vapnik, V.N., (1995), The Nature Of Statistical Learning Theory, Springer-Verlag.
Redner, R.A. & Walker, H.F., (1984), “Mixture densities, maximum likelihood, and the EM algorithm”, SIAM Review, 26, 195–239.
Rissanen, J. (1986), “Stochastic complexity and modeling”, Annals of Statistics, 14(3), 1080–1100.
Rissanen, J. (1989), Stochastic Complexity in Statistical Inquiry, World Scientific: Singapore.
Rissanen, J. (1999), “Hypothesis selection and testing by the MDL principle”, Computer Journal, 42(4), 260–269.
Rivals, I. & Personnaz, L., (1999) “On Cross Validation for Model Selection”, Neural Computation, 11, 863–870.
Rubi, D & Thayer, D., (1976), “EM algorithm for ML factor analysis”, Psychome-trika 57, 69–76.
Rumelhart, D.E., Hinton, G.E., & Williams, R.J., (1986), “Learning internal representations by error propagation”, Parallel Distributed Processing, 1, MIT press.
Sato, M., (2001), “Online model selection based on the vairational Bayes”, Neural Computation, 13, 1649–1681.
Schwarz, G., (1978), “Estimating the dimension of a model”, Annals of Statistics, 6, 461–464.
Sclove, S. L., (1987), “ Application of model-selection criteria to some problems in multivariate analysis”, Psychometrika, 52(3), 333–343.
Spearman, C., (1904), “General intelligence domainively determined and measured”, Am. J. Psychol. 15, 201–293.
Stone, M. (1974), “Cross-validatory choice and assessment of statistical prediction”, J. Royal Statistical Society B, 36, 111–147.
Stone, M. (1977a), “Asymptotics for and against cross-validation”, Biometrika, 64(1), 29–35.
Stone, M. (1977b), “An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion”, J. Royal Statistical Society B, 39(1), 44–47.
Stone, M. (1978),“Cross-validation: A review”, Math. Operat. Statist., 9, 127–140.
Stone, M. (1979), “Comments on model selection criteria of Akaike and Schwartz. J. Royal Statistical Society B, 41(2), 276–278.
Sugiura, N. (1978), “Further analysis of data by Akaike’s infprmation criterion and the finite corrections”, Communications in Statistics A, 7, 12–26.
Tikhonov, A.N. & Arsenin, V.Y., (1977), Solutions of Ill-posed Problems, V.H. Winston and Sons.
Tipping, M.E., and Bishop, C.M., (1999), “Mixtures of probabilistic principal component analysis”, Neural Computation, 11, 443–482.
Tong, L., Inouye, Y., & Liu, R., (1993) “Waveform-preserving blind estimation of multiple independent sources”, IEEE Trans. on Signal Processing 41, 2461–2470.
Wallace, C.S. & Boulton, D.M., (1968), “An information measure for classification”, Computer Journal, 11, 185–194.
Wallace, C.S. & Freeman, P.R., (1987), “Estimation and inference by compact coding”, J. of the Royal Statistical Society, 49(3), 240–265.
Wallace, C.S. & Dowe, D.R., (1999), “Minimum message length and Kolmogorov complexity”, Computer Journal, 42(4), 270–280.
Waterhouse, S., et al, (1996), “Bayesian method for mixture of experts”, D.S. Touretzky, et al, eds, Advances in NIPS 8, 351–357.
Xu, L., (2002a), “ BYY Harmony Learning, Structural RPCL, and Topological Self-Organizing on Mixture Models ”, to appear on Neural Networks, 2002.
Xu, L., (2002b), “ BYY Learning, Regularized Implementation, and Model Selection on Modular Networks with One Hidden Layer of Binary Units ”, to appear on a special issue, Neurocomputing, 2002.
Xu, L., (2001a), “BYY Harmony Learning, Independent State Space and Generalized APT Financial Analyses ”, IEEE Trans on Neural Networks, 12(4), 822–849.
Xu, L., (2001b), “Best Harmony, Unified RPCL and Automated Model Selection for Unsupervised and Supervised Learning on Gaussian Mixtures, Three-Layer Nets and ME-RBF-SVM Models”, Intl J. of Neural Systems 11(1), 43–69.
Xu, L., (2000), “Temporal BYY Learning for State Space Approach, Hidden Markov Model and Blind Source Separation”, IEEE Trans on Signal Processing 48, 2132–2144.
Xu, L., (1998a), “RBF Nets, Mixture Experts, and Bayesian Ying-Yang Learning“, Neurocomputing, Vol. 19, No. 1-3, 223–257.
Xu, L., (1998b), “Bayesian Kullback Ying-Yang Dependence Reduction Theory”, Neurocomputing 22, No. 1-3, 81–112, 1998.
Xu, L., Cheung, C.C., & Amari, S.-I., (1998) “Learned Parametric Mixture Based ICA Algorithm”, Neurocomputing 22, No. 1-3, 69–80. A part of its preliminary version on Proc. ESANN97, 291–296.
Xu, L., (1997), “Bayesian Ying-Yang Machine, Clustering and Number of Clusters”, Pattern Recognition Letters 18, No. 11-13, 1167–1178.
Xu, L. Yang, H.H., & Amari, S.-I., (1996), “Signal Source Separation by Mixtures Accumulative Distribution Functions or Mixture of Bell-Shape Density Distribution Functions ”, Research Proposal, presented at FRONTIER FORUM (speakers: D. Sherrington, S. Tanaka, L.Xu & J. F. Cardoso), organized by S.Amari, S.Tanaka & A.Cichocki, RIKEN, Japan, April 10, 1996.
Xu, L., (1996&95), “A Unified Learning Scheme: Bayesian-Kullback YING-YANG Machine”, Advances in Neural Information Processing Systems, 8, 444–450 (1996). A part of its preliminary version on Proc. ICONIP95-Peking, 977–988(1995).
Xu, L., (1995), “A unified learning framework: multisets modeling learning,” Proceedings of 1995 World Congress on Neural Networks, vol. 1, pp. 35–42.
Xu, L., Jordan, M.I., & Hinton, G.E., (1995), “An Alternative Model for Mixtures of Experts”, Advances in Neural Information Processing Systems 7, eds., Cowan, J.D., et al, MIT Press, 633–640, 1995.
Xu, L., Krzyzak, A., & Yuille, A.L., (1994), “On Radial Basis Function Nets and Kernel Regression: Statistical Consistency, Convergence Rates and Receptive Field Size”, Neural Networks, 7, 609–628.
Xu, L., Krzyzak, A. & Oja, E. (1993), “Rival Penalized Competitive Learning for Clustering Analysis, RBF net and Curve Detection”, IEEE Tr. on Neural Networks 4, 636–649.
Xu, L., (1991&93) “Least mean square error reconstruction for self-organizing neural-nets”, Neural Networks 6, 627–648, 1993. Its early version on Proc. IJCNN91’Singapore, 2363-2373, 1991.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Xu, L. (2002). Mining Dependence Structures from Statistical Learning Perspective. In: Yin, H., Allinson, N., Freeman, R., Keane, J., Hubbard, S. (eds) Intelligent Data Engineering and Automated Learning — IDEAL 2002. IDEAL 2002. Lecture Notes in Computer Science, vol 2412. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45675-9_47
Download citation
DOI: https://doi.org/10.1007/3-540-45675-9_47
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44025-3
Online ISBN: 978-3-540-45675-9
eBook Packages: Springer Book Archive