This chapter presents intuitive understandings for statistical learning from an information geometric point of view. We discuss a wide class of information divergence indices that express quantitatively a departure between any two probability density functions. In general, the information divergence leads to a statistical method by minimization which is based on the empirical data available. We discuss the association between the information divergence and a Riemannian metric and a pair of conjugate linear connections for a family of probability density functions. The most familiar example is the Kullback—Leibler divergence, which leads to the maximum likelihood method associated with the information metric and the pair of the exponential and mixture connections. For the class of statistical methods obtained by minimizing the divergence we discuss statistical properties focusing on its robustness. As applications to statistical learning we discuss the minimum divergence method for the principal component analysis, independent component analysis and for statistical pattern recognition.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Amari, S. (1985). Differential-Geometrical Methods in Statistics, volume 28 of Lecture Notes in Statistics. Springer, New York.
Amari, S. and Nagaoka, H. (2000). Methods of Information Geometry, volume 191 of Translations of Mathematical Monographs. Oxford University Press, Oxford.
Barndorff-Nielsen, O. E. (1978). Information and Exponential Families in Statistical Theory. Wiley, Chichester.
Basu, A., Harris, I. R., Hjort, N. L., and Jones, M. C. (1998). Robust and efficient estimation by minimising a density power divergence. Biometrika 85(3):549–559.
Beran, R. (1977). Minimum hellinger distance estimates for parametric models. Ann. Stat. 5(3):445–463.
Bishop, C. (1995). Neural Networks for Pattern Recognition. Clarendon, Oxford.
Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations. J. R. Stat. Soc. B 26(2):211–252.
Csiszar, I. (1967). Information type measures of differences of probability distribution and indirect observations. Studia Math. Hungarica 2:299–318.
Copas, J. (1988). Binary regression models for contaminated data. J. R. Stat. Soc. B 50: 225–265.
Eguchi, S. (1983). Second order efficiency of minimum contrast estimators in a curved exponential family. Ann. Stat. 11:793–803.
Eguchi, S. (1992). Geometry of minimum contrast. Hiroshima Math. J. 22:631–647.
Eguchi, S. and Copas, J. B. (2002). A class of logistic type discriminant functions. Biometrika 89:1–22.
Eguchi, S. (2006). Information geometry and statistical pattern recognition. Sugaku Exposition, American Mathematical Society, 197–216.
Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philos. Trans. Roy. Soc. London A 222:309–368.
Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55:119–139.
Friedman, J. H., Hastie, T., and Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Ann. Stat. 28:337–407.
Fujisawa, H. and Eguchi, S. (2005). A new approach to robust parameter estimation against heavy contamination. ISM Research Memo. 947.
Fujisawa, H. and Eguchi, S. (2006). Robust estimation in the normal mixture model. J. Stat. Plan. Inference 136(11):3989–4011.
Good, I. J. (1952). Rational decisions. J. Roy. Stat. Soc. B 14:107–114.
Grünwald, P. D. and Dawid, A. P. (2004). Game theory, maximum entropy, minimum discrepancy, and robust Bayesian decision theory. Ann. Stat. 32:1367–1433.
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. A. (2005) Robust Statistics: The Approach Based on Influence Functions. Wiley, New York.
Hastie, T., Tibishirani, R., and Friedman, J. (2001). The Elements of Statistical Learning. Springer, New York.
Higuchi, I. and Eguchi, S. (1998). The influence function of principal component analysis by self-organizing rule. Neural Comput. 10:1435–1444.
Higuchi, I. and Eguchi, S. (2004). Robust principal component analysis with adaptive selection for tuning parameters. J. Machine Learn. Res. 5:453–471.
Huber, P. J. (1981). Robust Statistics. Wiley, New York.
Hyvarinen, Karhunen, A., and Oja, K. (2001). Independent Component Analysis. Wiley, New Yo rk .
Kamiya, H. and Eguchi, S. (2001). A class of robust principal component vectors. J. Multi-variate Anal. 77:239–269.
Kanamori, T., Takenouchi, T., Murata, N., and Eguchi, S. (2004). The most robust loss function for boosting. Presented by T. Kanamori at Neural Information Processing: 11th International Conference, ICONIP 2004, Calcutta. Lecture Notes in Computer Science 3316, 496–501. Berlin: Springer.
Kanamori, T., Takenouchi, T., Eguchi, S., and Murata, N. (2007). Robust loss functions for boosting. Robust loss functions for boosting. Neural Comput. 19:2183–2244.
Lebanon, G. and Lafferty, J. (2002). Boosting and maximum likelihood for exponential models. Adv. Neural Inform. Process. Syst. 14.
Lugosi, G. and Vayatis, N. (2004). On the Bayes-risk consistency of regularized boosting methods. Ann. Stat. 32:30–55.
McLachlan, G. J. (2004). Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York.
Minami, M. and Eguchi, S. (2002). Robust blind source separation by beta-divergence. Neural Comput. 14:1859–1886.
Mollah, N. H., Minami, M., and Eguchi, S. (2006). Exploring latent structure of mixture ICA models by the minimum beta-divergence method. Neural Comput. 18:166–190.
Murata, N., Takenouchi. T., Kanamori, T ., and Eguchi. S . (2004). Information geometry of U-Boost and Bregman divergence. Neural Comput. 16:1437–1481.
Rao, C. R. (1945). Information and accuracy attainable in the estimation of statistical parameters. Bull. Culcutta Math. Soc. 37:81–91.
Rätsch, G., Onoda, T., and Müller K.-R. (2001). Soft margins for AdaBoost. Machine Learn. 42:287–320.
Schapire, R. E. (1990). The strength of weak learnability. Machine Learn. 5:197–227.
Schapire, R. E., Freund, Y., Bartlett, P., and Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Stat. 26:1651–1686.
Scott, D. W. (2001). Parametric statistical modeling by minimum integrated square error. Tech-nometrics 43:274–285.
Takenouchi, T. and Eguchi, S. (2004). Robustifying AdaBoost by adding the naive error rate. Neural Comput. 16:767–787.
Takenouchi, T., Eguchi, S., Murata N., and Kanamori, T. (2008). Robust Boosting algorithm for multiclass problem by mislabelling model. Neural Comput. 20:1596–1630.
Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer, New York,
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Eguchi, S. (2009). Information Divergence Geometry and the Application to Statistical Machine Learning. In: Emmert-Streib, F., Dehmer, M. (eds) Information Theory and Statistical Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-84816-7_13
Download citation
DOI: https://doi.org/10.1007/978-0-387-84816-7_13
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-84815-0
Online ISBN: 978-0-387-84816-7
eBook Packages: Computer ScienceComputer Science (R0)