Skip to main content

Information Divergence Geometry and the Application to Statistical Machine Learning

  • Chapter
Book cover Information Theory and Statistical Learning

This chapter presents intuitive understandings for statistical learning from an information geometric point of view. We discuss a wide class of information divergence indices that express quantitatively a departure between any two probability density functions. In general, the information divergence leads to a statistical method by minimization which is based on the empirical data available. We discuss the association between the information divergence and a Riemannian metric and a pair of conjugate linear connections for a family of probability density functions. The most familiar example is the Kullback—Leibler divergence, which leads to the maximum likelihood method associated with the information metric and the pair of the exponential and mixture connections. For the class of statistical methods obtained by minimizing the divergence we discuss statistical properties focusing on its robustness. As applications to statistical learning we discuss the minimum divergence method for the principal component analysis, independent component analysis and for statistical pattern recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Amari, S. (1985). Differential-Geometrical Methods in Statistics, volume 28 of Lecture Notes in Statistics. Springer, New York.

    Google Scholar 

  2. Amari, S. and Nagaoka, H. (2000). Methods of Information Geometry, volume 191 of Translations of Mathematical Monographs. Oxford University Press, Oxford.

    Google Scholar 

  3. Barndorff-Nielsen, O. E. (1978). Information and Exponential Families in Statistical Theory. Wiley, Chichester.

    MATH  Google Scholar 

  4. Basu, A., Harris, I. R., Hjort, N. L., and Jones, M. C. (1998). Robust and efficient estimation by minimising a density power divergence. Biometrika 85(3):549–559.

    Article  MATH  MathSciNet  Google Scholar 

  5. Beran, R. (1977). Minimum hellinger distance estimates for parametric models. Ann. Stat. 5(3):445–463.

    Article  MATH  MathSciNet  Google Scholar 

  6. Bishop, C. (1995). Neural Networks for Pattern Recognition. Clarendon, Oxford.

    Google Scholar 

  7. Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations. J. R. Stat. Soc. B 26(2):211–252.

    MATH  MathSciNet  Google Scholar 

  8. Csiszar, I. (1967). Information type measures of differences of probability distribution and indirect observations. Studia Math. Hungarica 2:299–318.

    MATH  MathSciNet  Google Scholar 

  9. Copas, J. (1988). Binary regression models for contaminated data. J. R. Stat. Soc. B 50: 225–265.

    MathSciNet  Google Scholar 

  10. Eguchi, S. (1983). Second order efficiency of minimum contrast estimators in a curved exponential family. Ann. Stat. 11:793–803.

    Article  MATH  MathSciNet  Google Scholar 

  11. Eguchi, S. (1992). Geometry of minimum contrast. Hiroshima Math. J. 22:631–647.

    MATH  MathSciNet  Google Scholar 

  12. Eguchi, S. and Copas, J. B. (2002). A class of logistic type discriminant functions. Biometrika 89:1–22.

    Article  MATH  MathSciNet  Google Scholar 

  13. Eguchi, S. (2006). Information geometry and statistical pattern recognition. Sugaku Exposition, American Mathematical Society, 197–216.

    Google Scholar 

  14. Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philos. Trans. Roy. Soc. London A 222:309–368.

    Article  Google Scholar 

  15. Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55:119–139.

    Article  MATH  MathSciNet  Google Scholar 

  16. Friedman, J. H., Hastie, T., and Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Ann. Stat. 28:337–407.

    Article  MATH  MathSciNet  Google Scholar 

  17. Fujisawa, H. and Eguchi, S. (2005). A new approach to robust parameter estimation against heavy contamination. ISM Research Memo. 947.

    Google Scholar 

  18. Fujisawa, H. and Eguchi, S. (2006). Robust estimation in the normal mixture model. J. Stat. Plan. Inference 136(11):3989–4011.

    Article  MATH  MathSciNet  Google Scholar 

  19. Good, I. J. (1952). Rational decisions. J. Roy. Stat. Soc. B 14:107–114.

    MathSciNet  Google Scholar 

  20. Grünwald, P. D. and Dawid, A. P. (2004). Game theory, maximum entropy, minimum discrepancy, and robust Bayesian decision theory. Ann. Stat. 32:1367–1433.

    Article  MATH  Google Scholar 

  21. Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. A. (2005) Robust Statistics: The Approach Based on Influence Functions. Wiley, New York.

    Google Scholar 

  22. Hastie, T., Tibishirani, R., and Friedman, J. (2001). The Elements of Statistical Learning. Springer, New York.

    MATH  Google Scholar 

  23. Higuchi, I. and Eguchi, S. (1998). The influence function of principal component analysis by self-organizing rule. Neural Comput. 10:1435–1444.

    Article  Google Scholar 

  24. Higuchi, I. and Eguchi, S. (2004). Robust principal component analysis with adaptive selection for tuning parameters. J. Machine Learn. Res. 5:453–471.

    MathSciNet  Google Scholar 

  25. Huber, P. J. (1981). Robust Statistics. Wiley, New York.

    MATH  Google Scholar 

  26. Hyvarinen, Karhunen, A., and Oja, K. (2001). Independent Component Analysis. Wiley, New Yo rk .

    Google Scholar 

  27. Kamiya, H. and Eguchi, S. (2001). A class of robust principal component vectors. J. Multi-variate Anal. 77:239–269.

    Article  MATH  MathSciNet  Google Scholar 

  28. Kanamori, T., Takenouchi, T., Murata, N., and Eguchi, S. (2004). The most robust loss function for boosting. Presented by T. Kanamori at Neural Information Processing: 11th International Conference, ICONIP 2004, Calcutta. Lecture Notes in Computer Science 3316, 496–501. Berlin: Springer.

    Google Scholar 

  29. Kanamori, T., Takenouchi, T., Eguchi, S., and Murata, N. (2007). Robust loss functions for boosting. Robust loss functions for boosting. Neural Comput. 19:2183–2244.

    Article  MATH  MathSciNet  Google Scholar 

  30. Lebanon, G. and Lafferty, J. (2002). Boosting and maximum likelihood for exponential models. Adv. Neural Inform. Process. Syst. 14.

    Google Scholar 

  31. Lugosi, G. and Vayatis, N. (2004). On the Bayes-risk consistency of regularized boosting methods. Ann. Stat. 32:30–55.

    MATH  MathSciNet  Google Scholar 

  32. McLachlan, G. J. (2004). Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York.

    MATH  Google Scholar 

  33. Minami, M. and Eguchi, S. (2002). Robust blind source separation by beta-divergence. Neural Comput. 14:1859–1886.

    Article  MATH  Google Scholar 

  34. Mollah, N. H., Minami, M., and Eguchi, S. (2006). Exploring latent structure of mixture ICA models by the minimum beta-divergence method. Neural Comput. 18:166–190.

    Article  MATH  Google Scholar 

  35. Murata, N., Takenouchi. T., Kanamori, T ., and Eguchi. S . (2004). Information geometry of U-Boost and Bregman divergence. Neural Comput. 16:1437–1481.

    Article  MATH  Google Scholar 

  36. Rao, C. R. (1945). Information and accuracy attainable in the estimation of statistical parameters. Bull. Culcutta Math. Soc. 37:81–91.

    MATH  Google Scholar 

  37. Rätsch, G., Onoda, T., and Müller K.-R. (2001). Soft margins for AdaBoost. Machine Learn. 42:287–320.

    Article  MATH  Google Scholar 

  38. Schapire, R. E. (1990). The strength of weak learnability. Machine Learn. 5:197–227.

    Google Scholar 

  39. Schapire, R. E., Freund, Y., Bartlett, P., and Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Stat. 26:1651–1686.

    Article  MATH  MathSciNet  Google Scholar 

  40. Scott, D. W. (2001). Parametric statistical modeling by minimum integrated square error. Tech-nometrics 43:274–285.

    Google Scholar 

  41. Takenouchi, T. and Eguchi, S. (2004). Robustifying AdaBoost by adding the naive error rate. Neural Comput. 16:767–787.

    Article  MATH  Google Scholar 

  42. Takenouchi, T., Eguchi, S., Murata N., and Kanamori, T. (2008). Robust Boosting algorithm for multiclass problem by mislabelling model. Neural Comput. 20:1596–1630.

    Article  MATH  MathSciNet  Google Scholar 

  43. Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer, New York,

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shinto Eguchi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Eguchi, S. (2009). Information Divergence Geometry and the Application to Statistical Machine Learning. In: Emmert-Streib, F., Dehmer, M. (eds) Information Theory and Statistical Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-84816-7_13

Download citation

  • DOI: https://doi.org/10.1007/978-0-387-84816-7_13

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-0-387-84815-0

  • Online ISBN: 978-0-387-84816-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics