Stochastic Learning

  • Léon Bottou
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3176)


This contribution presents an overview of the theoretical and practical aspects of the broad family of learning algorithms based on Stochastic Gradient Descent, including Perceptrons, Adalines, K-Means, LVQ, Multi-Layer Networks, and Graph Transformer Networks.


Loss Function Gradient Descent Stochastic Approximation Fisher Information Matrix Neural Information Processing System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Amari, S.-I.: Differential-geometrical methods in statistics. Springer, Berlin (1990)zbMATHGoogle Scholar
  2. 2.
    Amari, S.I.: A theory of adaptive pattern classifiers. IEEE Transactions on Electronic Computers EC-16, 299–307 (1967)CrossRefzbMATHGoogle Scholar
  3. 3.
    Amari, S.-I.: Natural learning in structured parameter spaces – natural riemannian gradient. In: Neural Information Processing Systems, vol. 9, pp. 127–133. MIT Press, Cambridge (1996)Google Scholar
  4. 4.
    Battiti, R.: First- and second-order methods for learning: Between steepest descent and newton’s method. Neural Computation 4, 141–166 (1992)CrossRefGoogle Scholar
  5. 5.
    Becker, S., Le Cun, Y.: Improving the convergence of back-propagation learning with second-order methods. In: Touretzky, D., Hinton, G., Sejnowski, T. (eds.) Proceedings of the 1988 Connectionist Models Summer School, pp. 29–37. Morgan Kaufmann, San Mateo (1989)Google Scholar
  6. 6.
    Bengio, Y., LeCun, Y., Nohl, C., Burges, C.: Lerec: A nn/hmm hybrid for on-line handwriting recognition. Neural Computation 7(6) (November 1995)Google Scholar
  7. 7.
    Benveniste, A., Metivier, M., Priouret, P.: Adaptive Algorithms and Stochastic Approximations. Springer, Berlin (1990)CrossRefzbMATHGoogle Scholar
  8. 8.
    Bottou, L., Le Cun, Y., Bengio, Y.: Global training of document processing systems using graph transformer networks. In: Proc. of Computer Vision and Pattern Recognition, pp. 489–493. IEEE, Puerto-Rico (1997)Google Scholar
  9. 9.
    Bottou, L.: Une Approche théorique de l’Apprentissage Connexionniste: Applications à la Reconnaissance de la Parole. PhD thesis, Université de Paris XI, Orsay, France (1991)Google Scholar
  10. 10.
    Bottou, L.: Online algorithms and stochastic approximations. In: Saad, D. (ed.) Online Learning and Neural Networks. Cambridge University Press, Cambridge (1998)Google Scholar
  11. 11.
    Bottou, L., Bengio, Y.: Convergence properties of the kmeans algorithm. In: Advances in Neural Information Processing Systems, Denver, vol. 7. MIT Press, Cambridge (1995)Google Scholar
  12. 12.
    Bottou, L., Le Cun, Y.: Large scale online learning. In: Advances in Neural Information Processing Systems, vol. 16. MIT Press, Cambridge (2004)Google Scholar
  13. 13.
    Bottou, L., Le Cun, Y.: On-line learning for very large datasets. In: Applied Stochastic Models in Business and Industry, Special issue (to appear, 2004)Google Scholar
  14. 14.
    Bottou, L., Murata, N.: Stochastic approximations and efficient learning. In: Arbib, M.A. (ed.) The Handbook of Brain Theory and Neural Networks, 2nd edn. The MIT Press, Cambridge (2002)Google Scholar
  15. 15.
    Dennis Jr., J.E., Schnabel, R.B.: Numerical Methods For Unconstrained Optimization and Nonlinear Equations. Prentice-Hall, Inc., Englewood Cliffs (1983)zbMATHGoogle Scholar
  16. 16.
    Duda, R.O., Hart, P.E.: Pattern Classification And Scene Analysis. Wiley and Sons, Chichester (1973)zbMATHGoogle Scholar
  17. 17.
    Gentile, C., Warmuth, M.K.: Linear hinge loss and average margin. In: Neural Information Processing Systems, vol. 11, pp. 231–255. MIT Press, Cambridge (1999)Google Scholar
  18. 18.
    Hebb, D.O.: The Organization of Behavior. Wiley, New York (1949)Google Scholar
  19. 19.
    Kohonen, T.: Self-organized formation of topologically correct feature maps. Biological Cybernetics 43, 59–69 (1982)MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Kohonen, T., Barna, G., Chrisley, R.: Statistical pattern recognition with neural network: Benchmarking studies. In: Proceedings of the IEEE Second International Conference on Neural Networks, San Diego, vol. 1, pp. 61–68 (1988)Google Scholar
  21. 21.
    Krasovskii, A.A.: Dynamic of continuous self-Organizing Systems. Fizmatgiz, Moscow (1963) (in russian)Google Scholar
  22. 22.
    Kushner, H.J., Clark, D.S.: Stochastic Approximation for Constrained and Unconstrained Systems. In: Applied Math. Sci., vol. 26. Springer, Berlin, New York (1978)Google Scholar
  23. 23.
    Le Cun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Computation 1(4), 541–551 (1989) (Winter)CrossRefGoogle Scholar
  24. 24.
    Le Cun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient based learning applied to document recognition. Proceedings of IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  25. 25.
    LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient backProp. In: Orr, G.B., Müller, K.-R. (eds.) NIPS-WS 1996. LNCS, vol. 1524, p. 9. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  26. 26.
    Le Cun, Y., Bottou, L., HuangFu, J.: Learning methods for generic object recognition with invariance to pose and lighting. In: Proc. of Computer Vision and Pattern Recognition, Washington, D.C. IEEE, Los Alamitos (2004)Google Scholar
  27. 27.
    Ljung, L., Söderström, T.: Theory and Practice of recursive identification. MIT Press, Cambridge (1983)zbMATHGoogle Scholar
  28. 28.
    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: LeCam, L.M., Neyman, J. (eds.) Proceedings of the Fifth Berkeley Symposium on Mathematics, Statistics, and Probabilities, vol. 1, pp. 281–297. University of California Press, Berkeley and Los Angeles (Calif) (1967)Google Scholar
  29. 29.
    Minsky, M., Papert, S.: Perceptrons. MIT Press, Cambridge (1969)zbMATHGoogle Scholar
  30. 30.
    Müller, U., Gunzinger, A., Guggenbühl, W.: Fast neural net simulation with a DSP processor array. IEEE Trans. on Neural Networks 6(1), 203–213 (1995)CrossRefGoogle Scholar
  31. 31.
    Murata, N., Amari, S.-i.: Statistical analysis of learning dynamics. Signal Processing 74(1), 3–28 (1999)CrossRefzbMATHGoogle Scholar
  32. 32.
    Orr, G.B., Leen, T.K.: Momentum and optimal stochastic search. In: Mozer, M.C., Smolensky, P., Touretzky, D.S., Elman, J.L., Weigend, A.S. (eds.) Proceedings of the 1993 Connectionist Models Summer School, pp. 351–357. Lawrence Erlbaum Associates, Mahwah (1994)Google Scholar
  33. 33.
    Robbins, H., Monro, S.: A stochastic approximation model. Ann. Math. Stat. 22, 400–407 (1951)CrossRefzbMATHGoogle Scholar
  34. 34.
    Rosenblatt, F.: The perceptron: A perceiving and recognizing automaton. Technical Report 85-460-1, Project PARA, Cornell Aeronautical Lab (1957)Google Scholar
  35. 35.
    Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. In: Parallel distributed processing: Explorations in the microstructure of cognition, vol. I, pp. 318–362. Bradford Books, Cambridge (1986)Google Scholar
  36. 36.
    Rosset, J.Z.S., Hastie, T.: Margin maximizing loss functions. In: Advances in Neural Information Processing Systems, vol. 16. MIT Press, Cambridge (2004)Google Scholar
  37. 37.
    Schenkel, M., Weissman, H., Guyon, I., Nohl, C., Henderson, D.: Recognition-based segmentation of on-line hand-printed words. In: Hanson, S.J., Cowan, J.D., Giles, C.L. (eds.) Advances in Neural Information Processing Systems, Denver, CO, vol. 5, pp. 723–730 (1993)Google Scholar
  38. 38.
    Schraudolph, N.N., Graepel, T.: Conjugate directions for stochastic gradient descent. In: Dorronsoro, J.R. (ed.) ICANN 2002. LNCS, vol. 2415, p. 1351. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  39. 39.
    Sejnowski, T.J., Rosenberg, C.R.: Parallel networks that learn to pronounce english text. Complex Systems 1, 145–168 (1987)zbMATHGoogle Scholar
  40. 40.
    Tsypkin, Y.: Adaptation and Learning in automatic systems. Academic Press, New York (1971)zbMATHGoogle Scholar
  41. 41.
    Tsypkin, Y.: Foundations of the theory of learning systems. Academic Press, New York (1973)zbMATHGoogle Scholar
  42. 42.
    Vapnik, V.N.: Estimation of dependences based on empirical data. Series in Statistics. Springer, Berlin, New York (1982)zbMATHGoogle Scholar
  43. 43.
    Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Berlin (1995)CrossRefzbMATHGoogle Scholar
  44. 44.
    Widrow, B., Hoff, M.E.: Adaptive switching circuits. In: IRE WESCON Conv. Record, Part 4, pp. 96–104 (1960)Google Scholar
  45. 45.
    Widrow, B., Stearns, S.D.: Adaptive Signal Processing. Prentice-Hall, Englewood Cliffs (1985)zbMATHGoogle Scholar
  46. 46.
    Wolf, R., Platt, J.: Postal address block location using a convolutional locator network. In: Cowan, J.D., Tesauro, G., Alspector, J. (eds.) Advances in Neural Information Processing Systems, vol. 6, pp. 745–752 (1994)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Léon Bottou
    • 1
  1. 1.NEC Labs of AmericaPrincetonUSA

Personalised recommendations