, Volume 3, Issue 3, pp 128–135 | Cite as

Adaptive exploration through covariance matrix adaptation enables developmental motor learning

  • Freek StulpEmail author
  • Pierre-Yves Oudeyer
Research Article


The “Policy Improvement with Path Integrals” (PI2) [25] and “Covariance Matrix Adaptation — Evolutionary Strategy” [8] are considered to be state-of-the-art in direct reinforcement learning and stochastic optimization respectively. We have recently shown that incorporating covariance matrix adaptation into PI2- which yields the PI CMA 2 algorithm — enables adaptive exploration by continually and autonomously reconsidering the exploration/exploitation trade-off. In this article, we provide an overview of our recent work on covariance matrix adaptation for direct reinforcement learning [22–24], highlight its relevance to developmental robotics, and conduct further experiments to analyze the results. We investigate two complementary phenomena from developmental robotics. First, we demonstrate PI CMA 2 ’s ability to adapt to slowly or abruptly changing tasks due to its continual and adaptive exploration. This is an important component of life-long skill learning in dynamic environments. Second, we show on a reaching task how PI CMA 2 subsequently releases degrees of freedom from proximal to more distal limbs as learning progresses. A similar effect is observed in human development, where it is known as ‘proximodistal maturation’.


reinforcement learning covariance matrix adaptation developmental robotics adaptive exploration proximodistal maturation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    L. Arnold, A. Auger, N. Hansen, and Y. Ollivier. Informationgeometric optimization algorithms: A unifying picture via invariance principles. Technical report, INRIA Saclay, 2011.Google Scholar
  2. [2]
    A. Baranes and P-Y. Oudeyer. The interaction of maturational constraints and intrinsic motivations in active motor development. In IEEE International Conference on Development and Learning, 2011.Google Scholar
  3. [3]
    N. E. Berthier, R.K. Clifton, D.D. McCall, and D.J. Robin. Proximodistal structure of early reaching in human infants. Exp Brain Res, 1999.Google Scholar
  4. [4]
    L. Berthouze and M. Lungarella. Motor skill acquisition under environmental perturbations: On the necessity of alternate freezing and freeing degrees of freedom. Adaptive Behavior, 12(1): 47–63, 2004.CrossRefGoogle Scholar
  5. [5]
    Josh C. Bongard. Morphological change in machines accelerates the evolution of robust behavior. Proceedigns of the National Academy of Sciences of the United States of America (PNAS), January 2010.Google Scholar
  6. [6]
    Ronen I. Brafman and Moshe Tennenholtz. R-max — a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res., 3:213–231, March 2003. ISSN 1532-4435.MathSciNetzbMATHGoogle Scholar
  7. [7]
    T. Glasmachers, T. Schaul, S. Yi, D. Wierstra, and J. Schmidhuber. Exponential natural evolution strategies. In Proceedings of the 12th annual conference on Genetic and evolutionary computation, pages 393–400. ACM, 2010.CrossRefGoogle Scholar
  8. [8]
    N. Hansen and A. Ostermeier. Completely derandomized selfadaptation in evolution strategies. Evolutionary Computation, 9(2):159–195, 2001.CrossRefGoogle Scholar
  9. [9]
    A. J. Ijspeert, J. Nakanishi, and S. Schaal. Movement imitation with nonlinear dynamical systems in humanoid robots. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2002.Google Scholar
  10. [10]
    Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Mach. Learn., 49(2–3):209–232, 2002. ISSN 0885-6125.zbMATHCrossRefGoogle Scholar
  11. [11]
    J. Konczak, M. Borutta, T Helge, and J. Dichgans. The development of goal-directed reaching in infants: hand trajectory formation and joint torque control. Experimental Brain Research, 1995.Google Scholar
  12. [12]
    A. Miyamae, Y. Nagata, I. Ono, and S. Kobayashi. Natural policy gradient methods with parameter-based exploration for control tasks. Advances in Neural Information Processing Systems, 2:437–441, 2010.Google Scholar
  13. [13]
    Y. Nagai, M. Asada, and K. Hosoda. Learning for joint attention helped by functional development. Advanced Robotic, 20(10), 2006.Google Scholar
  14. [14]
    Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomputing, 71(7–9):1180–1190, 2008.CrossRefGoogle Scholar
  15. [15]
    R. Ros and N. Hansen. A Simple Modification in CMA-ES Achieving Linear Time and Space Complexity. In Proceedings on Parallel Problem Solving from Nature, 296–305, 2008.CrossRefGoogle Scholar
  16. [16]
    Thomas Rückstiess, Frank Sehnke, Tom Schaul, Daan Wierstra, Yi Sun, and Jürgen Schmidhuber. Exploring parameter space in reinforcement learning. Paladyn. Journal of Behavioral Robotics, 1:14–24, 2010. ISSN 2080-9778.CrossRefGoogle Scholar
  17. [17]
    A. Saltelli, K. Chan, and E. M. Scott. Sensitivity analysis. Chichester: Wiley, 2000.zbMATHGoogle Scholar
  18. [18]
    Stefan Schaal. The sl simulation and real-time control software package. Technical report, University of Southern California, 2007.Google Scholar
  19. [19]
    Matthew Schlesinger, Domenico Parisi, and Jonas Langer. Learning to reach by constraining the movement search space. Developmental Science, 3:67–80, 2000.CrossRefGoogle Scholar
  20. [20]
    F. Sehnke, C. Osendorfer, T. Rückstie, A. Graves, J. Peters, and J. Schmidhuber. Parameter-exploring policy gradients. Neural Networks, 23(4):551–559, 2010.CrossRefGoogle Scholar
  21. [21]
    Andrew Stout, George D. Konidaris, and Andrew G. Barto. Intrinsically motivated reinforcement learning: A promising framework for developmental robot learning. In AAAI, 2005.Google Scholar
  22. [22]
    Freek Stulp. Adaptive exploration for continual reinforcement learning. In International Conference on Intelligent Robots and Systems (IROS), 2012.Google Scholar
  23. [23]
    Freek Stulp and Pierre-Yves Oudeyer. Emergent proximo-distal maturation through adaptive exploration. In International Conference on Development and Learning (ICDL), 2012. Paper of Excellence Award.Google Scholar
  24. [24]
    Freek Stulp and Olivier Sigaud. Path integral policy improvement with covariance matrix adaptation. In Proceedings of the 29th International Conference on Machine Learning (ICML), 2012.Google Scholar
  25. [25]
    Evangelos Theodorou, Jonas Buchli, and Stefan Schaal. A generalized path integral control approach to reinforcement learning. Journal of Machine Learning Research, 11:3137–3181, 2010.MathSciNetzbMATHGoogle Scholar
  26. [26]
    Sebastian B. Thrun. Effcient exploration in reinforcement learning. Technical Report CMU-CS-92-102, Carnegie-Mellon University, 1992.Google Scholar
  27. [27]
    R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8: 229–256, 1992.zbMATHGoogle Scholar

Copyright information

© Versita Warsaw and Springer-Verlag Wien 2012

Authors and Affiliations

  1. 1.Robotics and Computer VisionENSTA-ParisTechParisFrance
  2. 2.FLOWERS TeamINRIA Bordeaux Sud-OuestTalenceFrance

Personalised recommendations