Abstract
The “Policy Improvement with Path Integrals” (PI2) [25] and “Covariance Matrix Adaptation — Evolutionary Strategy” [8] are considered to be state-of-the-art in direct reinforcement learning and stochastic optimization respectively. We have recently shown that incorporating covariance matrix adaptation into PI2- which yields the PI 2CMA algorithm — enables adaptive exploration by continually and autonomously reconsidering the exploration/exploitation trade-off. In this article, we provide an overview of our recent work on covariance matrix adaptation for direct reinforcement learning [22–24], highlight its relevance to developmental robotics, and conduct further experiments to analyze the results. We investigate two complementary phenomena from developmental robotics. First, we demonstrate PI 2CMA ’s ability to adapt to slowly or abruptly changing tasks due to its continual and adaptive exploration. This is an important component of life-long skill learning in dynamic environments. Second, we show on a reaching task how PI 2CMA subsequently releases degrees of freedom from proximal to more distal limbs as learning progresses. A similar effect is observed in human development, where it is known as ‘proximodistal maturation’.
Similar content being viewed by others
References
L. Arnold, A. Auger, N. Hansen, and Y. Ollivier. Informationgeometric optimization algorithms: A unifying picture via invariance principles. Technical report, INRIA Saclay, 2011.
A. Baranes and P-Y. Oudeyer. The interaction of maturational constraints and intrinsic motivations in active motor development. In IEEE International Conference on Development and Learning, 2011.
N. E. Berthier, R.K. Clifton, D.D. McCall, and D.J. Robin. Proximodistal structure of early reaching in human infants. Exp Brain Res, 1999.
L. Berthouze and M. Lungarella. Motor skill acquisition under environmental perturbations: On the necessity of alternate freezing and freeing degrees of freedom. Adaptive Behavior, 12(1): 47–63, 2004.
Josh C. Bongard. Morphological change in machines accelerates the evolution of robust behavior. Proceedigns of the National Academy of Sciences of the United States of America (PNAS), January 2010.
Ronen I. Brafman and Moshe Tennenholtz. R-max — a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res., 3:213–231, March 2003. ISSN 1532-4435.
T. Glasmachers, T. Schaul, S. Yi, D. Wierstra, and J. Schmidhuber. Exponential natural evolution strategies. In Proceedings of the 12th annual conference on Genetic and evolutionary computation, pages 393–400. ACM, 2010.
N. Hansen and A. Ostermeier. Completely derandomized selfadaptation in evolution strategies. Evolutionary Computation, 9(2):159–195, 2001.
A. J. Ijspeert, J. Nakanishi, and S. Schaal. Movement imitation with nonlinear dynamical systems in humanoid robots. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2002.
Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Mach. Learn., 49(2–3):209–232, 2002. ISSN 0885-6125.
J. Konczak, M. Borutta, T Helge, and J. Dichgans. The development of goal-directed reaching in infants: hand trajectory formation and joint torque control. Experimental Brain Research, 1995.
A. Miyamae, Y. Nagata, I. Ono, and S. Kobayashi. Natural policy gradient methods with parameter-based exploration for control tasks. Advances in Neural Information Processing Systems, 2:437–441, 2010.
Y. Nagai, M. Asada, and K. Hosoda. Learning for joint attention helped by functional development. Advanced Robotic, 20(10), 2006.
Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomputing, 71(7–9):1180–1190, 2008.
R. Ros and N. Hansen. A Simple Modification in CMA-ES Achieving Linear Time and Space Complexity. In Proceedings on Parallel Problem Solving from Nature, 296–305, 2008.
Thomas Rückstiess, Frank Sehnke, Tom Schaul, Daan Wierstra, Yi Sun, and Jürgen Schmidhuber. Exploring parameter space in reinforcement learning. Paladyn. Journal of Behavioral Robotics, 1:14–24, 2010. ISSN 2080-9778.
A. Saltelli, K. Chan, and E. M. Scott. Sensitivity analysis. Chichester: Wiley, 2000.
Stefan Schaal. The sl simulation and real-time control software package. Technical report, University of Southern California, 2007.
Matthew Schlesinger, Domenico Parisi, and Jonas Langer. Learning to reach by constraining the movement search space. Developmental Science, 3:67–80, 2000.
F. Sehnke, C. Osendorfer, T. Rückstie, A. Graves, J. Peters, and J. Schmidhuber. Parameter-exploring policy gradients. Neural Networks, 23(4):551–559, 2010.
Andrew Stout, George D. Konidaris, and Andrew G. Barto. Intrinsically motivated reinforcement learning: A promising framework for developmental robot learning. In AAAI, 2005.
Freek Stulp. Adaptive exploration for continual reinforcement learning. In International Conference on Intelligent Robots and Systems (IROS), 2012.
Freek Stulp and Pierre-Yves Oudeyer. Emergent proximo-distal maturation through adaptive exploration. In International Conference on Development and Learning (ICDL), 2012. Paper of Excellence Award.
Freek Stulp and Olivier Sigaud. Path integral policy improvement with covariance matrix adaptation. In Proceedings of the 29th International Conference on Machine Learning (ICML), 2012.
Evangelos Theodorou, Jonas Buchli, and Stefan Schaal. A generalized path integral control approach to reinforcement learning. Journal of Machine Learning Research, 11:3137–3181, 2010.
Sebastian B. Thrun. Effcient exploration in reinforcement learning. Technical Report CMU-CS-92-102, Carnegie-Mellon University, 1992.
R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8: 229–256, 1992.
Author information
Authors and Affiliations
Corresponding author
About this article
Cite this article
Stulp, F., Oudeyer, PY. Adaptive exploration through covariance matrix adaptation enables developmental motor learning. Paladyn 3, 128–135 (2012). https://doi.org/10.2478/s13230-013-0108-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.2478/s13230-013-0108-6