Preference-Based Policy Learning

  • Riad Akrour
  • Marc Schoenauer
  • Michele Sebag
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6911)


Many machine learning approaches in robotics, based on reinforcement learning, inverse optimal control or direct policy learning, critically rely on robot simulators. This paper investigates a simulatorfree direct policy learning, called Preference − based Policy Learning (PPL). PPL iterates a four-step process: the robot demonstrates a candidate policy; the expert ranks this policy comparatively to other ones according to her preferences; these preferences are used to learn a policy return estimate; the robot uses the policy return estimate to build new candidate policies, and the process is iterated until the desired behavior is obtained. PPL requires a good representation of the policy search space be available, enabling one to learn accurate policy return estimates and limiting the human ranking effort needed to yield a good policy. Furthermore, this representation cannot use informed features (e.g., how far the robot is from any target) due to the simulator-free setting. As a second contribution, this paper proposes a representation based on the agnostic exploitation of the robotic log.

The convergence of PPL is analytically studied and its experimental validation on two problems, involving a single robot in a maze and two interacting robots, is presented.


Reward Function Behavioral Representation Interact Robot Policy Space Cumulative Reward 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Abbeel, P., Ng, A.Y.: Apprenticeship Learning via Inverse Reinforcement Learning. In: Brodley, C.E. (ed.) Proc. 21st Intl. Conf. on Machine Learning (ICML 2004). ACM Intl. Conf. Proc. Series, vol. 69, p. 1. ACM, New York (2004)Google Scholar
  2. 2.
    Auger, A.: Convergence Results for the (1,λ)-SA-ES using the Theory of ϕ-irreducible Markov Chains. Theoretical Computer Science 334(1-3), 35–69 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Bain, M., Sammut, C.: A Framework for Behavioural Cloning. In: Furukawa, K., Michie, D., Muggleton, S. (eds.) Machine Intelligence, vol. 15, pp. 103–129. Oxford University Press, Oxford (1995)Google Scholar
  4. 4.
    Bakir, G., Hofmann, T., Scholkopf, B., Smola, A.J., Taskar, B., Vishwanathan, S.V.N.: Machine Learning with Structured Outputs. MIT Press, Cambridge (2006)Google Scholar
  5. 5.
  6. 6.
    Brochu, E., de Freitas, N., Ghosh, A.: Active Preference Learning with Discrete Choice Data. In: Proc. NIPS 20, pp. 409–416 (2008)Google Scholar
  7. 7.
    Calinon, S., Guenter, F., Billard, A.: On Learning, Representing and Generalizing a Task in a Humanoid Robot. IEEE Trans. on Systems, Man and Cybernetics, Special Issue on Robot Learning by Observation, Demonstration and Imitation 37(2), 286–298 (2007)CrossRefGoogle Scholar
  8. 8.
    de Boer, P.-T., Kroese, D.P., Mannor, S., Rubinstein, R.Y.: A Tutorial on the Cross-Entropy Method. Annals OR 134(1), 19–67 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Dekel, O., Shalev-Shwartz, S., Singer, Y.: The Forgetron: A Kernel-Based Perceptron on a Budget. SIAM J. Comput. 37, 1342–1372 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Dietterich, T.G., Lathrop, R., Lozano-Perez, T.: Solving the Multiple-Instance Problem with Axis-Parallel Rectangles. Artif. Intelligence 89(1-2), 31–71 (1997)CrossRefzbMATHGoogle Scholar
  11. 11.
    Duda, R.O., Hart, P.E.: Pattern Classification and scene analysis. John Wiley and sons, Menlo Park, CA (1973)zbMATHGoogle Scholar
  12. 12.
    Joachims, T.: A Support Vector Method for Multivariate Performance Measures. In: De Raedt, L., Wrobel, S. (eds.) Proc. 22nd ICML. ACM Intl. Conf. Proc. Series, vol. 119, pp. 377–384. ACM, New York (2005)Google Scholar
  13. 13.
    Joachims, T.: Training Linear SVMs in Linear Time. In: Eliassi-Rad, T., et al. (eds.) Proc. 12th Intl. Conf. KDDM, pp. 217–226. ACM, New York (2006)Google Scholar
  14. 14.
    Zico Kolter, J., Abbeel, P., Ng, A.Y.: Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion. In: Proc. NIPS 20. MIT Press, Cambridge (2007)Google Scholar
  15. 15.
    Lehman, J., Stanley, K.O.: Exploiting Open-Endedness to Solve Problems Through the Search for Novelty. In: Proc. Artificial Life XI, pp. 329–336 (2008)Google Scholar
  16. 16.
    Lehman, J., Stanley, K.O.: Exploiting Open-Endedness to Solve Problems through the Search for Novelty. In: Proc. ALife 2008, MIT Press, Cambridge (2008)Google Scholar
  17. 17.
    Levine, S., Popovic, Z., Koltun, V.: Feature Construction for Inverse Reinforcement Learning. In: Proc. NIPS 23, pp. 1342–1350 (2010)Google Scholar
  18. 18.
    Liu, W., Winfield, A.F.T.: Modeling and Optimization of Adaptive Foraging in Swarm Robotic Systems. Intl. J. Robotic Research 29(14), 1743–1760 (2010)CrossRefGoogle Scholar
  19. 19.
    Ng, A.Y., Russell, S.: Algorithms for Inverse Reinforcement Learning. In: Langley, P. (ed.) Proc. 17th ICML, pp. 663–670. Morgan Kaufmann, San Francisco (2000)Google Scholar
  20. 20.
    Peters, J., Schaal, S.: Reinforcement Learning of Motor Skills with Policy Gradients. Neural Networks 21(4), 682–697 (2008)CrossRefGoogle Scholar
  21. 21.
    Ranzato, M.-A., Poultney, C.S., Chopra, S., LeCun, Y.: Efficient Learning of Sparse Representations with an Energy-Based Model. In: Schölkopf, B., Platt, J.C., Hoffman, T. (eds.) Proc. NIPS 19, pp. 1137–1144. MIT Press, Cambridge (2006)Google Scholar
  22. 22.
    Saxena, A., Driemeyer, J., Ng, A.Y.: Robotic Grasping of Novel Objects using Vision. Intl. J. Robotics Research (2008)Google Scholar
  23. 23.
    Schwefel, H.-P.: Numerical Optimization of Computer Models. John Wiley & Sons, New York (1981) 2nd edn. (1995)zbMATHGoogle Scholar
  24. 24.
    Stirling, T.S., Wischmann, S., Floreano, D.: Energy-efficient Indoor Search by Swarms of Simulated Flying Robots without Global Information. Swarm Intelligence 4(2), 117–143 (2010)CrossRefGoogle Scholar
  25. 25.
    Strehl, A.L., Li, L., Wiewiora, E., Langford, J., Littman, M.L.: PAC Model-free Reinforcement Learning. In: Airoldi, E.M., Blei, D.M., Fienberg, S.E., Goldenberg, A., Xing, E.P., Zheng, A.X. (eds.) ICML 2006. LNCS, vol. 4503, pp. 881–888. Springer, Heidelberg (2007)Google Scholar
  26. 26.
    Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)Google Scholar
  27. 27.
    Syed, U., Schapire, R.: A Game-Theoretic Approach to Apprenticeship Learning. In: Proc. NIPS 21, pp. 1449–1456. MIT Press, Cambridge (2008)Google Scholar
  28. 28.
    Thiery, C., Scherrer, B.: Improvements on Learning Tetris with Cross Entropy. ICGA Journal 32(1), 23–33 (2009)CrossRefGoogle Scholar
  29. 29.
    Trianni, V., Nolfi, S., Dorigo, M.: Cooperative Hole Avoidance in a Swarm-bot. Robotics and Autonomous Systems 54(2), 97–103 (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Riad Akrour
    • 1
  • Marc Schoenauer
    • 1
  • Michele Sebag
    • 1
  1. 1.TAO CNRS – INRIA – Université Paris-SudFrance

Personalised recommendations