Abstract
Conventional approaches to reinforcement learning assume availability of a numerical feedback signal, but in many domains, this is difficult to define or not available at all. The recently proposed framework of preference-based reinforcement learning relaxes this condition by replacing the quantitative reward signal with qualitative preferences over trajectories. In this paper, we show how to estimate preferences over actions from preferences over trajectories. These action preferences can then be used to learn a preferred policy. The performance of this new approach is evaluated by a comparison with SARSA in three common reinforcement learning benchmark problems, namely mountain car, inverted pendulum, and acrobot. The results are showing convergence rates that are comparable, but achieved with a much less time consuming tuning of the setup.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Akrour, R., Schoenauer, M., Sebag, M.: APRIL: Active preference learning-based reinforcement learning. In: Flach, P.A., De Bie, T., Cristianini, N. (eds.) ECML PKDD 2012, Part II. LNCS, vol. 7524, pp. 116–131. Springer, Heidelberg (2012)
Audibert, J.Y., Bubeck, S.: Minimax policies for adversarial and stochastic bandits. In: Proceedings of the 22nd Conference on Learning Theory (COLT 2009), Montreal, Quebec, Canada, pp. 773–818 (2009)
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2-3), 235–256 (2002)
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: Gambling in a rigged casino: The adversarial multi-arm bandit problem. In: Proceedings of the 36th Annual Symposium on Foundations of Computer Science, pp. 322–331 (1995)
Dimitrakakis, C., Lagoudakis, M.G.: Rollout sampling approximate policy iteration. Machine Learning 72(3), 157–171 (2008)
Fürnkranz, J., Hüllermeier, E. (eds.): Preference Learning. Springer (2010)
Fürnkranz, J., Hüllermeier, E., Cheng, W., Park, S.H.: Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine Learning 89(1-2), 123–156 (2012), special Issue of Selected Papers from ECML PKDD 2011
Hastie, T., Tibshirani, R.: Classification by pairwise coupling. The Annals of Statistics 26, 451–471 (1998)
Price, D., Knerr, S., Personnaz, L., Dreyfus, G.: Pairwise neural network classifiers with probabilistic outputs. In: Proceedings of the 7th Conference Advances in Neural Information Processing Systems (NIPS 1994), vol. 7, pp. 1109–1116. MIT Press (1994)
Rothkopf, C.A., Dimitrakakis, C.: Preference elicitation and inverse reinforcement learning. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011, Part III. LNCS, vol. 6913, pp. 34–48. Springer, Heidelberg (2011)
Singh, S.P., Jaakkola, T., Littman, M.L., Szepesvári, C.: Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning 38(3), 287–308 (2000)
Sutton, R.S., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
Wirth, C., Fürnkranz, J.: Learning from trajectory-based action preferences. In: Proceedings of the ICRA 2013 Workshop on Autonomous Learning (to appear, May 2013)
Wu, T.F., Lin, C.J., Weng, R.C.: Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research 5, 975–1005 (2004)
Zhao, Y., Kosorok, M., Zeng, D.: Reinforcement learning design for cancer clinical trials. Statistics in Medicine 28, 3295–3315 (2009)
Wilson, A., Fern, A., Tadepalli, P.: A Bayesian Approach for Policy Learning from Trajectory Preference Queries. Advances in Neural Information Processing Systems 25, 1142–1150 (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wirth, C., Fürnkranz, J. (2013). A Policy Iteration Algorithm for Learning from Preference-Based Feedback. In: Tucker, A., Höppner, F., Siebes, A., Swift, S. (eds) Advances in Intelligent Data Analysis XII. IDA 2013. Lecture Notes in Computer Science, vol 8207. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41398-8_37
Download citation
DOI: https://doi.org/10.1007/978-3-642-41398-8_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41397-1
Online ISBN: 978-3-642-41398-8
eBook Packages: Computer ScienceComputer Science (R0)