Preference-Based Reinforcement Learning Using Dyad Ranking
Preference-based reinforcement learning has recently been introduced as a generalization of conventional reinformcement learning. Instead of numerical rewards, which are often difficult to specify, the former assumes weaker feedback in the form of qualitative preferences between states or trajectories. A specific realization of preference-based reinforcement learning is approximate policy iteration using label ranking. We propose an extension of this method, in which label ranking is replaced by so-called dyad ranking. The main advantage of this extension is the ability of dyad ranking to learn from feature descriptions of actions, which are often available in reinforcement learning. Several simulation studies are conducted to confirm the usefulness of the approach.
This work was supported by the German Research Foundation (DFG) within the Collaborative Research Center “On-The-Fly Computing” (SFB 901). We are grateful to Javad Rahnama for his help with the case study on image pipeline configuration.
- 1.Akrour, R., Schoenauer, M., Sebag, M.: Preference-based policy learning. In: Proceedings of ECML/PKDD-2011, Athens, Greece (2011)Google Scholar
- 3.Cheng, W., Fürnkranz, J., Hüllermeier, E., Park, S.H.: Preference-based policy iteration: leveraging preference learning for reinforcement learning. In: Proceedings of ECML/PKDD-2011, Athens, Greece (2011)Google Scholar
- 6.Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 2nd edn. Prentice Hall, Englewood Cliffs (2002)Google Scholar
- 8.Lagoudakis, M., Parr, R.: Reinforcement learning as classification: leveraging modern classifiers. In: Proceedings of ICML, 20th International Conference on Machine Learning, vol. 20, pp. 424–431. AAAI Press (2003)Google Scholar
- 9.Schäfer, D., Hüllermeier, E.: Plackett-Luce networks for dyad ranking. In: Workshop LWDA, Lernen, Wissen, Daten, Analysen, Potsdam, Germany (2016)Google Scholar
- 10.Schäfer, D., Hüllermeier, E.: Dyad ranking using a bilinear Plackett-Luce model. In: Appice, A., Rodrigues, P.P., Santos Costa, V., Gama, J., Jorge, A., Soares, C. (eds.) ECML PKDD 2015. LNCS (LNAI), vol. 9285, pp. 227–242. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23525-7_14CrossRefGoogle Scholar
- 11.Schäfer, D., Hüllermeier, E.: Dyad ranking using Plackett-Luce models based on joint feature representations. Mach. Learn. (2018)Google Scholar
- 12.Settles, B.: Active learning literature survey. Technical Report 1648, University of Wisconsin-Madison (2008)Google Scholar
- 13.Sutton, R.S.: Learning to predict by the methods of temporal differences. Mach. Learn. 3(1), 9–44 (1988)Google Scholar
- 14.Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)Google Scholar
- 16.Vembu, S., Gärtner, T.: Label ranking: a survey. In: Fürnkranz, J., Hüllermeier, E., (eds.) Preference Learning. Springer (2010)Google Scholar
- 19.Wirth, C., Akrour, R., Neumann, G., Fürnkranz, J.: A survey of preference-based reinforcement learning methods. J. Mach. Learn. Res. 18, 136:1–136:46 (2017)Google Scholar
- 20.Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms (2017), arXiv:1708.07747