An exploration strategy for non-stationary opponents

Hernandez-Leal, Pablo; Zhan, Yusen; Taylor, Matthew E.; Sucar, L. Enrique; Munoz de Cote, Enrique

doi:10.1007/s10458-016-9347-3

An exploration strategy for non-stationary opponents

Published: 13 October 2016

Volume 31, pages 971–1002, (2017)
Cite this article

Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Pablo Hernandez-Leal¹,
Yusen Zhan³,
Matthew E. Taylor³,
L. Enrique Sucar² &
…
Enrique Munoz de Cote²

769 Accesses
6 Citations
3 Altmetric
Explore all metrics

Abstract

The success or failure of any learning algorithm is partially due to the exploration strategy it exerts. However, most exploration strategies assume that the environment is stationary and non-strategic. In this work we shed light on how to design exploration strategies in non-stationary and adversarial environments. Our proposed adversarial drift exploration (DE) is able to efficiently explore the state space while keeping track of regions of the environment that have changed. This proposed exploration is general enough to be applied in single agent non-stationary environments as well as in multiagent settings where the opponent changes its strategy in time. We use a two agent strategic interaction setting to test this new type of exploration, where the opponent switches between different behavioral patterns to emulate a non-deterministic, stochastic and adversarial environment. The agent’s objective is to learn a model of the opponent’s strategy to act optimally. Our contribution is twofold. First, we present DE as a strategy for switch detection. Second, we propose a new algorithm called R-max# for learning and planning against non-stationary opponent. To handle such opponents, R-max# reasons and acts in terms of two objectives: (1) to maximize utilities in the short term while learning and (2) eventually explore opponent behavioral changes. We provide theoretical results showing that R-max# is guaranteed to detect the opponent’s switch and learn a new model in terms of finite sample complexity. R-max# makes efficient use of exploration experiences, which results in rapid adaptation and efficient DE, to deal with the non-stationary nature of the opponent. We show experimentally how using DE outperforms the state of the art algorithms that were explicitly designed for modeling opponents (in terms average rewards) in two complimentary domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficiently detecting switches against non-stationary opponents

Article 26 November 2016

Pablo Hernandez-Leal, Yusen Zhan, … Enrique Munoz de Cote

Long-Term Exploration in Persistent MDPs

Using a Priori Information for Fast Learning Against Non-stationary Opponents

Notes

Godfather [33] offers the opponent a situation where it can obtain a high reward. If the opponent does not accept the offer, Godfather forces the opponent to obtain a low reward.
A related behavior called observationally equivalent models has been reported by Doshi et al. [17].
To ensure a DE there was a constant \(\epsilon \)-greedy\(=0.2\) exploration and no decay in the learning rate (\(\alpha \)).
Optimal policies are always cooperate, Pavlov and always defect, against opponents TFT, Pavlov and Bully, respectively.

References

Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed Bandit problem. Machine Learning, 47(2/3), 235–256.
Article MATH Google Scholar
Axelrod, R., & Hamilton, W. D. (1981). The evolution of cooperation. Science, 211(27), 1390–1396.
Article MathSciNet MATH Google Scholar
Babes, M., Munoz de Cote, E., & Littman, M. L. (2008). Social reward shaping in the prisoner’s dilemma. In Proceedings of the 7th International Conference on Autonomous Agents and Multiagent Systems, (pp. 1389–1392). Estoril: International Foundation for Autonomous Agents and Multiagent Systems.
Banerjee, B., & Peng, J. (2005). Efficient learning of multi-step best response. In Proceedings of the 4th International Conference on Autonomous Agents and Multiagent Systems, (pp. 60–66). Utretch: ACM.
Bard, N., Johanson, M., Burch, N., & Bowling, M. (2013). Online implicit agent modelling. In Proceedings of the 12th International Conference on Autonomous Agents and Multiagent Systems, (pp. 255–262). Saint Paul, MN: International Foundation for Autonomous Agents and Multiagent Systems.
Bowling, M., & Veloso, M. (2002). Multiagent learning using a variable learning rate. Artificial Intelligence, 136(2), 215–250.
Article MathSciNet MATH Google Scholar
Brafman, R. I., & Tennenholtz, M. (2003). R-MAX a general polynomial time algorithm for near-optimal reinforcement learning. The Journal of Machine Learning Research, 3, 213–231.
MathSciNet MATH Google Scholar
Busoniu, L., Babuska, R., & De Schutter, B. (2008). A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man and Cybernetics, Part C Applications and Reviews, 38(2), 156–172.
Article Google Scholar
Carmel, D., & Markovitch, S. (1999). Exploration strategies for model-based learning in multi-agent systems. Autonomous Agents and Multi-Agent Systems, 2(2), 141–172.
Article Google Scholar
Cawley, G. C., & Talbot, N. L. C. (2003). Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers. Pattern Recognition, 36(11), 2585–2592.
Article MATH Google Scholar
Chakraborty, D., Agmon, N., & Stone, P. (2013). Targeted opponent modeling of memory-bounded agents. In Proceedings of the Adaptive Learning Agents Workshop (ALA). Saint Paul, MN, USA.
Chakraborty, D., & Stone, P. (2008). Online multiagent learning against memory bounded adversaries. In Machine Learning and Knowledge Discovery in Databases (pp. 211–226). Berlin: Springer.
Chakraborty, D., & Stone, P. (2013). Multiagent learning in the presence of memory-bounded agents. Autonomous Agents and Multi-Agent Systems, 28(2), 182–213.
Article Google Scholar
Choi, S. P. M., Yeung, D. Y., & Zhang, N. L. (1999). An environment model for nonstationary reinforcement learning. In Advances in Neural Information Processing Systems, (pp. 987–993). Denver, CO, USA.
Da Silva, B. C., Basso, E. W., Bazzan, A. L., & Engel, P. M. (2006). Dealing with non-stationary environments using context detection. In Proceedings of the 23rd International Conference on Machine Learnig, (pp. 217–224). Pittsburgh, PA, USA.
Dietterich, T. G. (2000). Ensemble methods in machine learning. In Multiple Classifier Systems, (pp. 1–15). Berlin: Springer
Doshi, P., & Gmytrasiewicz, P. J. (2006). On the difficulty of achieving equilibrium in interactive POMDPs. In Twenty-first National Conference on Artificial Intelligence, (pp. 1131–1136). Boston, MA, USA.
Elidrisi, M., Johnson, N., & Gini, M. (2012). Fast learning against adaptive adversarial opponents. In Proceedings of the Adaptive Learning Agents Workshop (ALA), Valencia, Spain.
Elidrisi, M., Johnson, N., Gini, M., & Crandall, J. W. (2014). Fast adaptive learning in repeated stochastic games by game abstraction. In Proceedings of the 13th International Conference on Autonomous Agents and Multiagent Systems, (pp. 1141–1148). Paris, France.
Fulda, N., & Ventura, D. (2007). Predicting and preventing coordination problems in cooperative Q-learning systems. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence, (pp. 780–785). Hyderabad, India.
Garivier, A., & Moulines, E. (2011). On upper-confidence bound policies for switching bandit problems. In Algorithmic Learning Theory, (pp. 174–188). Berlin: Springer.
Geibel, P. (2001). Reinforcement learning with bounded risk. In Proceedings of the Eighteenth International Conference on Machine Learning, (pp. 162–169). Williamstown, MA: Morgan Kaufmann Publishers Inc.
Gittins, J. C. (1979). Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society, 41(2), 148–177.
MathSciNet MATH Google Scholar
Hans, A., Schneegaß, D., Schäfer, A. M., & Udluft, S. (2008). Safe exploration for reinforcement learning. In European Symposium on Artificial Neural Networks, (pp. 143–148). Bruges, Belgium.
Hernandez-Leal, P., Munoz de Cote, E., & Sucar, L. E. (2013). Modeling non-stationary opponents. In Proceedings of the 12th International Conference on Autonomous Agents and Multiagent Systems, (pp. 1135–1136). International Foundation for Autonomous Agents and Multiagent Systems, Saint Paul, MN, USA.
Hernandez-Leal, P., Munoz de Cote, E., & Sucar, L. E. (2014). A framework for learning and planning against switching strategies in repeated games. Connection Science, 26(2), 103–122.
Article Google Scholar
Hernandez-Leal, P., Munoz de Cote, E., & Sucar, L. E. (2014). Exploration strategies to detect strategy switches. In Proceedings of the Adaptive Learning Agents Workshop (ALA). Paris, France.
Hernandez-Leal, P., Taylor, M. E., Rosman, B., Sucar, L. E., & Munoz de Cote, E. (2016). Identifying and tracking switching, non-stationary opponents: a Bayesian approach. In In Multiagent Interaction without Prior Coordination Workshop at AAAI. Phoenix, AZ, USA.
HolmesParker, C., Taylor, M. E., Agogino, A., & Tumer, K. (2014). CLEANing the reward: counterfactual actions to remove exploratory action noise in multiagent learning. In Proceedings of the 13th International Conference on Autonomous Agents and Multiagent Systems, (pp. 1353–1354). International Foundation for Autonomous Agents and Multiagent Systems, Paris, France.
Kakade, S. M. (2003). On the sample complexity of reinforcement learning. Ph.D. thesis, Gatsby Computational Neuroscience Unit, University College London.
Lazaric, A., Munoz de Cote, E., & Gatti, N. (2007). Reinforcement learning in extensive form games with incomplete information: The bargaining case study. In Proceedings of the 6th International Conference on Autonomous Agents and Multiagent Systems. Honolulu, HI: ACM.
Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the 11th International Conference on Machine Learning, (pp. 157–163). New Brunswick, NJ.
Littman, M. L., & Stone, P. (2001). Implicit Negotiation in Repeated Games. In ATAL ’01: Revised Papers from the 8th International Workshop on Intelligent Agents VIII.
Lopes, M., Lang, T., Toussaint, M., & Oudeyer, P. Y. (2012). Exploration in model-based reinforcement learning by empirically estimating learning progress. In Advances in Neural Information Processing Systems, (pp. 206–214). Lake Tahoe, NV.
MacAlpine, P., Urieli, D., Barrett, S., Kalyanakrishnan, S., Barrera, F., Lopez-Mobilia, A., Ştiurcă, N., Vu, V., & Stone, P. (2012). UT Austin Villa 2011: a champion agent in the RoboCup 3D Soccer simulation competition. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, (pp. 129–136). International Foundation for Autonomous Agents and Multiagent Systems, Valencia, Spain.
Marinescu, A., Dusparic, I., Taylor, A., Cahill, V., & Clarke, S. (2015). P-MARL: Prediction-based multi-agent reinforcement learning for non-stationary environments. In Proceedings of the 14th International Conference on Autonomous Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems.
Mohan, Y., & Ponnambalam, S. G. (2011). Exploration strategies for learning in multi-agent foraging. In Swarm, Evolutionary, and Memetic Computing 2011, (pp. 17–26). Springer.
Monahan, G. E. (1982). A survey of partially observable Markov decision processes: Theory, models, and algorithms. Management Science, 28, 1–16.
Article MathSciNet MATH Google Scholar
Mota, P., Melo, F., & Coheur, L. (2015). Modeling students self-studies behaviors. In Proceedings of the 14th International Conference on Autonomous Agents and Multiagent Systems, (pp. 1521–1528). Istanbul, Turkey
Munoz de Cote, E., Chapman, A. C., Sykulski, A. M., & Jennings, N. R. (2010). Automated planning in repeated adversarial games. In Uncertainty in Artificial Intelligence, (pp. 376–383). Catalina Island, CA.
Munoz de Cote, E., & Jennings, N. R. (2010). Planning against fictitious players in repeated normal form games. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems, (pp. 1073–1080). International Foundation for Autonomous Agents and Multiagent Systems, Toronto, Canada.
Ng, A. Y., Harada, D., & Russell, S. J. (1999). Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, (pp. 278–287). Bled, Slovenia.
Puterman, M. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley.
Book MATH Google Scholar
Rejeb, L., Guessoum, Z., & M’Hallah, R. (2005). An adaptive approach for the exploration–exploitation dilemma for learning agents. In Proceedings of the 4th international Central and Eastern European conference on Multi-Agent Systems and Applications, (pp. 316–325). Springer, Budapest, Hungary.
Stahl, I. (1972). Bargaining theory. Stockolm: Stockolm School of Economics.
Google Scholar
Stone, P., & Veloso, M. (2000). Multiagent systems: A survey from a machine learning perspective. Autonomous Robots, 8(3), 345–383.
Article Google Scholar
Suematsu, N., & Hayashi, A. (2002). A multiagent reinforcement learning algorithm using extended optimal response. In Proceedings of the 1st International Conference on Autonomous Agents and Multiagent Systems, (pp. 370–377). ACM Request Permissions, Bologna, Italy.
Taylor, M. E., & Stone, P. (2009). Transfer learning for reinforcement learning domains: A survey. The Journal of Machine Learning Research, 10, 1633–1685.
MathSciNet MATH Google Scholar
Vrancx, P., Gurzi, P., Rodriguez, A., Steenhaut, K., & Nowe, A. (2015). A reinforcement learning approach for interdomain routing with link prices. ACM Transactions on Autonomous and Adaptive Systems, 10(1), 1–26.
Article Google Scholar
Watkins, C., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292.
MATH Google Scholar
Weinberg, M., & Rosenschein, J. S. (2004). Best-response multiagent learning in non-stationary environments. In Proceedings of the 3rd International Conference on Autonomous Agents and Multiagent Systems, (pp. 506–513). New York: IEEE Computer Society.
Zinkevich, M. A., Bowling, M., & Wunder, M. (2011). The lemonade stand game competition: Solving unsolvable games. SIGecom Exchanges, 10(1), 35–38.
Article Google Scholar

Download references

Acknowledgments

The first author was supported by a scholarship grant 329007 from the National Council of Science and Technology of Mexico (CONACYT). This research has taken place in part at the Intelligent Robot Learning (IRL) Lab, Washington State University. IRL research is supported in part by grants AFRL FA8750-14-1-0069, AFRL FA8750-14-1-0070, NSF IIS-1149917, NSF IIS-1319412, USDA 2014-67021-22174, and a Google Research Award.

Author information

Authors and Affiliations

Centrum Wiskunde & Informatica (CWI), Science Park 123, Amsterdam, The Netherlands
Pablo Hernandez-Leal
Instituto Nacional de Astrofísica, Óptica y Electrónica, Luis Enrique Erro 1, Sta. María Tonantzintla, Puebla, México
L. Enrique Sucar & Enrique Munoz de Cote
Washington State University (WSU), Pullman, WA, USA
Yusen Zhan & Matthew E. Taylor

Authors

Pablo Hernandez-Leal
View author publications
You can also search for this author in PubMed Google Scholar
Yusen Zhan
View author publications
You can also search for this author in PubMed Google Scholar
Matthew E. Taylor
View author publications
You can also search for this author in PubMed Google Scholar
L. Enrique Sucar
View author publications
You can also search for this author in PubMed Google Scholar
Enrique Munoz de Cote
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pablo Hernandez-Leal.

Additional information

Most of this work was performed while the first author was a graduate student at INAOE.

This paper extends the paper “Exploration strategies to detect strategy switches” presented at the Adaptive Learning Agents workshop [27].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hernandez-Leal, P., Zhan, Y., Taylor, M.E. et al. An exploration strategy for non-stationary opponents. Auton Agent Multi-Agent Syst 31, 971–1002 (2017). https://doi.org/10.1007/s10458-016-9347-3

Download citation

Published: 13 October 2016
Issue Date: September 2017
DOI: https://doi.org/10.1007/s10458-016-9347-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An exploration strategy for non-stationary opponents

Abstract

Access this article

Similar content being viewed by others

Efficiently detecting switches against non-stationary opponents

Long-Term Exploration in Persistent MDPs

Using a Priori Information for Fast Learning Against Non-stationary Opponents

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An exploration strategy for non-stationary opponents

Abstract

Access this article

Similar content being viewed by others

Efficiently detecting switches against non-stationary opponents

Long-Term Exploration in Persistent MDPs

Using a Priori Information for Fast Learning Against Non-stationary Opponents

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation