The Explore–Exploit Dilemma in Nonstationary Decision Making under Uncertainty

Axelrod, Allan; Chowdhary, Girish

doi:10.1007/978-3-319-26327-4_2

Allan Axelrod⁴ &
Girish Chowdhary⁴

Part of the book series: Studies in Systems, Decision and Control ((SSDC,volume 42))

1134 Accesses
3 Citations

Abstract

It is often assumed that autonomous systems are operating in environments that may be described by a stationary (time-invariant) environment. However, real-world environments are often nonstationary (time-varying), where the underlying phenomena changes in time, so stationary approximations of the nonstationary environment may quickly lose relevance. Here, two approaches are presented and applied in the context of reinforcement learning in nonstationary environments. In Sect. 2.2, the first approach leverages reinforcement learning in the presence of a changing reward-model. In particular, a functional termed the Fog-of-War is used to drive exploration which results in the timely discovery of new models in nonstationary environments. In Sect. 2.3, the Fog-of-War functional is adapted in real-time to reflect the heterogeneous information content of a real-world environment; this is critically important for the use of the approach in Sect. 2.2 in real world environments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abdoos M, Mozayani N, Bazzan AL (2011) Traffic light control in non-stationary environments based on multi agent q-learning. In: 2011 14th international IEEE conference on, IEEE Intelligent Transportation Systems (ITSC), pp 1580–1585
Google Scholar
Adams RP, Murray I, MacKay DJ (2009) Tractable nonparametric bayesian inference in poisson processes with gaussian process intensities. In: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, pp 9–16
Google Scholar
Allamaraju R, Kingravi H, Axelrod A, Chowdhary G, Grande R, Crick C, Sheng W, How J (2014) Human aware path planning in urban environments with nonstationary mdps. In: IEEE international conference on robotics and automation, Hong Kong, China
Google Scholar
Axelrod A, Chowdhary G (2015) Adaptive algorithms for autonomous data-ferrying in nonstationary environments. In: AIAA Aerospace science and technology forum, Kissimmee, FL
Google Scholar
Bedekar AS, Azizoglu M (1998) The information-theoretic capacity of discrete-time queues. IEEE Trans Inf Theory 44(2):446–461
Article MathSciNet MATH Google Scholar
Bodik P, Hong W, Guestrin C, Madden S, Paskin M, Thibaux R (2004) Intel lab data. Technical report
Google Scholar
Boone G (1997) Efficient reinforcement learning: Model-based acrobot control. In: Robotics and Automation, 1997. Proceedings., 1997 IEEE International Conference on, IEEE, vol 1, pp 229–234
Google Scholar
Busoniu L, Babuska R, Schutter BD, Ernst D (2010) Reinforcement learning and dynamic programming using function approximators, 1st edn. CRC Press
Google Scholar
Choi SP, Yeung DY, Zhang NL (2001) Hidden-mode markov decision processes for nonstationary sequential decision making. In: Sequence learning, Springer, pp 264–287
Google Scholar
Coleman TP, Kiyavash N, Subramanian VG (2008) The rate-distortion function of a poisson process with a queueing distortion measure. In: Data Compression Conference, DCC 2008, IEEE, pp 63–72
Google Scholar
Csató L, Opper M (2002) Sparse on-line Gaussian processes. Neural Comput 14(3):641–668
Article MATH Google Scholar
Daw ND, O’Doherty JP, Dayan P, Seymour B, Dolan RJ (2006) Cortical substrates for exploratory decisions in humans. Nature 441(7095):876–879
Article Google Scholar
Frost VS, Melamed B (1994) Traffic modeling for telecommunications networks. IEEE Commun Mag 32(3):70–81
Article Google Scholar
Garivier A, Moulines E (2008) On upper-confidence bound policies for non-stationary bandit problems. arXiv preprint arXiv:08053415
Gelman A, Carlin JB, Stern HS, Rubin DB (2014) Bayesian data analysis, vol 2. Taylor & Francis
Google Scholar
Geramifard A, Walsh TJ, Tellex S, Chowdhary G, Roy N, How JP (2013) A tutorial on linear function approximators for dynamic programming and reinforcement learning. Foundations and Trends\({}^{\textregistered }\) in Machine Learning 6(4): 375–451. doi:10.1561/2200000042
Grande RC (2014) Computationally efficient gaussian process changepoint detection and regression. Ph.D thesis, Massachusetts Institute of Technology
Google Scholar
Grande RC, Walsh TJ, How JP (2014) Sample efficient reinforcement learning with Gaussian processes. URL http://acl.mit.edu/papers/Grande14_ICML.pdf
Granmo OC, Berg S (2010) Solving non-stationary bandit problems by random sampling from sibling kalman filters. In: Trends in applied intelligent systems, Springer, pp 199–208
Google Scholar
Grondman I, Busoniu L, Lopes GA, Babuska R (2012) A survey of actor–critic reinforcement learning: Standard and natural policy gradients. IEEE Trans Syst, Man, and Cybern, Part C: Appl Rev 42(6):1291–1307
Google Scholar
Gunter T, Lloyd C, Osborne MA, Roberts SJ (2014) Efficient bayesian nonparametric modelling of structured point processes. arXiv preprint arXiv:14076949
Guo D, Shamai S, Verdú S (2008) Mutual information and conditional mean estimation in poisson channels. IEEE Trans Inf Theory 54(5):1837–1849
Article MATH Google Scholar
Gur Y, Zeevi A, Besbes O (2014) Stochastic multi-armed-bandit problem with non-stationary rewards. In: Advances in neural information processing systems, pp 199–207
Google Scholar
Harremoës P (2001) Binomial and poisson distributions as maximum entropy distributions. IEEE Trans Inf Theory 47(5):2039–2041
Article MATH Google Scholar
Harremoës P, Ruzankin P (2004) Rate of convergence to poisson law in terms of information divergence
Google Scholar
Harremoës P, Johnson O, Kontoyiannis I (2007) Thinning and the law of small numbers. IEEE International Symposium on Information Theory, ISIT 2007. IEEE, pp 1491–1495
Google Scholar
Harremoës P, Vignat C et al (2003) A nash equilibrium related to the poisson channel. Commun Inf Syst 3(3):183–190
MathSciNet MATH Google Scholar
Hershey JR, Olsen PA (2007) Approximating the kullback leibler divergence between gaussian mixture models. In: ICASSP (4), pp 317–320
Google Scholar
Hester T, Stone P (2013) Texplore: real-time sample-efficient reinforcement learning for robots. Mach learning 90(3):385–429
Article MathSciNet Google Scholar
Johnson O (2007) Log-concavity and the maximum entropy property of the poisson distribution. Stoch Process Appl 117(6):791–802
Article MATH Google Scholar
Kakade S, Langford J, Kearns M (2003) Exploration in metric state spaces
Google Scholar
Karagiannis T, Molle M, Faloutsos M, Broido A (2004) A nonstationary poisson view of internet traffic. In: INFOCOM 2004. Twenty-third annualjoint conference of the IEEE computer and communications societies. IEEE, vol 3, pp 1558–1569
Google Scholar
Karaman S, Walter M, Perez A, Frazzoli E, Teller S (2011) Anytime motion planning using the RRT*. In: International conference on robotics and automation. IEEE, pp 1478–1483
Google Scholar
Kolter JZ, Ng AY (2009) Near-bayesian exploration in polynomial time. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 513–520
Google Scholar
Kontoyiannis I, Harremoës P, Johnson O (2005) Entropy and the law of small numbers. IEEE Trans Inf Theory 51(2):466–472
Article MATH Google Scholar
Koulouriotis DE, Xanthopoulos A (2008) Reinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems. Appl Math Comput 196(2):913–922
Article MATH Google Scholar
Lagoudakis MG, Parr R (2003) Least-squares policy iteration. J Mach Learn Res 4:1107–1149, URL http://dl.acm.org/citation.cfm-id=945365.964290
Leśniewicz M (2014) Expected entropy as a measure and criterion of randomness of binary sequences. Przeglad Elektrotechniczny 90(1):42–46
Google Scholar
Markov decision processes (MDP) toolbox (2012). http://www7.inra.fr/mia/T/MDPtoolbox/MDPtoolbox.html
Møller J, Syversveen AR, Waagepetersen RP (1998) Log gaussian cox processes. Scand J Stat 25(3):451–482
Article Google Scholar
Mu B, Chowdhary G, How J (2014) Efficient distributed sensing using adaptive censoring-based inference. Automatica
Google Scholar
Pazis J, Parr R (2013) Pac optimal exploration in continuous space markov decision processes
Google Scholar
Prabhakar B, Gallager R (2003) Entropy and the timing capacity of discrete queues. IEEE Trans Inf Theory 49(2):357–370
Article MathSciNet MATH Google Scholar
Rasmussen C, Williams C (2005) Gaussian processes for machine learning (Adaptive Computation and Machine Learning). The MIT Press
Google Scholar
Reverdy P, Wilson RC, Holmes P, Leonard NE (2012) Towards optimization of a human-inspired heuristic for solving explore–exploit problems. In: CDC, pp 2820–2825
Google Scholar
Ross S, Pineau J (2012) Model-based bayesian reinforcement learning in large structured domains. arXiv preprint arXiv:12063281
Rubin I (1974) Information rates and data-compression schemes for poisson processes. IEEE Trans Inf Theory 20(2):200–210
Article MATH Google Scholar
Scholkopf B, Herbrich R, Smola A (2001) A generalized representer theorem. In: Helmbold D, Williamson B (eds) Computational learning theory., Lecture notes in computer scienceSpringer, Berlin, pp 416–426 URL http://dx.doi.org/10.1007/3-540-44581-1_27
Scholkopft B, Mullert KR (1999) Fisher discriminant analysis with kernels. Proceedings of the IEEE signal processing society workshop neural networks for signal processing IX. Madison, WI, USA, pp 23–25
Google Scholar
Sutton R, Barto A (1998) Reinforcement learning, an introduction. MIT Press, Cambridge
Google Scholar
Thrun S, Burgard W, Fox D, et al (2005) Probabilistic robotics, vol 1. MIT press Cambridge
Google Scholar
Thrun SB (1992) Efficient exploration in reinforcement learning. Carnegie-Mellon University, Technical report
Google Scholar
Tsitsiklis JN, Roy BV (1997) An analysis of temporal difference learning with function approximation. IEEE Trans Autom Control 42(5):674–690
Article MATH Google Scholar
Vlassis N, Ghavamzadeh M, Mannor S, Poupart P (2012) Bayesian reinforcement learning. In: Reinforcement learning, Springer, pp 359–386
Google Scholar
Walsh TJ, Goschin S, Littman ML (2010) Integrating sample-based planning and model-based reinforcement learning. In: AAAI
Google Scholar
Watkins CJCH, Dayan P (1992) Q-learning. J Mach Learn 16:185–202
Google Scholar
Wiering MA (1999) Explorations in efficient reinforcement learning. Ph.D thesis, University of Amsterdam/IDSIA
Google Scholar
Wilson RC, Geana A, White JM, Ludvig EA, Cohen JD (2014) Humans use directed and random exploration to solve the explore–exploit dilemma. J Exp Psychol: Gen 143(6):2074
Google Scholar
Wilson SW, et al (1996) Explore/exploit strategies in autonomy. In: From animals to animats 4: Proceedings of the 4th international conference on simulation of adaptive behavior, pp 325–332
Google Scholar
Yu JY, Mannor S (2009) Online learning in markov decision processes with arbitrarily changing rewards and transitions. In: International conference on game theory for networks, GameNets’ 09. IEEE, pp 314–322
Google Scholar
Zhou Z, Matteson DS, Woodard DB, Henderson SG, Micheas AC (2014) A spatio-temporal point process model for ambulance demand. arXiv preprint arXiv:14015547

Download references

Acknowledgments

This work is sponsored by the Department of Energy Award Number DE-FE0012173, the Air Force Office of Scientific Research Award Number FA9550-14-1-0399, and the Air Force Office of Scientific Research Young Investigator Program Number FA9550-15-1-0146.

Author information

Authors and Affiliations

Oklahoma State University, 218 Engineering North, Stillwater, OK, 74078, USA
Allan Axelrod & Girish Chowdhary

Authors

Allan Axelrod
View author publications
You can also search for this author in PubMed Google Scholar
Girish Chowdhary
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Allan Axelrod .

Editor information

Editors and Affiliations

Automation Department, Technical University of Cluj-Napoca, Cluj-Napoca, Romania
Lucian Busoniu
Automation Department, Technical University of Cluj-Napoca, Cluj-Napoca, Romania
Levente Tamás

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Axelrod, A., Chowdhary, G. (2015). The Explore–Exploit Dilemma in Nonstationary Decision Making under Uncertainty. In: Busoniu, L., Tamás, L. (eds) Handling Uncertainty and Networked Structure in Robot Control. Studies in Systems, Decision and Control, vol 42. Springer, Cham. https://doi.org/10.1007/978-3-319-26327-4_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-26327-4_2
Published: 07 February 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26325-0
Online ISBN: 978-3-319-26327-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics