Skip to main content

The Explore–Exploit Dilemma in Nonstationary Decision Making under Uncertainty

  • Chapter
  • First Online:
Handling Uncertainty and Networked Structure in Robot Control

Part of the book series: Studies in Systems, Decision and Control ((SSDC,volume 42))

Abstract

It is often assumed that autonomous systems are operating in environments that may be described by a stationary (time-invariant) environment. However, real-world environments are often nonstationary (time-varying), where the underlying phenomena changes in time, so stationary approximations of the nonstationary environment may quickly lose relevance. Here, two approaches are presented and applied in the context of reinforcement learning in nonstationary environments. In Sect. 2.2, the first approach leverages reinforcement learning in the presence of a changing reward-model. In particular, a functional termed the Fog-of-War is used to drive exploration which results in the timely discovery of new models in nonstationary environments. In Sect. 2.3, the Fog-of-War functional is adapted in real-time to reflect the heterogeneous information content of a real-world environment; this is critically important for the use of the approach in Sect. 2.2 in real world environments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Abdoos M, Mozayani N, Bazzan AL (2011) Traffic light control in non-stationary environments based on multi agent q-learning. In: 2011 14th international IEEE conference on, IEEE Intelligent Transportation Systems (ITSC), pp 1580–1585

    Google Scholar 

  • Adams RP, Murray I, MacKay DJ (2009) Tractable nonparametric bayesian inference in poisson processes with gaussian process intensities. In: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, pp 9–16

    Google Scholar 

  • Allamaraju R, Kingravi H, Axelrod A, Chowdhary G, Grande R, Crick C, Sheng W, How J (2014) Human aware path planning in urban environments with nonstationary mdps. In: IEEE international conference on robotics and automation, Hong Kong, China

    Google Scholar 

  • Axelrod A, Chowdhary G (2015) Adaptive algorithms for autonomous data-ferrying in nonstationary environments. In: AIAA Aerospace science and technology forum, Kissimmee, FL

    Google Scholar 

  • Bedekar AS, Azizoglu M (1998) The information-theoretic capacity of discrete-time queues. IEEE Trans Inf Theory 44(2):446–461

    Article  MathSciNet  MATH  Google Scholar 

  • Bodik P, Hong W, Guestrin C, Madden S, Paskin M, Thibaux R (2004) Intel lab data. Technical report

    Google Scholar 

  • Boone G (1997) Efficient reinforcement learning: Model-based acrobot control. In: Robotics and Automation, 1997. Proceedings., 1997 IEEE International Conference on, IEEE, vol 1, pp 229–234

    Google Scholar 

  • Busoniu L, Babuska R, Schutter BD, Ernst D (2010) Reinforcement learning and dynamic programming using function approximators, 1st edn. CRC Press

    Google Scholar 

  • Choi SP, Yeung DY, Zhang NL (2001) Hidden-mode markov decision processes for nonstationary sequential decision making. In: Sequence learning, Springer, pp 264–287

    Google Scholar 

  • Coleman TP, Kiyavash N, Subramanian VG (2008) The rate-distortion function of a poisson process with a queueing distortion measure. In: Data Compression Conference, DCC 2008, IEEE, pp 63–72

    Google Scholar 

  • Csató L, Opper M (2002) Sparse on-line Gaussian processes. Neural Comput 14(3):641–668

    Article  MATH  Google Scholar 

  • Daw ND, O’Doherty JP, Dayan P, Seymour B, Dolan RJ (2006) Cortical substrates for exploratory decisions in humans. Nature 441(7095):876–879

    Article  Google Scholar 

  • Frost VS, Melamed B (1994) Traffic modeling for telecommunications networks. IEEE Commun Mag 32(3):70–81

    Article  Google Scholar 

  • Garivier A, Moulines E (2008) On upper-confidence bound policies for non-stationary bandit problems. arXiv preprint arXiv:08053415

  • Gelman A, Carlin JB, Stern HS, Rubin DB (2014) Bayesian data analysis, vol 2. Taylor & Francis

    Google Scholar 

  • Geramifard A, Walsh TJ, Tellex S, Chowdhary G, Roy N, How JP (2013) A tutorial on linear function approximators for dynamic programming and reinforcement learning. Foundations and Trends\({}^{\textregistered }\) in Machine Learning 6(4): 375–451. doi:10.1561/2200000042

  • Grande RC (2014) Computationally efficient gaussian process changepoint detection and regression. Ph.D thesis, Massachusetts Institute of Technology

    Google Scholar 

  • Grande RC, Walsh TJ, How JP (2014) Sample efficient reinforcement learning with Gaussian processes. URL http://acl.mit.edu/papers/Grande14_ICML.pdf

  • Granmo OC, Berg S (2010) Solving non-stationary bandit problems by random sampling from sibling kalman filters. In: Trends in applied intelligent systems, Springer, pp 199–208

    Google Scholar 

  • Grondman I, Busoniu L, Lopes GA, Babuska R (2012) A survey of actor–critic reinforcement learning: Standard and natural policy gradients. IEEE Trans Syst, Man, and Cybern, Part C: Appl Rev 42(6):1291–1307

    Google Scholar 

  • Gunter T, Lloyd C, Osborne MA, Roberts SJ (2014) Efficient bayesian nonparametric modelling of structured point processes. arXiv preprint arXiv:14076949

  • Guo D, Shamai S, Verdú S (2008) Mutual information and conditional mean estimation in poisson channels. IEEE Trans Inf Theory 54(5):1837–1849

    Article  MATH  Google Scholar 

  • Gur Y, Zeevi A, Besbes O (2014) Stochastic multi-armed-bandit problem with non-stationary rewards. In: Advances in neural information processing systems, pp 199–207

    Google Scholar 

  • Harremoës P (2001) Binomial and poisson distributions as maximum entropy distributions. IEEE Trans Inf Theory 47(5):2039–2041

    Article  MATH  Google Scholar 

  • Harremoës P, Ruzankin P (2004) Rate of convergence to poisson law in terms of information divergence

    Google Scholar 

  • Harremoës P, Johnson O, Kontoyiannis I (2007) Thinning and the law of small numbers. IEEE International Symposium on Information Theory, ISIT 2007. IEEE, pp 1491–1495

    Google Scholar 

  • Harremoës P, Vignat C et al (2003) A nash equilibrium related to the poisson channel. Commun Inf Syst 3(3):183–190

    MathSciNet  MATH  Google Scholar 

  • Hershey JR, Olsen PA (2007) Approximating the kullback leibler divergence between gaussian mixture models. In: ICASSP (4), pp 317–320

    Google Scholar 

  • Hester T, Stone P (2013) Texplore: real-time sample-efficient reinforcement learning for robots. Mach learning 90(3):385–429

    Article  MathSciNet  Google Scholar 

  • Johnson O (2007) Log-concavity and the maximum entropy property of the poisson distribution. Stoch Process Appl 117(6):791–802

    Article  MATH  Google Scholar 

  • Kakade S, Langford J, Kearns M (2003) Exploration in metric state spaces

    Google Scholar 

  • Karagiannis T, Molle M, Faloutsos M, Broido A (2004) A nonstationary poisson view of internet traffic. In: INFOCOM 2004. Twenty-third annualjoint conference of the IEEE computer and communications societies. IEEE, vol 3, pp 1558–1569

    Google Scholar 

  • Karaman S, Walter M, Perez A, Frazzoli E, Teller S (2011) Anytime motion planning using the RRT*. In: International conference on robotics and automation. IEEE, pp 1478–1483

    Google Scholar 

  • Kolter JZ, Ng AY (2009) Near-bayesian exploration in polynomial time. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 513–520

    Google Scholar 

  • Kontoyiannis I, Harremoës P, Johnson O (2005) Entropy and the law of small numbers. IEEE Trans Inf Theory 51(2):466–472

    Article  MATH  Google Scholar 

  • Koulouriotis DE, Xanthopoulos A (2008) Reinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems. Appl Math Comput 196(2):913–922

    Article  MATH  Google Scholar 

  • Lagoudakis MG, Parr R (2003) Least-squares policy iteration. J Mach Learn Res 4:1107–1149, URL http://dl.acm.org/citation.cfm-id=945365.964290

  • Leśniewicz M (2014) Expected entropy as a measure and criterion of randomness of binary sequences. Przeglad Elektrotechniczny 90(1):42–46

    Google Scholar 

  • Markov decision processes (MDP) toolbox (2012). http://www7.inra.fr/mia/T/MDPtoolbox/MDPtoolbox.html

  • Møller J, Syversveen AR, Waagepetersen RP (1998) Log gaussian cox processes. Scand J Stat 25(3):451–482

    Article  Google Scholar 

  • Mu B, Chowdhary G, How J (2014) Efficient distributed sensing using adaptive censoring-based inference. Automatica

    Google Scholar 

  • Pazis J, Parr R (2013) Pac optimal exploration in continuous space markov decision processes

    Google Scholar 

  • Prabhakar B, Gallager R (2003) Entropy and the timing capacity of discrete queues. IEEE Trans Inf Theory 49(2):357–370

    Article  MathSciNet  MATH  Google Scholar 

  • Rasmussen C, Williams C (2005) Gaussian processes for machine learning (Adaptive Computation and Machine Learning). The MIT Press

    Google Scholar 

  • Reverdy P, Wilson RC, Holmes P, Leonard NE (2012) Towards optimization of a human-inspired heuristic for solving explore–exploit problems. In: CDC, pp 2820–2825

    Google Scholar 

  • Ross S, Pineau J (2012) Model-based bayesian reinforcement learning in large structured domains. arXiv preprint arXiv:12063281

  • Rubin I (1974) Information rates and data-compression schemes for poisson processes. IEEE Trans Inf Theory 20(2):200–210

    Article  MATH  Google Scholar 

  • Scholkopf B, Herbrich R, Smola A (2001) A generalized representer theorem. In: Helmbold D, Williamson B (eds) Computational learning theory., Lecture notes in computer scienceSpringer, Berlin, pp 416–426 URL http://dx.doi.org/10.1007/3-540-44581-1_27

  • Scholkopft B, Mullert KR (1999) Fisher discriminant analysis with kernels. Proceedings of the IEEE signal processing society workshop neural networks for signal processing IX. Madison, WI, USA, pp 23–25

    Google Scholar 

  • Sutton R, Barto A (1998) Reinforcement learning, an introduction. MIT Press, Cambridge

    Google Scholar 

  • Thrun S, Burgard W, Fox D, et al (2005) Probabilistic robotics, vol 1. MIT press Cambridge

    Google Scholar 

  • Thrun SB (1992) Efficient exploration in reinforcement learning. Carnegie-Mellon University, Technical report

    Google Scholar 

  • Tsitsiklis JN, Roy BV (1997) An analysis of temporal difference learning with function approximation. IEEE Trans Autom Control 42(5):674–690

    Article  MATH  Google Scholar 

  • Vlassis N, Ghavamzadeh M, Mannor S, Poupart P (2012) Bayesian reinforcement learning. In: Reinforcement learning, Springer, pp 359–386

    Google Scholar 

  • Walsh TJ, Goschin S, Littman ML (2010) Integrating sample-based planning and model-based reinforcement learning. In: AAAI

    Google Scholar 

  • Watkins CJCH, Dayan P (1992) Q-learning. J Mach Learn 16:185–202

    Google Scholar 

  • Wiering MA (1999) Explorations in efficient reinforcement learning. Ph.D thesis, University of Amsterdam/IDSIA

    Google Scholar 

  • Wilson RC, Geana A, White JM, Ludvig EA, Cohen JD (2014) Humans use directed and random exploration to solve the explore–exploit dilemma. J Exp Psychol: Gen 143(6):2074

    Google Scholar 

  • Wilson SW, et al (1996) Explore/exploit strategies in autonomy. In: From animals to animats 4: Proceedings of the 4th international conference on simulation of adaptive behavior, pp 325–332

    Google Scholar 

  • Yu JY, Mannor S (2009) Online learning in markov decision processes with arbitrarily changing rewards and transitions. In: International conference on game theory for networks, GameNets’ 09. IEEE, pp 314–322

    Google Scholar 

  • Zhou Z, Matteson DS, Woodard DB, Henderson SG, Micheas AC (2014) A spatio-temporal point process model for ambulance demand. arXiv preprint arXiv:14015547

Download references

Acknowledgments

This work is sponsored by the Department of Energy Award Number DE-FE0012173, the Air Force Office of Scientific Research Award Number FA9550-14-1-0399, and the Air Force Office of Scientific Research Young Investigator Program Number FA9550-15-1-0146.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Allan Axelrod .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Axelrod, A., Chowdhary, G. (2015). The Explore–Exploit Dilemma in Nonstationary Decision Making under Uncertainty. In: Busoniu, L., Tamás, L. (eds) Handling Uncertainty and Networked Structure in Robot Control. Studies in Systems, Decision and Control, vol 42. Springer, Cham. https://doi.org/10.1007/978-3-319-26327-4_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-26327-4_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-26325-0

  • Online ISBN: 978-3-319-26327-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics