Abstract
Deep reinforcement learning has over the past few years shown great potential in learning near-optimal control in complex simulated environments with little visible information. Rainbow (Q-Learning) and PPO (Policy Optimisation) have shown outstanding performance in a variety of tasks, including Atari 2600, MuJoCo, and Roboschool test suite. Although these algorithms are fundamentally different, both suffer from high variance, low sample efficiency, and hyperparameter sensitivity that, in practice, make these algorithms a no-go for critical operations in the industry.
On the other hand, model-based reinforcement learning focuses on learning the transition dynamics between states in an environment. If the environment dynamics are adequately learned, a model-based approach is perhaps the most sample efficient method for learning agents to act in an environment optimally. The traits of model-based reinforcement are ideal for real-world environments where sampling is slow and in mission-critical operations. In the warehouse industry, there is an increasing motivation to minimise time and to maximise production. In many of these environments, the literature suggests that the autonomous agents in these environments act suboptimally using handcrafted policies for a significant portion of the state-space.
In this paper, we present The Dreaming Variational Autoencoder v2 (DVAE-2), a model-based reinforcement learning algorithm that increases sample efficiency, hence enable algorithms with low sample efficiency function better in real-world environments. We introduce the Deep Warehouse environment for industry-near testing of autonomous agents in logistic warehouses. We illustrate that the DVAE-2 algorithm improves the sample efficiency for the Deep Warehouse compared to model-free methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
\(\mathcal {S}\) and \(\mathcal {A}\) is defined for discrete or continuous spaces. \(r: \mathcal {S} \times \mathcal {A} \rightarrow \mathbb {R}\) where r is commonly referred to as \(\mathcal {R}(s, s')\) in the literature.
- 2.
In this setting, the lowest score is the technique with least accumulated error.
- 3.
We use the mean squared error (MSE) loss in our implementation.
- 4.
The deep warehouse environment is open-source and freely available at https://github.com/cair/deep-warehouse.
- 5.
We recognise large experiments to consist of environments where the agents require significant sampling to converge.
References
Andersen, P.-A., Goodwin, M., Granmo, O.-C.: Towards a deep reinforcement learning approach for tower line wars. In: Bramer, M., Petridis, M. (eds.) SGAI 2017. LNCS (LNAI), vol. 10630, pp. 101–114. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-71078-5_8
Andersen, P.A., Goodwin, M., Granmo, O.C.: Deep RTS: a game environment for deep reinforcement learning in real-time strategy games. In: Proceedings of the IEEE International Conference on Computational Intelligence and Games, August 2018. http://arxiv.org/abs/1808.05032
Andersen, P.-A., Goodwin, M., Granmo, O.-C.: The dreaming variational autoencoder for reinforcement learning environments. In: Bramer, M., Petridis, M. (eds.) SGAI 2018. LNCS (LNAI), vol. 11311, pp. 143–155. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04191-5_11
Azar, M.G., Piot, B., Pires, B.A., Grill, J.B., Altché, F., Munos, R.: World Discovery Models. arxiv preprint arXiv:1902.07685, February 2019. http://arxiv.org/abs/1902.07685
Blundell, C., et al.: Model-Free Episodic Control. arxiv preprint arXiv:1606.04460, June 2016. http://arxiv.org/abs/1606.04460
Botvinick, M., Ritter, S., Wang, J.X., Kurth-Nelson, Z., Blundell, C., Hassabis, D.: Reinforcement learning, fast and slow. Trends Cogn. Sci. 23(5), 408–422 (2019). https://doi.org/10.1016/j.tics.2019.02.006. http://www.ncbi.nlm.nih.gov/pubmed/31003893
Brockman, G., et al.: OpenAI Gym. arxiv preprint arXiv:1606.01540, June 2016. http://arxiv.org/abs/1606.01540
Buckman, J., Hafner, D., Tucker, G., Brevdo, E., Lee, H.: Sample-efficient reinforcement learning with stochastic ensemble value expansion. In: Advances in Neural Information Processing Systems, vol. 32, pp. 8224–8234, July 2018. http://arxiv.org/abs/1807.01675
Chua, K., Calandra, R., McAllister, R., Levine, S.: Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: Advances in Neural Information Processing Systems, vol. 31, May 2018. http://arxiv.org/abs/1805.12114
Greensmith, E., Bartlett, P.L., Baxter, J.: Variance reduction techniques for gradient estimates in reinforcement learning. J. Mach. Learn. Res. 5, 1471–1530 (2004)
Gregor, K., Rezende, D.J., Besse, F., Wu, Y., Merzic, H., van den Oord, A.: Shaping Belief States with Generative Environment Models for RL. arxiv preprint arXiv:1906.09237, June 2019. http://arxiv.org/abs/1906.09237
Ha, D., Schmidhuber, J.: Recurrent world models facilitate policy evolution. In: Advances in Neural Information Processing Systems, vol. 31, September 2018. http://arxiv.org/abs/1809.01999
Ha, D., Schmidhuber, J.: World Models. arxiv preprint arXiv:1803.10122, March 2018. https://doi.org/10.5281/zenodo.1207631, https://arxiv.org/abs/1803.10122
Hafner, D., et al.: Learning latent dynamics for planning from pixels. In: Proceedings of the 36th International Conference on Machine Learning, November 2018. http://arxiv.org/abs/1811.04551
Higgins, I., et al.: Beta-VAE: learning basic visual concepts with a constrained variational framework. In: International Conference on Learning Representations, November 2016. https://openreview.net/forum?id=Sy2fzU9gl
Janner, M., Fu, J., Zhang, M., Levine, S.: When to Trust Your Model: Model-Based Policy Optimization. arXiv preprint arXiv:1906.08253, June 2019. http://arxiv.org/abs/1906.08253
Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. J. Artif. Intell. Res. (1996). https://doi.org/10.1.1.68.466, http://arxiv.org/abs/cs/9605103
Liang, X., Wang, Q., Feng, Y., Liu, Z., Huang, J.: VMAV-C: A Deep Attention-based Reinforcement Learning Algorithm for Model-based Control. arxiv preprint arXiv:1812.09968, December 2018. http://arxiv.org/abs/1812.09968
Mnih, V., et al.: Playing Atari with deep reinforcement learning. In: Neural Information Processing Systems, December 2013. http://arxiv.org/abs/1312.5602
Roodbergen, K.J., Vis, I.F.A.: A survey of literature on automated storage and retrieval systems. Eur. J. Oper. Res. (2009). https://doi.org/10.1016/j.ejor.2008.01.038
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal Policy Optimization Algorithms. arxiv preprint arXiv:1707.06347, July 2017. http://arxiv.org/abs/1707.06347
Sutton, R.S.: The Bitter Lesson (2019). http://www.incompleteideas.net/IncIdeas/BitterLesson.html
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Andersen, PA., Goodwin, M., Granmo, OC. (2019). Towards Model-Based Reinforcement Learning for Industry-Near Environments. In: Bramer, M., Petridis, M. (eds) Artificial Intelligence XXXVI. SGAI 2019. Lecture Notes in Computer Science(), vol 11927. Springer, Cham. https://doi.org/10.1007/978-3-030-34885-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-34885-4_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34884-7
Online ISBN: 978-3-030-34885-4
eBook Packages: Computer ScienceComputer Science (R0)