Crossmodal attentive skill learner: learning in Atari and beyond with audio–video inputs

  • Dong-Ki KimEmail author
  • Shayegan Omidshafiei
  • Jason Pazis
  • Jonathan P. How


This paper introduces the Crossmodal Attentive Skill Learner (CASL), integrated with the recently-introduced Asynchronous Advantage Option-Critic architecture [Harb et al. in When waiting is not an option: learning options with a deliberation cost. arXiv preprint arXiv:1709.04571, 2017] to enable hierarchical reinforcement learning across multiple sensory inputs. Agents trained using our approach learn to attend to their various sensory modalities (e.g., audio, video) at the appropriate moments, thereby executing actions based on multiple sensory streams without reliance on supervisory data. We demonstrate empirically that the sensory attention mechanism anticipates and identifies useful latent features, while filtering irrelevant sensor modalities during execution. Further, we provide concrete examples in which the approach not only improves performance in a single task, but accelerates transfer to new tasks. We modify the Arcade Learning Environment [Bellemare et al. in J Artif Intell Res 47:253–279, 2013] to support audio queries (ALE-audio code available at, and conduct evaluations of crossmodal learning in the Atari 2600 games H.E.R.O. and Amidar. Finally, building on the recent work of Babaeizadeh et al. [in: International conference on learning representations (ICLR), 2017], we open-source a fast hybrid CPU–GPU implementation of CASL (CASL code available at


Hierarchical learning Reinforcement learning Multimodal learning 



  1. 1.
    Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mordatch, I., & Abbeel, P. (2018). Continuous adaptation via meta-learning in nonstationary and competitive environments. In International Conference on Learning Representations (ICLR).Google Scholar
  2. 2.
    Alvis, C. D., Ott, L., & Ramos, F. (2017). Online learning for scene segmentation with laser-constrained CRFs. International Conference on Robotics and Automation (ICRA), 4639–4643.Google Scholar
  3. 3.
    Andreas, J., Klein, D., & Levine, S. (2016). Modular multitask reinforcement learning with policy sketches. arXiv preprint arXiv:1611.01796.
  4. 4.
    Ba, J., Mnih, V., & Kavukcuoglu, K. (2014). Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755.
  5. 5.
    Babaeizadeh, M., Frosio, I., Tyree, S., Clemons, J., & Kautz, J. (2017). Reinforcement learning through asynchronous advantage actor-critic on a GPU. In International Conference on Learning Representations (ICLR).Google Scholar
  6. 6.
    Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  7. 7.
    Beal, M. J., Attias, H., & Jojic, N. (2002). Audio-video sensor fusion with probabilistic graphical models. European Conference on Computer Vision (ECCV), 736–750.Google Scholar
  8. 8.
    Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2013). The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47, 253–279.CrossRefGoogle Scholar
  9. 9.
    Bengio, S. (2002). An asynchronous hidden markov model for audio-visual speech recognition. Advances in Neural Information Processing Systems (NIPS), 1237–1244.Google Scholar
  10. 10.
    Cadena, C., & Košecká, J. (2014). Semantic segmentation with heterogeneous sensor coverages. International Conference on Robotics and Automation (ICRA), 2639–2645.Google Scholar
  11. 11.
    Caglayan, O., Barrault, L., & Bougares, F. (2016). Multimodal attention for neural machine translation. arXiv preprint arXiv:1609.03976.
  12. 12.
    Carrasco, M. (2011). Visual attention: The past 25 years. Vision research, 51(13), 1484–1525.CrossRefGoogle Scholar
  13. 13.
    Chambers, A. D., Scherer, S., Yoder, L., Jain, S., Nuske, S. T., & Singh, S. (2014). Robust multi-sensor fusion for micro aerial vehicle navigation in GPS-degraded/denied environments. In American Control Conference (ACC).Google Scholar
  14. 14.
    Da Silva, B., Konidaris, G., & Barto, A. (2012). Learning parameterized skills. arXiv preprint arXiv:1206.6398.
  15. 15.
    Eitel, A., Springenberg, J. T., Spinello, L., Riedmiller, M., & Burgard, W. (2015). Multimodal deep learning for robust RGB-D object recognition. In International Conference on Intelligent Robots and Systems (IROS).Google Scholar
  16. 16.
    Harb, J., Bacon, P.-L., Klissarov, M., & Precup, D. (2017). When waiting is not an option: Learning options with a deliberation cost. arXiv preprint arXiv:1709.04571.
  17. 17.
    Hausknecht, M., & Stone, P. (2015). Deep recurrent Q-learning for partially observable MDPs. arXiv preprint arXiv:1507.06527.
  18. 18.
    He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. CoRR, arXiv:1512.03385.
  19. 19.
    Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.CrossRefGoogle Scholar
  20. 20.
    Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1), 99–134.MathSciNetCrossRefGoogle Scholar
  21. 21.
    Kiros, R., Salakhutdinov, R., & Zemel, R. S. (2014). Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539.
  22. 22.
    Konidaris, G., & Barto, A. G. (2009). Skill discovery in continuous reinforcement learning domains using skill chaining. Advances in Neural Information Processing Systems (NIPS), 1015–1023.Google Scholar
  23. 23.
    Kulkarni, T. D., Narasimhan, K., Saeedi, A., & Tenenbaum, J. (2016). Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. Advances in Neural Information Processing Systems (NIPS), 3675–3683.Google Scholar
  24. 24.
    Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. International Conference on Machine Learning (ICML), 282–289.Google Scholar
  25. 25.
    Leong, Y. C., Radulescu, A., Daniel, R., DeWoskin, V., & Niv, Y. (2017). Dynamic interaction between reinforcement learning and attention in multidimensional environments. Neuron, 93(2), 451–463.CrossRefGoogle Scholar
  26. 26.
    Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
  27. 27.
    Lynen, S., Achtelik, M. W., Weiss, S., Chli, M., & Siegwart, R. (2013). A robust and modular multi-sensor fusion approach applied to mav navigation. International Conference on Intelligent Robots and Systems (IROS), 3923–3929.Google Scholar
  28. 28.
    Machado, M. C., Bellemare, M. G., & Bowling, M. (2017). A laplacian framework for option discovery in reinforcement learning. arXiv preprint arXiv:1703.00956.
  29. 29.
    Mackintosh, N. J. (1975). A theory of attention: Variations in the associability of stimuli with reinforcement. Psychological Review, 82(4), 276.CrossRefGoogle Scholar
  30. 30.
    Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., et al. (2016). Asynchronous methods for deep reinforcement learning. International Conference on Machine Learning (ICML), 1928–1937.Google Scholar
  31. 31.
    Mnih, V., Heess, N., Graves, A., et al. (2014). Recurrent models of visual attention. Advances in Neural Information Processing Systems (NIPS), 2204–2212.Google Scholar
  32. 32.
    Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.CrossRefGoogle Scholar
  33. 33.
    Nachum, O., Gu, S. S., Lee, H., & Levine, S. (2018). Data-efficient hierarchical reinforcement learning. Advances in Neural Information Processing Systems (NIPS), 3306–3317.Google Scholar
  34. 34.
    Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. International Conference on Machine Learning (ICML), 689–696.Google Scholar
  35. 35.
    Niv, Y., Daniel, R., Geana, A., Gershman, S. J., Leong, Y. C., Radulescu, A., et al. (2015). Reinforcement learning in multidimensional environments relies on attention mechanisms. Journal of Neuroscience, 35(21), 8145–8157.CrossRefGoogle Scholar
  36. 36.
    Nobili, S., Camurri, M., Barasuol, V., Focchi, M., Caldwell, D. G., Semini, C., & Fallon, M. (2017). Heterogeneous sensor fusion for accurate state estimation of dynamic legged robots. In Robotics: Science and Systems (RSS).Google Scholar
  37. 37.
    Pearce, J. M., & Hall, G. (1980). A model for pavlovian learning: Variations in the effectiveness of conditioned but not of unconditioned stimuli. Psychological Review, 87(6), 532.CrossRefGoogle Scholar
  38. 38.
    Pearce, J. M., & Mackintosh, N. J. (2010). Two theories of attention: A review and a possible integration. Attention and Associative Learning: From Brain to Behaviour, pages 11–39.Google Scholar
  39. 39.
    Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., et al. (2017). Mastering the game of go without human knowledge. Nature, 550(7676), 354–359.CrossRefGoogle Scholar
  40. 40.
    Sorokin, I., Seleznev, A., Pavlov, M., Fedorov, A., & Ignateva, A. (2015). Deep attention recurrent Q-network. arXiv preprint arXiv:1512.01693.
  41. 41.
    Srivastava, N., & Salakhutdinov, R. (2014). Multimodal learning with deep boltzmann machines. Journal of Machine Learning Research, 15, 2949–2980.MathSciNetzbMATHGoogle Scholar
  42. 42.
    Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction.Google Scholar
  43. 43.
    Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1–2), 181–211.MathSciNetCrossRefGoogle Scholar
  44. 44.
    Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., & Kavukcuoglu, K. (2017). Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161.
  45. 45.
    Vinyals, O., Kaiser, Ł., Koo, T., Petrov, S., Sutskever, I., & Hinton, G. (2015). Grammar as a foreign language. Advances in Neural Information Processing Systems (NIPS), 2773–2781.Google Scholar
  46. 46.
    Yang, Z., Yang, D., Dyer, C., He, X., Smola, A. J., & Hovy, E. H. (2016). Hierarchical attention networks for document classification. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 1480–1489.Google Scholar
  47. 47.
    Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., & Fei-Fei, L. (2015). Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision, 1–15.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Authors and Affiliations

  1. 1.Massachusetts Institute of TechnologyCambridgeUSA
  2. 2.Amazon AlexaCambridgeUSA

Personalised recommendations