Deep Execution Monitor for Robot Assistive Tasks

  • Lorenzo Mauro
  • Edoardo Alati
  • Marta Sanzari
  • Valsamis Ntouskos
  • Gianluca Massimiani
  • Fiora PirriEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11134)


We consider a novel approach to high-level robot task execution for a robot assistive task. In this work we explore the problem of learning to predict the next subtask by introducing a deep model for both sequencing goals and for visually evaluating the state of a task. We show that deep learning for monitoring robot tasks execution very well supports the interconnection between task-level planning and robot operations. These solutions can also cope with the natural non-determinism of the execution monitor. We show that a deep execution monitor leverages robot performance. We measure the improvement taking into account some robot helping tasks performed at a warehouse.



The research has been granted by the H2020 Project Second Hands under grant agreement No. 643950. We thanks in particular our partners: the team at Ocado, Graham Deacon, Duncan Russel, Giuseppe Cotugno and Dario Turchi, the team of KIT led by Tamim Asfour, the team at UCL with Lourdes Agapito, Martin Runz and Denis Tome, and the group at EPFL led by Aude Billiard.


  1. 1.
    Al-Omari, M., Chinellato, E., Gatsoulis, Y., Hogg, D.C., Cohn, A.G.: Unsupervised grounding of textual descriptions of object features and actions in video. In: KR 2016, pp. 505–508 (2016)Google Scholar
  2. 2.
    Alford, R., Shivashankar, V., Roberts, M., Frank, J., Aha, D.W.: Hierarchical planning: relating task and goal decomposition with task sharing. In: IJCAI 2016, pp. 3022–3029 (2016)Google Scholar
  3. 3.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
  4. 4.
    Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-dynamic programming: an overview. Decis. Control 1, 560–564 (1995)Google Scholar
  5. 5.
    Boutilier, C., Reiter, R., Soutchanski, M., Thrun, S., et al.: Decision-theoretic, high-level agent programming in the situation calculus. In: AAAI/IAAI 2000, pp. 355–362 (2000)Google Scholar
  6. 6.
    Cohn, A.G., Hazarika, S.M.: Qualitative spatial representation and reasoning: an overview. Fun. Inf. 46(1–2), 1–29 (2001)MathSciNetzbMATHGoogle Scholar
  7. 7.
    Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: ECCV 2018 (2018)Google Scholar
  8. 8.
    Das, A., Agrawal, H., Zitnick, C.L., Parikh, D., Batra, D.: Human attention in visual question answering: do humans and deep networks look at the same regions? arXiv preprint arXiv:1606.03556 (2016)
  9. 9.
    Doyle, R.J., Atkinson, D.J., Doshi, R.S.: Generating perception requests and expectations to verify the execution of plans. In: AAAI 1986, pp. 81–88 (1986)Google Scholar
  10. 10.
    Erol, K., Hendler, J.A., Nau, D.S.: UMCP: a sound and complete procedure for hierarchical task-network planning. In: AIPS, vol. 94, pp. 249–254 (1994)Google Scholar
  11. 11.
    Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The Pascal visual object classes challenge: a retrospective. IJCV 111(1), 98–136 (2015)CrossRefGoogle Scholar
  12. 12.
    Fikes, R.E.: Monitored execution of robot plans produced by strips, SRI, Technical report (1971)Google Scholar
  13. 13.
    Finzi, A., Pirri, F.: Combining probabilities, failures and safety in robot control. In: International Joint Conference on Artificial Intelligence, vol. 17, no. 1. Lawrence Erlbaum Associates Ltd., pp. 1331–1336 (2001)Google Scholar
  14. 14.
    Furnari, A., Battiato, S., Grauman, K., Farinella, G.M.: Next-active-object prediction from egocentric videos. J. Vis. Commun. Image Represent. 49, 401–411 (2017)CrossRefGoogle Scholar
  15. 15.
    Gianni, M., Kruijff, G.-J.M., Pirri, F.: A stimulus-response framework for robot control. ACM Trans. Interact. Intell. Syst. (TIIS) 4(4), 21 (2015)Google Scholar
  16. 16.
    Guadarrama, S., et al.: Grounding spatial relations for human-robot interaction. In: IROS 2013, pp. 1640–1647 (2013)Google Scholar
  17. 17.
    Gupta, S., Davidson, J., Levine, S., Sukthankar, R., Malik, J.: Cognitive mapping and planning for visual navigation. arXiv preprint arXiv:1702.03920, vol. 3 (2017)
  18. 18.
    Haarnoja, T., Ajay, A., Levine, S., Abbeel, P.: Backprop KF: learning discriminative deterministic state estimators. In: Advances in Neural Information Processing Systems, pp. 4376–4384 (2016)Google Scholar
  19. 19.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988. IEEE (2017)Google Scholar
  20. 20.
    Helmert, M.: The fast downward planning system. JAIR 26, 191–246 (2006)CrossRefGoogle Scholar
  21. 21.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  22. 22.
    Hofmann, T., Niemueller, T., Lakemeyer, G.: Initial results on generating macro actions from a plan database for planning on autonomous mobile robots (2017)Google Scholar
  23. 23.
    Hornung, A., Böttcher, S., Schlagenhauf, J., Dornhege, C., Hertle, A., Bennewitz, M.: Mobile manipulation in cluttered environments with humanoids: integrated perception, task planning, and action execution. In: Humanoids 2014, pp. 773–778 (2014)Google Scholar
  24. 24.
    Ingrand, F., Ghallab, M.: Deliberation for autonomous robots: a survey. Artif. Intell. 247, 10–44 (2017)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Karkus, P., Hsu, D., Lee, W.S.: QMDP-Net: deep learning for planning under partial observability. In: Advances in Neural Information Processing Systems, pp. 4697–4707 (2017)Google Scholar
  26. 26.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS 2012, pp. 1097–1105 (2012)Google Scholar
  27. 27.
    Lenz, I., Lee, H., Saxena, A.: Deep learning for detecting robotic grasps. Int. J. Robot. Res. 34(4–5), 705–724 (2015)CrossRefGoogle Scholar
  28. 28.
    Littman, M.L., Sutton, R.S.: Predictive representations of state. In: Advances in Neural Information Processing Systems, pp. 1555–1561 (2002)Google Scholar
  29. 29.
    Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016). Scholar
  30. 30.
    Luong, M.-T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015)
  31. 31.
    Luong, M.-T., Sutskever, I., Le, Q.V., Vinyals, O., Zaremba, W.: Addressing the rare word problem in neural machine translation. arXiv preprint arXiv:1410.8206 (2014)
  32. 32.
    Mendoza, J.P., Veloso, M., Simmons, R.: Plan execution monitoring through detection of unmet expectations about action outcomes. In: ICRA 2015, pp. 3247–3252 (2015)Google Scholar
  33. 33.
    Mirowski, P., et al.: Learning to navigate in complex environments. arXiv:1611.03673 (2016)
  34. 34.
    Nilsson, N.J.: A hierarchical robot planning and execution system. SRI (1973)Google Scholar
  35. 35.
    Ntouskos, V., Pirri, F., Pizzoli, M., Sinha, A., Cafaro, B.: Saliency prediction in the coherence theory of attention. In: Biologically Inspired Cognitive Architectures, vol. 5, pp. 10–28 (2013)CrossRefGoogle Scholar
  36. 36.
    Ntouskos, V., et al.: Component-wise modeling of articulated objects. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2327–2335 (2015)Google Scholar
  37. 37.
    Petrick, R.P., Bacchus, F.: PKS: knowledge-based planning with incomplete information and sensing. In: Proceedings of the System Demonstration session at ICAPS (2004)Google Scholar
  38. 38.
    Pettersson, O.: Execution monitoring in robotics: a survey. Robot. Auton. Syst. 53(2), 73–88 (2005)MathSciNetCrossRefGoogle Scholar
  39. 39.
    Raffel, C., Luong, M.-T., Liu, P.J., Weiss, R.J., Eck, D.: Online and linear-time attention by enforcing monotonic alignments. arXiv preprint arXiv:1704.00784 (2017)
  40. 40.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS 2015, pp. 91–99 (2015)Google Scholar
  41. 41.
    Sabharwal, C.L., Leopold, J.L., Eloe, N.: A more expressive 3D region connection calculus. In: DMS, pp. 307–311. Citeseer (2011)Google Scholar
  42. 42.
    Santoro, A., et al.: A simple neural network module for relational reasoning. In: Advances in Neural Information Processing Systems, pp. 4974–4983 (2017)Google Scholar
  43. 43.
    Sanzari, M., Ntouskos, V., Pirri, F.: Bayesian image based 3D pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 566–582. Springer, Cham (2016). Scholar
  44. 44.
    Shivashankar, V.: Hierarchical goal networks: formalisms and algorithms for planning and acting, Ph.D. dissertation, University of Maryland, College Park (2015)Google Scholar
  45. 45.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
  46. 46.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)Google Scholar
  47. 47.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. MIT Press, Cambridge (2017)Google Scholar
  48. 48.
    Wächter, M., Ottenhaus, S., Kröhnert, M., Vahrenkamp, N., Asfour, T.: The ArmarX statechart concept: graphical programing of robot behavior. Front. Robot. AI 3, 33 (2016)CrossRefGoogle Scholar
  49. 49.
    Wang, H., Liang, W., Yu, L.-F.: Transferring objects: joint inference of container and human pose. In: CVPR 2017, pp. 2933–2941 (2017)Google Scholar
  50. 50.
    Wilkins, D.E.: Recovering from execution errors in SIPE. Comput. Intell. 1(1), 33–45 (1985)CrossRefGoogle Scholar
  51. 51.
    Wu, C., Zhang, J., Sener, O., Selman, B., Savarese, S., Saxena, A.: Watch-n-patch: unsupervised learning of actions and relations. In: TPAMI 2017 (2017)Google Scholar
  52. 52.
    Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). Scholar
  53. 53.
    Zhu, L., Xu, Z., Yang, Y., Hauptmann, A.G.: Uncovering the temporal context for video question answering. IJCV 124(3), 409–421 (2017)MathSciNetCrossRefGoogle Scholar
  54. 54.
    Zhu, Y., et al.: Visual semantic planning using deep successor representations. CoRR abs/1712.05474 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Alcor Lab, DiagUniversity of Rome La SapienzaRomeItaly

Personalised recommendations