Advertisement

Deep Learning for Assistive Computer Vision

  • Marco LeoEmail author
  • Antonino Furnari
  • Gerard G. Medioni
  • Mohan Trivedi
  • Giovanni M. Farinella
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11134)

Abstract

This paper revises the main advances in assistive computer vision recently fostered by deep learning. To this aim, we first discuss how the application of deep learning in computer vision has contributed to the development of assistive techinologies, then analyze the recent advances in assistive technologies achieved in five main areas, namely, object classification and localization, scene understanding, human pose estimation and tracking, action/event recognition and anticipation. The paper is concluded with a discussion and insights for future directions.

Keywords

Assistive technologies Computer vision Deep learning 

Notes

Acknowledgments

This research has been supported by Piano della Ricerca 2016–2018 linea di Intervento 2 of DMI, University of Catania.

References

  1. 1.
    Abebe, G., Cavallaro, A.: A long short-term memory convolutional neural network for first-person vision activity recognition. In: Proceedings of International Conference on Computer Vision Workshops (ICCVW) (2017)Google Scholar
  2. 2.
    Abu Farha, Y., Richard, A., Gall, J.: When will you do what?-anticipating temporal occurrences of activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5343–5352 (2018)Google Scholar
  3. 3.
    Andriluka, M., et al.: PoseTrack: a benchmark for human pose estimation and tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5167–5176 (2018)Google Scholar
  4. 4.
    Battaglia, P., Pascanu, R., Lai, M., Rezende, D.J., et al.: Interaction networks for learning about objects, relations and physics. In: Advances in Neural Information Processing Systems, pp. 4502–4510 (2016)Google Scholar
  5. 5.
    Brust, C.A., Sickert, S., Simon, M., Rodner, E., Denzler, J.: Efficient convolutional patch networks for scene understanding. In: International Conference on Computer Vision Theory and Applications (VISAPP) (2015)Google Scholar
  6. 6.
    Celiktutan, O., Demiris, Y.: Inferring human knowledgeability from eye gaze in m-learning environments. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018 Workshops, LNCS, vol. 11134, pp. 193–209. Springer, Cham (2019)Google Scholar
  7. 7.
    Damen, D., et al.: Scaling egocentric vision: the EPIC-KITCHENS dataset. arXiv preprint arXiv:1804.02748 (2018)
  8. 8.
    Erol, B.A., Majumdar, A., Lwowski, J., Benavidez, P., Rad, P., Jamshidi, M.: Improved deep neural network object tracking system for applications in home robotics. In: Pedrycz, W., Chen, S.-M. (eds.) Computational Intelligence for Pattern Recognition. SCI, vol. 777, pp. 369–395. Springer, Cham (2018).  https://doi.org/10.1007/978-3-319-89629-8_14CrossRefGoogle Scholar
  9. 9.
    Fan, C., Lee, J., Ryoo, M.S.: Forecasting hand and object locations in future frames. CoRR abs/1705.07328 (2017). http://arxiv.org/abs/1705.07328
  10. 10.
    Feng, D., Barnes, N., You, S.: DSD: depth structural descriptor for edge-based assistive navigation. In: 2017 IEEE International Conference on Computer Vision Workshop (ICCVW), pp. 1536–1544. IEEE (2017)Google Scholar
  11. 11.
    Furnari, A., Battiato, S., Farinella, G.M.: Personal-location-based temporal segmentation of egocentric videos for lifelogging applications. J. Vis. Commun. Image Represent. 52, 1–12 (2018)CrossRefGoogle Scholar
  12. 12.
    Furnari, A., Battiato, S., Grauman, K., Farinella, G.M.: Next-active-object prediction from egocentric videos. J. Vis. Commun. Image Represent. 49, 401–411 (2017)CrossRefGoogle Scholar
  13. 13.
    Gao, J., Yang, Z., Nevatia, R.: RED: reinforced encoder-decoder networks for action anticipation. In: British Machine Vision Conference (2017)Google Scholar
  14. 14.
    Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016). http://www.deeplearningbook.org
  15. 15.
    Hesse, N., Bodensteiner, C., Arens, M., Hofmann, U., Weinberger, R., Schroeder, S.: An empirical study towards understanding how deep convolutional nets recognize falls. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018 Workshops. LNCS, vol. 11134, pp. 112–127. Springer, Cham (2019)Google Scholar
  16. 16.
    Ivorra, E., Ortega, M., Alcañiz, M., Garcia-Aracil, N.: Multimodal computer vision framework for human assistive robotics. In: 2018 Workshop on Metrology for Industry 4.0 and IoT, pp. 1–5. IEEE (2018)Google Scholar
  17. 17.
    Johnson, J., et al.: Inferring and executing programs for visual reasoning. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3008–3017, October 2017.  https://doi.org/10.1109/ICCV.2017.325
  18. 18.
    Katircioglu, I., Tekin, B., Salzmann, M., Lepetit, V., Fua, P.: Learning latent representations of 3D human pose with deep neural networks. Int. J. Comput. Vis. (2018).  https://doi.org/10.1007/s11263-018-1066-6CrossRefGoogle Scholar
  19. 19.
    Kawana, Y., Ukita, N., Huang, J.B., Yang, M.H.: Ensemble convolutional neural networks for pose estimation. Comput. Vis. Image Underst. 169, 62–74 (2018).  https://doi.org/10.1016/j.cviu.2017.12.005CrossRefGoogle Scholar
  20. 20.
    Leo, M., Medioni, G., Trivedi, M., Kanade, T., Farinella, G.: Computer vision for assistive technologies. Comput. Vis. Image Underst. 154(Suppl. C), 1–15 (2017)CrossRefGoogle Scholar
  21. 21.
    Leo, M., Del Coco, M., Carcagnì, P., Mazzeo, P.L., Spagnolo, P., Distante, C.: A technological framework to support standardized protocols for the diagnosis and assessment of ASD. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 269–284. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-48881-3_19CrossRefGoogle Scholar
  22. 22.
    Litjens, G., et al.: A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017)CrossRefGoogle Scholar
  23. 23.
    Liu, W., Wang, Z., Liu, X., Zeng, N., Liu, Y., Alsaadi, F.E.: A survey of deep neural network architectures and their applications. Neurocomputing 234, 11–26 (2017).  https://doi.org/10.1016/j.neucom.2016.12.038. http://www.sciencedirect.com/science/article/pii/S0925231216315533CrossRefGoogle Scholar
  24. 24.
    Mahmud, T., Hasan, M., Roy-Chowdhury, A.K.: Joint prediction of activity labels and starting times in untrimmed videos. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5784–5793 (2017)Google Scholar
  25. 25.
    Nair, V., Budhai, M., Olmschenk, G., Seiple, W.H., Zhu, Z.: ASSIST: personalized indoor navigation via multimodal sensors and high-level semantic information. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018 Workshops. LNCS, vol. 11134, pp. 128–143. Springer, Cham (2019)Google Scholar
  26. 26.
    Nouredanesh, M., Li, A.W., Godfrey, A., Hoey, J., Tung, J.: Chasing feet in the wild: a proposed egocentric motion-aware gait assessment tool. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018 Workshops. LNCS, vol. 11134, pp. 176–192. Springer, Cham (2019)Google Scholar
  27. 27.
    Ortis, A., Farinella, G.M., D’Amico, V., Addesso, L., Torrisi, G., Battiato, S.: Organizing egocentric videos of daily living activities. Pattern Recogn. 72, 207–218 (2017)CrossRefGoogle Scholar
  28. 28.
    Park, H.S., Hwang, J.J., Niu, Y., Shi, J.: Egocentric future localization. In: CVPR 2016, pp. 4697–4705 (2016)Google Scholar
  29. 29.
    Perrett, T., Damen, D.: Recurrent assistance: cross-dataset training of LSTMs on kitchen tasks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1354–1362 (2017)Google Scholar
  30. 30.
    Pirri, F., Mauro, L., Alati, E., Sanzari, M., Ntouskos, V.: Deep execution monitor for robot assistive tasks. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018 Workshops. LNCS, vol. 11134, pp. 158–175. Springer, Cham (2019)Google Scholar
  31. 31.
    Ravì, D., et al.: Deep learning for health informatics. IEEE J. Biomed. Health Inform. 21, 4–21 (2017)CrossRefGoogle Scholar
  32. 32.
    Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. arXiv preprint (2017)Google Scholar
  33. 33.
    Rhinehart, N., Kitani, K.M.: First-person activity forecasting with online inverse reinforcement learning. In: ICCV (2017)Google Scholar
  34. 34.
    Rodriguez, C., Fernando, B., Li, H.: Action anticipation by predicting future dynamic images. arXiv preprint arXiv:1808.00141 (2018)
  35. 35.
    Sawatzky, J., Gall, J.: Adaptive binarization for weakly supervised affordance segmentation. arXiv preprint arXiv:1707.02850 (2017)
  36. 36.
    Schydlo, P., Rakovic, M., Jamone, L., Santos-Victor, J.: Anticipation in Human-Robot Cooperation: A Recurrent Neural Network Approach for Multiple Action Sequences Prediction. arXiv e-prints, February 2018Google Scholar
  37. 37.
    Sciortino, G., Farinella, G.M., Battiato, S., Leo, M., Distante, C.: On the estimation of children’s poses. In: Battiato, S., Gallo, G., Schettini, R., Stanco, F. (eds.) ICIAP 2017. LNCS, vol. 10485, pp. 410–421. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-68548-9_38CrossRefGoogle Scholar
  38. 38.
    Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 806–813 (2014)Google Scholar
  39. 39.
    Soran, B., Farhadi, A., Shapiro, L.: Generating notifications for missing actions: don’t forget to turn the lights off! In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4669–4677 (2016)Google Scholar
  40. 40.
    Soran, B., Lowes, L., Steele, K.M.: Evaluation of infants with spinal muscular atrophy type-I using convolutional neural networks. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 495–507. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-48881-3_34CrossRefGoogle Scholar
  41. 41.
    Sun, C., Shrivastava, A., Vondrick, C., Murphy, K., Sukthankar, R., Schmid, C.: Actor-centric relation network. arXiv preprint arXiv:1807.10982 (2018)
  42. 42.
    Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4631–4640 (2016)Google Scholar
  43. 43.
    Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653–1660 (2014)Google Scholar
  44. 44.
    Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 98–106 (2016)Google Scholar
  45. 45.
    Wang, A., Dantcheva, A., Broutart, J.C., Robert, P., Bremond, F., Bilinski, P.: Comparing methods for assessment of facial dynamics in patients with major neurocognitive disorders. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018 Workshops. LNCS, vol. 11134, pp. 144–157. Springer, Cham (2019)Google Scholar
  46. 46.
    Wang, L., Wang, Z., Qiao, Y., Van Gool, L.: Transferring deep object and scene representations for event recognition in still images. Int. J. Comput. Vis. 126(2), 390–409 (2018).  https://doi.org/10.1007/s11263-017-1043-5MathSciNetCrossRefGoogle Scholar
  47. 47.
    Yan, Z.: Computer vision for medical infant motion analysis: state of the art and RGB-D data set. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018 Workshops. LNCS, vol. 11134, pp. 32–49. Springer, Cham (2019)Google Scholar
  48. 48.
    Jiang, Y., Natarajan, V., Chen, X., Rohrbach, M., Batra, D., Parikh, D.: Pythia v0.1: the winning entry to the VQA challenge 2018. arXiv preprint arXiv:1807.09956 (2018)
  49. 49.
    Zhang, M., Ma, K.T., Lim, J.H., Zhao, Q., Feng, J.: Deep future gaze: gaze anticipation on egocentric videos using adversarial networks. In: Conference on Computer Vision and Pattern Recognition, pp. 4372–4381 (2017)Google Scholar
  50. 50.
    Zhang, Q., Yang, L.T., Chen, Z., Li, P.: A survey on deep learning for big data. Inf. Fusion 42(Suppl. C), 146–157 (2018)CrossRefGoogle Scholar
  51. 51.
    Zhang, Y., et al.: Physically-based rendering for indoor scene understanding using convolutional neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5057–5065. IEEE (2017)Google Scholar
  52. 52.
    Zhao, Z.Q., Zheng, P., Xu, S., Wu, X.: Object Detection with Deep Learning: A Review. arXiv e-prints, July 2018Google Scholar
  53. 53.
    Zhu, Y., Jiang, S.: Deep structured learning for visual relationship detection. In: The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-2018) (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Marco Leo
    • 1
    Email author
  • Antonino Furnari
    • 2
  • Gerard G. Medioni
    • 3
  • Mohan Trivedi
    • 4
  • Giovanni M. Farinella
    • 2
  1. 1.Institute of Applied Sciences and Intelligent SystemsNational Research Council of ItalyLecceItaly
  2. 2.University of CataniaCataniaItaly
  3. 3.University of Southern CaliforniaLos AngelesUSA
  4. 4.University of CaliforniaOaklandUSA

Personalised recommendations