Advertisement

Leveraging Uncertainty to Rethink Loss Functions and Evaluation Measures for Egocentric Action Anticipation

  • Antonino FurnariEmail author
  • Sebastiano Battiato
  • Giovanni Maria Farinella
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11133)

Abstract

Current action anticipation approaches often neglect the intrinsic uncertainty of future predictions when loss functions or evaluation measures are designed. The uncertainty of future observations is especially relevant in the context of egocentric visual data, which is naturally exposed to a great deal of variability. Considering the problem of egocentric action anticipation, we investigate how loss functions and evaluation measures can be designed to explicitly take into account the natural multi-modality of future events. In particular, we discuss suitable measures to evaluate egocentric action anticipation and study how loss functions can be defined to incorporate the uncertainty arising from the prediction of future events. Experiments performed on the EPIC-KITCHENS dataset show that the proposed loss function allows improving the results of both egocentric action anticipation and recognition methods.

Keywords

Egocentric vision Action anticipation Loss functions First person vision 

Notes

Acknowledgment

This research has been supported by Piano della Ricerca 2016-2018 linea di Intervento 2 of DMI of the University of Catania.

References

  1. 1.
    Abu Farha, Y., Richard, A., Gall, J.: When will you do what?-anticipating temporal occurrences of activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5343–5352 (2018)Google Scholar
  2. 2.
    Aliakbarian, M.S., Saleh, F.S., Salzmann, M., Fernando, B., Petersson, L., Andersson, L.: Encouraging LSTMs to anticipate actions very early. In: IEEE International Conference on Computer Vision (ICCV), vol. 1 (2017)Google Scholar
  3. 3.
    Berrada, L., Zisserman, A., Kumar, M.P.: Smooth loss functions for deep top-k classification. In: International Conference on Learning Representations (2018)Google Scholar
  4. 4.
    Bokhari, S.Z., Kitani, K.M.: Long-term activity forecasting using first-person vision. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10115, pp. 346–360. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-54193-8_22CrossRefGoogle Scholar
  5. 5.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733 (2017)Google Scholar
  6. 6.
    Damen, D., et al.: Scaling egocentric vision: the Open image in new window dataset. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 753–771. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01225-0_44CrossRefGoogle Scholar
  7. 7.
    Fan, C., Lee, J., Ryoo, M.S.: Forecasting hand and object locations in future frames. CoRR abs/1705.07328 (2017). http://arxiv.org/abs/1705.07328
  8. 8.
    Fathi, A., Farhadi, A., Rehg, J.M.: Understanding egocentric activities. In: International Conference on Computer Vision, pp. 407–414 (2011)Google Scholar
  9. 9.
    Fathi, A., Li, Y., Rehg, J.M.: Learning to recognize daily actions using gaze. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 314–327. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33718-5_23CrossRefGoogle Scholar
  10. 10.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)Google Scholar
  11. 11.
    Furnari, A., Battiato, S., Grauman, K., Farinella, G.M.: Next-active-object prediction from egocentric videos. J. Vis. Commun. Image Represent. 49, 401–411 (2017).  https://doi.org/10.1016/j.jvcir.2017.10.004CrossRefGoogle Scholar
  12. 12.
    Gao, J., Yang, Z., Nevatia, R.: RED: reinforced encoder-decoder networks for action anticipation. In: British Machine Vision Conference (2017)Google Scholar
  13. 13.
    Huang, D.-A., Kitani, K.M.: Action-reaction: forecasting the dynamics of human interaction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 489–504. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10584-0_32CrossRefGoogle Scholar
  14. 14.
    Jain, A., Koppula, H.S., Raghavan, B., Soh, S., Saxena, A.: Car that knows before you do: Anticipating maneuvers via learning temporal driving models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3182–3190 (2015)Google Scholar
  15. 15.
    Kanade, T., Hebert, M.: First-person vision. Proc. IEEE 100(8), 2442–2453 (2012).  https://doi.org/10.1109/JPROC.2012.2200554CrossRefGoogle Scholar
  16. 16.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)Google Scholar
  17. 17.
    Koppula, H.S., Saxena, A.: Anticipating human activities using object affordances for reactive robotic response. IEEE Trans. Pattern Anal. Mach. Intell. 38(1), 14–29 (2016).  https://doi.org/10.1109/TPAMI.2015.2430335CrossRefGoogle Scholar
  18. 18.
    Lan, T., Chen, T.-C., Savarese, S.: A hierarchical representation for future action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 689–704. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10578-9_45CrossRefGoogle Scholar
  19. 19.
    Lapin, M., Hein, M., Schiele, B.: Analysis and optimization of loss functions for multiclass, top-k, and multilabel classification. IEEE Trans. Pattern Anal. Mach. Intell. 40(7), 1533–1554 (2017)CrossRefGoogle Scholar
  20. 20.
    Li, Y., Ye, Z., Rehg, J.M.: Delving into egocentric actions. In: Computer Vision and Pattern Recognition, pp. 287–295 (2015)Google Scholar
  21. 21.
    Ma, M., Fan, H., Kitani, K.M.: Going deeper into first-person activity recognition. In: Computer Vision and Pattern Recognition, pp. 1894–1903 (2016)Google Scholar
  22. 22.
    Ma, S., Sigal, L., Sclaroff, S.: Learning activity progression in lSTMs for activity detection and early detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1942–1950 (2016)Google Scholar
  23. 23.
    Mahmud, T., Hasan, M., Roy-Chowdhury, A.K.: Joint prediction of activity labels and starting times in untrimmed videos. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5784–5793 (2017)Google Scholar
  24. 24.
    Park, H.S., Hwang, J.J., Niu, Y., Shi, J.: Egocentric future localization. CVPR 2016, 4697–4705 (2016).  https://doi.org/10.1109/CVPR.2016.508CrossRefGoogle Scholar
  25. 25.
    Rhinehart, N., Kitani, K.M.: First-person activity forecasting with online inverse reinforcement learning. In: ICCV (2017)Google Scholar
  26. 26.
    Ryoo, M.S., Fuchs, T.J., Xia, L., Aggarwal, J.K., Matthies, L.: Robot-centric activity prediction from first-person videos: what will they do to me? In: IEEE International Conference on Human-Robot Interaction, pp. 295–302 (2015).  https://doi.org/10.1145/2696454.2696462
  27. 27.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  28. 28.
    Singh, K.K., Fatahalian, K., Efros, A.A.: Krishnacam: Using a longitudinal, single-person, egocentric dataset for scene understanding tasks. In: IEEE Winter Conference on Applications of Computer Vision (2016).  https://doi.org/10.1109/WACV.2016.7477717
  29. 29.
    Soran, B., Farhadi, A., Shapiro, L.: Generating notifications for missing actions: don’t forget to turn the lights off! In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4669–4677 (2016).  https://doi.org/10.1109/ICCV.2015.530
  30. 30.
    Spriggs, E.H., De La Torre, F., Hebert, M.: Temporal segmentation and activity classification from first-person sensing. In: Computer Vision and Pattern Recognition Workshops, pp. 17–24 (2009)Google Scholar
  31. 31.
    Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 98–106 (2016)Google Scholar
  32. 32.
    Walker, J., Doersch, C., Gupta, A., Hebert, M.: An uncertain future: forecasting from static images using variational autoencoders. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 835–851. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46478-7_51CrossRefGoogle Scholar
  33. 33.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)MathSciNetCrossRefGoogle Scholar
  34. 34.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: International Conference on Computer Vision, pp. 3551–3558 (2013)Google Scholar
  35. 35.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_2CrossRefGoogle Scholar
  36. 36.
    Yu, H.F., Jain, P., Kar, P., Dhillon, I.: Large-scale multi-label learning with missing labels. In: International Conference on Machine Learning, pp. 593–601 (2014)Google Scholar
  37. 37.
    Zhang, M., Ma, K.T., Lim, J.H., Zhao, Q., Feng, J.: Deep future gaze: gaze anticipation on egocentric videos using adversarial networks. In: Conference on Computer Vision and Pattern Recognition, pp. 4372–4381 (2017)Google Scholar
  38. 38.
    Zhang, M.L., Zhou, Z.H.: A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 26(8), 1819–1837 (2014)CrossRefGoogle Scholar
  39. 39.
    Zhou, Y., Berg, T.L.: Temporal perception and prediction in ego-centric video. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4498–4506 (2016).  https://doi.org/10.1109/ICCV.2015.511

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Mathematics and Computer ScienceUniversity of CataniaCataniaItaly

Personalised recommendations