Advertisement

Scenes-Objects-Actions: A Multi-task, Multi-label Video Dataset

  • Jamie Ray
  • Heng WangEmail author
  • Du Tran
  • Yufei Wang
  • Matt Feiszli
  • Lorenzo Torresani
  • Manohar Paluri
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11218)

Abstract

This paper introduces a large-scale, multi-label and multi-task video dataset named Scenes-Objects-Actions (SOA). Most prior video datasets are based on a predefined taxonomy, which is used to define the keyword queries issued to search engines. The videos retrieved by the search engines are then verified for correctness by human annotators. Datasets collected in this manner tend to generate high classification accuracy as search engines typically rank “easy” videos first. The SOA dataset adopts a different approach. We rely on uniform sampling to get a better representation of videos on the Web. Trained annotators are asked to provide free-form text labels describing each video in three different aspects: scene, object and action. These raw labels are then merged, split and renamed to generate a taxonomy for SOA. All the annotations are verified again based on the taxonomy. The final dataset includes 562K videos with 3.64M annotations spanning 49 categories for scenes, 356 for objects, 148 for actions, and naturally captures the long tail distribution of visual concepts in the real world. We show that datasets collected in this way are quite challenging by evaluating existing popular video models on SOA. We provide in-depth analysis about the performance of different models on SOA, and highlight potential new directions in video classification. We compare SOA with existing datasets and discuss various factors that impact the performance of transfer learning. A key-feature of SOA is that it enables the empirical study of correlation among scene, object and action recognition in video. We present results of this study and further analyze the potential of using the information learned from one task to improve the others. We also demonstrate different ways of scaling up SOA to learn better features. We believe that the challenges presented by SOA offer the opportunity for further advancement in video analysis as we progress from single-label classification towards a more comprehensive understanding of video data.

Keywords

Video dataset Multi-task Scene Object Action 

References

  1. 1.
    Abu-El-Haija, S., et al.: Youtube-8m: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016)
  2. 2.
    Ballas, N., Yao, L., Pal, C., Courville, A.: Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432 (2015)
  3. 3.
    Caffe2-Team: Caffe2: A New Lightweight, Modular, and Scalable Deep Learning Framework. https://caffe2.ai/
  4. 4.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)Google Scholar
  5. 5.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 248–255. IEEE (2009)Google Scholar
  6. 6.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)Google Scholar
  7. 7.
    Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 363–370. Springer, Heidelberg (2003).  https://doi.org/10.1007/3-540-45103-X_50CrossRefGoogle Scholar
  8. 8.
    Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., Russell, B.C.: ActionVLAD: learning spatio-temporal aggregation for action classification. In: CVPR (2017)Google Scholar
  9. 9.
    Goyal, R., et al.: The? Something something? Video database for learning and evaluating visual common sense. In: Proceedings of ICCV (2017)Google Scholar
  10. 10.
    Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNS retrace the history of 2D CNNS and ImageNet? arXiv preprint arXiv:1711.09577 (2017)
  11. 11.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  12. 12.
    Huh, M., Agrawal, P., Efros, A.A.: What makes ImageNet good for transfer learning? arXiv preprint arXiv:1608.08614 (2016)
  13. 13.
    Ikizler-Cinbis, N., Sclaroff, S.: Object, scene and actions: combining multiple features for human action recognition. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 494–507. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-15549-9_36CrossRefGoogle Scholar
  14. 14.
    ILSVRC-2015-VID: ImageNet Object Detection from Video Challenge. https://www.kaggle.com/c/imagenet-object-detection-from-video-challenge
  15. 15.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)Google Scholar
  16. 16.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE TPAMI 35(1), 221–231 (2013)CrossRefGoogle Scholar
  17. 17.
    Jiang, Y.G., Li, Z., Chang, S.F.: Modeling scene and object contexts for human action retrieval with few examples. IEEE Trans. Circuits Syst. Video Technol. 21(5), 674–681 (2011)CrossRefGoogle Scholar
  18. 18.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)Google Scholar
  19. 19.
    Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
  20. 20.
    Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge (2009)zbMATHGoogle Scholar
  21. 21.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB51: a large video database for human motion recognition. In: ICCV (2011)Google Scholar
  22. 22.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2556–2563. IEEE (2011)Google Scholar
  23. 23.
    Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)CrossRefGoogle Scholar
  24. 24.
    Long, X., et al.: Multimodal keyless attention fusion for video classification (2018)Google Scholar
  25. 25.
    Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(Nov), 2579–2605 (2008)Google Scholar
  26. 26.
    Miller, G.A.: Wordnet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)CrossRefGoogle Scholar
  27. 27.
    Pantofaru, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions (2017)Google Scholar
  28. 28.
    Real, E., Shlens, J., Mazzocchi, S., Pan, X., Vanhoucke, V.: YouTube-BoundingBoxes: a large high-precision human-annotated data set for object detection in video. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7464–7473. IEEE (2017)Google Scholar
  29. 29.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, vol. 3, pp. 32–36. IEEE (2004)Google Scholar
  30. 30.
    Sigurdsson, G.A., Russakovsky, O., Gupta, A.: What actions are needed for understanding human actions in videos? In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017, pp. 2156–2165 (2017)Google Scholar
  31. 31.
    Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 510–526. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_31CrossRefGoogle Scholar
  32. 32.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)Google Scholar
  33. 33.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human action classes from videos in the wild. In: CRCV-TR-12-01 (2012)Google Scholar
  34. 34.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
  35. 35.
    Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMS. In: International Conference on Machine Learning, pp. 843–852 (2015)Google Scholar
  36. 36.
    Szegedy, C., et al.: Going deeper with convolutions. In: CVPR (2015)Google Scholar
  37. 37.
    Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1521–1528. IEEE (2011)Google Scholar
  38. 38.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)Google Scholar
  39. 39.
    Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. arXiv preprint arXiv:1711.11248 (2017)
  40. 40.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3169–3176. IEEE (2011)Google Scholar
  41. 41.
    Wang, Y.X., Ramanan, D., Hebert, M.: Learning to model the tail. In: Advances in Neural Information Processing Systems, pp. 7029–7039 (2017)Google Scholar
  42. 42.
    Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning for video understanding. arXiv preprint arXiv:1712.04851 (2017)
  43. 43.
    Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)Google Scholar
  44. 44.
    Zhou, B., Andonian, A., Torralba, A.: Temporal relational reasoning in videos. arXiv preprint arXiv:1711.08496 (2017)
  45. 45.
    Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: NIPS (2014)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Jamie Ray
    • 1
  • Heng Wang
    • 1
    Email author
  • Du Tran
    • 1
  • Yufei Wang
    • 1
  • Matt Feiszli
    • 1
  • Lorenzo Torresani
    • 1
    • 2
  • Manohar Paluri
    • 1
  1. 1.Facebook AIMenlo ParkUSA
  2. 2.Dartmouth CollegeHanoverUSA

Personalised recommendations