Cross and Learn: Cross-Modal Self-supervision

Sayed, Nawid; Brattoli, Biagio; Ommer, Björn

doi:10.1007/978-3-030-12939-2_17

Cross and Learn: Cross-Modal Self-supervision

Nawid Sayed¹⁵,
Biagio Brattoli¹⁵ &
Björn Ommer¹⁵

Conference paper
First Online: 14 February 2019

2969 Accesses
19 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11269))

Abstract

In this paper we present a self-supervised method for representation learning utilizing two different modalities. Based on the observation that cross-modal information has a high semantic meaning we propose a method to effectively exploit this signal. For our approach we utilize video data since it is available on a large scale and provides easily accessible modalities given by RGB and optical flow. We demonstrate state-of-the-art performance on highly contested action recognition datasets in the context of self-supervised learning. We show that our feature representation also transfers to other tasks and conduct extensive ablation studies to validate our core contributions.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: Proceedings of the IEEE International Conference on Computer Vision (2015)
Google Scholar
Bautista, M., Fuchs, P., Ommer, B.: Learning where to drive by watching others. In: Proceedings of the German Conference Pattern Recognition (2017)
Google Scholar
Bautista, M., Sanakoyeu, A., Sutter, E., Ommer, B.: CliqueCNN: deep unsupervised exemplar learning. In: Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS) (2016)
Google Scholar
Brattoli, B., Büchler, U., Wahl, A.S., Schwab, M.E., Ommer, B.: LSTM self-supervision for detailed behavior analysis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Büchler, U., Brattoli, B., Ommer, B.: Improving spatiotemporal self-supervision by deep reinforcement learning. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 797–814. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_47
Chapter Google Scholar
Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE International Conference on Computer Vision (2015)
Google Scholar
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Girshick, R.B.: Fast R-CNN. CoRR (2015)
Google Scholar
Gupta, S., Hoffman, J., Malik, J.: Cross modal distillation for supervision transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
Itseez: Open source computer vision library (2015). https://github.com/itseez/opencv
Jia, Y., et al.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
Kay, W., et al.: The kinetics human action video dataset. CoRR (2017)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Krähenbühl, P., Doersch, C., Donahue, J., Darrell, T.: Data-dependent initializations of convolutional neural networks. arXiv preprint arXiv:1511.06856 (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (2012)
Google Scholar
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: Proceedings of the International Conference on Computer Vision (ICCV) (2011)
Google Scholar
Larsson, G., Maire, M., Shakhnarovich, G.: Colorization as a proxy task for visual understanding. arXiv preprint arXiv:1703.04044 (2017)
Lee, H.Y., Huang, J.B., Singh, M.K., Yang, M.H.: Unsupervised representation learning by sorting sequences. In: IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Luo, Z., Peng, B., Huang, D.A., Alahi, A., Fei-Fei, L.: Unsupervised learning of long-term motion dynamics for videos. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Milbich, T., Bautista, M., Sutter, E., Ommer, B.: Unsupervised video understanding by reconciliation of posture similarities. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Misra, I., Zitnick, C.L., Hebert, M.: Unsupervised learning using sequential verification for action recognition (2016)
Google Scholar
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: ICML (2011)
Google Scholar
Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5
Chapter Google Scholar
Noroozi, M., Pirsiavash, H., Favaro, P.: Representation learning by learning to count. arXiv preprint arXiv:1708.06734 (2017)
Noroozi, M., Vinjimoor, A., Favaro, P., Pirsiavash, H.: Boosting self-supervised learning via knowledge transfer. arXiv preprint arXiv:1805.00385 (2018)
Paszke, A., Gross, S., Chintala, S., Chanan, G.: PyTorch (2017)
Google Scholar
Purushwalkam, S., Gupta, A.: Pose from action: unsupervised learning of pose features based on motion. arXiv preprint arXiv:1609.05420 (2016)
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115, 211–252 (2015)
Article MathSciNet Google Scholar
Sanakoyeu, A., Bautista, M., Ommer, B.: Deep unsupervised learning of visual similarities. Pattern Recogn. 78, 331–343 (2018)
Article Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Conference on Neural Information Processing Systems (NIPS) (2014)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806 (2014)
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010)
MathSciNet MATH Google Scholar
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Conference on Neural Information Processing Systems (NIPS) (2016)
Google Scholar
Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. CoRR (2015)
Google Scholar
Wang, X., Farhadi, A., Gupta, A.: Actions \(\sim \) transformations. In: CVPR (2016)
Google Scholar
Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: Proceedings of the IEEE International Conference on Computer Vision (2015)
Google Scholar
Yang, X., Ramesh, P., Chitta, R., Madhvanath, S., Bernal, E.A., Luo, J.: Deep multimodal representation learning from temporal data. CoRR (2017)
Google Scholar
Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_40
Chapter Google Scholar

Download references

Acknowledgments

We are grateful to the NVIDIA corporation for supporting our research, the experiments in this paper were performed on a donated Titan X (Pascal) GPU.

Author information

Authors and Affiliations

Heidelberg University, HCI/IWR, Heidelberg, Germany
Nawid Sayed, Biagio Brattoli & Björn Ommer

Authors

Nawid Sayed
View author publications
You can also search for this author in PubMed Google Scholar
Biagio Brattoli
View author publications
You can also search for this author in PubMed Google Scholar
Björn Ommer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Nawid Sayed , Biagio Brattoli or Björn Ommer .

Editor information

Editors and Affiliations

University of Freiburg, Freiburg im Breisgau, Baden-Württemberg, Germany
Thomas Brox
University of Stuttgart, Stuttgart, Baden-Württemberg, Germany
Andrés Bruhn
CISPA Helmholtz Center for Information Security, Saarbrücken, Germany
Mario Fritz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sayed, N., Brattoli, B., Ommer, B. (2019). Cross and Learn: Cross-Modal Self-supervision. In: Brox, T., Bruhn, A., Fritz, M. (eds) Pattern Recognition. GCPR 2018. Lecture Notes in Computer Science(), vol 11269. Springer, Cham. https://doi.org/10.1007/978-3-030-12939-2_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-12939-2_17
Published: 14 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-12938-5
Online ISBN: 978-3-030-12939-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics