An Efficient 3D-NAS Method for Video-Based Gesture Recognition

Guo, Ziheng; Chen, Yang; Huang, Wei; Zhang, Junhao

doi:10.1007/978-3-030-30508-6_26

An Efficient 3D-NAS Method for Video-Based Gesture Recognition

Ziheng Guo¹²,
Yang Chen¹²,
Wei Huang¹³ &
…
Junhao Zhang¹³

Conference paper
First Online: 09 September 2019

2675 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11729))

Abstract

3D convolutional neural network (3DCNN) is a powerful and effective model utilizing spatial-temporal features, especially for gesture recognition. Unfortunately, so many parameters are modified in 3DCNN that lots of researchers choose 2DCNN or hybrid models, but these models are designed manually. In this paper, we propose a framework to automatically construct a model based on 3DCNN by network architecture search (NAS) [1]. In our method called 3DNAS, a 3D teacher network is trained from scratch as a pre-trained model to accelerate the convergence of the child networks. Then series of child networks with various architectures are generated randomly and each is trained under the direction of converted teacher model. Finally, the controller predicts a network architecture according to the rewards of all the child networks. We evaluate our method on a video-based gesture recognition dataset 20BN-Jester dataset v1 [2] and the result shows our approach is superiority against prior methods both in efficiency and accuracy.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning (2016). arXiv preprint arXiv:1611.01578
TwentyBN: jester dataset: a hand gesture dataset (2017). https://www.twentybn.com/datasets/jester
Klaser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients. In: 19th British Machine Vision Conference, BMVC 2008, vol. 275, pp. 1–10. British Machine Vision Association (2008)
Google Scholar
Scovanner, P., Ali, S., Shah, M.: A 3-dimensional SIFT descriptor and its application to action recognition. In: Proceedings of the 15th ACM International Conference on Multimedia, pp. 357–360. ACM (2007). https://doi.org/10.1145/1291233.1291311
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558 (2013). https://doi.org/10.1109/iccv.2013.441
Tang, P., Wang, X., Shi, B., et al.: Deep fishernet for object classification (2016). arXiv preprint arXiv:1608.00182. https://doi.org/10.1109/tnnls.2018.2874657
Article Google Scholar
Arandjelovic, R., Gronat, P., Torii, A., et al.: NetVLAD: CNN architecture for weakly supervised place recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5297–5307 (2016). https://doi.org/10.1109/tpami.2017.2711011
Article Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 831–846. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_49
Chapter Google Scholar
Yang, K., Li, R., Qiao, P., et al.: Temporal pyramid relation network for video-based gesture recognition. In: 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 3104–3108. IEEE (2018)
Google Scholar
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imagenet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6546–6555 (2018)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017). https://doi.org/10.1109/cvpr.2017.502
Tran, D., Bourdev, L., Fergus, R., et al.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015). https://doi.org/10.1109/iccv.2015.510
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016). https://doi.org/10.1109/cvpr.2016.90
Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5533–5541 (2017). https://doi.org/10.1109/iccv.2017.590
Zolfaghari, M., Singh, K., Brox, T.: ECO: efficient convolutional network for online video understanding. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 713–730. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_43
Chapter Google Scholar
Chen, Y., Kalantidis, Y., Li, J., Yan, S., Feng, J.: Multi-fiber networks for video recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 364–380. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_22
Chapter Google Scholar
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 318–335. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_19
Chapter Google Scholar
Pham, H., Guan, M.Y., Zoph, B., et al.: Efficient neural architecture search via parameter sharing (2018). arXiv preprint arXiv:1802.03268
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014). https://doi.org/10.1002/14651858.cd001941.pub3
Diba, A., Sharma, V., Van Gool, L.: Deep temporal linear encoding networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2329–2338 (2017). https://doi.org/10.1109/cvpr.2017.168
Li, Y., Miao, Q., Tian, K., et al.: Large-scale gesture recognition with a fusion of RGB-D data based on the C3D model. In: 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE (2016). https://doi.org/10.1109/icpr.2016.7899602

Download references

Author information

Authors and Affiliations

Southeast University, Nanjing, 210096, China
Ziheng Guo & Yang Chen
2012 Lab, Shanghai Huawei Technologies Co., Ltd., Shanghai, 200120, China
Wei Huang & Junhao Zhang

Authors

Ziheng Guo
View author publications
You can also search for this author in PubMed Google Scholar
Yang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Wei Huang
View author publications
You can also search for this author in PubMed Google Scholar
Junhao Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yang Chen .

Editor information

Editors and Affiliations

Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
Igor V. Tetko
Institute of Computer Science, Czech Academy of Sciences, Praha 8, Czech Republic
Věra Kůrková
Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
Pavel Karpov
Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
Fabian Theis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guo, Z., Chen, Y., Huang, W., Zhang, J. (2019). An Efficient 3D-NAS Method for Video-Based Gesture Recognition. In: Tetko, I., Kůrková, V., Karpov, P., Theis, F. (eds) Artificial Neural Networks and Machine Learning – ICANN 2019: Image Processing. ICANN 2019. Lecture Notes in Computer Science(), vol 11729. Springer, Cham. https://doi.org/10.1007/978-3-030-30508-6_26

Download citation

DOI: https://doi.org/10.1007/978-3-030-30508-6_26
Published: 09 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30507-9
Online ISBN: 978-3-030-30508-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics