Approach for Video Classification with Multi-label on YouTube-8M Dataset
Video traffic is increasing at a considerable rate due to the spread of personal media and advancements in media technology. Accordingly, there is a growing need for techniques to automatically classify moving images. This paper use NetVLAD and NetFV models and the Huber loss function for video classification problem and YouTube-8M dataset to verify the experiment. We tried various attempts according to the dataset and optimize hyperparameters, ultimately obtain a GAP score of 0.8668.
KeywordsVideo classification Large-scale video Multi-label
This work was partly supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (2017-0-01772, Development of QA system for video story understanding to pass Video Turing Test), Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (2017-0-01781, Data Collection and Automatic Tuning System Development for the Video Understanding), and Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. 2017-0-00271, Development of Archive Solution and Content Management Platform).
- 1.Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5297–5307 (2016)Google Scholar
- 2.Campos Camunez, V., Jou, B., Giró Nieto, X., Torres Viñals, J., Chang, S.F.: Skip RNN: learning to skip state updates in recurrent neural networks. In: Proceedings of the Sixth International Conference on Learning Representations, Monday April 30–Thursday 3 May 2018, Vancouver Convention Center, Vancouver, pp. 1–17 (2018)Google Scholar
- 3.Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: Encoder-decoder approaches (2014). arXiv preprint: arXiv:1409.1259
- 6.Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3304–3311. IEEE (2010)Google Scholar
- 7.Kim, Y.: Convolutional neural networks for sentence classification (2014). arXiv preprint: arXiv:1408.5882
- 8.Miech, A., Laptev, I., Sivic, J.: Learnable pooling with Context Gating for video classification. ArXiv e-prints (2017)Google Scholar
- 9.Na, S., Yu, Y., Lee, S., Kim, J., Kim, G.: Encoding Video and Label Priors for Multi-label Video Classification on YouTube-8M dataset. ArXiv e-prints (2017)Google Scholar
- 10.Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image categorization. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2007)Google Scholar