Large-Scale Video Classification with Feature Space Augmentation Coupled with Learned Label Relations and Ensembling
This paper presents the Axon AI’s solution to the 2nd YouTube-8M Video Understanding Challenge, achieving the final global average precision (GAP) of 88.733% on the private test set (ranked 3rd among 394 teams, not considering the model size constraint), and 87.287% using a model that meets size requirement. Two sets of 7 individual models belonging to 3 different families were trained separately. Then, the inference results on a training data were aggregated from these multiple models and fed to train a compact model that meets the model size requirement. In order to further improve performance we explored and employed data over/sub-sampling in feature space, an additional regularization term during training exploiting label relationship, and learned weights for ensembling different individual models.
KeywordsVideo classification YouTube-8M dataset
The authors would like to thank Youtube-8M Challenge organizers for hosting this exciting competition and for providing the excellent starter code, and the Axon team to support this project.
- 1.Bengio, S., et al.: Using web co-occurrence statistics for improving image categorization. In: Computer Vision and Pattern Recognition (CVPR) (2013)Google Scholar
- 2.Bober-Irizar, M., Husain, S., Ong, E.J., Bober, M.: Cultivating DNN diversity for large scale video labelling. In: Computer Vision and Pattern Recognition (CVPR) Youtube-8M Workshop (2017)Google Scholar
- 4.DeVries, T., Taylor, G.W.: Dataset augmentation in feature space (2017). https://arxiv.org/abs/1702.05538
- 5.Google: Google cloud & youtube-7m video understanding challenge (2017). https://www.kaggle.com/c/youtube8m
- 6.Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NIPS 2014 Deep Learning Workshop (2014)Google Scholar
- 8.Miech, A., Laptev, I., Sivic, J.: https://github.com/antoine77340/loupe
- 9.Miech, A., Laptev, I., Sivic, J.: Learnable pooling with context gating for video classification. In: Computer Vision and Pattern Recognition (CVPR) Youtube-8M Workshop (2017)Google Scholar
- 10.Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., Belongie, S.: Objects in context. In: IEEE ICCV (2007)Google Scholar