Advertisement

Large-Scale Video Classification with Feature Space Augmentation Coupled with Learned Label Relations and Ensembling

  • Choongyeun ChoEmail author
  • Benjamin Antin
  • Sanchit Arora
  • Shwan Ashrafi
  • Peilin Duan
  • Dang The Huynh
  • Lee James
  • Hang Tuan Nguyen
  • Mojtaba Solgi
  • Cuong Van Than
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11132)

Abstract

This paper presents the Axon AI’s solution to the 2nd YouTube-8M Video Understanding Challenge, achieving the final global average precision (GAP) of 88.733% on the private test set (ranked 3rd among 394 teams, not considering the model size constraint), and 87.287% using a model that meets size requirement. Two sets of 7 individual models belonging to 3 different families were trained separately. Then, the inference results on a training data were aggregated from these multiple models and fed to train a compact model that meets the model size requirement. In order to further improve performance we explored and employed data over/sub-sampling in feature space, an additional regularization term during training exploiting label relationship, and learned weights for ensembling different individual models.

Keywords

Video classification YouTube-8M dataset 

Notes

Acknowledgement

The authors would like to thank Youtube-8M Challenge organizers for hosting this exciting competition and for providing the excellent starter code, and the Axon team to support this project.

References

  1. 1.
    Bengio, S., et al.: Using web co-occurrence statistics for improving image categorization. In: Computer Vision and Pattern Recognition (CVPR) (2013)Google Scholar
  2. 2.
    Bober-Irizar, M., Husain, S., Ong, E.J., Bober, M.: Cultivating DNN diversity for large scale video labelling. In: Computer Vision and Pattern Recognition (CVPR) Youtube-8M Workshop (2017)Google Scholar
  3. 3.
    Deng, J., et al.: Large-scale object classification using label relation graphs. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part I. LNCS, vol. 8689, pp. 48–64. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10590-1_4CrossRefGoogle Scholar
  4. 4.
    DeVries, T., Taylor, G.W.: Dataset augmentation in feature space (2017). https://arxiv.org/abs/1702.05538
  5. 5.
    Google: Google cloud & youtube-7m video understanding challenge (2017). https://www.kaggle.com/c/youtube8m
  6. 6.
    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NIPS 2014 Deep Learning Workshop (2014)Google Scholar
  7. 7.
    Jiang, Y.G., Wu, Z., Wang, J., Xue, X., Chang, S.F.: Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE TPAMI 40(2), 352–364 (2018)CrossRefGoogle Scholar
  8. 8.
    Miech, A., Laptev, I., Sivic, J.: https://github.com/antoine77340/loupe
  9. 9.
    Miech, A., Laptev, I., Sivic, J.: Learnable pooling with context gating for video classification. In: Computer Vision and Pattern Recognition (CVPR) Youtube-8M Workshop (2017)Google Scholar
  10. 10.
    Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., Belongie, S.: Objects in context. In: IEEE ICCV (2007)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Choongyeun Cho
    • 1
    Email author
  • Benjamin Antin
    • 1
    • 2
  • Sanchit Arora
    • 3
  • Shwan Ashrafi
    • 1
  • Peilin Duan
    • 1
  • Dang The Huynh
    • 1
  • Lee James
    • 1
  • Hang Tuan Nguyen
    • 1
  • Mojtaba Solgi
    • 1
  • Cuong Van Than
    • 1
  1. 1.AxonSeattleUSA
  2. 2.Stanford UniversityStanfordUSA
  3. 3.Upenn/Possible FinanceSeattleUSA

Personalised recommendations