Advertisement

How Clever Is the FiLM Model, and How Clever Can it Be?

  • Alexander KuhnleEmail author
  • Huiyuan Xie
  • Ann Copestake
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11132)

Abstract

The FiLM model achieves close-to-perfect performance on the diagnostic CLEVR dataset and is distinguished from other such models by having a comparatively simple and easily transferable architecture. In this paper, we investigate in more detail the ability of FiLM to learn various linguistic constructions. Our results indicate that (a) FiLM is not able to learn relational statements straight away except for very simple instances, (b) training on a broader set of instances as well as pretraining on simpler instance types can help alleviate these learning difficulties, (c) mixing is less robust than pretraining and very sensitive to the compositional structure of the dataset. Overall, our results suggest that the approach of big all-encompassing datasets and the paradigm of “the effectiveness of data” may have fundamental limitations.

Keywords

VQA Synthetic data Evaluation Deep learning 

Notes

Acknowledgments

We thank the anonymous reviewers for their constructive feedback. AK is grateful for being supported by a Qualcomm Research Studentship and an EPSRC Doctoral Training Studentship.

Supplementary material

References

  1. 1.
    Agrawal, A., Batra, D., Parikh, D.: Analyzing the behavior of visual question answering models. In: Su, J., Duh, K., Carreras, X. (eds.) Proceedings of the Conference on Empirical Methods in Natural Language Processing. EMNLP 2016, pp. 1955–1960. Association for Computational Linguistics, Stroudsburg (2016)Google Scholar
  2. 2.
    Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision. ICCV 2015. IEEE Computer Society, Washington, DC (2015)Google Scholar
  3. 3.
    Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. ICML 2009, pp. 41–48. ACM, New York (2009)Google Scholar
  4. 4.
    Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. In: Màrquez, L., Callison-Burch, C., Su, J. (eds.) Proceedings of the Conference on Empirical Methods in Natural Language Processing. EMNLP 2015, pp. 632–642. Association for Computational Linguistics, Stroudsburg (2015)Google Scholar
  5. 5.
    Elman, J.L.: Learning and development in neural networks: the importance of starting small. Cognition 48(1), 71–99 (1993)CrossRefGoogle Scholar
  6. 6.
    Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2017, pp. 6325–6334. IEEE Computer Society, Washington, DC (2017)Google Scholar
  7. 7.
    Halevy, A., Norvig, P., Pereira, F.: The unreasonable effectiveness of data. IEEE Intell. Syst. 24(2), 8–12 (2009)CrossRefGoogle Scholar
  8. 8.
    Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko, K.: Learning to reason: end-to-end module networks for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision. ICCV 2017. IEEE Computer Society, Washington, DC (2017)Google Scholar
  9. 9.
    Hudson, D.A., Manning, C.D.: Compositional attention networks for machine reasoning. In: Proceedings of the International Conference on Learning Representations. ICLR 2018 (2018)Google Scholar
  10. 10.
    Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2017. IEEE Computer Society, Washington, DC (2017)Google Scholar
  11. 11.
    Johnson, J., et al.: Inferring and executing programs for visual reasoning. In: Proceedings of the IEEE International Conference on Computer Vision. ICCV 2017. IEEE Computer Society, Washington, DC (2017)Google Scholar
  12. 12.
    Kuhnle, A., Copestake, A.: ShapeWorld - a new test methodology for multimodal language understanding. arXiv e-prints 1704.04517 (2017)
  13. 13.
    Kuhnle, A., Copestake, A.: Deep learning evaluation using deep linguistic processing. In: Walker, M., Ji, H., Stent, A. (eds.) Proceedings of the Workshop on Generalization in the Age of Deep Learning. NAACL 2018, pp. 17–23. Association for Computational Linguistics, Stroudsburg (2018)Google Scholar
  14. 14.
    Mascharka, D., Tran, P., Soklaski, R., Majumdar, A.: Transparency by design: closing the gap between performance and interpretability in visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2018. IEEE Computer Society, Washington, DC (2018)Google Scholar
  15. 15.
    Mudrakarta, P.K., Taly, A., Sundararajan, M., Dhamdhere, K.: Did the model understand the question? arXiv e-prints 1805.05492 (2018)
  16. 16.
    Perez, E., Strub, F., de Vries, H., Dumoulin, V., Courville, A.C.: FiLM: visual reasoning with a general conditioning layer. In: AAAI. AAAI Press, Palo Alto (2018)Google Scholar
  17. 17.
    Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: Su, J., Duh, K., Carreras, X. (eds.) Proceedings of the Conference on Empirical Methods in Natural Language Processing. EMNLP 2016, pp. 2383–2392. Association for Computational Linguistics, Stroudsburg (2016)Google Scholar
  18. 18.
    Santoro, A., et al.: A simple neural network module for relational reasoning. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pp. 4974–4983. Curran Associates Inc., Red Hook (2017)Google Scholar
  19. 19.
    Suarez, J., Johnson, J., Li, F.: DDRprog: a CLEVR differentiable dynamic reasoning programmer. arXiv e-prints 1803.11361 (2018)
  20. 20.
    Suhr, A., Lewis, M., Yeh, J., Artzi, Y.: A corpus of natural language for visual reasoning. In: Barzilay, R., Kan, M.Y. (eds.) Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. ACL 2017. Association for Computational Linguistics, Stroudsburg (2017)Google Scholar
  21. 21.
    Weston, J., Bordes, A., Chopra, S., Mikolov, T.: Towards AI-complete question answering: a set of prerequisite toy tasks. arXiv e-prints 1502.05698 (2015)
  22. 22.
    Yang, G.R., Ganichev, I., Wang, X.J., Shlens, J., Sussillo, D.: A dataset and architecture for visual reasoning with a working memory. arXiv e-prints 1803.06092 (2018)
  23. 23.
    Yang, Z., He, X., Gao, J., Deng, L., Smola, A.J.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2016. IEEE Computer Society, Washington, DC (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Computer Science and TechnologyUniversity of CambridgeCambridgeUK

Personalised recommendations