Modality-Specific Learning Rate Control for Multimodal Classification

  • Naotsuna FujimoriEmail author
  • Rei Endo
  • Yoshihiko Kawai
  • Takahiro Mochizuki
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12047)


Multimodal machine learning is an approach to performing tasks with inputs containing multiple expressions for a single subject. There are many recent reports on multimodal machine learning using the framework of deep neural networks. Conventionally, a common learning rate has been used for the network of all modalities. This has led, however, to a decrease in the overall accuracy due to overfitting in some modality models in cases when the convergence rate and generalization performance differ among modalities. In this paper, we propose a method that solves this problem by constructing a model within the framework of multitask learning, which simultaneously learns modality-specific classifiers as well as a multimodal classifier, to detect overfitting in each modality and carry out early stopping separately. We evaluated the accuracy of the proposed method using several datasets and demonstrated that it improves classification accuracy.


Deep neural network Multimodal machine learning Training algorithm Learning rate control 


  1. 1.
    Gallo, I., Calefati, A., Nawaz, S.: Multimodal classification fusion in real-world scenarios. In: Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition, pp. 36–41 (2017)Google Scholar
  2. 2.
    Kim, E., McCoy, K.F.: Multimodal deep learning using images and text for information graphic classification. In: Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility, pp. 143–148 (2018)Google Scholar
  3. 3.
    Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Getoor, L., Scheffer, T. (eds.) Proceedings of the 28th International Conference on Machine Learning, pp. 689–696. Omnipress, USA (2011)Google Scholar
  4. 4.
    Palaskar, S., Sanabria, R., Metze, F.: End-to-end multimodal speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5774–5778 (2018)Google Scholar
  5. 5.
    Yu, Z., Yu, J., Fan, J., Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the International Conference on Computer Vision, pp. 1839–1848 (2017)Google Scholar
  6. 6.
    Peng, G., et al.: Dynamic fusion with intra- and inter- modality attention flow for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6639–6648 (2019)Google Scholar
  7. 7.
    Wu, D., et al.: Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38(8), 1583–1597 (2016)CrossRefGoogle Scholar
  8. 8.
    Abavisani, M., Joze, H.R.V., Patel, V.M.: Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1165–1174 (2019)Google Scholar
  9. 9.
    Arevalo, J., Solorio, T., Montes-y-Gómez, M., González, F.A.: Gated multimodal units for information fusion. In: 5th International Conference on Learning Representations Workshop (2017)Google Scholar
  10. 10.
    Vielzeuf, V., Lechervy, A., Pateux, S., Jurie, F.: CentralNet: a multilayer approach for multimodal fusion. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11134, pp. 575–589. Springer, Cham (2019). Scholar
  11. 11.
    Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. In: Proceedings of the International Conference on Learning Representations (2017)Google Scholar
  12. 12.
    Smith, L.N.: Cyclical learning rates for training neural networks. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 464–472 (2017)Google Scholar
  13. 13.
    Zeiler, M.D.: ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012)
  14. 14.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: The 3rd International Conference on Learning Representations (2015)Google Scholar
  15. 15.
    Prechelt, L.: Early stopping—but when? In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 53–67. Springer, Heidelberg (2012). Scholar
  16. 16.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Naotsuna Fujimori
    • 1
    Email author
  • Rei Endo
    • 1
  • Yoshihiko Kawai
    • 1
  • Takahiro Mochizuki
    • 1
  1. 1.Science & Technology Research LaboratoriesJapan Broadcasting Corporation (NHK)Setagaya-kuJapan

Personalised recommendations