Cross-modality earth mover’s distance-driven convolutional neural network for different-modality data

  • Zheng Zuo
  • Liang LiuEmail author
  • Jiayong Liu
  • Cheng Huang
Original Article


Cross-modality matching refers to the problem of comparing similarity/dissimilarity of a pair of data points of different modalities, such as an image and a text. Deep neural networks have been popular to represent data points of different modalities due to their ability to extract effective features. However, existing works use simple distance metrics to compare the deep features of multiple modalities, which do not fit the nature of cross-modality matching, because it imposes the features of different modalities to be of the same dimension and do not allow cross-feature matching. To solve this problem, we propose to use convolutional neural network (CNN) models with soft-max activation layer to represent a pair of different-modality data points to two histograms (not necessarily of the same dimensions) and compare their dissimilarity by using earth mover’s distance (EMD). The EMD can match the features extracted by the two CNN models of different modalities freely. Moreover, we develop a joint learning framework to learn the CNN parameters specifically for the EMD-driven comparison, supervised by the relevance/irrelevance labels of the data pairs of different modalities. The experiments over applications such as image–text retrieval, and malware detection show its advantage over existing cross-modality matching methods.


Deep learning Convolutional neural network Earth mover’s distance Malware detection 



This work was partly supported by the National Key Technology R&D Program of China (Grant No. 2017YFB0802900).

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflicts of interests.


  1. 1.
    Boyd S, Parikh N, Chu E, Peleato B, Eckstein J et al (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends® Mach Learn 3(1):1–122zbMATHGoogle Scholar
  2. 2.
    Bronstein MM, Bronstein AM, Michel F, Paragios N (2010) Data fusion through cross-modality metric learning using similarity-sensitive hashing. In: 2010 IEEE conference on computer vision and pattern recognition (CVPR), pp 3594–3601. IEEEGoogle Scholar
  3. 3.
    Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from National University of Singapore. In: Proceedings of the ACM international conference on image and video retrieval. ACM, p 48Google Scholar
  4. 4.
    Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), 2013. IEEE, pp 6645–6649Google Scholar
  5. 5.
    Jia Y, Salzmann M, Darrell T (2011) Learning cross-modality similarity for multinomial data. In: 2011 IEEE international conference on computer vision (ICCV). IEEE, pp 2407–2414Google Scholar
  6. 6.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105Google Scholar
  7. 7.
    Lawrence S, Giles CL, Tsoi AC, Back AD (1997) Face recognition: a convolutional neural-network approach. IEEE Trans Neural Netw 8(1):98–113CrossRefGoogle Scholar
  8. 8.
    Lin L, Wang G, Zuo W, Feng X, Zhang L (2017) Cross-domain visual matching via generalized similarity measure and feature learning. IEEE Trans Pattern Anal Mach Intell 39(6):1089–1102CrossRefGoogle Scholar
  9. 9.
    Ling H, Okada K (2007) An efficient earth mover’s distance algorithm for robust histogram comparison. IEEE Trans Pattern Anal Mach Intell 29(5):840–853CrossRefGoogle Scholar
  10. 10.
    Masci J, Bronstein MM, Bronstein AM, Schmidhuber J (2014) Multimodal similarity-preserving hashing. IEEE Trans Pattern Anal Mach Intell 36(4):824–830CrossRefGoogle Scholar
  11. 11.
    Mikolov T, Karafiát M, Burget L, Černockỳ J, Khudanpur S (2010) Recurrent neural network based language model. In: Eleventh annual conference of the international speech communication associationGoogle Scholar
  12. 12.
    Plummer BA, Wang L, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S (2015) Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: 2015 IEEE international conference on computer vision (ICCV). IEEE, pp 2641–2649Google Scholar
  13. 13.
    Rubner Y, Tomasi C (2001) The earth mover’s distance. In: Rubner Y, Tomasi C (eds) Perceptual metrics for image database navigation. Springer, Berlin, pp 13–28CrossRefGoogle Scholar
  14. 14.
    Rubner Y, Tomasi C, Guibas LJ (2000) The earth mover’s distance as a metric for image retrieval. Int J Comput Vis 40(2):99–121CrossRefzbMATHGoogle Scholar
  15. 15.
    Sandler R, Lindenbaum M (2011) Nonnegative matrix factorization with earth mover’s distance metric for image analysis. IEEE Trans Pattern Anal Mach Intelligence 33(8):1590–1602CrossRefGoogle Scholar
  16. 16.
    Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2009) The graph neural network model. IEEE Trans Neural Netw 20(1):61–80CrossRefGoogle Scholar
  17. 17.
    Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681CrossRefGoogle Scholar
  18. 18.
    Shishibori M, Koizumi D, Kita K (2009) A fast retrieval algorithm for the earth mover’s distance using EMD lower bounds and the priority queue. In: International conference on natural language processing and knowledge engineering, 2009. NLP-KE 2009. IEEE, pp 1–6Google Scholar
  19. 19.
    Simard PY, Steinkraus D, Platt JC (2003) Best practices for convolutional neural networks applied to visual document analysis. In: Null. IEEE, p 958Google Scholar
  20. 20.
    Wang L, Li Y, Huang J, Lazebnik S (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell 41:394–407CrossRefGoogle Scholar
  21. 21.
    Wu Y, Wang L, Cui F, Zhai H, Dong B, Wang JY (2016) Cross-model convolutional neural network for multiple modality data representation. Neural Comput Appl 30:1–11Google Scholar
  22. 22.
    Zhang G, Liang G, Su F, Qu F, Wang JY (2018) Cross-domain attribute representation based on convolutional neural network. In: International conference on intelligent computing. Springer, pp 134–142Google Scholar
  23. 23.
    Zhang H, Chow TW (2011) A coarse-to-fine framework to efficiently thwart plagiarism. Pattern Recognit 44(2):471–487CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  • Zheng Zuo
    • 1
  • Liang Liu
    • 2
    Email author
  • Jiayong Liu
    • 2
  • Cheng Huang
    • 2
  1. 1.College of Electronics and Information EngineeringSichuan UniversityChengduChina
  2. 2.College of CybersecuritySichuan UniversityChengduChina

Personalised recommendations