Cross-modality earth mover’s distance-driven convolutional neural network for different-modality data

Zuo, Zheng; Liu, Liang; Liu, Jiayong; Huang, Cheng

doi:10.1007/s00521-019-04471-8

Cross-modality earth mover’s distance-driven convolutional neural network for different-modality data

Original Article
Published: 13 September 2019

Volume 32, pages 9581–9592, (2020)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Zheng Zuo¹,
Liang Liu²,
Jiayong Liu² &
…
Cheng Huang²

245 Accesses
1 Citation
Explore all metrics

Abstract

Cross-modality matching refers to the problem of comparing similarity/dissimilarity of a pair of data points of different modalities, such as an image and a text. Deep neural networks have been popular to represent data points of different modalities due to their ability to extract effective features. However, existing works use simple distance metrics to compare the deep features of multiple modalities, which do not fit the nature of cross-modality matching, because it imposes the features of different modalities to be of the same dimension and do not allow cross-feature matching. To solve this problem, we propose to use convolutional neural network (CNN) models with soft-max activation layer to represent a pair of different-modality data points to two histograms (not necessarily of the same dimensions) and compare their dissimilarity by using earth mover’s distance (EMD). The EMD can match the features extracted by the two CNN models of different modalities freely. Moreover, we develop a joint learning framework to learn the CNN parameters specifically for the EMD-driven comparison, supervised by the relevance/irrelevance labels of the data pairs of different modalities. The experiments over applications such as image–text retrieval, and malware detection show its advantage over existing cross-modality matching methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two-Stream Convolutional Neural Network for Multimodal Matching

2D-Convolution Based Feature Fusion for Cross-Modal Correlation Learning

Deep Self-correlation Descriptor for Dense Cross-Modal Correspondence

References

Boyd S, Parikh N, Chu E, Peleato B, Eckstein J et al (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends® Mach Learn 3(1):1–122
MATH Google Scholar
Bronstein MM, Bronstein AM, Michel F, Paragios N (2010) Data fusion through cross-modality metric learning using similarity-sensitive hashing. In: 2010 IEEE conference on computer vision and pattern recognition (CVPR), pp 3594–3601. IEEE
Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from National University of Singapore. In: Proceedings of the ACM international conference on image and video retrieval. ACM, p 48
Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), 2013. IEEE, pp 6645–6649
Jia Y, Salzmann M, Darrell T (2011) Learning cross-modality similarity for multinomial data. In: 2011 IEEE international conference on computer vision (ICCV). IEEE, pp 2407–2414
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Lawrence S, Giles CL, Tsoi AC, Back AD (1997) Face recognition: a convolutional neural-network approach. IEEE Trans Neural Netw 8(1):98–113
Article Google Scholar
Lin L, Wang G, Zuo W, Feng X, Zhang L (2017) Cross-domain visual matching via generalized similarity measure and feature learning. IEEE Trans Pattern Anal Mach Intell 39(6):1089–1102
Article Google Scholar
Ling H, Okada K (2007) An efficient earth mover’s distance algorithm for robust histogram comparison. IEEE Trans Pattern Anal Mach Intell 29(5):840–853
Article Google Scholar
Masci J, Bronstein MM, Bronstein AM, Schmidhuber J (2014) Multimodal similarity-preserving hashing. IEEE Trans Pattern Anal Mach Intell 36(4):824–830
Article Google Scholar
Mikolov T, Karafiát M, Burget L, Černockỳ J, Khudanpur S (2010) Recurrent neural network based language model. In: Eleventh annual conference of the international speech communication association
Plummer BA, Wang L, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S (2015) Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: 2015 IEEE international conference on computer vision (ICCV). IEEE, pp 2641–2649
Rubner Y, Tomasi C (2001) The earth mover’s distance. In: Rubner Y, Tomasi C (eds) Perceptual metrics for image database navigation. Springer, Berlin, pp 13–28
Chapter Google Scholar
Rubner Y, Tomasi C, Guibas LJ (2000) The earth mover’s distance as a metric for image retrieval. Int J Comput Vis 40(2):99–121
Article Google Scholar
Sandler R, Lindenbaum M (2011) Nonnegative matrix factorization with earth mover’s distance metric for image analysis. IEEE Trans Pattern Anal Mach Intelligence 33(8):1590–1602
Article Google Scholar
Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2009) The graph neural network model. IEEE Trans Neural Netw 20(1):61–80
Article Google Scholar
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681
Article Google Scholar
Shishibori M, Koizumi D, Kita K (2009) A fast retrieval algorithm for the earth mover’s distance using EMD lower bounds and the priority queue. In: International conference on natural language processing and knowledge engineering, 2009. NLP-KE 2009. IEEE, pp 1–6
Simard PY, Steinkraus D, Platt JC (2003) Best practices for convolutional neural networks applied to visual document analysis. In: Null. IEEE, p 958
Wang L, Li Y, Huang J, Lazebnik S (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell 41:394–407
Article Google Scholar
Wu Y, Wang L, Cui F, Zhai H, Dong B, Wang JY (2016) Cross-model convolutional neural network for multiple modality data representation. Neural Comput Appl 30:1–11
Google Scholar
Zhang G, Liang G, Su F, Qu F, Wang JY (2018) Cross-domain attribute representation based on convolutional neural network. In: International conference on intelligent computing. Springer, pp 134–142
Zhang H, Chow TW (2011) A coarse-to-fine framework to efficiently thwart plagiarism. Pattern Recognit 44(2):471–487
Article Google Scholar

Download references

Acknowledgements

This work was partly supported by the National Key Technology R&D Program of China (Grant No. 2017YFB0802900).

Author information

Authors and Affiliations

College of Electronics and Information Engineering, Sichuan University, Chengdu, 610064, China
Zheng Zuo
College of Cybersecurity, Sichuan University, Chengdu, 610064, China
Liang Liu, Jiayong Liu & Cheng Huang

Authors

Zheng Zuo
View author publications
You can also search for this author in PubMed Google Scholar
Liang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jiayong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liang Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zuo, Z., Liu, L., Liu, J. et al. Cross-modality earth mover’s distance-driven convolutional neural network for different-modality data. Neural Comput & Applic 32, 9581–9592 (2020). https://doi.org/10.1007/s00521-019-04471-8

Download citation

Received: 03 October 2018
Accepted: 09 May 2019
Published: 13 September 2019
Issue Date: July 2020
DOI: https://doi.org/10.1007/s00521-019-04471-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-modality earth mover’s distance-driven convolutional neural network for different-modality data

Abstract

Access this article

Similar content being viewed by others

Two-Stream Convolutional Neural Network for Multimodal Matching

2D-Convolution Based Feature Fusion for Cross-Modal Correlation Learning

Deep Self-correlation Descriptor for Dense Cross-Modal Correspondence

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cross-modality earth mover’s distance-driven convolutional neural network for different-modality data

Abstract

Access this article

Similar content being viewed by others

Two-Stream Convolutional Neural Network for Multimodal Matching

2D-Convolution Based Feature Fusion for Cross-Modal Correlation Learning

Deep Self-correlation Descriptor for Dense Cross-Modal Correspondence

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation