Cross-Modal Hamming Hashing

  • Yue Cao
  • Bin Liu
  • Mingsheng LongEmail author
  • Jianmin Wang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11205)


Cross-modal hashing enables similarity retrieval across different content modalities, such as searching relevant images in response to text queries. It provide with the advantages of computation efficiency and retrieval quality for multimedia retrieval. Hamming space retrieval enables efficient constant-time search that returns data items within a given Hamming radius to each query, by hash lookups instead of linear scan. However, Hamming space retrieval is ineffective in existing cross-modal hashing methods, subject to their weak capability of concentrating the relevant items to be within a small Hamming ball, while worse still, the Hamming distances between hash codes from different modalities are inevitably large due to the large heterogeneity across different modalities. This work presents Cross-Modal Hamming Hashing (CMHH), a novel deep cross-modal hashing approach that generates compact and highly concentrated hash codes to enable efficient and effective Hamming space retrieval. The main idea is to penalize significantly on similar cross-modal pairs with Hamming distance larger than the Hamming radius threshold, by designing a pairwise focal loss based on the exponential distribution. Extensive experiments demonstrate that CMHH can generate highly concentrated hash codes and achieve state-of-the-art cross-modal retrieval performance for both hash lookups and linear scan scenarios on three benchmark datasets, NUS-WIDE, MIRFlickr-25K, and IAPR TC-12.


Deep hashing Cross-modal hashing Hamming space retrieval 



This work is supported by National Key R&D Program of China (2016YFB1000701), and National Natural Science Foundation of China (61772299, 61672313, 71690231).


  1. 1.
    Wang, J., Zhang, T., Sebe, N., Shen, H.T., et al.: A survey on learning to hash. IEEE Trans. Pattern Anal. Mach. Intell. 40, 769–790 (2017)CrossRefGoogle Scholar
  2. 2.
    Smeulders, A.W., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. TPAMI 22, 1349–1380 (2000)CrossRefGoogle Scholar
  3. 3.
    Bronstein, M., Bronstein, A., Michel, F., Paragios, N.: Data fusion through cross-modality metric learning using similarity-sensitive hashing. In: CVPR. IEEE (2010)Google Scholar
  4. 4.
    Kumar, S., Udupa, R.: Learning hash functions for cross-view similarity search. In: IJCAI (2011)Google Scholar
  5. 5.
    Zhen, Y., Yeung, D.: Co-regularized hashing for multimodal data. In: NIPS, pp. 1385–1393 (2012)Google Scholar
  6. 6.
    Zhen, Y., Yeung, D.Y.: A probabilistic model for multimodal hash function learning. In: SIGKDD. ACM (2012)Google Scholar
  7. 7.
    Song, J., Yang, Y., Yang, Y., Huang, Z., Shen, H.T.: Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: SIGMOD. ACM (2013)Google Scholar
  8. 8.
    Wang, W., Ooi, B.C., Yang, X., Zhang, D., Zhuang, Y.: Effective multi-modal retrieval based on stacked auto-encoders. VLDB 7, 649–660 (2014)Google Scholar
  9. 9.
    Yu, Z., Wu, F., Yang, Y., Tian, Q., Luo, J., Zhuang, Y.: Discriminative coupled dictionary hashing for fast cross-media retrieval. In: SIGIR. ACM (2014)Google Scholar
  10. 10.
    Liu, X., He, J., Deng, C., Lang, B.: Collaborative hashing. In: CVPR. IEEE (2014)Google Scholar
  11. 11.
    Zhang, D., Li, W.: Large-scale supervised multimodal hashing with semantic correlation maximization. In: AAAI (2014)Google Scholar
  12. 12.
    Wu, B., Yang, Q., Zheng, W., Wang, Y., Wang, J.: Quantized correlation hashing for fast cross-modal search. In: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, 25–31 July 2015 (2015)Google Scholar
  13. 13.
    Long, M., Cao, Y., Wang, J., Yu, P.S.: Composite correlation quantization for efficient multimodal retrieval. In: SIGIR (2016)Google Scholar
  14. 14.
    Xia, R., Pan, Y., Lai, H., Liu, C., Yan, S.: Supervised hashing for image retrieval via image representation learning. In: Proceedings of the AAAI Conference on Artificial Intellignece (AAAI). AAAI (2014)Google Scholar
  15. 15.
    Lai, H., Pan, Y., Liu, Y., Yan, S.: Simultaneous feature learning and hash coding with deep neural networks. In: CVPR (2015)Google Scholar
  16. 16.
    Zhu, H., Long, M., Wang, J., Cao, Y.: Deep hashing network for efficient similarity retrieval. In: Proceedings of the AAAI Conference on Artificial Intellignece (AAAI). AAAI (2016)Google Scholar
  17. 17.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS) (2012)Google Scholar
  18. 18.
    Lin, M., Chen, Q., Yan, S.: Network in network. In: International Conference on Learning Representations (ICLR 2014) arXiv:1409.1556 (2014)
  19. 19.
    Masci, J., Bronstein, M.M., Bronstein, A.M., Schmidhuber, J.: Multimodal similarity-preserving hashing. IEEE Trans. Pattern Anal. Mach. Intell. 36, 824–830 (2014)CrossRefGoogle Scholar
  20. 20.
    Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep boltzmann machines. JMLR 15, 2949–2980 (2014)MathSciNetzbMATHGoogle Scholar
  21. 21.
    Wan, J., Wang, D., Hoi, S.C.H., Wu, P., Zhu, J., Zhang, Y., Li, J.: Deep learning for content-based image retrieval: a comprehensive study. In: MM. ACM (2014)Google Scholar
  22. 22.
    Cao, Y., Long, M., Wang, J., Yang, Q., Yu, P.S.: Deep visual-semantic hashing for cross-modal retrieval. In: SIGKDD, pp. 1445–1454 (2016)Google Scholar
  23. 23.
    Jiang, Q., Li, W.: Deep cross-modal hashing. In: CVPR 2017, pp. 3270–3278 (2017)Google Scholar
  24. 24.
    Cao, Y., Long, M., Wang, J.: Correlation hashing network for efficient cross-modal retrieval. In: BMVC (2017)Google Scholar
  25. 25.
    Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new perspectives. TPAMI 35, 1798–1828 (2013)CrossRefGoogle Scholar
  26. 26.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (2013)Google Scholar
  27. 27.
    Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: Advances in Neural Information Processing Systems (2014)Google Scholar
  28. 28.
    Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1. MIT Press, Cambridge (1986)Google Scholar
  29. 29.
    Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: ICML, pp. 1764–1772. ACM (2014)Google Scholar
  30. 30.
    Fleet, D.J., Punjani, A., Norouzi, M.: Fast search in hamming space with multi-index hashing. In: CVPR. IEEE (2012)Google Scholar
  31. 31.
    Wu, F., Yu, Z., Yang, Y., Tang, S., Zhang, Y., Zhuang, Y.: Sparse multi-modal hashing. IEEE Trans. Multimed. 16(2), 427–439 (2014)CrossRefGoogle Scholar
  32. 32.
    Ou, M., Cui, P., Wang, F., Wang, J., Zhu, W., Yang, S.: Comparing apples to oranges: a scalable solution with heterogeneous hashing. In: SIGKDD. ACM (2013)Google Scholar
  33. 33.
    Ding, G., Guo, Y., Zhou, J.: Collective matrix factorization hashing for multimodal data. In: CVPR (2014)Google Scholar
  34. 34.
    Wang, D., Gao, X., Wang, X., He, L.: Semantic topic multimodal hashing for cross-media retrieval. In: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, 25–31 July 2015 (2015)Google Scholar
  35. 35.
    Hu, Y., Jin, Z., Ren, H., Cai, D., He, X.: Iterative multi-view hashing for cross media indexing. In: MM. ACM (2014)Google Scholar
  36. 36.
    Wei, Y., Song, Y., Zhen, Y., Liu, B., Yang, Q.: Scalable heterogeneous translated hashing. In: SIGKDD. ACM (2014)Google Scholar
  37. 37.
    Lin, Z., Ding, G., Hu, M., Wang, J.: Semantics-preserving hashing for cross-view retrieval. In: CVPR (2015)Google Scholar
  38. 38.
    Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)Google Scholar
  39. 39.
    Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al.: Devise: A deep visual-semantic embedding model. In: NIPS, pp. 2121–2129 (2013)Google Scholar
  40. 40.
    Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. In: NIPS (2014)Google Scholar
  41. 41.
    Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? dataset and methods for multilingual image question answering. In: NIPS (2015)Google Scholar
  42. 42.
    Cao, Y., Long, M., Wang, J., Zhu, H., Wen, Q.: Deep quantization network for efficient image retrieval. In: Proceedings of the AAAI Conference on Artificial Intellignece (AAAI). AAAI (2016)Google Scholar
  43. 43.
    Cao, Z., Long, M., Wang, J., Yu, P.S.: Hashnet: Deep learning to hash by continuation. In: ICCV 2017 (2017)Google Scholar
  44. 44.
    Liu, B., Cao, Y., Long, M., Wang, J., Wang, J.: Deep triplet quantization. In: MM. ACM (2018)Google Scholar
  45. 45.
    Dmochowski, J.P., Sajda, P., Parra, L.C.: Maximum likelihood in cost-sensitive learning: model specification, approximations, and upper bounds. J. Mach. Learn. Res. (JMLR) 11(Dec), 3313–3332 (2010)MathSciNetzbMATHGoogle Scholar
  46. 46.
    Lin, T., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV 2017 (2017)Google Scholar
  47. 47.
    Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.T.: NUS-WIDE: a real-world web image database from national university of Singapore. In: CIVR. ACM (2009)Google Scholar
  48. 48.
    Huiskes, M.J., Lew, M.S.: The MIR FLICKR retrieval evaluation. In: ICMR. ACM (2008)Google Scholar
  49. 49.
    Grubinger, M., Clough, P., Müller, H., Deselaers, T.: The IAPR TC-12 benchmark: a new evaluation resource for visual information systems. In: International Workshop OntoImage, pp. 13–23 (2006)Google Scholar
  50. 50.
    Zhu, X., Huang, Z., Shen, H.T., Zhao, X.: Linear cross-modal hashing for efficient multimedia search. In: MM. ACM (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Yue Cao
    • 1
    • 2
  • Bin Liu
    • 1
    • 2
  • Mingsheng Long
    • 1
    • 2
    Email author
  • Jianmin Wang
    • 1
    • 2
  1. 1.School of SoftwareTsinghua UniversityBeijingChina
  2. 2.National Engineering Laboratory for Big Data SoftwareBeijing National Research Center for Information Science and TechnologyBeijingChina

Personalised recommendations