Advertisement

Multimedia Tools and Applications

, Volume 78, Issue 24, pp 35607–35631 | Cite as

An image retrieval method based on semantic matching with multiple positional representations

  • Chunye Li
  • Zhiping ZhouEmail author
  • Wei Zhang
Article
  • 36 Downloads

Abstract

Text-based image retrieval requires manual annotation or automatic labeling of the machine. Manual annotation is time-consuming, and simple text description is difficult to fully express the content of the image. Existing deep models rely on the representation of a single sentence, and such methods cannot well capture the contextualized local information in the matching process. In response to these problems, this paper presents a new retrieval idea based on image caption. First, the image description sentences of images are generated by using the image caption model. Then, for the sentence matching model, we propose a multiple positional representations semantic matching model. We use two interrelated Bi-LSTMs and the attention mechanism to match sentences. the matching score is finally produced by aggregating interactions between these different positional sentence representations. The sentence matching model is used to match the retrieval sentence with the image description sentences in the image library. In our experiments, the accuracy of the proposed image caption model and the sentence matching model are all improved compared with the competitive models, and our method can complete the image retrieval task.

Keywords

Image retrieval Image caption Sentence matching Multiple positional representations 

Notes

Acknowledgements

This work is supported by Postgraduate Research & Practice Innovation Program of Jiangsu Province of the People’s Republic of China under SJCX19_0797.

References

  1. 1.
    Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. Comput SciGoogle Scholar
  2. 2.
    Berger A, Caruana R, Cohn D, Freitag D, Mittal V (2000) Bridging the lexical chasm: statistical approaches to answer-finding. In: International ACM SIGIR conference on research and development in information retrieval, pp 192–199Google Scholar
  3. 3.
    Blacoe W, Lapata M (2012) A comparison of vector-based representations for semantic composition. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, EMNLP-CoNLL 2012, July 12-14, 2012, Jeju Island, Korea, pp 546–556Google Scholar
  4. 4.
    Collobert R, Weston J (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In: Machine learning, Proceedings of the twenty-fifth international conference (ICML 2008), Helsinki, Finland, June 5-9, 2008, pp 160–167Google Scholar
  5. 5.
    Ding G, Chen M, Zhao S, Chen H, Han J, Liu Q (2018) Neural image caption generation with weighted training and reference. Cogn ComputGoogle Scholar
  6. 6.
    Dolan B, Quirk C, Brockett C (2004) Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: International conference on computational linguistics, p 350Google Scholar
  7. 7.
    Eakins JP (1996) Automatic image content retrieval - are we getting anywhere? De Montfort University Milton Keynes (1): 123–135Google Scholar
  8. 8.
    Fang H, Gupta S, Iandola FN, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC, Zitnick CL, Zweig G (2015) From captions to visual concepts and back. In: IEEE conference on computer vision and pattern recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp 1473–1482Google Scholar
  9. 9.
    Ferreira R, Cavalcanti GDC, Freitas F, Lins RD, Simske SJ, Riss M (2018) Combining sentence similarities measures to identify paraphrases. Comput Speech Lang 47:59–73CrossRefGoogle Scholar
  10. 10.
    Harmandas V, Sanderson M, Dunlop MD (1997) Image retrieval by hypertext links. Acm Sigir Forum 31(SI):296–303CrossRefGoogle Scholar
  11. 11.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, las vegas, NV, USA, June 27-30, 2016, pp 770–778Google Scholar
  12. 12.
    Hermann KM, Kočiský T, Grefenstette E, Espeholt L, Kay W, Suleyman M, Blunsom P (2015) Teaching machines to read and comprehend : 1693–1701Google Scholar
  13. 13.
    Hu B, Lu Z, Li H, Chen Q (2015) Convolutional neural network architectures for matching natural language sentences. Adv Neural Inf Proces Syst 3:2042–2050Google Scholar
  14. 14.
    Huang PS, He X, Gao J, Deng L, Acero A, Heck L (2013) Learning deep structured semantic models for web search using clickthrough data. In: ACM international conference on conference on information & knowledge management, pp 2333–2338Google Scholar
  15. 15.
    Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: 2015 IEEE international conference on computer vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp 2407–2415Google Scholar
  16. 16.
    Karpathy A, Fei-Fei L (2017) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39(4):664–676CrossRefGoogle Scholar
  17. 17.
    Kim Y (2014) Convolutional neural networks for sentence classification. Eprint ArxivGoogle Scholar
  18. 18.
    Kiros R, Zhu Y, Salakhutdinov R, Zemel RS, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: Advances in neural information processing systems 28: annual conference on neural information processing systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp 3294–3302Google Scholar
  19. 19.
    Li H, Xu J (2014) Semantic matching in search. Found Trends Inf Retr 7 (5):343–469CrossRefGoogle Scholar
  20. 20.
    Li YN, Wang P, Su YT (2015) Robust image hashing based on selective quaternion invariance. IEEE Signal Process Lett 22(12):2396–2400CrossRefGoogle Scholar
  21. 21.
    Liang X, Shen X, Feng J, Lin L, Yan S (2016) Semantic object parsing with graph LSTM. In: Computer vision - ECCV 2016 - 14th European conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, pp 125–143CrossRefGoogle Scholar
  22. 22.
    Lin T, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: Computer vision - ECCV 2014 - 13th European conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, pp 740–755CrossRefGoogle Scholar
  23. 23.
    Liu L, Finch AM, Utiyama M, Sumita E (2016) Agreement on target-bidirectional lstms for sequence-to-sequence learning. In: Proceedings of the thirtieth AAAI conference on artificial intelligence, February 12-17, 2016, Phoenix, Arizona, USA, pp 2630–2637Google Scholar
  24. 24.
    Liu B, Zhang T, Han FX, Niu D, Lai K, Xu Y (2018) Matching natural language sentences with hierarchical sentence factorization. In: Proceedings of the 2018 world wide web conference on world wide web, WWW 2018, Lyon, France, April 23-27, 2018, pp 1237–1246Google Scholar
  25. 25.
    Mao J, Xu W, Yang Y, Wang J, Yuille AL (2015) Deep captioning with multimodal recurrent neural networks (m-rnn). In: 3Rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track ProceedingsGoogle Scholar
  26. 26.
    Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems 26: 27th annual conference on neural information processing systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, USA, pp 3111–3119Google Scholar
  27. 27.
    Palangi H, Deng L, Shen Y, Gao J, He X, Chen J, Song X, Ward R (2016) Deep sentence embedding using long short-term memory networks: analysis and application to information retrieval. IEEE/ACM Trans Audio Speech Lang Process 24(4):694–707CrossRefGoogle Scholar
  28. 28.
    Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Conference on empirical methods in natural language processing, pp 1532–1543Google Scholar
  29. 29.
    Piplani T, Bamman D (2018) Deepseek: content based image search & retrieval. CoRR arXiv:1801.03406
  30. 30.
    Plummer BA, Wang L, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: 2015 IEEE International conference on computer vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp 2641–2649Google Scholar
  31. 31.
    Qin C, Chen X, Luo X, Zhang X, Sun X (2018) Perceptual image hashing via dual-cross pattern encoding and salient structure detection. Inf Sci 423:284–302MathSciNetCrossRefGoogle Scholar
  32. 32.
    Qiu X, Huang X (2015) Convolutional neural tensor network architecture for community-based question answering. In: International conference on artificial intelligence, pp 1305–1311Google Scholar
  33. 33.
    Qu S, Xi Y, Ding S (2017) Visual attention based on long-short term memory model for image caption generation. In: 2017 29th Chinese control and decision conference (CCDC), pp 4789–4794Google Scholar
  34. 34.
    Rocktäschel T, Grefenstette E, Hermann KM, Kočiský T, Blunsom P (2015) Reasoning about entailment with neural attention. CoRRGoogle Scholar
  35. 35.
    Shen Y, He X, Gao J, Deng L, Mesnil G (2014) Learning semantic representations using convolutional neural network for web search. Proc Www: 373–374Google Scholar
  36. 36.
    Shetty R, Rohrbach M, Hendricks LA, Fritz M, Schiele B (2017) Speaking the same language: matching machine to human captions by adversarial training. In: IEEE International conference on computer vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp 4155–4164Google Scholar
  37. 37.
    Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval: the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1348CrossRefGoogle Scholar
  38. 38.
    Socher R, Chen D, Manning CD, Ng AY (2013) Reasoning with neural tensor networks for knowledge base completion. In: International conference on neural information processing systems, pp 926–934Google Scholar
  39. 39.
    Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: IEEE Conference on computer vision and pattern recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp 3156–3164Google Scholar
  40. 40.
    Wan J, Wang D, Hoi SCH, Wu P, Zhu J, Zhang Y, Li J (2014) Deep learning for content-based image retrieval: a comprehensive study (FullPaper) 157–166Google Scholar
  41. 41.
    Wan S, Lan Y, Guo J, Xu J, Pang L, Cheng X (2015) A deep architecture for semantic matching with multiple positional sentence representations. CoRR, 2835–2841Google Scholar
  42. 42.
    Wan S, Lan Y, Guo J, Xu J, Pang L, Cheng X (2016) A deep architecture for semantic matching with multiple positional sentence representations. In: Proceedings of the thirtieth AAAI conference on artificial intelligence, February 12-17, 2016, Phoenix, Arizona, USA, pp 2835–2841Google Scholar
  43. 43.
    Wu Q, Shen C, Liu L, Dick AR, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems?. In: 2016 IEEE Conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp 203–212Google Scholar
  44. 44.
    Xiaojun BI, Pan T (2017) Image retrieval method with relevance feedback based on improved teaching-learning-based optimization algorithm. Syst Eng Electron 39(10):2359–2367Google Scholar
  45. 45.
    Xu K, Ba J, Kiros R, Cho K, Courville AC, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd international conference on machine learning, ICML 2015, Lille, France, 6-11 July 2015, pp 2048–2057Google Scholar
  46. 46.
    Yang J, Yu K, Gong Y, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification. Cvpr, 1794–1801Google Scholar
  47. 47.
    Yang Z, Yuan Y, Wu Y, Salakhutdinov R, Cohen WW (2016) Encode, review, and decode: reviewer module for caption generation. CoRR arXiv:1605.07912
  48. 48.
    Yao H, Liu H, Zhang P (2018) A novel sentence similarity model with word embedding based on convolutional neural network. Concurrency and Computation: Practice and Experience. 30(23)Google Scholar
  49. 49.
    Yin W, Schütze H (2015) Multigrancnn: an architecture for general matching of text chunks on multiple levels of granularity. In: Meeting of the association for computational linguistics and the international joint conference on natural language processing, pp 63–73Google Scholar
  50. 50.
    Yin W, Schütze H (2015) Convolutional neural network for paraphrase identification. In: NAACL HLT 2015, The 2015 conference of the north american chapter of the association for computational linguistics: human language technologies, Denver, Colorado, USA, May 31 - June 5, 2015, pp 901–911Google Scholar
  51. 51.
    Yin W, Schütze H, Xiang B, Zhou B (2015) Abcnn: attention-based convolutional neural network for modeling sentence pairs. Comput SciGoogle Scholar
  52. 52.
    You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: 2016 IEEE Conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp 4651–4659Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.School of Internet of Things EngineeringJiangnan UniversityWuxiPeople’s Republic of China
  2. 2.Engineering Research Center of Internet of Things Technology Applications Ministry of EducationJiangnan UniversityWuxiPeople’s Republic of China

Personalised recommendations