Advertisement

Single Shot Scene Text Retrieval

  • Lluís GómezEmail author
  • Andrés Mafla
  • Marçal Rusiñol
  • Dimosthenis Karatzas
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11218)

Abstract

Textual information found in scene images provides high level semantic information about the image and its context and it can be leveraged for better scene understanding. In this paper we address the problem of scene text retrieval: given a text query, the system must return all images containing the queried text. The novelty of the proposed model consists in the usage of a single shot CNN architecture that predicts at the same time bounding boxes and a compact text representation of the words in them. In this way, the text based image retrieval task can be casted as a simple nearest neighbor search of the query text representation over the outputs of the CNN over the entire image database. Our experiments demonstrate that the proposed architecture outperforms previous state-of-the-art while it offers a significant increase in processing speed.

Keywords

Image retrieval Scene text Word spotting Convolutional neural networks Region proposals networks PHOC 

Notes

Acknowledgement

This work has been partially supported by the Spanish research project TIN2014-52072-P, the CERCA Programme/Generalitat de Catalunya, the H2020 Marie Skłodowska-Curie actions of the European Union, grant agreement No 712949 (TECNIOspring PLUS), the Agency for Business Competitiveness of the Government of Catalonia (ACCIO), CEFIPRA Project 5302-1 and the project “aBSINTHE - AYUDAS FUNDACIÓN BBVA A EQUIPOS DE INVESTIGACION CIENTIFICA 2017. We gratefully acknowledge the support of the NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.

References

  1. 1.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  2. 2.
    Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: COCO-text: dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140 (2016)
  3. 3.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553) (2015)CrossRefGoogle Scholar
  4. 4.
    Movshovitz-Attias, Y., Yu, Q., Stumpe, M.C., Shet, V., Arnoud, S., Yatziv, L.: Ontological supervision for fine grained classification of street view storefronts. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1693–1702 (2015)Google Scholar
  5. 5.
    Karaoglu, S., Tao, R., van Gemert, J.C., Gevers, T.: Con-text: text detection for fine-grained object classification. IEEE Trans. Image Process. 26(8), 3965–3980 (2017)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Bai, X., Yang, M., Lyu, P., Xu, Y.: Integrating scene text and visual appearance for fine-grained image classification with convolutional neural networks. arXiv preprint arXiv:1704.04613 (2017)
  7. 7.
    Mishra, A., Alahari, K., Jawahar, C.: Image retrieval using textual cues. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3040–3047 (2013)Google Scholar
  8. 8.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)Google Scholar
  9. 9.
    Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. arXiv preprint arXiv:1612.08242 (2016)
  10. 10.
    Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and recognition with embedded attributes. IEEE Trans. Pattern Anal. Mach. Intell. 36(12), 2552–2566 (2014)CrossRefGoogle Scholar
  11. 11.
    Sudholt, S., Fink, G.A.: PHOCNET: a deep convolutional neural network for word spotting in handwritten documents. In: Proceedings of the IEEE International Conference on Frontiers in Handwriting Recognition, pp. 277–282 (2016)Google Scholar
  12. 12.
    Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. Int. J. Comput. Vis. 116(1), 1–20 (2016)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2315–2324 (2016)Google Scholar
  14. 14.
    Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: TextBoxes: a fast text detector with a single deep neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 4161–4167 (2017)Google Scholar
  15. 15.
    Liao, M., Shi, B., Bai, X.: TextBoxes++: a single-shot oriented scene text detector. arXiv preprint arXiv:1801.02765 (2018)
  16. 16.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_2CrossRefGoogle Scholar
  17. 17.
    Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11) (2017)CrossRefGoogle Scholar
  18. 18.
    Buvsta, M., Neumann, L., Matas, J.: Deep TextSpotter: an end-to-end trainable scene text localization and recognition framework. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2204–2212 (2017)Google Scholar
  19. 19.
    Li, H., Wang, P., Shen, C.: Towards end-to-end text spotting with convolutional recurrent neural networks. arXiv preprint arXiv:1707.03985 (2017)
  20. 20.
    Aldavert, D., Rusiñol, M., Toledo, R., Lladós, J.: Integrating visual and textual cues for query-by-string word spotting. In: Proceedings of the IEEE International Conference on Document Analysis and Recognition, pp. 511–515 (2013)Google Scholar
  21. 21.
    Ghosh, S.K., Valveny, E.: Query by string word spotting based on character bi-gram indexing. In: Proceedings of the IEEE International Conference on Document Analysis and Recognition, pp. 881–885 (2015)Google Scholar
  22. 22.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of the International Conference on Neural Information Processing Systems, pp. 91–99 (2015)Google Scholar
  23. 23.
    Lang, K., Mitchell, T.: Newsgroup 20 dataset (1999)Google Scholar
  24. 24.
    Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: Proceedings of the IEEE International Conference on Document Analysis and Recognition, pp. 1484–1493 (2013)Google Scholar
  25. 25.
    Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: Proceedings of the IEEE International Conference on Document Analysis and Recognition, pp. 1156–1160 (2015)Google Scholar
  26. 26.
    Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1457–1464 (2011)Google Scholar
  27. 27.
    He, T., Tian, Z., Huang, W., Shen, C., Qiao, Y., Sun, C.: An end-to-end TextSpotter with explicit alignment and attention. In: CVPR (2018)Google Scholar
  28. 28.
    Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes with stroke width transform. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2963–2970 (2010)Google Scholar
  29. 29.
    Mishra, A., Alahari, K., Jawahar, C.: Top-down and bottom-up cues for scene text recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2687–2694 (2012)Google Scholar
  30. 30.
    Neumann, L., Matas, J.: Real-time scene text localization and recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2012)Google Scholar
  31. 31.
    Ghosh, S.K., Gomez, L., Karatzas, D., Valveny, E.: Efficient indexing for query by string text retrieval. In: Proceedings of the IEEE International Conference on Document Analysis and Recognition, pp. 1236–1240 (2015)Google Scholar
  32. 32.
    Mishra, A.: Understanding Text in Scene Images. Ph.D. thesis, International Institute of Information Technology Hyderabad (2016)Google Scholar
  33. 33.
    Gómez, L., Karatzas, D.: TextProposals: a text-specific selective search algorithm for word spotting in the wild. Pattern Recogn. 70, 60–74 (2017)CrossRefGoogle Scholar
  34. 34.
    Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227 (2014)
  35. 35.
    Bernhardsson, E.: ANNOY: approximate nearest neighbors in C++/Python optimized for memory usage and loading/saving to disk (2013)Google Scholar
  36. 36.
    Malkov, Y.A., Yashunin, D.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. arXiv:1603.09320 (2016)
  37. 37.
    Andoni, A., Indyk, P., Laarhoven, T., Razenshteyn, I., Schmidt, L.: Practical and optimal LSH for angular distance. In: NIPS (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Lluís Gómez
    • 1
    Email author
  • Andrés Mafla
    • 1
  • Marçal Rusiñol
    • 1
  • Dimosthenis Karatzas
    • 1
  1. 1.Computer Vision CenterUniversitat Autònoma de BarcelonaBellaterra (Barcelona)Spain

Personalised recommendations