Advertisement

Multimodal Object Recognition Using Deep Learning Representations Extracted from Images and Smartphone Sensors

  • Javier Ortega Bastida
  • Antonio-Javier GallegoEmail author
  • Antonio Pertusa
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11401)

Abstract

In this work, we present a multimodal approach to perform object recognition from photographs taken using smartphones. The proposed method extracts neural codes from the input image using a Convolutional Neural Network (CNN), and combines them with a series of metadata gathered from the smartphone sensors when the picture was taken. These metadata complement the visual contents and they can provide additional information in order to determine the target class. We add feature selection and metadata pre-processing, by encoding textual features, such as the kind of place where a picture was taken, using Doc2Vec in order to maintain the semantics. The deep representations extracted from images and metadata are combined with early fusion to classify samples using different machine learning methods (k-Nearest Neighbors, Random Forests and Support Vector Machines). Results show that metadata preprocessing is beneficial, SVM outperforms kNN when using neural codes on the visual information, and the combination of neural codes and metadata only improves the results slightly when the images are classified into very general categories.

Keywords

Multimodality Object recognition Metadata Learning representations 

Notes

Acknowledgment

This work was supported by the Pattern Recognition and Artificial Intelligence Group (PRAIg) from the University of Alicante, Spain.

References

  1. 1.
    Geonames feature codes. http://www.geonames.org/export/codes.html. Accessed 14 June 2018
  2. 2.
    Boutell, M., Luo, J.: Beyond pixels: exploiting camera metadata for photo classification. Pattern Recognit. 38(6), 935–946 (2005)CrossRefGoogle Scholar
  3. 3.
    Dinakaran, B., Annapurna, J., Kumar, C.A.: Interactive image retrieval using text and image content. Cybern. Inf. Technol. 10(3), 20–30 (2010)Google Scholar
  4. 4.
    Fellbaum, C.: WordNet: an electronic lexical database (1998).  https://doi.org/10.1139/h11-025
  5. 5.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  6. 6.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)
  7. 7.
    Howard, A.G., et al.: MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. CoRR (2017)Google Scholar
  8. 8.
    Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation. CoRR abs/1607.05368 (2016)Google Scholar
  9. 9.
    Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. CoRR abs/1405.4053 (2014). http://arxiv.org/abs/1405.4053
  10. 10.
    Ma, L., Lu, Z., Shang, L., Li, H.: Multimodal convolutional neural networks for matching image and sentence. In: IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar
  11. 11.
    van der Maaten, L., Hinton, G.: Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)zbMATHGoogle Scholar
  12. 12.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. CoRR abs/1310.4546 (2013). http://arxiv.org/abs/1310.4546
  13. 13.
    Park, G., Im, W.: Image-text multi-modal representation learning by adversarial backpropagation. CoRR abs/1612.08354 (2016)Google Scholar
  14. 14.
    Pertusa, A., Gallego, A.-J., Bernabeu, M.: MirBot: a multimodal interactive image retrieval system. In: Sanches, J.M., Micó, L., Cardoso, J.S. (eds.) IbPRIA 2013. LNCS, vol. 7887, pp. 197–204. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-38628-2_23CrossRefGoogle Scholar
  15. 15.
    Pertusa, A., Gallego, A.J., Bernabeu, M.: MirBot: a collaborative object recognition system for smartphones using convolutional neural networks. Neurocomputing 293, 87–99 (2018)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Javier Ortega Bastida
    • 1
  • Antonio-Javier Gallego
    • 1
    Email author
  • Antonio Pertusa
    • 1
  1. 1.Department of Software and Computing SystemsUniversity of AlicanteSan Vicente del RaspeigSpain

Personalised recommendations