Scene Recognition in User Preference Prediction Based on Classification of Deep Embeddings and Object Detection

  • Andrey V. SavchenkoEmail author
  • Alexandr G. Rassadin
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11555)


In this paper we consider general scene recognition problem for analysis of user preferences based on his or her photos on mobile phone. Special attention is paid to out-of-class detections and efficient processing using MobileNet-based architectures. We propose the three stage procedure. At first, pre-trained convolutional neural network (CNN) is used extraction of input image embeddings at one of the last layers, which are used for training a classifier, e.g., support vector machine or random forest. Secondly, we fine-tune the pre-trained network on the given training set and compute the predictions (scores) at the output of the resulted CNN. Finally, we perform object detection in the input image, and the resulted sparse vector of detected objects is classified. The decision is made based on a computation of a weighted sum of the class posterior probabilities estimated by all three classifiers. Experimental results with a subset of ImageNet dataset demonstrate that the proposed approach is up to 5% more accurate when compared to conventional fine-tuned models.


Image recognition Scene recognition Convolutional neural network (CNN) Object detection Ensemble of classifiers Classifier fusion 



The article was prepared within the framework of the Academic Fund Program at the National Research University Higher School of Economics (HSE University) in 2019 (grant No. 19-04-004) and by the Russian Academic Excellence Project “5–100”.


  1. 1.
    Prince, S.J.: Computer Vision: Models Learning and Inference. Cambridge University Press, Cambridge (2012)Google Scholar
  2. 2.
    Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)Google Scholar
  3. 3.
    Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
  4. 4.
    Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Inverted residuals and linear bottlenecks: mobile networks for classification, detection and segmentation. arXiv preprint arXiv:1801.04381 (2018)
  5. 5.
    Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1452–1464 (2018)Google Scholar
  6. 6.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 91–99 (2015)Google Scholar
  7. 7.
    Bayat, A., Do Koh, H., Kumar Nand, A., Pereira, M., Pomplun, M.: Scene grammar in human and machine recognition of objects and scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1992–1999 (2018)Google Scholar
  8. 8.
    Jian, M., et al.: Saliency detection based on background seeds by object proposals and extended random walk. J. Vis. Commun. Image Represent. 57, 202–211 (2018)Google Scholar
  9. 9.
    Jian, M., et al.: Assessment of feature fusion strategies in visual attention mechanism for saliency detection. Pattern Recogn. Lett. (2018, in press).
  10. 10.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. IEEE (2009)Google Scholar
  11. 11.
    Savchenko, A.V.: Sequential three-way decisions in multi-category image recognition with deep features based on distance factor. Inf. Sci. 489, 18–36 (2019)Google Scholar
  12. 12.
    Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 806–813. IEEE (2014)Google Scholar
  13. 13.
    Savchenko, A.V.: Maximum-likelihood approximate nearest neighbor method in real-time image recognition. Pattern Recogn. 61, 459–469 (2017)Google Scholar
  14. 14.
    Savchenko, A.V., Belova, N.S.: Unconstrained face identification using maximum likelihood of distances between deep off-the-shelf features. Expert Syst. Appl. 108, 170–182 (2018)Google Scholar
  15. 15.
    Rassadin, A., Gruzdev, A., Savchenko, A.: Group-level emotion recognition using transfer learning from face identification. In: Proceedings of the 19th International Conference on Multimodal Interaction (ICMI), pp. 544–548. ACM (2017)Google Scholar
  16. 16.
    Tarasov, A.V., Savchenko, A.V.: Emotion recognition of a group of people in video analytics using deep off-the-shelf image embeddings. In: van der Aalst, W.M.P., et al. (eds.) AIST 2018. LNCS, vol. 11179, pp. 191–198. Springer, Cham (2018). Scholar
  17. 17.
    Rendle, S.: Factorization machines. In: 10th International Conference on Data Mining (ICDM), pp. 995–1000. IEEE (2010)Google Scholar
  18. 18.
    Savchenko, A.V.: Probabilistic neural network with homogeneity testing in recognition of discrete patterns set. Neural Netw. 46, 227–241 (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Laboratory of Algorithms and Technologies for Network AnalysisNational Research University Higher School of EconomicsNizhny NovgorodRussia

Personalised recommendations