Egocentric visual scene description based on human-object interaction and deep spatial relations among objects

  • Gulraiz Khan
  • Muhammad Usman Ghani
  • Aiman Siddiqi
  • Zahoor-ur-Rehman
  • Sanghyun Seo
  • Sung Wook Baik
  • Irfan MehmoodEmail author


Visual Scene interpretation is one of the major areas of research in the recent past. Recognition of human object interaction is a fundamental step towards understanding visual scenes. Videos can be described via a variety of human-object interaction scenarios such as when both human and object are static (static-static), one is static while other is dynamic (static-dynamic) and both are dynamic (dynamic-dynamic). This paper presents a unified framework for the explanation of these interactions between humans and a variety of objects using deep learning as a pivot methodology. Human-object interaction is extracted through native machine learning techniques, while spatial relations are captured by training a model through convolution neural network. We also address the recognition of human posture in detail to provide egocentric visual description. After extracting visual features, sequential minimal optimization is employed for training our model. Extracted inter-action, spatial relations and posture information are fed into natural language generation module along with interacting object label to generate scene understanding. Evaluation of the proposed framework is done for two state of the art datasets i.e., MSCOCO and MSR3D Daily activity dataset; where achieved results are 78 and 91.16% accurate, respectively.


Scene description Classification Surveillance Human-object interaction Spatial relations Deep neural network 



This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea Government (MSIP) (No. 2016R1A2B4011712) & by IGNITE, National Technology Fund, Pakistan for the project entitle “Automatic Surveillance System for Video Sequences”.


  1. 1.
    Aydemir A et al (2011) Search in the real world: Active visual object search based on spatial relations. Robotics and Automation (ICRA)Google Scholar
  2. 2.
    Ellis C, Masood S, Tappen M, Laviola J, Sukthankar R (2013) Exploring the trade-off between accuracy and observational latency in action recognition. Int J Comput Vis 101(3):420436CrossRefGoogle Scholar
  3. 3.
    Gupta A, Kembhavi A, Davis LS (2009) Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Trans Pattern Anal Mach Intell 31(10):1775–1789CrossRefGoogle Scholar
  4. 4.
    Hamza R et al. (2017) Hash based encryption for keyframes of diagnostic hysteroscopy. IEEE AccessGoogle Scholar
  5. 5.
    Hamza R et al (2017) Secure video summarization framework for personalized wireless capsule endoscopy. Pervasive and Mobile Computing 41:436–450CrossRefGoogle Scholar
  6. 6.
    Huang D et al. (2014) Sequential max-margin event detectors.” European conference on computer vision. Springer, ChamGoogle Scholar
  7. 7.
    Jain P et al. (2015) Knowledge acquisition for language description from scene understanding." Computer, Communication and Control (IC4), 2015 International Conference on. IEEEGoogle Scholar
  8. 8.
    Karpathy A, Li F-F (2015) Deep visual-semantic alignments for generating image descriptions. Proc IEEE Conf Comput Vis Patt RecogGoogle Scholar
  9. 9.
    H Kuehne, H Jhuang, E Garrote, T Poggio, T Serre, HMDB (2011) A Large Video Database for Human Motion Recognition. ICCVGoogle Scholar
  10. 10.
    W Li, Z Zhang, Z Liu (2010) Action recognition based on a bag of 3D points, in: IEEE CVPR Workshop on Human Communicative Behavior, AnalysisGoogle Scholar
  11. 11.
    Lin T-Y et al (2014) Microsoft coco: Common objects in context. European conference on computer vision. Springer, ChamGoogle Scholar
  12. 12.
    Muhammad K et al. (2018) Secure Surveillance Framework for IoT systems using Probabilistic Image Encryption. IEEE Trans Indust InfoGoogle Scholar
  13. 13.
    Redmon J et al (2016) You only look once: Unified, real-time object detection. Proc IEEE Conf Comput Vis Patt RecogGoogle Scholar
  14. 14.
    Sajjad M, et al. (2018) CNN-based anti-spoofing two-tier multi-factor authentication system. Pattern Recognition LettersGoogle Scholar
  15. 15.
    Sj K, Aydemir A, Jensfelt P (2012) Topological spatial relations for active visual search. Robot Auton Syst 60(9):1093–1107CrossRefGoogle Scholar
  16. 16.
    J Sung, C Ponce, B Selman, A Saxena (2012) Unstructured human activity detection from RGBD images, in: Proc. International Conference on Robotics and Automation 842849Google Scholar
  17. 17.
    Wang J et al. (2012) Mining actionlet ensemble for action recognition with depth cameras.” Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEEGoogle Scholar
  18. 18.
    Welke K et al. (2011) Grounded spatial symbols for task planning based on experience. Humanoid Robots (Humanoids), 2013 13th IEEE-RAS International Conference on. IEEE, 2013. IEEE International Conference on. IEEEGoogle Scholar
  19. 19.
    Xia L, C-C Chen, JK Aggarwal (2012) View invariant human action recognition using histograms of 3d joints.” Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on. IEEEGoogle Scholar
  20. 20.
    Yang X, Tian YL (2014) Effective 3d action recognition using eigenjoints. J Vis Commun Image Represent 25.1:2–11CrossRefGoogle Scholar
  21. 21.
    Zanfir M, M Leordeanu, and C Sminchisescu (2013) The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection. Proceedings of the IEEE international conference on computer visionGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Gulraiz Khan
    • 1
  • Muhammad Usman Ghani
    • 1
    • 2
  • Aiman Siddiqi
    • 1
  • Zahoor-ur-Rehman
    • 3
  • Sanghyun Seo
    • 4
  • Sung Wook Baik
    • 5
  • Irfan Mehmood
    • 5
    Email author
  1. 1.Al-Khwarizmi Institute of Computer Science UETLahorePakistan
  2. 2.Department of Computer Science and EngineeringUniversity of Engineering and Technology LahoreLahorePakistan
  3. 3.Department of Computer ScienceCOMSATS University IslamabadAttock CampusPakistan
  4. 4.Department of Media SoftwareSungkyul UniversityAnyang-siSouth Korea
  5. 5.Department of SoftwareSejong UniversitySeoulSouth Korea

Personalised recommendations