Advertisement

Scene Reconstruction for Storytelling in 360\(^\circ \) Videos

  • Gonçalo PinheiroEmail author
  • Nelson Alves
  • Luis Magalhães
  • Luís Agrellos
  • Miguel Guevara
Conference paper
Part of the Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering book series (LNICST, volume 273)

Abstract

In immersive and interactive contents like 360-degrees videos the user has the control of the camera, which poses a challenge to the content producer since the user may look to where he wants. This paper presents the concept and first steps towards the development of a framework that provides a workflow for storytelling in 360-degrees videos. With the proposed framework it will be possible to connect a sound to a source and taking advantage of binaural audio it will help to redirect the user attention to where the content producer wants. To present this kind of audio, the scenario must be mapped/reconstructed so as to understand how the objects contained in it interfere with the sound waves propagation. The proposed system is capable of reconstructing the scenario from a stereoscopic, still or motion 360-degrees video when provided in an equirectangular projection. The system also incorporates a module that detects and tracks people, mapping their motion from the real world to the 3D world. In this document we describe all the technical decisions and implementations of the system. To the best of our knowledge, this system is the only that has shown the capability to reconstruct scenarios in a large variety of 360 footage and allows for the creation of binaural audio from that reconstruction.

Keywords

360 videos Storytelling Scene Reconstruction Binaural sound Computer vision Computer graphics 3D reconstruction People detection People tracking 

Notes

Acknowledgments

This article is a result of the project CHIC - Cooperative Holistic view on Internet and Content (project n\(^\circ \) 24498), supported by the European Regional Development Fund (ERDF), through the Competitiveness and Internationalization Operational Program (COMPETE 2020) under the PORTUGAL 2020 Partnership Agreement.

References

  1. 1.
    360-degree projection. https://github.com/bingsyslab/360projection. Accessed 11 June 2018
  2. 2.
    S3A spatial audio. http://www.s3a-spatialaudio.org. Accessed 24 July 2018
  3. 3.
    Akbarzadeh, A., et al.: Towards urban 3D reconstruction from video. In: Proceedings of the Third International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT 2006). IEEE Computer Society (2006)Google Scholar
  4. 4.
    Andriluka, M., Roth, S., Schiele, B.: People-tracking-by-detection and people-detection-by-tracking. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8, June 2008.  https://doi.org/10.1109/CVPR.2008.4587583
  5. 5.
    Breuers, S., Beyer, L., Rafi, U., Leibe, B.: Detection-tracking for efficient person analysis: the DetTA pipeline. CoRR abs/1804.10134 (2018). http://arxiv.org/abs/1804.10134
  6. 6.
    Chen, W., Fu, Z., Yang, D., Deng, J.: Single-image depth perception in the wild. CoRR abs/1604.03901 (2016). http://arxiv.org/abs/1604.03901
  7. 7.
    Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. CoRR abs/1406.2283 (2014). http://arxiv.org/abs/1406.2283
  8. 8.
    Ewerth, R., et al.: Estimating relative depth in single images via rankboost. In: 2017 IEEE International Conference on Multimedia and Expo (ICME), pp. 919–924, July 2017.  https://doi.org/10.1109/ICME.2017.8019434
  9. 9.
    Geiger, A., Ziegler, J., Stiller, C.: StereoScan: Dense 3D reconstruction in real-time. In: 2011 IEEE Intelligent Vehicles Symposium (IV), pp. 963–968, June 2011.  https://doi.org/10.1109/IVS.2011.5940405
  10. 10.
    Grani, F., et al.: Audio-visual attractors for capturing attention to the screens when walking in cave systems. In: 2014 IEEE VR Workshop: Sonic Interaction in Virtual Environments (SIVE), pp. 3–6, March 2014.  https://doi.org/10.1109/SIVE.2014.7006282
  11. 11.
    Güler, R.A., Neverova, N., Kokkinos, I.: DensePose: dense human pose estimation in the wild. CoRR abs/1802.00434 (2018). http://arxiv.org/abs/1802.00434
  12. 12.
    Jafari, O.H., Mitzel, D., Leibe, B.: Real-time RGB-D based people detection and tracking for mobile robots and head-worn cameras. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 5636–5643, May 2014.  https://doi.org/10.1109/ICRA.2014.6907688
  13. 13.
    Kim, A., Eustice, R.M.: Active visual slam for robotic area coverage: theory and experiment. Int. J. Robot. Res. 34(4–5), 457–475 (2015).  https://doi.org/10.1177/0278364914547893CrossRefGoogle Scholar
  14. 14.
    Kim, H., Hilton, A.: Block world reconstruction from spherical stereo image pairs. Comput. Vis. Image Underst. 139, 104–121 (2015).  https://doi.org/10.1016/j.cviu.2015.04.001. http://www.sciencedirect.com/science/article/pii/S1077314215000831CrossRefGoogle Scholar
  15. 15.
    Kim, H., et al.: Acoustic room modelling using a spherical camera for reverberant spatial audio objects. In: Audio Engineering Society Convention 142, May 2017. http://www.aes.org/e-lib/browse.cfm?elib=18583
  16. 16.
    Lin, T., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. CoRR abs/1708.02002 (2017). http://arxiv.org/abs/1708.02002
  17. 17.
    Liu, B., Gould, S., Koller, D.: Single image depth estimation from predicted semantic labels. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1253–1260, June 2010.  https://doi.org/10.1109/CVPR.2010.5539823
  18. 18.
    Liu, C., Yang, J., Ceylan, D., Yumer, E., Furukawa, Y.: PlaneNet: piece-wise planar reconstruction from a single RGB image. CoRR abs/1804.06278 (2018). http://arxiv.org/abs/1804.06278
  19. 19.
    Polic, M., Förstner, W., Pajdla, T.: Fast and accurate camera covariance computation for large 3D reconstruction (2018)Google Scholar
  20. 20.
    Riazuelo, L., Montano, L., Montiel, J.M.M.: Semantic visual SLAM in populated environments. In: 2017 European Conference on Mobile Robots (ECMR), pp. 1–7, Sept 2017.  https://doi.org/10.1109/ECMR.2017.8098697
  21. 21.
    Saurer, O., Pollefeys, M., Hee Lee, G.: Sparse to dense 3D reconstruction from rolling shutter images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3337–3345 (2016)Google Scholar
  22. 22.
    Spinello, L., Arras, K.O., Triebel, R., Siegwart, R.: A layered approach to people detection in 3D range data. In: Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, pp. 1625–1630. AAAI Press (2010). http://dl.acm.org/citation.cfm?id=2898607.2898866
  23. 23.
    Stewart, R., Andriluka, M., Ng, A.Y.: End-to-end people detection in crowded scenes. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016Google Scholar
  24. 24.
    Sturm, P., Triggs, B.: A factorization based algorithm for multi-image projective structure and motion. In: Buxton, B., Cipolla, R. (eds.) ECCV 1996. LNCS, vol. 1065, pp. 709–720. Springer, Heidelberg (1996).  https://doi.org/10.1007/3-540-61123-1_183CrossRefGoogle Scholar
  25. 25.
    Toldo, R., Gherardi, R., Farenzena, M., Fusiello, A.: Hierarchical structure-and-motion recovery from uncalibrated images. Comput. Vis. Image Underst. 140, 127–143 (2015).  https://doi.org/10.1016/j.cviu.2015.05.011. http://www.sciencedirect.com/science/article/pii/S1077314215001228CrossRefGoogle Scholar
  26. 26.
    Tome, D., Russell, C., Agapito, L.: Lifting from the deep: convolutional 3D pose estimation from a single image. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017Google Scholar
  27. 27.
    Wong, K.H., Chang, M.M.Y.: 3D model reconstruction by constrained bundle adjustment. In: 2004 Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, vol. 3, pp. 902–905, Aug 2004.  https://doi.org/10.1109/ICPR.2004.1334674
  28. 28.
    Yu, R., Russell, C., Campbell, N.D.F., Agapito, L.: Direct, dense, and deformable: Template-based non-rigid 3D reconstruction from RGB video. In: The IEEE International Conference on Computer Vision (ICCV), December 2015Google Scholar
  29. 29.
    Yu, S., Lhuillier, M.: Incremental reconstruction of manifold surface from sparse visual mapping. In: 2012 Second International Conference on 3D Imaging, Modeling, Processing, Visualization Transmission, pp. 293–300, October 2012.  https://doi.org/10.1109/3DIMPVT.2012.11
  30. 30.
    Zakharov, A.A., Barinov, A.E.: An algorithm for 3D-object reconstruction from video using stereo correspondences. Pattern Recogn. Image Anal. 25(1), 117–121 (2015).  https://doi.org/10.1134/S1054661815010228CrossRefGoogle Scholar
  31. 31.
    Zhang, G., Liu, J., Li, H., Chen, Y.Q., Davis, L.S.: Joint human detection and head pose estimation via multistream networks for RGB-D videos. IEEE Signal Process. Lett. 24(11), 1666–1670 (2017).  https://doi.org/10.1109/LSP.2017.2731952CrossRefGoogle Scholar
  32. 32.
    Zhou, H., Zou, D., Pei, L., Ying, R., Liu, P., Yu, W.: StructSLAM: visual SLAM with building structure lines. IEEE Trans. Veh. Technol. 64(4), 1364–1375 (2015).  https://doi.org/10.1109/TVT.2015.2388780CrossRefGoogle Scholar
  33. 33.
    Zhuo, W., Salzmann, M., He, X., Liu, M.: Indoor scene structure analysis for single image depth estimation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 614–622, June 2015.  https://doi.org/10.1109/CVPR.2015.7298660

Copyright information

© ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2019

Authors and Affiliations

  • Gonçalo Pinheiro
    • 1
    Email author
  • Nelson Alves
    • 1
  • Luis Magalhães
    • 2
  • Luís Agrellos
    • 3
  • Miguel Guevara
    • 1
  1. 1.Centro de Computação GráficaGuimarãesPortugal
  2. 2.University of MinhoGuimarãesPortugal
  3. 3.GMKPortoPortugal

Personalised recommendations