PanoFormer: Panorama Transformer for Indoor 360 $$^{\circ }$$ Depth Estimation

Shen, Zhijie; Lin, Chunyu; Liao, Kang; Nie, Lang; Zheng, Zishuo; Zhao, Yao

doi:10.1007/978-3-031-19769-7_12

Zhijie Shen^12,13,
Chunyu Lin^12,13,
Kang Liao^12,13,14,
Lang Nie^12,13,
Zishuo Zheng^12,13 &
…
Yao Zhao^12,13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13661))

Included in the following conference series:

European Conference on Computer Vision

3461 Accesses
14 Citations

Abstract

Existing panoramic depth estimation methods based on convolutional neural networks (CNNs) focus on removing panoramic distortions, failing to perceive panoramic structures efficiently due to the fixed receptive field in CNNs. This paper proposes the panorama transformer (named PanoFormer) to estimate the depth in panorama images, with tangent patches from spherical domain, learnable token flows, and pano-rama specific metrics. In particular, we divide patches on the spherical tangent domain into tokens to reduce the negative effect of panoramic distortions. Since the geometric structures are essential for depth estimation, a self-attention module is redesigned with an additional learnable token flow. In addition, considering the characteristic of the spherical domain, we present two panorama-specific metrics to comprehensively evaluate the panoramic depth estimation models’ performance. Extensive experiments demonstrate that our approach significantly outperforms the state-of-the-art (SOTA) methods. Furthermore, the proposed method can be effectively extended to solve semantic panorama segmentation, a similar pixel2pixel task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Distortion-Aware Convolutional Filters for Dense Prediction in Panoramic Images

Layout-Guided Indoor Panorama Inpainting with Plane-Aware Normalization

Panoramic Vision Transformer for Saliency Detection in 360 $$^\circ $$ Videos

References

Armeni, I., Sax, S., Zamir, A.R., Savarese, S.: Joint 2d–3d-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105 (2017)
Bhat, S.F., Alhashim, I., Wonka, P.: Adabins: depth estimation using adaptive bins. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4009–4018 (2021)
Google Scholar
Bhoi, A.: Monocular depth estimation: A survey. arXiv preprint arXiv:1901.09402 (2019)
Chang, A., et al.: Matterport3d: Learning from rgb-d data in indoor environments. In: 2017 International Conference on 3D Vision (3DV), pp. 667–676. IEEE Computer Society (2017)
Google Scholar
Chen, H.X., Li, K., Fu, Z., Liu, M., Chen, Z., Guo, Y.: Distortion-aware monocular depth estimation for omnidirectional images. IEEE Signal Process. Lett. 28, 334–338 (2021)
Article Google Scholar
Cheng, H.T., Chao, C.H., Dong, J.D., Wen, H.K., Liu, T.L., Sun, M.: Cube padding for weakly-supervised saliency prediction in 360 videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1420–1429 (2018)
Google Scholar
Cheng, X., Wang, P., Zhou, Y., Guan, C., Yang, R.: Omnidirectional depth extension networks. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 589–595. IEEE (2020)
Google Scholar
Cohen, T.S., Geiger, M., Köhler, J., Welling, M.: Spherical cnns. In: International Conference on Learning Representations (2018)
Google Scholar
Coors, B., Condurache, A.P., Geiger, A.: Spherenet: Learning spherical representations for detection and classification in omnidirectional images. In: Proceedings of the European conference on computer vision (ECCV), pp. 518–533 (2018)
Google Scholar
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 764–773 (2017)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Eder, M., Shvets, M., Lim, J., Frahm, J.M.: Tangent images for mitigating spherical distortion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12426–12434 (2020)
Google Scholar
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems 27 (2014)
Google Scholar
Esmaeili, A., Marvasti, F.: A novel approach to quantized matrix completion using huber loss measure. IEEE Signal Process. Lett. 26(2), 337–341 (2019)
Article Google Scholar
Esteves, C., Allen-Blanchette, C., Makadia, A., Daniilidis, K.: Learning so (3) equivariant representations with spherical cnns. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 52–68 (2018)
Google Scholar
Jiang, C., Huang, J., Kashinath, K., Marcus, P., Niessner, M., et al.: Spherical cnns on unstructured grids. arXiv preprint arXiv:1901.02039 (2019)
Jiang, H., Sheng, Z., Zhu, S., Dong, Z., Huang, R.: Unifuse: unidirectional fusion for 360 panorama depth estimation. IEEE Robot. Autom. Lett. 6(2), 1519–1526 (2021)
Article Google Scholar
Jin, L., Xu, Y., Zheng, J., Zhang, J., Tang, R., Xu, S., Yu, J., Gao, S.: Geometric structure based and regularized depth estimation from 360 indoor imagery. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 889–898 (2020)
Google Scholar
Khasanova, R., Frossard, P.: Geometry aware convolutional filters for omnidirectional images representation. In: International Conference on Machine Learning, pp. 3351–3359. PMLR (2019)
Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 2016 Fourth International Conference on 3D Vision (3DV), pp. 239–248. IEEE (2016)
Google Scholar
Lee, Y., Jeong, J., Yun, J., Cho, W., Yoon, K.J.: Spherephd: applying cnns on a spherical polyhedron representation of 360deg images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9181–9189 (2019)
Google Scholar
Li, Y., Zhang, K., Cao, J., Timofte, R., Van Gool, L.: Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707 (2021)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Pearson, I.F.: Map Projections: Theory and Applications (1990)
Google Scholar
Pintore, G., Agus, M., Almansa, E., Schneider, J., Gobbetti, E.: Slicenet: deep dense depth estimation from a single indoor panorama using a slice-based representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11536–11545 (2021)
Google Scholar
Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12179–12188 (2021)
Google Scholar
Shen, Z., Lin, C., Nie, L., Liao, K., Zhao, Y.: Distortion-tolerant monocular depth estimation on omnidirectional images using dual-cubemap. In: 2021 IEEE International Conference on Multimedia and Expo (ICME). pp. 1–6. IEEE (2021)
Google Scholar
Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1746–1754 (2017)
Google Scholar
Su, Y.C., Grauman, K.: Kernel transformer networks for compact spherical convolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9442–9451 (2019)
Google Scholar
Sun, C., Hsiao, C.W., Sun, M., Chen, H.T.: Horizonnet: learning room layout with 1d representation and pano stretch data augmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1047–1056 (2019)
Google Scholar
Sun, C., Sun, M., Chen, H.T.: Hohonet: 360 indoor holistic understanding with latent horizontal features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2573–2582 (2021)
Google Scholar
Tateno, K., Navab, N., Tombari, F.: Distortion-aware convolutional filters for dense prediction in panoramic images. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 707–722 (2018)
Google Scholar
Wang, F.E., Yeh, Y.H., Sun, M., Chiu, W.C., Tsai, Y.H.: Bifuse: Monocular 360 depth estimation via bi-projection fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 462–471 (2020)
Google Scholar
Wang, Z., Cun, X., Bao, J., Liu, J.: Uformer: a general u-shaped transformer for image restoration. arXiv preprint arXiv:2106.03106 (2021)
Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L.: CVT: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22–31 (2021)
Google Scholar
Xiong, B., Grauman, K.: Snap angle prediction for 360$^{\circ }$ panoramas. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 3–20. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_1
Chapter Google Scholar
Xu, Y., Zhang, Z., Gao, S.: Spherical dnns and their applications in 360$^{\circ }$ images and videos. IEEE Trans. Pattern Anal. Mach. Intell. (2021)
Google Scholar
Yan, Z., Li, X., Wang, K., Zhang, Z., Li, J., Yang, J.: Multi-modal masked pre-training for monocular panoramic depth completion. arXiv preprint arXiv:2203.09855 (2022)
Yan, Z., Wang, K., Li, X., Zhang, Z., Xu, B., Li, J., Yang, J.: Rignet: Repetitive image guided network for depth completion. arXiv preprint arXiv:2107.13802 (2021)
Yu-Chuan, S., Kristen, G.: Flat2sphere: Learning spherical convolution for fast features from 360 imagery. In: Proceedings of International Conference on Neural Information Processing Systems (NIPS) (2017)
Google Scholar
Yuan, K., Guo, S., Liu, Z., Zhou, A., Yu, F., Wu, W.: Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 579–588 (2021)
Google Scholar
Yun, I., Lee, H.J., Rhee, C.E.: Improving 360 monocular depth estimation via non-local dense prediction transformer and joint supervised and self-supervised learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 3224–3233 (2022)
Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: International Conference on Learning Representations (2020)
Google Scholar
Zioulis, Nikolaos, Karakottas, Antonis, Zarpalas, Dimitrios, Daras, Petros: OmniDepth: dense depth estimation for indoors spherical panoramas. In: Ferrari, Vittorio, Hebert, Martial, Sminchisescu, Cristian, Weiss, Yair (eds.) ECCV 2018. LNCS, vol. 11210, pp. 453–471. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_28
Chapter Google Scholar

Download references

Acknowledgement

This work was supported by the National Key R &D Program of China (No.2021ZD0112100), and the National Natural Science Foundation of China (Nos. 62172032, U1936212, 62120106009).

Author information

Authors and Affiliations

Institute of Information Science, Beijing Jiaotong University, Beijing, China
Zhijie Shen, Chunyu Lin, Kang Liao, Lang Nie, Zishuo Zheng & Yao Zhao
Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing, China
Zhijie Shen, Chunyu Lin, Kang Liao, Lang Nie, Zishuo Zheng & Yao Zhao
Max Planck Institute for Informatics, Saarbrücken, Germany
Kang Liao

Authors

Zhijie Shen
View author publications
You can also search for this author in PubMed Google Scholar
Chunyu Lin
View author publications
You can also search for this author in PubMed Google Scholar
Kang Liao
View author publications
You can also search for this author in PubMed Google Scholar
Lang Nie
View author publications
You can also search for this author in PubMed Google Scholar
Zishuo Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Yao Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chunyu Lin .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 342 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shen, Z., Lin, C., Liao, K., Nie, L., Zheng, Z., Zhao, Y. (2022). PanoFormer: Panorama Transformer for Indoor 360$^{\circ }$ Depth Estimation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13661. Springer, Cham. https://doi.org/10.1007/978-3-031-19769-7_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-19769-7_12
Published: 23 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19768-0
Online ISBN: 978-3-031-19769-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

PanoFormer: Panorama Transformer for Indoor 360\(^{\circ }\) Depth Estimation

Abstract

Access this chapter

Similar content being viewed by others

Distortion-Aware Convolutional Filters for Dense Prediction in Panoramic Images

Layout-Guided Indoor Panorama Inpainting with Plane-Aware Normalization

Panoramic Vision Transformer for Saliency Detection in 360 $$^\circ $$ Videos

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 342 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

PanoFormer: Panorama Transformer for Indoor 360\(^{\circ }\) Depth Estimation

Abstract

Access this chapter

Similar content being viewed by others

Distortion-Aware Convolutional Filters for Dense Prediction in Panoramic Images

Layout-Guided Indoor Panorama Inpainting with Plane-Aware Normalization

Panoramic Vision Transformer for Saliency Detection in 360 $$^\circ $$ Videos

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 342 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation