Joint Semantic Segmentation and 3D Reconstruction from Monocular Video

Kundu, Abhijit; Li, Yin; Dellaert, Frank; Li, Fuxin; Rehg, James M.

doi:10.1007/978-3-319-10599-4_45

Abhijit Kundu¹⁹,
Yin Li¹⁹,
Frank Dellaert¹⁹,
Fuxin Li¹⁹ &
…
James M. Rehg¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 8694))

Included in the following conference series:

European Conference on Computer Vision

18k Accesses
96 Citations
3 Altmetric

Abstract

We present an approach for joint inference of 3D scene structure and semantic labeling for monocular video. Starting with monocular image stream, our framework produces a 3D volumetric semantic + occupancy map, which is much more useful than a series of 2D semantic label images or a sparse point cloud produced by traditional semantic segmentation and Structure from Motion(SfM) pipelines respectively. We derive a Conditional Random Field (CRF) model defined in the 3D space, that jointly infers the semantic category and occupancy for each voxel. Such a joint inference in the 3D CRF paves the way for more informed priors and constraints, which is otherwise not possible if solved separately in their traditional frameworks. We make use of class specific semantic cues that constrain the 3D structure in areas, where multiview constraints are weak. Our model comprises of higher order factors, which helps when the depth is unobservable.We also make use of class specific semantic cues to reduce either the degree of such higher order factors, or to approximately model them with unaries if possible. We demonstrate improved 3D structure and temporally consistent semantic segmentation for difficult, large scale, forward moving monocular image sequences.

Download to read the full chapter text

Chapter PDF

4D Temporally Coherent Multi-Person Semantic Reconstruction and Segmentation

Article Open access 28 April 2022

Recursive Inference for Prediction of Objects in Urban Environments

Video parsing via spatiotemporally analysis with images

Article 07 July 2015

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Agarwal, S., Mierle, K.: Others: Ceres solver (2012), https://code.google.com/p/ceres-solver/
Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and recognition using structure from motion point clouds. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 44–57. Springer, Heidelberg (2008)
Chapter Google Scholar
Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and recognition using structure from motion point clouds. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 44–57. Springer, Heidelberg (2008)
Chapter Google Scholar
Brostow, G., Fauqueur, J., Cipolla, R.: Semantic object classes in video: A high-definition ground truth database. PRL 30(2), 88–97 (2009)
Article Google Scholar
Cornelis, N., Leibe, B., Cornelis, K., Van Gool, L.: 3D urban scene modeling integrating recognition and reconstruction. IJCV 78(2-3), 121–141 (2008)
Google Scholar
Floros, G., Leibe, B.: Joint 2D-3D temporally consistent segmentation of street scenes. In: CVPR (2012)
Google Scholar
Furukawa, Y., Ponce, J.: Accurate, dense, and robust multiview stereopsis. PAMI 32(8), 1362–1376 (2010)
Article Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: CVPR (2012)
Google Scholar
Häne, C., Zach, C., Cohen, A., Angst, R., Pollefeys, M.: Joint 3D scene reconstruction and class segmentation. In: CVPR (2013)
Google Scholar
Hoiem, D., Efros, A., Hebert, M.: Recovering surface layout from an image. IJCV 75(1), 151–172 (2007)
Article Google Scholar
Hornung, A., Wurm, K.M., Bennewitz, M., Stachniss, C., Burgard, W.: OctoMap: An efficient probabilistic 3D mapping framework based on octrees. Autonomous Robots (2013)
Google Scholar
Jancosek, M., Pajdla, T.: Multi-view reconstruction preserving weakly-supported surfaces. In: CVPR (2011)
Google Scholar
Kaess, M., Johannsson, H., Roberts, R., Ila, V., Leonard, J., Dellaert, F.: iSAM2: Incremental smoothing and mapping using the Bayes tree. IJRR 31, 217–236 (2012)
Google Scholar
Kappes, J.H., Speth, M., Reinelt, G., Schnorr, C.: Towards efficient and exact map-inference for large scale discrete computer vision problems via combinatorial optimization. In: CVPR (2013)
Google Scholar
Kohli, P., Ladick, L., Torr, P.: Robust higher order potentials for enforcing label consistency. IJCV 82(3), 302–324 (2009)
Article Google Scholar
Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. The MIT Press (2009)
Google Scholar
Komodakis, N., Paragios, N.: Beyond pairwise energies: Efficient optimization for higher-order mrfs. In: CVPR (2009)
Google Scholar
Krahenbuhl, P., Koltun, V.: Efficient inference in fully connected crfs with gaussian edge potentials. In: NIPS (2011)
Google Scholar
Ladicky, L., Sturgess, P., Russell, C., Sengupta, S., Bastanlar, Y., Clocksin, W., Torr, P.H.: Joint optimisation for object class segmentation and dense stereo reconstruction. In: BMVC (2010)
Google Scholar
Ladicky, L., Russell, C., Kohli, P., Torr, P.: Associative hierarchical crfs for object class image segmentation. In: ICCV (2009)
Google Scholar
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: ICML (2001)
Google Scholar
Liu, B., Gould, S., Koller, D.: Single image depth estimation from predicted semantic labels. In: CVPR (2010)
Google Scholar
Liu, S., Cooper, D.B.: Ray markov random fields for image-based 3D modeling: model and efficient inference. In: CVPR (2010)
Google Scholar
Miksik, O., Munoz, D., Bagnell, J.A., Hebert, M.: Efficient temporal consistency for streaming video scene analysis. In: ICRA (2013)
Google Scholar
Saxena, A., Chung, S., Ng, A.: 3-D Depth Reconstruction from a Single Still image. IJCV 76(1), 53–69 (2008)
Article Google Scholar
Sengupta, S., Greveson, E., Shahrokni, A., Torr, P.H.S.: Urban 3D semantic modelling using stereo vision. In: ICRA (2013)
Google Scholar
Sturgess, P., Alahari, K., Ladicky, L., Torr, P.H.S.: Combining appearance and structure from motion features for road scene understanding. In: BMVC (2009)
Google Scholar
Sutton, C., McCallum, A.: An introduction to conditional random fields. PAMI 4(4), 267–373 (2012)
Google Scholar
Tarlow, D., Givoni, I.E., Zemel, R.S.: Hop-map: Efficient message passing with high order potentials. In: AISTATS (2010)
Google Scholar
Thrun, S., Burgard, W., Fox, D.: Probabilistic robotics. MIT Press (2005)
Google Scholar
Tighe, J., Lazebnik, S.: Superparsing: Scalable nonparametric image parsing with superpixels. International Journal of Computer Vision (2012)
Google Scholar
Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning 1(1-2), 1–305 (2008)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Georgia Institute of Technology, Atlanta, USA
Abhijit Kundu, Yin Li, Frank Dellaert, Fuxin Li & James M. Rehg

Authors

Abhijit Kundu
View author publications
You can also search for this author in PubMed Google Scholar
Yin Li
View author publications
You can also search for this author in PubMed Google Scholar
Frank Dellaert
View author publications
You can also search for this author in PubMed Google Scholar
Fuxin Li
View author publications
You can also search for this author in PubMed Google Scholar
James M. Rehg
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Toronto, 6 King’s College Road, M5H 3S5, Toronto, ON, Canada
David Fleet
Faculty of Electrical Engineering, Department of Cybernetics, Czech Technical University in Prague, Technicka 2, 166 27, Prague 6, Czech Republic
Tomas Pajdla
Max-Planck-Institut für Informatik, Campus E1 4, 66123, Saarbrücken, Germany
Bernt Schiele
ESAT - PSI, iMinds, KU Leuven, Kasteelpark Arenberg 10, Bus 2441, 3001, Leuven, Belgium
Tinne Tuytelaars

1 Electronic Supplementary Material

Electronic Supplementary Material (PDF 142 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kundu, A., Li, Y., Dellaert, F., Li, F., Rehg, J.M. (2014). Joint Semantic Segmentation and 3D Reconstruction from Monocular Video. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8694. Springer, Cham. https://doi.org/10.1007/978-3-319-10599-4_45

Download citation

DOI: https://doi.org/10.1007/978-3-319-10599-4_45
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10598-7
Online ISBN: 978-3-319-10599-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Joint Semantic Segmentation and 3D Reconstruction from Monocular Video

Abstract

Chapter PDF

Similar content being viewed by others

4D Temporally Coherent Multi-Person Semantic Reconstruction and Segmentation

Recursive Inference for Prediction of Objects in Urban Environments

Video parsing via spatiotemporally analysis with images

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

1 Electronic Supplementary Material

Electronic Supplementary Material (PDF 142 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Joint Semantic Segmentation and 3D Reconstruction from Monocular Video

Abstract

Chapter PDF

Similar content being viewed by others

4D Temporally Coherent Multi-Person Semantic Reconstruction and Segmentation

Recursive Inference for Prediction of Objects in Urban Environments

Video parsing via spatiotemporally analysis with images

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

1 Electronic Supplementary Material

Electronic Supplementary Material (PDF 142 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation