Multichannel Semantic Segmentation with Unsupervised Domain Adaptation

  • Kohei WatanabeEmail author
  • Kuniaki Saito
  • Yoshitaka Ushiku
  • Tatsuya Harada
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11133)


Most contemporary robots have depth sensors, and research on semantic segmentation with RGBD images has shown that depth images boost the accuracy of segmentation. Since it is time-consuming to annotate images with semantic labels per pixel, it would be ideal if we could avoid this laborious work by utilizing an existing dataset or a synthetic dataset which we can generate on our own. Robot motions are often tested in a synthetic environment, where multichannel (e.g., RGB + depth + instance boundary) images plus their pixel-level semantic labels are available. However, models trained simply on synthetic images tend to demonstrate poor performance on real images. In order to address this, we propose two approaches that can efficiently exploit multichannel inputs combined with an unsupervised domain adaptation (UDA) algorithm. One is a fusion-based approach that uses depth images as inputs. The other is a multitask learning approach that uses depth images as outputs. We demonstrated that the segmentation results were improved by using a multitask learning approach with a post-process and created a benchmark for this task.


Semantic segmentation Domain adaptation RGB-depth Multi-task learning 



The work was partially funded by the ImPACT Program of the Council for Science, Technology, and Innovation (Cabinet Office, Government of Japan).


  1. 1.
    Armeni, I., Sax, S., Zamir, A.R., Savarese, S.: Joint 2D–3D-semantic data for indoor scene understanding. arXiv:1702.01105 (2017)
  2. 2.
    Armeni, I., et al.: 3D semantic parsing of large-scale indoor spaces. In: CVPR (2016)Google Scholar
  3. 3.
    Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., Krishnan, D.: Unsupervised pixel-level domain adaptation with generative adversarial networks. In: CVPR (2017)Google Scholar
  4. 4.
    Chen, Y.H., Chen, W.Y., Chen, Y.T., Tsai, B.C., Wang, Y.C.F., Sun, M.: No more discrimination: cross city adaptation of road scene segmenters. In: ICCV (2017)Google Scholar
  5. 5.
    Cheng, Y., Cai, R., Li, Z., Zhao, X., Huang, K.: Locality-sensitive deconvolution networks with gated fusion for RGB-D indoor semantic segmentation. In: CVPR (2017)Google Scholar
  6. 6.
    Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)Google Scholar
  7. 7.
    Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: CVPR (2017)Google Scholar
  8. 8.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  9. 9.
    Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: ICML (2014)Google Scholar
  10. 10.
    Gupta, S., Arbelaez, P., Malik, J.: Perceptual organization and recognition of indoor scenes from RGB-D images. In: CVPR (2013)Google Scholar
  11. 11.
    Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014). Scholar
  12. 12.
    Ha, Q., Watanabe, K., Karasawa, T., Ushiku, Y., Harada, T.: MFNet: towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In: IROS (2017)Google Scholar
  13. 13.
    Hazirbas, C., Ma, L., Domokos, C., Cremers, D.: FuseNet: incorporating depth into semantic segmentation via fusion-based CNN architecture. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10111, pp. 213–228. Springer, Cham (2017). Scholar
  14. 14.
    Hoffman, J., Gupta, S., Leong, J., Guadarrama, S., Darrell, T.: Cross-modal adaptation for RGB-D detection. In: ICRA (2016)Google Scholar
  15. 15.
    Hoffman, J., Wang, D., Yu, F., Darrell, T.: FCNs in the wild: pixel-level adversarial and constraint-based adaptation. arXiv:1612.02649 (2016)
  16. 16.
    Kendall, A., Gal, Y.: What uncertainties do we need in Bayesian deep learning for computer vision? In: NIPS (2017)Google Scholar
  17. 17.
    Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: CVPR (2018)Google Scholar
  18. 18.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  19. 19.
    Kuga, R., Kanezaki, A., Samejima, M., Sugano, Y., Matsushita, Y.: Multi-task learning using multi-modal encoder-decoder networks with shared skip connections. In: ICCV Workshop (2017)Google Scholar
  20. 20.
    Lin, D., Chen, G., Cohen-Or, D., Heng, P.A., Huang, H.: Cascaded feature network for semantic segmentation of RGB-D images. In: ICCV (2017)Google Scholar
  21. 21.
    Lin, G., Milan, A., Shen, C., Reid, I.: RefineNet: multi-path refinement networks with identity mappings for high-resolution semantic segmentation. In: CVPR (2017)Google Scholar
  22. 22.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)Google Scholar
  23. 23.
    Long, M., Cao, Y., Wang, J., Jordan, M.I.: Learning transferable features with deep adaptation networks. In: ICML (2015)Google Scholar
  24. 24.
    McCormac, J., Handa, A., Leutenegger, S., Davison, A.J.: SceneNet RGB-D: can 5m synthetic images beat generic ImageNet pre-training on indoor segmentation? In: ICCV (2017)Google Scholar
  25. 25.
    Morerio, P., Cavazza, J., Murino, V.: Minimal-entropy correlation alignment for unsupervised deep domain adaptation. In: ICLR (2018)Google Scholar
  26. 26.
    Park, S.J., Hong, K.S., Lee, S.: RDFNet: RGB-D multi-level residual feature fusion for indoor semantic segmentation. In: ICCV (2017)Google Scholar
  27. 27.
    Qi, X., Liao, R., Jia, J., Fidler, S., Urtasun, R.: 3D graph neural networks for RGBD semantic segmentation. In: ICCV (2017)Google Scholar
  28. 28.
    Qiu, W., et al.: UnrealCV: virtual worlds for computer vision. In: ACMMM Open Source Software Competition (2017)Google Scholar
  29. 29.
    Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: ground truth from computer games. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 102–118. Springer, Cham (2016). Scholar
  30. 30.
    Saito, K., Watanabe, K., Ushiku, Y., Harada, T.: Maximum classifier discrepancy for unsupervised domain adaptation. In: CVPR (2018)Google Scholar
  31. 31.
    Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: NYU depth dataset V2. In: ECCV (2012)Google Scholar
  32. 32.
    Song, S., Lichtenberg, S.P., Xiao, J.: SUN RGB-D: a RGB-D scene understanding benchmark suite. In: CVPR (2015)Google Scholar
  33. 33.
    Song, X., Herranz, L., Jiang, S.: Depth CNNs for RGB-D scene recognition: learning from scratch better than transferring from RGB-CNNs. In: AAAI (2017)Google Scholar
  34. 34.
    Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: CVPR (2017)Google Scholar
  35. 35.
    Xie, S., Tu, Z.: Holistically-nested edge detection. In: ICCV (2015)Google Scholar
  36. 36.
    Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. In: CVPR (2017)Google Scholar
  37. 37.
    Zhang, Y., David, P., Gong, B.: Curriculum domain adaptation for semantic segmentation of urban scenes. In: ICCV (2017)Google Scholar
  38. 38.
    Zhang, Y., et al.: Physically-based rendering for indoor scene understanding using convolutional neural networks. In: CVPR (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Kohei Watanabe
    • 1
    Email author
  • Kuniaki Saito
    • 1
  • Yoshitaka Ushiku
    • 1
  • Tatsuya Harada
    • 1
    • 2
  1. 1.The University of TokyoTokyoJapan
  2. 2.RIKENTokyoJapan

Personalised recommendations