Synthesizing Training Images for Semantic Segmentation

  • Yunhui ZhangEmail author
  • Zizhao Wu
  • Zhiping Zhou
  • Yigang Wang
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 875)


Semantic segmentation is one of the key problems in the computer vision area. Recently, Convolutional Neural Networks (CNNs) have yielded a significant performance for the semantic segmentation task. However, CNNs require a sufficient amount of annotated training images, which is challenging since massive human labour is needed. In this paper, we propose to use 3D models to automatically generate synthetic images with pixel-level annotations. We take advantage of 3D models to generate synthetic images of high diversity in object appearance and background clutterness, by randomly sampling rendering parameters and adding random background patterns. Then, we use the synthetic images to augment training samples for semantic segmentation by combining with publicly available real-world images. Experimental results demonstrate that CNNs trained with our synthetic images improve performance on the semantic segmentation task in the PASCAL VOC 2012 dataset.


Semantic segmentation Synthesizing training images CNN Augmentation Generate synthetic images 



This work was partially supported by the National Natural Science Foundation of China (No. 61602139) and Zhejiang Province science and technology planning project (2018C01030).


  1. 1.
    Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. CoRR abs/1511.00561 (2015)Google Scholar
  2. 2.
    Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and recognition using structure from motion point clouds. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5302, pp. 44–57. Springer, Heidelberg (2008). Scholar
  3. 3.
    Chen, W., et al.: Synthesizing training images for boosting human 3D pose estimation. In: Fourth International Conference on 2016 3D Vision 3DV 2016, Stanford, CA, USA, 25–28, October, 2016 pp. 479–488 (2016)Google Scholar
  4. 4.
    Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016)Google Scholar
  5. 5.
    Eigen, D., Fergus, R.: Predicting depth surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV, pp. 2650–2658 (2015)Google Scholar
  6. 6.
    Everingham, M., Gool, L.J.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vision 88(2), 303–338 (2010)CrossRefGoogle Scholar
  7. 7.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR, pp. 3354–3361 (2012)Google Scholar
  8. 8.
    Hariharan, B., Arbelaez, P., Bourdev, L.D., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: IEEE International Conference on 2011 Computer Vision ICCV , Barcelona, Spain, 6–13, November, 2011 pp. 991–998 (2011)Google Scholar
  9. 9.
    Hong, S., Oh, J., Lee, H., Han, B.: Learning transferrable knowledge for semantic segmentation with deep convolutional neural network. In: CVRP, pp. 3204–3212 (2016)Google Scholar
  10. 10.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1106–1114 (2012)Google Scholar
  11. 11.
    Ladický, Ľ., Sturgess, P., Alahari, K., Russell, C., Torr, P.H.S.: What, where and how many? combining object detectors and CRFs. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 424–437. Springer, Heidelberg (2010). Scholar
  12. 12.
    Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  13. 13.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on 2015 Computer Vision and Pattern Recognition CVPR 2015, Boston, MA, USA, 7–12, June, 2015 pp. 3431–3440 (2015)Google Scholar
  14. 14.
    Pathak, D., Krähenbühl, P., Darrell, T.: Constrained convolutional neural networks for weakly supervised segmentation. In: ICCV, pp. 1796–1804 (2015)Google Scholar
  15. 15.
    Pinheiro, P.H.O., Collobert, R.: From image-level to pixel-level labeling with convolutional networks. In: CVPR, pp. 1713–1721 (2015)Google Scholar
  16. 16.
    Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: ground truth from computer games. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 102–118. Springer, Cham (2016). Scholar
  17. 17.
    Ros, G., Sellart, L., Materzynska, J., Vázquez, D., Lopez, A.M.: The Synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: CVPR, pp. 3234–3243 (2016)Google Scholar
  18. 18.
    Shotton, J., Johnson, M., Cipolla, R.: Semantic texton forests for image categorization and segmentation. In: CVPR. IEEE Computer Society (2008)Google Scholar
  19. 19.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)Google Scholar
  20. 20.
    Sturgess, P., Alahari, K., Ladicky, L., Torr, P.H.S.: Combining appearance and structure from motion features for road scene understanding. In: British Machine Vision Conference BMVC, pp. 1–11 (2009)Google Scholar
  21. 21.
    Su, H., Qi, C.R., Li, Y., Guibas, L.J.: Render for CNN: viewpoint estimation in images using cnns trained with rendered 3D model views. In: ICCV, pp. 2686–2694 (2015)Google Scholar
  22. 22.
    Szegedy, C., et al.: Going deeper with convolutions. CoRR abs/1409.4842 (2014)Google Scholar
  23. 23.
    Wang, L., et al.: Temporal segment networks for action recognition in videos. CoRR abs/1705.02953 (2017)Google Scholar
  24. 24.
    Wu, Z., et al.: 3D shapeNets: a deep representation for volumetric shapes. In: CVPR, pp. 1912–1920 (2015)Google Scholar
  25. 25.
    Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: SUN database: large-scale scene recognition from abbey to zoo. In: The Twenty-Third IEEE Conference on 2010 Computer Vision and Pattern Recognition CVPR, San Francisco, CA, USA, 13–18 June 2010. pp. 3485–3492 (2010)Google Scholar
  26. 26.
    Zeiler, M.D., Taylor, G.W., Fergus, R.: Adaptive deconvolutional networks for mid and high level feature learning. In: ICCV, pp. 2018–2025 (2011)Google Scholar
  27. 27.
    Zheng, S., et al.: Conditional random fields as recurrent neural networks. In: ICCV, pp. 1529–1537 (2015)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  • Yunhui Zhang
    • 1
    Email author
  • Zizhao Wu
    • 1
  • Zhiping Zhou
    • 2
  • Yigang Wang
    • 1
  1. 1.Digite Media Interactive Simulation LabHangzhou Dianzi UniversityHangzhouChina
  2. 2.School of Computer ScienceHangzhou Dianzi UniversityHangzhouChina

Personalised recommendations