Advertisement

Semantic Understanding of Scenes Through the ADE20K Dataset

  • Bolei Zhou
  • Hang Zhao
  • Xavier Puig
  • Tete Xiao
  • Sanja Fidler
  • Adela Barriuso
  • Antonio Torralba
Article

Abstract

Semantic understanding of visual scenes is one of the holy grails of computer vision. Despite efforts of the community in data collection, there are still few image datasets covering a wide range of scenes and object categories with pixel-wise annotations for scene understanding. In this work, we present a densely annotated dataset ADE20K, which spans diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. Totally there are 25k images of the complex everyday scenes containing a variety of objects in their natural spatial context. On average there are 19.5 instances and 10.5 object classes per image. Based on ADE20K, we construct benchmarks for scene parsing and instance segmentation. We provide baseline performances on both of the benchmarks and re-implement state-of-the-art models for open source. We further evaluate the effect of synchronized batch normalization and find that a reasonably large batch size is crucial for the semantic segmentation performance. We show that the networks trained on ADE20K are able to segment a wide variety of scenes and objects.

Keywords

Scene understanding Semantic segmentation Instance segmentation Image dataset Deep neural networks 

Notes

Acknowledgements

This work was partially supported by Samsung and NSF Grant No.1524817 to AT, CUHK Direct Grant for Research 2018/2019 No. 4055098 to BZ. SF acknowledges the support from NSERC.

References

  1. Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12, 2481–2495.CrossRefGoogle Scholar
  2. Bell, S., Upchurch, P., Snavely, N., & Bala, K. (2013). OpenSurfaces: A richly annotated catalog of surface appearance. ACM Transactions on Graphics (TOG), 32, 111.CrossRefGoogle Scholar
  3. Bell, S., Upchurch, P., Snavely, N., & Bala, K. (2015). Material recognition in the wild with the materials in context database. In Proceedings of CVPR.Google Scholar
  4. Caesar, H., Uijlings, J., & Ferrari, V. (2017). Coco-stuff: Thing and stuff classes in context.Google Scholar
  5. Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2016). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv:1606.00915.
  6. Chen, X., Mottaghi, R., Liu, X., Cho, N. G., Fidler, S., Urtasun, R., & Yuille, A. (2014). Detect what you can: Detecting and representing objects using holistic models and body parts. In Proceedings of CVPR.Google Scholar
  7. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., et al. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of CVPR.Google Scholar
  8. Dai, J., He, K., & Sun, J. (2015). Convolutional feature masking for joint object and stuff segmentation. In Proceedings of CVPR.Google Scholar
  9. Dai, J., He, K., & Sun, J. (2016). Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of CVPR.Google Scholar
  10. Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88, 303–338.CrossRefGoogle Scholar
  11. Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of CVPR.Google Scholar
  12. Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., et al. (2017). Accurate, large minibatch SGD: Training imagenet in 1 hour. ArXiv preprint arXiv:1706.02677.
  13. He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In Proceedings of ICCV.Google Scholar
  14. Huang, J. B., Kang, S. B., Ahuja, N., & Kopf, J. (2014). Image completion using planar structure guidance. ACM Transactions on Graphics (TOG), 33, 129.Google Scholar
  15. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. ArXiv preprint arXiv:1502.03167.
  16. Jiang, B., Luo, R., Mao, J., Xiao, T., & Jiang, Y. (2018). Acquisition of localization confidence for accurate object detection. In Proceedings of ECCV.Google Scholar
  17. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems.Google Scholar
  18. Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of CVPR.Google Scholar
  19. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). Microsoft coco: Common objects in context. In Proceedings of ECCV.Google Scholar
  20. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of CVPR.Google Scholar
  21. Martin, D., Fowlkes, C., Tal, D., & Malik, J. (2001). A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of ICCV.Google Scholar
  22. Mottaghi, R., Chen, X., Liu, X., Cho, N. G., Lee, S. W., Fidler, S., et al. (2014). The role of context for object detection and semantic segmentation in the wild. In Proceedings of CVPR.Google Scholar
  23. Nathan Silberman, P. K., Derek, H., & Fergus, R. (2012). Indoor segmentation and support inference from RGBD images. In Proceedings of ECCV.Google Scholar
  24. Nguyen, A., Dosovitskiy, A., Yosinski, J., Brox, T., & Clune, J. (2016). Synthesizing the preferred inputs for neurons in neural networks via deep generator networks.Google Scholar
  25. Noh, H., Hong, S., & Han, B. (2015). Learning deconvolution network for semantic segmentation. In Proceedings of ICCV.Google Scholar
  26. Patterson, G., & Hays, J. (2016). Coco attributes: Attributes for people, animals, and objects. In Proceedings of ECCV.Google Scholar
  27. Peng, C., Xiao, T., Li, Z., Jiang, Y., Zhang, X., Jia, K., et al. (2018). Megdet: A large mini-batch object detector. In Proceedings of CVPR, pp. 6181–6189.Google Scholar
  28. Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems.Google Scholar
  29. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.MathSciNetCrossRefGoogle Scholar
  30. Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). Labelme: A database and web-based tool for image annotation. International Journal of Computer Vision, 77, 157–173.CrossRefGoogle Scholar
  31. Song, S., Lichtenberg, S. P., & Xiao, J. (2015). Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of CVPR.Google Scholar
  32. Spain, M., & Perona, P. (2010). Measuring and predicting object importance. International Journal of Computer Vision, 91, 59–76.CrossRefGoogle Scholar
  33. Wu, Z., Shen, C., van den Hengel, A. (2016). Wider or deeper: Revisiting the resnet model for visual recognition. CoRR arXiv:1611.10080.
  34. Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In Proceedings of CVPR.Google Scholar
  35. Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual parsing for scene understanding. In Proceedings of ECCV.Google Scholar
  36. Yu, F., & Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions.Google Scholar
  37. Zhao, H., Puig, X., Zhou, B., Fidler, S., Torralba, A. (2017a). Open vocabulary scene parsing. In International Conference on Computer Vision (ICCV).Google Scholar
  38. Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017b). Pyramid scene parsing network. In Proceedings of CVPR.Google Scholar
  39. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Advances in neural information processing systems.Google Scholar
  40. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In Proceedings of CVPR.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Information EngineeringThe Chinese University of Hong KongShatinChina
  2. 2.Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of TechnologyCambridgeUSA
  3. 3.School of Electronic Engineering and Computer SciencePeking UniversityBeijingChina
  4. 4.Department of Computer ScienceUniversity of TorontoTorontoCanada

Personalised recommendations