Improved Semantic Stixels via Multimodal Sensor Fusion

  • Florian PiewakEmail author
  • Peter Pinggera
  • Markus Enzweiler
  • David Pfeiffer
  • Marius Zöllner
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11269)


This paper presents a compact and accurate representation of 3D scenes that are observed by a LiDAR sensor and a monocular camera. The proposed method is based on the well-established Stixel model originally developed for stereo vision applications. We extend this Stixel concept to incorporate data from multiple sensor modalities. The resulting mid-level fusion scheme takes full advantage of the geometric accuracy of LiDAR measurements as well as the high resolution and semantic detail of RGB images. The obtained environment model provides a geometrically and semantically consistent representation of the 3D scene at a significantly reduced amount of data while minimizing information loss at the same time. Since the different sensor modalities are considered as input to a joint optimization problem, the solution is obtained with only minor computational overhead. We demonstrate the effectiveness of the proposed multimodal Stixel algorithm on a manually annotated ground truth dataset. Our results indicate that the proposed mid-level fusion of LiDAR and camera data improves both the geometric and semantic accuracy of the Stixel model significantly while reducing the computational overhead as well as the amount of generated data in comparison to using a single modality on its own.


  1. 1.
    Armeni, I., Sax, S., Zamir, A.R., et al.: Joint 2D-3D-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105 (2017)
  2. 2.
    Badino, H., Franke, U., Pfeiffer, D.: The Stixel World - a compact medium level representation of the 3D-world. In: Denzler, J., Notni, G., Süße, H. (eds.) DAGM 2009. LNCS, vol. 5748, pp. 51–60. Springer, Heidelberg (2009). Scholar
  3. 3.
    Bai, H., Cai, S., Ye, N., et al.: Intention-aware online POMDP planning for autonomous driving in a crowd. In: International Conference on Robotics and Automation, ICRA (2015)Google Scholar
  4. 4.
    Benenson, R., Mathias, M., Timofte, R., Van Gool, L.: Fast Stixel computation for fast pedestrian detection. In: Fusiello, A., Murino, V., Cucchiara, R. (eds.) ECCV 2012. LNCS, vol. 7585, pp. 11–20. Springer, Heidelberg (2012). Scholar
  5. 5.
    Cordts, M.: Understanding Cityscapes: efficient urban semantic scene understanding. Ph.D. thesis. Technische Universität Darmstadt (2017)Google Scholar
  6. 6.
    Cordts, M., Omran, M., Ramos, S., et al.: The Cityscapes dataset for semantic urban scene understanding. In: Conference on Computer Vision and Pattern Recognition, CVPR (2016)Google Scholar
  7. 7.
    Cordts, M., Rehfeld, T., Schneider, L., et al.: The Stixel World: a medium-level representation of traffic scenes. Image Vis. Comput. 68, 40–52 (2017)CrossRefGoogle Scholar
  8. 8.
    Dai, A., Chang, A.X., Savva, M., et al.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: Conference on Computer Vision and Pattern Recognition, CVPR (2017)Google Scholar
  9. 9.
    Forsberg, O.: Semantic Stixels fusing LIDAR for scene perception (2018)Google Scholar
  10. 10.
    Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., et al.: A review on deep learning techniques applied to semantic segmentation. arXiv preprint arXiv:1704.06857 (2017)
  11. 11.
    Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014). Scholar
  12. 12.
    Hackel, T., Savinov, N., Ladicky, L., et al.: a new large-scale point cloud classification benchmark. Ann. Photogram. Remote Sens. Spat. Inf. Sci. (ISPRS) IV-1/W1, 91–98 (2017)CrossRefGoogle Scholar
  13. 13.
    Hernandez-Juarez, D., Schneider, L., Espinosa, A., et al.: Slanted Stixels: representing San Francisco’s Steepest streets. In: British Machine Vision Conference, BMVC (2017)Google Scholar
  14. 14.
    Iandola, F.N., Han, S., Moskewicz, M.W., et al.: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and \(<\)0.5MB model size. arXiv preprint arXiv:1602.07360 (2016)
  15. 15.
    Levi, D., Garnett, N., Fetaya, E.: StixelNet: a deep convolutional network for obstacle detection and road segmentation. In: British Machine Vision Conference, BMVC (2015)Google Scholar
  16. 16.
    Liu, M.Y., Lin, S., Ramalingam, S., et al.: Layered interpretation of street view images. In: Robotics: Science and Systems. Robotics: Science and Systems Foundation (2015)Google Scholar
  17. 17.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Conference on Computer Vision and Pattern Recognition, CVPR (2015)Google Scholar
  18. 18.
    Martinez, M., Roitberg, A., Koester, D., et al.: Using technology developed for autonomous cars to help navigate blind people. In: Conference on Computer Vision Workshops, ICCVW (2017)Google Scholar
  19. 19.
    Muller, A.C., Behnke, S.: Learning depth-sensitive conditional random fields for semantic segmentation of RGB-D images. In: Conference on Robotics and Automation, ICRA (2014)Google Scholar
  20. 20.
    Nuss, D., Reuter, S., Thom, M., et al.: A random finite set approach for dynamic occupancy grid maps with real-time application. arXiv preprint arXiv:1605.02406 (2016)
  21. 21.
    Pfeiffer, D.: The Stixel World - a compact medium-level representation for efficiently modeling dynamic three-dimensional environments. Ph.D. thesis. Humboldt-Universität Berlin (2012)Google Scholar
  22. 22.
    Piewak, F., Pinggera, P., Schäfer, M., et al.: Boosting LiDAR-based semantic labeling by cross-modal training data generation. arXiv preprint arXiv:1804.09915 (2018)
  23. 23.
    Qi, C.R., Yi, L., Su, H., et al.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, NIPS (2017)Google Scholar
  24. 24.
    Riegler, G., Ulusoy, A.O., Geiger, A.: OctNet: learning deep 3D representations at high resolutions. In: Computer Vision and Pattern Recognition, CVPR (2017)Google Scholar
  25. 25.
    Sankaranarayanan, S., Balaji, Y., Jain, A., et al.: Learning from synthetic data: addressing domain shift for semantic segmentation. arXiv preprint arXiv:1711.06969 (2017)
  26. 26.
    Schneider, L., Cordts, M., Rehfeld, T., et al.: Semantic Stixels: depth is not enough. In: Intelligent Vehicles Symposium, IV (2016)Google Scholar
  27. 27.
    Vu, T.D., Burlet, J., Aycard, O., et al.: Grid-based localization and local mapping with moving object detection and tracking. J. Inf. Fusion 12(1), 58–69 (2011)CrossRefGoogle Scholar
  28. 28.
    Wu, B., Wan, A., Yue, X., et al.: SqueezeSeg: convolutional neural nets with recurrent CRF for real-time road-object segmentation from 3D LiDAR point cloud. arXiv preprint arXiv:1710.07368 (2017)
  29. 29.
    Yang, F., Choi, W., Lin, Y.: Exploit all the layers: fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers. In: Conference on Computer Vision and Pattern Recognition, CVPR (2016)Google Scholar
  30. 30.
    Zhou, Y., Tuzel, O.: VoxelNet: end-to-end learning for point cloud based 3D object detection. arXiv preprint arXiv:1711.06396 (2017)

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Daimler AG, R&DStuttgartGermany
  2. 2.Karlsruhe Institute of Technology (KIT)KarlsruheGermany
  3. 3.Forschungszentrum Informatik (FZI)KarlsruheGermany

Personalised recommendations