What Makes Good Synthetic Training Data for Learning Disparity and Optical Flow Estimation?

  • Nikolaus Mayer
  • Eddy Ilg
  • Philipp Fischer
  • Caner Hazirbas
  • Daniel Cremers
  • Alexey Dosovitskiy
  • Thomas Brox
Article

Abstract

The finding that very large networks can be trained efficiently and reliably has led to a paradigm shift in computer vision from engineered solutions to learning formulations. As a result, the research challenge shifts from devising algorithms to creating suitable and abundant training data for supervised learning. How to efficiently create such training data? The dominant data acquisition method in visual recognition is based on web data and manual annotation. Yet, for many computer vision problems, such as stereo or optical flow estimation, this approach is not feasible because humans cannot manually enter a pixel-accurate flow field. In this paper, we promote the use of synthetically generated data for the purpose of training deep networks on such tasks. We suggest multiple ways to generate such data and evaluate the influence of dataset properties on the performance and generalization properties of the resulting networks. We also demonstrate the benefit of learning schedules that use different types of data at selected stages of the training process.

Keywords

Deep learning Data generation Synthetic ground truth FlowNet DispNet 

Supplementary material

11263_2018_1082_MOESM1_ESM.pdf (5 mb)
Supplementary material 1 (pdf 5088 KB)

References

  1. Aubry, M., Maturana, D., Efros, A., Russell, B., & Sivic, J. (2014). Seeing 3d chairs: Exemplar part-based 2d–3d alignment using a large dataset of cad models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  2. Baker, S., Scharstein, D., Lewis, J. P., Roth, S., Black, M. J., & Szeliski, R. (2011). A database and evaluation methodology for optical flow. IJCV, 92, 1–31.CrossRefGoogle Scholar
  3. Barron, J. L., Fleet, D. J., & Beauchemin, S. S. (1994). Performance of optical flow techniques. IJCV, 12, 43–77.Google Scholar
  4. Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, ICML ’09 (pp. 41–48).Google Scholar
  5. Brox, T., Bruhn, A., Papenberg, N., & Weickert, J. (2004). High accuracy optical flow estimation based on a theory for warping. In Computer vision-ECCV, 2004 (pp. 25–36).Google Scholar
  6. Brox, T., & Malik, J. (2011). Large displacement optical flow: descriptor matching in variational motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 33, 500–513.CrossRefGoogle Scholar
  7. Butler, D. J., Wulff, J., Stanley, G. B., & Black, M. J. (2012). A naturalistic open source movie for optical flow evaluation. In European conference on computer vision (ECCV).Google Scholar
  8. Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., et al. (2015). ShapeNet: An Information-rich 3D model repository. Tech. Rep. ArXiv preprint arXiv:1512.03012.
  9. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., et al. (2016). The cityscapes dataset for semantic urban scene understanding. In IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  10. Dai, A., Chang, A. X., Savva, M., Halber, M., Funkhouser, T., & Nießner, M. (2017). Scannet: Richly-annotated 3d reconstructions of indoor scenes. In IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  11. de Souza, C. R., Gaidon, A., Cabon, Y., & Peña, A. M. L. (2017). Procedural generation of videos to train deep action recognition networks. In 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017 (pp. 2594–2604).Google Scholar
  12. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  13. Dosovitskiy, A., Fischer, P., Ilg E, Häusser, P., Hazırbaş, C., Golkov, V., et al. (2015). FlowNet: Learning optical flow with convolutional networks. In IEEE international conference on computer vision (ICCV).Google Scholar
  14. Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., & Koltun, V. (2017). Carla: An open urban driving simulator. In Conference on robot learning (pp. 1–16).Google Scholar
  15. Dwibedi, D., Misra, I., & Hebert, M. (2017). Cut, paste and learn: Surprisingly easy synthesis for instance detection. In The IEEE international conference on computer vision (ICCV).Google Scholar
  16. Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In Conference on neural information processing systems (NIPS).Google Scholar
  17. Elman, J. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48(1), 71–99.CrossRefGoogle Scholar
  18. Gaidon, A., Wang, Q., Cabon, Y., & Vig, E. (2016). Virtual worlds as proxy for multi-object tracking analysis. In IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  19. Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  20. Handa, A., Pătrăucean, V., Badrinarayanan, V., Stent, S., & Cipolla, R. (2016). Understanding realworld indoor scenes with synthetic data. In IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  21. Handa, A., Whelan, T., McDonald, J., & Davison, A. (2014). A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM. In IEEE international conference on robotics and automation (ICRA).Google Scholar
  22. Heeger, D. J. (1987). Model for the extraction of image flow. JOSA A, 4, 1455–1471.CrossRefGoogle Scholar
  23. Horn, B. K. P., & Schunck, B. G. (1981). Determining optical flow. Artificial Intelligence, 17, 185–203.CrossRefGoogle Scholar
  24. Huguet, F., & Devernay, F. (2007). A variational method for scene flow estimation from stereo sequences. In IEEE international conference on computer vision (ICCV).Google Scholar
  25. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., & Brox, T. (2017). Flownet 2.0: Evolution of optical flow estimation with deep networks. In IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  26. Klein, G., & Murray, D. W. (2010). Simulating low-cost cameras for augmented reality compositing. IEEE Transactions on Visualization and Computer Graphics, 16, 369–380.CrossRefGoogle Scholar
  27. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (ECCV).Google Scholar
  28. Mac Aodha, O., Brostow, G. J., Pollefeys, M. (2010). Segmenting video into classes of algorithm-suitability. In 2010 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1054–1061). IEEE.Google Scholar
  29. Mac Aodha, O., Humayun, A., Pollefeys, M., & Brostow, G. J. (2013). Learning a confidence measure for optical flow. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(5), 1107–1120.CrossRefGoogle Scholar
  30. Mac Aodha, O., Brostow, G. J., Pollefeys, M. (2010). Segmenting video into classes of algorithm-suitability. In 2010 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1054–1061). IEEE.Google Scholar
  31. Mac Aodha, O., Humayun, A., Pollefeys, M., & Brostow, G. J. (2013). Learning a confidence measure for optical flow. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(5), 1107–1120.CrossRefGoogle Scholar
  32. Mayer, N., Ilg, E., Häusser, P., Fischer, P., Cremers, D., Dosovitskiy, A., & Brox, T. (2016). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  33. McCane, B., Novins, K., Crannitch, D., & Galvin, B. (2001). On benchmarking optical flow. Computer Vision and Image Understanding, 84, 126–143.CrossRefMATHGoogle Scholar
  34. McCormac, J., Handa, A., Leutenegger, S., & Davison, A. J. (2017). Scenenet rgb-d: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation? In The IEEE international conference on computer vision (ICCV).Google Scholar
  35. Meister, S., & Kondermann, D. (2011). Real versus realistically rendered scenes for optical flow evaluation. In ITG conference on electronic media technology (CEMT).Google Scholar
  36. Menze, M., & Geiger, A. (2015). Object scene flow for autonomous vehicles. In IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  37. Movshovitz-Attias, Y., Kanade, T., & Sheikh, Y. (2016). How useful is photo-realistic rendering for visual learning? In ECCV workshops.Google Scholar
  38. Onkarappa, N., & Sappa, A. D. (2014). Speed and texture: An empirical study on optical-flow accuracy in ADAS scenarios. IEEE Transactions on Intelligent Transportation Systems, 15, 136–147.CrossRefGoogle Scholar
  39. Otte, M., & Nagel, H. H. (1995). Estimation of optical flow based on higher-order spatiotemporal derivatives in interlaced and non-interlaced image sequences. Artificial Intelligence, 78, 5–43.CrossRefGoogle Scholar
  40. Qiu, W., & Yuille, A. L. (2016). Unrealcv: Connecting computer vision to unreal engine. In Computer Vision-ECCV 2016 Workshops-Amsterdam, The Netherlands, October 8–10 and 15–16, 2016, Proceedings, Part III (pp. 909–916)Google Scholar
  41. Richter, S. R., Hayder, Z., & Koltun, V. (2017). Playing for benchmarks. In International conference on computer vision (ICCV).Google Scholar
  42. Richter, S. R., Vineet, V., Roth, S., & Koltun, V. (2016). Playing for data: Ground truth from computer games. In European conference on computer vision (ECCV).Google Scholar
  43. Ros, G., Sellart, L., Materzynska, J., Vazquez, D., & Lopez, A. M. (2016). The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3234–3243)Google Scholar
  44. Scharstein, D., Hirschmüller, H., Kitajima, Y., Krathwohl, G, Nešić, N., Wang, X., & Westling, P. (2014). High-resolution stereo datasets with subpixel-accurate ground truth. In Pattern recognition.Google Scholar
  45. Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from rgbd images. In European conference on computer vision (ECCV).Google Scholar
  46. Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., & Funkhouser, T. (2017). Semantic scene completion from a single depth image. In IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  47. Sturm, J., Engelhard, N., Endres, F., Burgard, W., & Cremers, D. (2012). A benchmark for the evaluation of rgb-d slam systems. In International conference on intelligent robot systems (IROS).Google Scholar
  48. Su, H., Qi, C. R., Li, Y., & Guibas, L. J. (2015). Render for CNN: Viewpoint estimation in images using cnns trained with rendered 3d model views. In IEEE international conference on computer vision (ICCV).Google Scholar
  49. Taylor, G. R., Chosak, A. J., & Brewer, P. C. (2007). Ovvv: Using virtual worlds to design and evaluate surveillance systems. In IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  50. Vaudrey, T., Rabe, C., Klette, R., & Milburn, J. (2008). Differences between stereo and motion behaviour on synthetic and real-world stereo sequences. In International conference on image and vision computing.Google Scholar
  51. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J. (2015) 3d shapenets: A deep representation for volumetric shapes. In IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  52. Wulff, J., Butler, D. J., Stanley, G. B., & Black, M. J. (2012). Lessons and insights from creating a synthetic optical flow benchmark. In ECCV Workshop on unsolved problems in optical flow and stereo estimation.Google Scholar
  53. Xiao, J., Owens, A., Torralba, A. (2013). Sun3d: A database of big spaces reconstructed using sfm and object labels. In IEEE international conference on computer vision (ICCV).Google Scholar
  54. Zhang, Y., Qiu, W., Chen, Q., Hu, X., & Yuille, A. L. (2016). Unrealstereo: A synthetic dataset for analyzing stereo vision. Tech. Rep. ArXiv preprint arXiv:1612.04647.
  55. Zhang, Y., Song, S., Yumer, E., Savva, M., Lee, J. Y., Jin, H., & Funkhouser, T. (2017). Physically-based rendering for indoor scene understanding using convolutional neural networks. In IEEE Conference on computer vision and pattern recognition (CVPR).Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Nikolaus Mayer
    • 1
  • Eddy Ilg
    • 1
  • Philipp Fischer
    • 1
  • Caner Hazirbas
    • 2
  • Daniel Cremers
    • 2
  • Alexey Dosovitskiy
    • 1
  • Thomas Brox
    • 1
  1. 1.University of FreiburgFreiburg im BreisgauGermany
  2. 2.Technical University of MunichMunichGermany

Personalised recommendations