Recurrent Temporal Deep Field for Semantic Video Labeling

  • Peng LeiEmail author
  • Sinisa Todorovic
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9909)


This paper specifies a new deep architecture, called Recurrent Temporal Deep Field (RTDF), for semantic video labeling. RTDF is a conditional random field (CRF) that combines a deconvolution neural network (DeconvNet) and a recurrent temporal restricted Boltzmann machine (RTRBM). DeconvNet is grounded onto pixels of a new frame for estimating the unary potential of the CRF. RTRBM estimates a high-order potential of the CRF by capturing long-term spatiotemporal dependencies of pixel labels that RTDF has already predicted in previous frames. We derive a mean-field inference algorithm to jointly predict all latent variables in both RTRBM and CRF. We also conduct end-to-end joint training of all DeconvNet, RTRBM, and CRF parameters. The joint learning and inference integrate the three components into a unified deep model – RTDF. Our evaluation on the benchmark Youtube Face Database (YFDB) and Cambridge-driving Labeled Video Database (Camvid) demonstrates that RTDF outperforms the state of the art both qualitatively and quantitatively.


Video labeling Recurrent Temporal Deep Field Recurrent Temporal Restricted Boltzmann Machine Deconvolution CRF 



This work was supported in part by grant NSF RI 1302700. The authors would like to thank Sheng Chen for useful discussion and acknowledge Dimitris Trigkakis for helping with the datasets.


  1. 1.
    Wolf, L., Hassner, T., Maoz, I.: Face recognition in unconstrained videos with matched background similarity. In: CVPR (2011)Google Scholar
  2. 2.
    Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561 (2015)
  3. 3.
    Sutskever, I., Hinton, G.E., Taylor, G.W.: The recurrent temporal restricted Boltzmann machine. In: NIPS (2009)Google Scholar
  4. 4.
    Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML (2001)Google Scholar
  5. 5.
    Galmar, E., Athanasiadis, T., Huet, B., Avrithis, Y.: Spatiotemporal semantic video segmentation. In: MSPW (2008)Google Scholar
  6. 6.
    Grundmann, M., Kwatra, V., Han, M., Essa, I.: Efficient hierarchical graph-based video segmentation. In: CVPR (2010)Google Scholar
  7. 7.
    Jain, A., Chatterjee, S., Vidal, R.: Coarse-to-fine semantic video segmentation using supervoxel trees. In: ICCV (2013)Google Scholar
  8. 8.
    Yi, S., Pavlovic, V.: Multi-cue structure preserving mrf for unconstrained video segmentation. arXiv preprint arXiv:1506.09124 (2015)
  9. 9.
    Zhao, H., Fu, Y.: Semantic single video segmentation with robust graph representation. In: IJCAI (2015)Google Scholar
  10. 10.
    Liu, B., He, X., Gould, S.: Multi-class semantic video segmentation with exemplar-based object reasoning. In: WACV (2015)Google Scholar
  11. 11.
    Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. PAMI 35(8), 1915–1929 (2013)CrossRefGoogle Scholar
  12. 12.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)Google Scholar
  13. 13.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs. In: ICLR (2014)Google Scholar
  14. 14.
    Ciresan, D., Giusti, A., Gambardella, L.M., Schmidhuber, J.: Deep neural networks segment neuronal membranes in electron microscopy images. In: NIPS (2012)Google Scholar
  15. 15.
    Pinheiro, P.H., Collobert, R.: Recurrent convolutional neural networks for scene parsing. In: ICML (2014)Google Scholar
  16. 16.
    Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Simultaneous detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part VII. LNCS, vol. 8695, pp. 297–312. Springer, Heidelberg (2014)Google Scholar
  17. 17.
    Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part VII. LNCS, vol. 8695, pp. 345–360. Springer, Heidelberg (2014)Google Scholar
  18. 18.
    Ganin, Y., Lempitsky, V.: \(N^4\)-fields: neural network nearest neighbor fields for image transforms. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9004, pp. 536–551. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-16808-1_36 Google Scholar
  19. 19.
    Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.: Conditional random fields as recurrent neural networks. In: ICCV (2015)Google Scholar
  20. 20.
    Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: ICCV (2015)Google Scholar
  21. 21.
    Smolensky, P.: Information Processing in Dynamical Systems: Foundations of Harmony Theory. MIT Press Cambridge, Cambridge (1986)Google Scholar
  22. 22.
    He, X., Zemel, R.S., Carreira-Perpiñán, M.Á: Multiscale conditional random fields for image labeling. In: CVPR (2004)Google Scholar
  23. 23.
    Li, Y., Tarlow, D., Zemel, R.: Exploring compositional high order pattern potentials for structured output learning. In: CVPR (2013)Google Scholar
  24. 24.
    Kae, A., Sohn, K., Lee, H., Learned-Miller, E.: Augmenting CRFs with boltzmann machine shape priors for image labeling. In: CVPR (2013)Google Scholar
  25. 25.
    Eslami, S.A., Heess, N., Williams, C.K., Winn, J.: The shape Boltzmann machine: a strong model of object shape. IJCV 107(2), 155–176 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  26. 26.
    Kae, A., Marlin, B., Learned-Miller, E.: The shape-time random field for semantic video labeling. In: CVPR (2014)Google Scholar
  27. 27.
    Taylor, G.W., Hinton, G.E., Roweis, S.T.: Modeling human motion using binary latent variables. In: NIPS (2006)Google Scholar
  28. 28.
    Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. Technical report, DTIC Document (1985)Google Scholar
  29. 29.
    Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8), 1771–1800 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    Mnih, V., Larochelle, H., Hinton, G.E.: Conditional restricted Boltzmann machines for structured output prediction. In: UAI (2011)Google Scholar
  31. 31.
    LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., Huang, F.: A tutorial on energy-based learning. In: Predicting Structured Data, vol. 1 (2006)Google Scholar
  32. 32.
    Brostow, G.J., Fauqueur, J., Cipolla, R.: Semantic object classes in video: a high-definition ground truth database. PRL 30(2), 88–97 (2008)CrossRefGoogle Scholar
  33. 33.
    Li, F., Kim, T., Humayun, A., Tsai, D., Rehg, J.M.: Video segmentation by tracking many figure-ground segments. In: ICCV (2013)Google Scholar
  34. 34.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the KITTI vision benchmark suite. In: CVPR (2012)Google Scholar
  35. 35.
    Brox, T., Malik, J.: Object segmentation by long term analysis of point trajectories. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 282–295. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  36. 36.
    Zhang, C., Wang, L., Yang, R.: Semantic segmentation of urban scenes using dense depth maps. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 708–721. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  37. 37.
    Tighe, J., Lazebnik, S.: Superparsing. IJCV 101(2), 329–349 (2013)MathSciNetCrossRefGoogle Scholar
  38. 38.
    Sturgess, P., Alahari, K., Ladicky, L., Torr, P.H.: Combining appearance and structure from motion features for road scene understanding. In: BMVC (2009)Google Scholar
  39. 39.
    Ladický, Ľ., Sturgess, P., Alahari, K., Russell, C., Torr, P.H.S.: What, where and how many? combining object detectors and CRFs. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 424–437. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  40. 40.
    Rota Bulo, S., Kontschieder, P.: Neural decision forests for semantic image labelling. In: CVPR (2014)Google Scholar
  41. 41.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.School of Electrical Engineering and Computer ScienceOregon State UniversityCorvallisUSA

Personalised recommendations