Advertisement

Learning to Draw Sight Lines

  • Hao ZhaoEmail author
  • Ming Lu
  • Anbang Yao
  • Yurong Chen
  • Li Zhang
Article
  • 172 Downloads

Abstract

In this paper, we are concerned with the task of gaze following. Given a scene (e.g. a girl playing soccer on the field) and a human subject’s head position, this task aims to infer where she is looking (e.g. at the soccer ball). An existing method adopts a saliency model conditioned on the head position. However, this methodology is intrinsically troubled with dataset bias issues, which we will reveal in detail. In order to resolve these issues, we argue that the right methodology is to simulate how human beings follow gazes. Specifically, we propose the hypothesis that a human follows gazes by searching for salient objects along the subject’s sight line direction. To algorithmically embody this hypothesis, a two-stage method is proposed, which is dubbed as learning to draw sight lines. In the first stage, a fully convolutional network is trained to directly regress the existence strength of sight lines. It may seem counterintuitive at a first glance as these so-called sight lines do not really exist in the form of learnable image gradients. However, with the large-scale dataset GazeFollow, we demonstrate that this highly abstract concept can be grounded into neural network activations. An extensive study is conducted on the design of this sight line grounding network. We show that the best model we visited can already outperform the state-of-the-arts by a large margin, using a naive greedy inference strategy. We attribute these improvements to modern architecture design philosophies. However, no matter how strong the sight line grounding network is, the greedy inference strategy cannot handle a bunch of failure cases caused by dataset bias issues. We identify these issues and demonstrate that those grounded sight lines, which is a unique ingredient of our method, is the key to overcome them. Specifically, an algorithm termed as RASP is introduced as a second stage. RASP has five intriguing features: (1) it explicitly embodies the aforementioned hypothesis; (2) it involves no hyper-parameters, thus guaranteeing its robustness; (3) if needed, it can be implemented as an integrated layer for end-to-end inference; (4) it improves the performances of all sight line grounding networks we inspected; (5) further analyses confirm that RASP works by alleviating those spotted dataset biases. Strong results are achieved on the GazeFollow benchmark. Combining RASP and the best sight line grounding network can bring mean distance, minimum distance and mean angle difference 45.85%, 42.60%, and 49.23% closer towards human performance when compared with state-of-the-arts. We also contribute a video gaze following benchmark called GazeShift, on which we further demonstrate the importance of RASP in video applications. Codes and models will be released, encouraging further research on the important task of gaze following. Along with our implementation, we contribute a well-engineered toolbox for joint subject tracking and gaze following.

Keywords

Scene understanding Gaze following Dataset biases Convolutional neural network (CNN) 

Notes

Acknowledgements

We thank anonymous reviewers for suggestions on literature review and experimental design. This work was jointly supported by National Natural Science Foundation of China (Grant Nos. 61132007, 61172125 and U1533132).

Supplementary material

11263_2019_1263_MOESM1_ESM.txt (0 kb)
Supplementary material 1 (txt 0 KB)
11263_2019_1263_MOESM2_ESM.txt (1 kb)
Supplementary material 2 (txt 0 KB)

References

  1. Achanta, R., Hemami, S., Estrada, F., & Süsstrunk, S. (2009). Frequency-tuned salient region detection. In IEEE international conference on computer vision and pattern recognition (CVPR 2009) (pp. 1597–1604). CONF.Google Scholar
  2. Agrawal, A., Batra, D., Parikh, D., & Kembhavi, A. (2017). Don’t just assume; look and answer: Overcoming priors for visual question answering. ArXiv preprint arXiv:1712.00377.
  3. Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12), 2481–2495.CrossRefGoogle Scholar
  4. Blanz, V., & Vetter, T. (1999). A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on computer graphics and interactive techniques (pp. 187–194). ACM Press/Addison-Wesley Publishing Co.Google Scholar
  5. Borji, A., Cheng, M. M., Jiang, H., & Li, J. (2015). Salient object detection: A benchmark. IEEE Transactions on Image Processing, 24(12), 5706–5722.MathSciNetCrossRefGoogle Scholar
  6. Breitenstein, M. D., Kuettel, D., Weise, T., Van Gool, L., & Pfister, H. (2008). Real-time face pose estimation from single range images. In IEEE conference on computer vision and pattern recognition, 2008. CVPR 2008 (pp. 1–8). IEEE.Google Scholar
  7. Brooks, R., & Meltzoff, A. N. (2005). The development of gaze following and its relation to language. Developmental Science, 8, 535–543. CrossRefGoogle Scholar
  8. Bruce, N., & Tsotsos, J. (2006). Saliency based on information maximization. In Advances in neural information processing systems (pp. 155–162).Google Scholar
  9. Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2017). Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7291–7299)Google Scholar
  10. Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4), 834–848.CrossRefGoogle Scholar
  11. Cheng, M. M., Mitra, N. J., Huang, X., Torr, P. H., & Hu, S. M. (2015). Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 569–582.CrossRefGoogle Scholar
  12. Chong, E., Ruiz, N., Wang, Y., Zhang, Y., Rozga, A., & Rehg, J. M. (2018). Connecting gaze, scene, and attention: Generalized attention estimation via joint modeling of gaze and scene saliency. In Proceedings of the European conference on computer vision (ECCV) (pp. 383–398).Google Scholar
  13. Erdem, E., & Erdem, A. (2013). Visual saliency estimation by nonlinearly integrating features using region covariances. Journal of Vision, 13(4), 11–11.CrossRefGoogle Scholar
  14. Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.CrossRefGoogle Scholar
  15. Fan, L., Chen, Y., Wei, P., Wang, W., Zhu, S. C. (2018). Inferring shared attention in social scene videos. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6460–6468).Google Scholar
  16. Flom, R., Deák, G. O., Phill, C. G., & Pick, A. D. (2004). Nine-month-olds’ shared visual attention as a function of gesture and object location. Infant Behavior and Development, 27, 181–194.CrossRefGoogle Scholar
  17. Fouhey, D. F., Delaitre, V., Gupta, A., Efros, A. A., Laptev, I., & Sivic, J. (2014). People watching: Human actions as a cue for single view geometry. International Journal of Computer Vision, 110(3), 259–274.CrossRefGoogle Scholar
  18. Funes Mora, K. A., Monay, F., & Odobez, J. M. (2014). Eyediap: A database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. In Proceedings of the symposium on eye tracking research and applications (pp. 255–258). ACM.Google Scholar
  19. Ghiasi, G., & Fowlkes, C. C. (2016). Laplacian pyramid reconstruction and refinement for semantic segmentation. In European conference on computer vision (pp. 519–534). Springer.Google Scholar
  20. Harel, J., Koch, C., & Perona, P. (2007). Graph-based visual saliency. In Advances in neural information processing systems (pp. 545–552).Google Scholar
  21. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).Google Scholar
  22. Henriques, J. F., Caseiro, R., Martins, P., & Batista, J. (2015). High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 583–596.CrossRefGoogle Scholar
  23. Hoiem, D., Efros, A. A., & Hebert, M. (2005). Geometric context from a single image. In Tenth IEEE international conference on computer vision, 2005. ICCV 2005 Vol 1 (pp. 654–661). IEEE.Google Scholar
  24. Hou, X., & Zhang, L. (2007). Saliency detection: A spectral residual approach. In 2007 IEEE conference on computer vision and pattern recognition (pp. 1–8). IEEE.Google Scholar
  25. Hou, X., & Zhang, L. (2009). Dynamic visual attention: Searching for coding length increments. In Advances in neural information processing systems (pp. 681–688).Google Scholar
  26. Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11, 1254–1259.CrossRefGoogle Scholar
  27. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on multimedia (pp. 675–678). ACM.Google Scholar
  28. Jiang, M., Huang, S., Duan, J., & Zhao, Q. (2015). Salicon: Saliency in context. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  29. Joachims, T., Finley, T., & Yu, C. N. J. (2009). Cutting-plane training of structural SVMs. Machine Learning, 77(1), 27–59.CrossRefGoogle Scholar
  30. Judd, T., Ehinger, K., Durand, F., & Torralba, A. (2009). Learning to predict where humans look. In 2009 IEEE 12th international conference on computer vision.Google Scholar
  31. Kalal, Z., Mikolajczyk, K., Matas, J., et al. (2012). Tracking-learning-detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7), 1409.CrossRefGoogle Scholar
  32. Krafka, K., Khosla, A., Kellnhofer, P., Kannan, H., Bhandarkar, S., Matusik, W., & Torralba, A. (2016). Eye tracking for everyone. In 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2176–2184). IEEE.Google Scholar
  33. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).Google Scholar
  34. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.CrossRefGoogle Scholar
  35. Li, Y., Fathi, A., & Rehg, J. M. (2013). Learning to predict gaze in egocentric video. In Proceedings of the IEEE international conference on computer vision (pp. 3216–3223).Google Scholar
  36. Li, Y., Hou, X., Koch, C., Rehg, J. M., & Yuille, A. L. (2014). The secrets of salient object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 280–287).Google Scholar
  37. Liu, T., Yuan, Z., Sun, J., Wang, J., Zheng, N., Tang, X., et al. (2011). Learning to detect a salient object. IEEE Transactions on Pattern analysis and machine intelligence, 33(2), 353–367.CrossRefGoogle Scholar
  38. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431–3440).Google Scholar
  39. Lu, F., Okabe, T., Sugano, Y., & Sato, Y. (2011). A head pose-free approach for appearance-based gaze estimation. In BMVC (pp. 1–11).Google Scholar
  40. Lu, F., Sugano, Y., Okabe, T., & Sato, Y. (2014). Adaptive linear regression for appearance-based gaze estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(10), 2033–2046.CrossRefGoogle Scholar
  41. Lu, J., Yang, J., Batra, D., & Parikh, D. (2018). Neural baby talk. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7219–7228).Google Scholar
  42. Lukežič, A., Vojíř, T., Zajc, L. Č., Matas, J., & Kristan, M. (2018). Discriminative correlation filter tracker with channel and spatial reliability. International Journal of Computer Vision, 126(7), 671–688.MathSciNetCrossRefGoogle Scholar
  43. Mallya, A., & Lazebnik, S. (2015). Learning informative edge maps for indoor scene layout prediction. In Proceedings of the IEEE international conference on computer vision (pp. 936–944).Google Scholar
  44. Marín-Jiménez, M. J., Zisserman, A., Eichner, M., & Ferrari, V. (2014). Detecting people looking at each other in videos. International Journal of Computer Vision, 106(3), 282–296.CrossRefGoogle Scholar
  45. Mathe, S., & Sminchisescu, C. (2012). Dynamic eye movement datasets and learnt saliency models for visual action recognition. In European conference on computer vision (pp. 842–856). Springer.Google Scholar
  46. Mathe, S., & Sminchisescu, C. (2015). Actions in the eye: Dynamic gaze datasets and learnt saliency models for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(7), 1408–1424.CrossRefGoogle Scholar
  47. Morimoto, C. H., & Mimica, M. R. (2005). Eye gaze tracking techniques for interactive applications. Computer Vision and Image Understanding, 98(1), 4–24.CrossRefGoogle Scholar
  48. Perazzi, F., Krähenbühl, P., Pritch, Y., & Hornung, A. (2012). Saliency filters: Contrast based filtering for salient region detection. In 2012 IEEE conference on computer vision and pattern recognition (pp. 733–740). IEEE.Google Scholar
  49. Recasens, A., Khosla, A., Vondrick, C., & Torralba, A. (2015). Where are they looking? In NIPS.Google Scholar
  50. Recasens, A., Vondrick, C., Khosla, A., & Torralba, A. (2017). Following gaze in video. In The IEEE international conference on computer vision (ICCV) vol. 4.Google Scholar
  51. Rehg, J., Abowd, G., Rozga, A., Romero, M., Clements, M., Sclaroff, S., Essa, I., Ousley, O., Li, Y., & Kim, C., et al. (2013). Decoding children’s social behavior. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3414–3421).Google Scholar
  52. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.MathSciNetCrossRefGoogle Scholar
  53. Schneider, T., Schauerte, B., & Stiefelhagen, R. (2014). Manifold alignment for person independent appearance-based gaze estimation. In 2014 22nd international conference on pattern recognition (pp. 1167–1172). IEEE.Google Scholar
  54. Schwing, A. G., Hazan, T., Pollefeys, M., & Urtasun, R. (2012). Efficient structured prediction for 3d indoor scene understanding. In 2012 IEEE conference on computer vision and pattern recognition (CVPR) (pp 2815–2822). IEEE.Google Scholar
  55. Senju, A., & Csibra, G. (2008). Gaze following in human infants depends on communicative signals. Current Biology, 18, 668–671.CrossRefGoogle Scholar
  56. Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., & Webb, R. (2017). Learning from simulated and unsupervised images through adversarial training. In The IEEE conference on computer vision and pattern recognition (CVPR) Vol 3 (p. 6).Google Scholar
  57. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. ArXiv preprint arXiv:1409.1556.
  58. Song, S., Yu, F., Zeng, A., Chang, A. X., Savva, M., & Funkhouser, T. (2017). Semantic scene completion from a single depth image. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 190–198). IEEE.Google Scholar
  59. Sugano, Y., Matsushita, Y., Sato, Y., & Koike, H. (2008). An incremental learning method for unconstrained gaze estimation. In European conference on computer vision (pp. 656–667). Springer.Google Scholar
  60. Sugano, Y., Matsushita, Y., & Sato, Y. (2014). Learning-by-synthesis for appearance-based 3d gaze estimation. In 2014 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1821–1828). IEEE.Google Scholar
  61. Wei, P., Liu, Y., Shu, T., Zheng, N., & Zhu, S. C. (2018). Where and why are they looking? jointly inferring human attention and intentions in complex tasks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6801–6809).Google Scholar
  62. Wood, E., Baltrušaitis, T., Morency, L. P., Robinson, P., Bulling, A. (2016). Learning an appearance-based gaze estimator from one million synthesised images. In Proceedings of the ninth biennial ACM symposium on eye tracking research and applications (pp. 131–138). ACM.Google Scholar
  63. Wu, Z., Shen, C., & Van Den Hengel, A. (2016). Wider or deeper: Revisiting the resnet model for visual recognition. ArXiv preprint arXiv:1611.10080.
  64. Yan, Q., Xu, L., Shi, J., & Jia, J. (2013). Hierarchical saliency detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1155–1162).Google Scholar
  65. Yao, A., & Chen, Y. (2018). Combinatorial shape regression for face alignment in images. US Patent App. 15/573,631Google Scholar
  66. Yu, F., & Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. ArXiv preprint arXiv:1511.07122.
  67. Yu, F., Koltun, V., & Funkhouser, T. A. (2017). Dilated residual networks. In CVPR Vol 2 (p. 3).Google Scholar
  68. Zagoruyko, S., & Komodakis, N. (2016). Wide residual networks. ArXiv preprint arXiv:1605.07146.
  69. Zhang, L., Tong, M. H., Marks, T. K., Shan, H., & Cottrell, G. W. (2008). Sun: A bayesian framework for saliency using natural statistics. Journal of Vision, 8(7), 32–32.CrossRefGoogle Scholar
  70. Zhang, S., Zhu, X., Lei, Z., Shi, H., Wang, X., & Li, S. Z. (2017). \(\text{S}^{\wedge }\) 3fd: Single shot scale-invariant face detector. In 2017 IEEE international conference on computer vision (ICCV) (pp. 192–201). IEEE.Google Scholar
  71. Zhang, X., Sugano, Y., Fritz, M., & Bulling, A. (2015). Appearance-based gaze estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4511–4520).Google Scholar
  72. Zhang, Y., Bai, M., Kohli, P., Izadi, S., & Xiao, J. (2016). Deepcontext: Context-encoding neural pathways for 3d holistic scene understanding. ArXiv preprint arXiv:1603.04922.
  73. Zhao, H., Lu, M., Yao, A., Guo, Y., Chen, Y., & Zhang, L. (2017a). Physics inspired optimization on semantic transfer features: An alternative method for room layout estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 10–18).Google Scholar
  74. Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017b). Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2881–2890).Google Scholar
  75. Zhao, Y., & Zhu, S. C. (2013). Scene parsing by integrating function, geometry and appearance models. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3119–3126).Google Scholar
  76. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Advances in neural information processing systems (pp. 487–495).Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Tsinghua UniversityBeijingChina
  2. 2.Intel Labs ChinaBeijingChina

Personalised recommendations