Exploring Transfer Learning Approaches for Head Pose Classification from Multi-view Surveillance Images


Head pose classification from surveillance images acquired with distant, large field-of-view cameras is difficult as faces are captured at low-resolution and have a blurred appearance. Domain adaptation approaches are useful for transferring knowledge from the training (source) to the test (target) data when they have different attributes, minimizing target data labeling efforts in the process. This paper examines the use of transfer learning for efficient multi-view head pose classification with minimal target training data under three challenging situations: (i) where the range of head poses in the source and target images is different, (ii) where source images capture a stationary person while target images capture a moving person whose facial appearance varies under motion due to changing perspective, scale and (iii) a combination of (i) and (ii). On the whole, the presented methods represent novel transfer learning solutions employed in the context of multi-view head pose classification. We demonstrate that the proposed solutions considerably outperform the state-of-the-art through extensive experimental validation. Finally, the DPOSE dataset compiled for benchmarking head pose classification performance with moving persons, and to aid behavioral understanding applications is presented in this work.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10


  1. 1.

    Head pose estimation involves determination of the pan (out-of-plane horizontal head rotation), tilt (out-of-plane vertical rotation) and roll (in-plane head rotation). In this work, we are mainly concerned about estimating pan and tilt.

  2. 2.

    available at http://tev.fbk.eu/DATABASES/DPOSE.html

  3. 3.

    27824 4-view images correspond to static targets rotating in-place at the room center, while 25660 images capture freely moving targets.

  4. 4.

    These values account for the tracker’s variance, the horizontal and vertical offsets of the head from the body centroid due to head pan, tilt and roll.

  5. 5.

    This warping can also be applied in the case where the number of cameras/views for the source and target are different.

  6. 6.

    as seen from Table 1, which presents accuracies achieved with source-only \(Cov (d=12)\) features

  7. 7.

    In our implementation, we consider the room-center as the reference position.

  8. 8.

    \({\varvec{\varSigma }}\) is chosen to be positive semi-definite and have a trace equal to 1 as proposed in Kulis et al. (2011)

  9. 9.


  10. 10.

    The NN classifier assigns the class label of the nearest target training example to the test image.


  1. Benfold, B., & Reid, I. (2011). Unsupervised learning of a scene-specific coarse gaze estimator. In International Conference on Computer Vision (pp. 2344–2351).

  2. Chen, C., & Odobez, J.-M. (2012). We are not contortionists: Coupled adaptive learning for head and body orientation estimation in surveillance video. In Computer Vision and Pattern Recognition (pp. 1544–1551).

  3. Dai, W., Yang, Q., Xue, G. R., & Yu, Y. (2007). Boosting for transfer learning. In International Conference on Machine Learning (pp. 193–200).

  4. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition (pp. 886–893).

  5. Daume, H. (2007). Frustratingly easy domain adaptation. In Proceedings of Association for Computational Linguistics (pp. 256–263).

  6. Doshi, A., & Trivedi, M. M. (2012). Head and eye gaze dynamics during visual attention shifts in complex environments. Journal of Vision, 12(2), 1–16.

    Article  Google Scholar 

  7. Duan, L., Tsang, I. W., & Xu, D. (2012). Domain transfer multiple kernel learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 465–479.

    Article  Google Scholar 

  8. Duan, L., Tsang, I. W., Xu, D., & Chua, T.-S. (2009). Domain adaptation from multiple sources via auxiliary classifiers. In International Conference on Machine Learning (pp. 289–296).

  9. Farhadi, A., & Tabrizi, M. K. (2008). Learning to recognize activities from the wrong view point. In European Conference on Computer Vision (pp. 154–166).

  10. Ferencz, A., Learned-Miller, E. G., & Malik, J. (2008). Learning to locate informative features for visual identification. International Journal of Computer Vision, 77(1–3), 3–24.

    Article  Google Scholar 

  11. HOSDB. (2006). Imagery library for intelligent detection systems (i-lids). In IEEE Crime and Security.

  12. Jiang, J., & Zhai, C. (2007). Instance weighting for domain adaptation in nlp. In Association of Computational Linguistics (pp. 264–271).

  13. Katzenmaier, M., Stiefelhagen, R., & Schultz, T. (2004). Identifying the addressee in human-human-robot interactions based on head pose and speech. In International Conference on Multimodal Interfaces (pp. 144–151).

  14. Kulis, B., Saenko, K., & Darrell, T. (2011). What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In Computer Vision and Pattern Recognition (pp. 1785–1792).

  15. Lanz, O. (2006). Approximate bayesian multibody tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9), 1436–1449.

    Article  Google Scholar 

  16. Lanz, O., & Brunelli, R. (2008). Joint bayesian tracking of head location and pose from low-resolution video. In R. Stiefelhagen, R. Bowers, & J. G. Fiscus (Eds.), Multimodal technologies for perception of humans, Lecture Notes in Computer Science (Vol. 4625, pp. 287–296). Heidelberg: Springer.

  17. Lepri, B., Subramanian, R., Kalimeri, K., Staiano, J., Pianesi, F., & Sebe, N. (2012). Connecting meeting behavior with extraversion–A systematic study. IEEE Transactions on Affective Computing, 3(4), 443–455.

    Article  Google Scholar 

  18. Lim, J. J., Salakhutdinov, R., & Torralba, A. (2011). Transfer learning by borrowing examples for multiclass object detection. In Advances in Neural Information Processing Systems (pp. 118–126).

  19. Muñoz-Salinas, R., Yeguas-Bolivar, E., Saffiotti, A., & Carnicer, R. M. (2012). Multi-camera head pose estimation. Machine Vision and Applications, 23(3), 479–490.

    Article  Google Scholar 

  20. Murphy-Chutorian, E., & Trivedi, M. M. (2009). Head pose estimation in computer vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4), 607–626.

    Article  Google Scholar 

  21. Orozco, J., Gong, S., & Xiang, T. (2009). Head pose classification in crowded scenes. In British Machine Vision Conference (pp. 1– 11).

  22. Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345–1359.

    Article  Google Scholar 

  23. Pardoe, D., & Stone, P. (2010). Boosting for regression transfer. In International Conference on Machine Learning (pp. 863–870).

  24. Rajagopal, A., Subramanian, R., Vieriu, R. L., Ricci, E., Lanz, O., Sebe, N., & Ramakrishnan, K. (2012). An adaptation framework for head pose estimation in dynamic multi-view scenarios. In Asian Conference on Computer Vision (pp. 652–666).

  25. Ricci, E., & Odobez, J.-M. (2009). Learning large margin likelihoods for realtime head pose tracking. In International Conference on Image Processing (pp. 2593–2596).

  26. Smith, K., Ba, S. O., Odobez, J.-M., & Gatica-Perez, D. (2008). Tracking the visual focus of attention for a varying number of wandering people. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(7), 1212–1229.

    Article  Google Scholar 

  27. Stiefelhagen, R., Bowers, R., & Fiscus, J. G. (2007). Multimodal Technologies for Perception of Humans. In International evaluation workshops CLEAR 2007 and RT 2007, Baltimore, MD, May 8–11, 2007, Revised Selected Papers (Vol. 4625). Heidelberg: Springer.

  28. Subramanian, R., Staiano, J., Kalimeri, K., Sebe, N., & Pianesi, F. (2010). Putting the pieces together: Multimodal analysis of social attention in meetings. In Acm Int’l Conference on Multimedia (pp. 659–662).

  29. Subramanian, R., Yan, Y., Staiano, J., Lanz, O., & Sebe, N. (2013). On the relationship between head pose, social attention and personality prediction for unstructured and dynamic group interactions. In Acm Int’l Conference on Multimodal Interfaces.

  30. Tosato, D., Farenzena, M., Spera, M., Murino, V., & Cristani, M. (2010). Multi-class classification on riemannian manifolds for video surveillance. In European Conference on Computer Vision (pp. 378–391).

  31. Voit, M., & Stiefelhagen, R. (2009). A system for probabilistic joint 3d head tracking and pose estimation in low-resolution, multi-view environments. In Computer Vision Systems (pp. 415–424).

  32. Wang, X., Han, T. X., & Yan, S. (2009). An hog-lbp human detector with partial occlusion handling. In International Conference on Computer Vision (pp. 32–39).

  33. Williams, C., Bonilla, E. V., & Chai, K. M. (2007). Multi-task gaussian process prediction. In Advances in Neural Information Processing Systems (pp. 153–160).

  34. Yan, Y., Subramanian, R., Lanz, O., & Sebe, N. (2012). Active transfer learning for multi-view head-pose classification. In Int’l Conference on Pattern Recognition (pp. 1168–1171).

  35. Yan, Y., Ricci, E., Subramanian, R., Lanz, O., & Sebe, N. (2013) No matter where you are: Flexible graph-guided multi-task learning for multi-view head pose classification under target motion. In Int’l Conference on Computer Vision.

  36. Yang, J., Yan, R., & Hauptmann, A. G. (2007). Cross-domain video concept detection using adaptive svms. In Acm Int’l Conference on Multimedia (pp. 188–197).

  37. Yang, W., Wang, Y., & Mori, G. (2009). Human action recognition from a single clip per action. In Int’l Workshop on Machine learning for Vision-Based Motion Analysis.

  38. Yang, W., Wang, Y., & Mori. G. (2010). Efficient human action detection using a transferable distance function. In Asian Conference on Computer Vision (pp. 417–426).

  39. Zabulis, X., Sarmis, T., & Argyros, A. A. (2009). 3d head pose estimation from multiple distant views. In British Machine Vision Conference (pp. 1–12).

  40. Zhang, Y., & Yeung, D.-Y. (2010). A convex formulation for learning task relationships in multi-task learning. In Uncertainity in Artificial Intelligence (pp. 733–742).

  41. Zheng, J., Jiang, Z., Phillips, J., & Chellappa, R. (2012). Cross-view action recognition via a transferable dictionary pair. In British Machine Vision Conference (pp. 1–11).

Download references


The authors gratefully acknowledge partial support from Singapore’s Agency for Science, Technology and Research (A*STAR) under the Human Sixth Sense Programme (HSSP) grant, EIT ICT Labs SSP 12205 Activity TIK—The Interaction Toolkit, tasks T1320A-T1321A and the FP7 EU project DALI.

Author information



Corresponding author

Correspondence to Ramanathan Subramanian.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 5174 KB)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Kolar Rajagopal, A., Subramanian, R., Ricci, E. et al. Exploring Transfer Learning Approaches for Head Pose Classification from Multi-view Surveillance Images. Int J Comput Vis 109, 146–167 (2014). https://doi.org/10.1007/s11263-013-0692-2

Download citation


  • Transfer learning
  • Multi-view head pose classification
  • Varying acquisition conditions
  • Moving persons