Incorporating Audio Signals into Constructing a Visual Saliency Map

  • Jiro Nakajima
  • Akihiro Sugimoto
  • Kazuhiko Kawamoto
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8333)


The saliency map has been proposed to identify regions that draw human visual attention. Differences of features from the surroundings are hierarchially computed for an image or an image sequence in multiple resolutions and they are fused in a fully bottom-up manner to obtain a saliency map. A video usually contains sounds, and not only visual stimuli but also auditory stimuli attract human attention. Nevertheless, most conventional methods discard auditory information and image information alone is used in computing a saliency map. This paper presents a method for constructing a visual saliency map by integrating image features with auditory features. We assume a single moving sound source in a video and introduce a sound source feature. Our method detects the sound source feature using the correlation between audio signals and sound source motion, and computes its importance in each frame in a video using an auditory saliency map. The importance is used to fuse the sound source feature with image features to construct a visual saliency map. Experiments using subjects demonstrate that a saliency map by our proposed method reflects human’s visual attention more accurately than that by a conventional method.


gaze visual attention visual saliency auditory saliency audio signal video sound source feature 


  1. 1.
    Barzelay, Z., Schechner, Y.Y.: Harmony in motion. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007, pp. 1–8 (2007)Google Scholar
  2. 2.
    Bruce, N.D.B., Tsotsos, J.K.: Saliency based on information maximization. In: NIPS (2005)Google Scholar
  3. 3.
    Cerf, M., Harel, J., Einhäuser, W., Koch, C.: Predicting human gaze using low-level saliency combined with face detection. In: NIPS (2007)Google Scholar
  4. 4.
    Gangnet, M., Perny, D., Coueignoux, P.: Perspective mapping of planar textures. Computers & Graphics 8(2), 115–123 (1984)CrossRefGoogle Scholar
  5. 5.
    Greenspan, H., Belongie, S., Goodman, R., Perona, P., Rakshit, S., Anderson, C.H.: Overcomplete steerable pyramid filters and rotation invariance. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 222–228 (1994)Google Scholar
  6. 6.
    Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, pp. 545–552 (2006)Google Scholar
  7. 7.
    Huber, D.J., Khosla, D., Dow, P.A.: Fusion of multi-sensory saliency maps for automated perception and control. In: Proceedings of the SPIE (2009)Google Scholar
  8. 8.
    Itti, L., Dhavale, N., Pighin, F.: Realistic avatar eye and head animation using a neurobiological model of visual attention. In: Bosacchi, B., Fogel, D.B., Bezdek, J.C. (eds.) Proceedings of the SPIE 48th Annual International Symposium on Optical Science and Technology, vol. 5200, pp. 64–78. SPIE Press (2003)Google Scholar
  9. 9.
    Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998)CrossRefGoogle Scholar
  10. 10.
    Kalinli, O., Narayanan, S.S.: Prominence detection using auditory attention cues and task-dependent high level information. IEEE Transactions on Audio, Speech & Language Processing 17, 1009–1024 (2009)CrossRefGoogle Scholar
  11. 11.
    Kayser, C., Petkov, C.I., Lippert, M., Logothetis, N.K.: Mechanisms for Allocating Auditory Attention: An Auditory Saliency Map. Current Biology 15(21), 1943–1947 (2005)CrossRefGoogle Scholar
  12. 12.
    Kubota, H., Sugano, Y., Okabe, T., Sato, Y., Sugimoto, A., Hiraki, K.: Incorporating visual field characteristics into a saliency map. In: Proceedings of the Symposium on Eye Tracking Research and Applications, ETRA 2012, pp. 333–336. ACM (2012)Google Scholar
  13. 13.
    Peters, R.J., Iyer, A., Itti, L., Koch, C.: Components of bottom-up gaze allocation in natural images. Vision Research 45(8), 2397–2416 (2005)CrossRefGoogle Scholar
  14. 14.
    Schauerte, B., Kühn, B., Kroschel, K., Stiefelhagen, R.: Multimodal saliency-based attention for object-based scene analysis. In: Proceedings of the 24th International Conference on Intelligent Robots and Systems (IROS). IEEE/RSJ (2011)Google Scholar
  15. 15.
    Tatler, B.W., Baddeley, R.J., Gilchrist, I.D.: Visual correlates of fixation selection: effects of scale and time. Vision Research 45(5), 643–659 (2005)CrossRefGoogle Scholar
  16. 16.
    Treisman, A.M., Gelade, G.: A feature-integration theory of attention. Cognitive Psychology 12(1), 97–136 (1980)CrossRefGoogle Scholar
  17. 17.
    Yamada, K., Sugano, Y., Okabe, T., Sato, Y., Sugimoto, A., Hiraki, K.: Attention prediction in egocentric video using motion and visual saliency. In: Ho, Y.-S. (ed.) PSIVT 2011, Part I. LNCS, vol. 7087, pp. 277–288. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  18. 18.
    Yuyu, L., Sato, Y.: Visual localization of non-stationary sound sources. In: Proceedings of the 17th ACM International Conference on Multimedia, pp. 513–516. ACM (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Jiro Nakajima
    • 1
  • Akihiro Sugimoto
    • 2
  • Kazuhiko Kawamoto
    • 1
  1. 1.Chiba UniversityChibaJapan
  2. 2.National Institute of InformaticsTokyoJapan

Personalised recommendations