Multimedia Tools and Applications

, Volume 77, Issue 3, pp 3369–3385 | Cite as

Gaussian filter for TDOA based sound source localization in multimedia surveillance

  • Mengyao Zhu
  • Huan Yao
  • Xiukun Wu
  • Zhihua Lu
  • Xiaoqiang Zhu
  • Qinghua Huang


Although multimedia surveillance systems are becoming increasingly ubiquitous in our living environment, automated multimedia surveillance systems based on video camera lacks the robustness and reliability most of the time in several real applications. To overcome this drawback, audio sensory devices have been taken into account in a considerable amount of research. For example, Sound Source Localization (SSL) may indicate potential security risks and could point the camera in that direction. In this paper, a reliable sound source localization based on Time-Difference-Of-Arrival (TDOA) is explored. The novel aspect of our approach includes a TDOA based Gaussian filter to improve the accuracy and stability of sound source localization. The advantage of our proposed algorithm is its extensive integration with various TDOA-based methods in all kinds of microphone array. The Experimental comparison shows significant improvement over the state of the art TDOA-based algorithm.


Microphone array,· sound source location TDOA Gaussian filter Surveillance 



This work was supported by the key support Projects of Shanghai Science and Technology Committee (16010500100), the National Natural Science Foundation of China (61402277, 61571279), and Innovation Program of Shanghai Municipal Education Commission (15ZZ044).


  1. 1.
    Benesty J, Chen J, Huang Y (2008) Microphone Array signal processing. Springer, BerlinGoogle Scholar
  2. 2.
    Bian X, Abowd GD, Rehg JM (2005) Using sound source localization in a home environment. In: Gellersen HW, Want R, Schmidt A (eds) Pervasive Computing: Third International Conference, PERVASIVE 2005. Proceeding. Springer, Munich, p 19–36Google Scholar
  3. 3.
    Brandstein M, Ward D (2013) Microphone arrays: signal processing techniques and applications. Springer Science and Business Media, MedfordGoogle Scholar
  4. 4.
    Brandstein MS, Adcock JE, Silverman HF (1997) A closed-form location estimator for use with room environment microphone arrays. IEEE Trans Speech Audio Process 5(1):45–50CrossRefGoogle Scholar
  5. 5.
    Buckley KM, Griffiths LJ (1988) Broad-band signal-subspace spatial-spectrum (BASS-ALE) estimation. IEEE Trans Acoust Speech Signal Process 36(7):953–964CrossRefzbMATHGoogle Scholar
  6. 6.
    Carter GC (1977) Variance bounds for passively locating an acoustic source with a symmetric line array. J Acoust Soc Am 62(4):922–926CrossRefGoogle Scholar
  7. 7.
    Carter GC, Nuttall AH, Cable PG (1973) The smoothed coherence transform. Proc IEEE 61(10):1497–1498CrossRefGoogle Scholar
  8. 8.
    Champagne B, Bedard S, Stephenne A (1996) Performance of time-delay estimation in the presence of room reverberation. IEEE Trans Speech Audio Process 4(2):148–152CrossRefGoogle Scholar
  9. 9.
    Chang X, Nie F, Wang S, Yang Y, Zhou X, Zhang C (2016) Compound rank-k projections for bilinear analysis. IEEE Trans Neural Netw Lear Syst 27(7):1502–1513MathSciNetCrossRefGoogle Scholar
  10. 10.
    Chang X, Ma Z, Yang Y, Zeng Z, Hauptmann AG (2017) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans Cybernetics 47(5):1180–1197CrossRefGoogle Scholar
  11. 11.
    Chang X, Yu YL, Yang Y, Xing EP (2017) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617–1632CrossRefGoogle Scholar
  12. 12.
    Chang X, Ma Z, Lin M, Yang Y, Hauptmann A (2017) Feature interaction augmented sparse learning for fast Kinect motion detection. IEEE Trans Image Process 26(8):3911–3920MathSciNetCrossRefGoogle Scholar
  13. 13.
    Chua TW, Leman K, Gao F (2014) Hierarchical audio-visual surveillance for passenger elevators. In: Gurrin C, Hopfgartner F, Hurst W, et al (eds) MultiMedia Modeling: 20th Anniversary International Conference, MMM 2014. Proceedings, Part II. Springer, Dublin, p 44–55Google Scholar
  14. 14.
    Crocco M, Cristani M, Trucco A, Murino V (2016) Audio surveillance: a systematic review. ACM Comput Surv 48(4):52. CrossRefGoogle Scholar
  15. 15.
    Ephraim Y, Malah D (1985) Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans Acoust Speech Signal Process 33(2):443–445CrossRefGoogle Scholar
  16. 16.
    Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL et al (1993) TIMIT acoustic-phonetic continuous speech corpus. Linguistic Data Consortium, PhiladelphiaCrossRefGoogle Scholar
  17. 17.
    Guo Y, Hazas M (2010) Acoustic source localization of everyday sounds using wireless sensor networks. International conference adjunct papers on ubiquitous computing. ACM, Copenhagen, p 411–412Google Scholar
  18. 18.
    Hahn W, Tretter S (1973) Optimum processing for delay-vector estimation in passive signal arrays. IEEE Trans Inf Theory 19(5):608–614CrossRefzbMATHGoogle Scholar
  19. 19.
    Haykin S (2002) Adaptive filter theory. Prentice Hall 2:478–481Google Scholar
  20. 20.
    Ianniello JP (1982) Time delay estimation via cross-correlation in the presence of large estimation errors. IEEE Trans Acoust Speech Signal Process 30(6):998–1003CrossRefGoogle Scholar
  21. 21.
    Johnson DH, Dudgeon DE (1992) Array signal processing: concepts and techniques. P T R Prentice Hall, Upper Saddle RiverzbMATHGoogle Scholar
  22. 22.
    Knapp C, Carter G (1976) The generalized correlation method for estimation of time delay. IEEE Trans Acoust Speech Signal Process 24(4):320–327CrossRefGoogle Scholar
  23. 23.
    Kotus J, Lopatka K, Czyzewski A, Bogdanis G (2013) Audio-visual surveillance system for application in bank operating room. In: Dziech A, Czyżewski A (eds) Multimedia Communications, Services and Security: 6th International Conference, MCSS 2013. Proceedings. Springer, Krakow, pp 107–120Google Scholar
  24. 24.
    Kotus J, Lopatka K, Czyzewski A (2014) Detection and localization of selected acoustic events in acoustic field for smart surveillance applications. Multimedia Tools and Applications 68(1):5–21CrossRefGoogle Scholar
  25. 25.
    Ma Z, Chang X, Yang Y, Sebe N, Hauptmann A (2017) The many shades of negativity. IEEE Trans Multimedia, PP(99), 1–1. doi:
  26. 26.
    Pham QC, Lapeyronnie A, Baudry C, Lucat L (2010) Audio-video surveillance system for public transportation. International conference on image processing theory, tools and applications. IEEE, Paris, p 47–53Google Scholar
  27. 27.
    Schmidt RO (1972) A new approach to geometry of range difference location. IEEE Trans Aerosp Electron Syst 6:821–835CrossRefGoogle Scholar
  28. 28.
    Schmidt RO (1981) A signal subspace approach to multiple emitter location spectral estimation. Ph.d.thesis Stanford UniversityGoogle Scholar
  29. 29.
    de Silva G C, Yamasaki T, Aizawa K (2008) Audio analysis for multimedia retrieval from a ubiquitous home. In: Satoh SI, Nack F, Etoh M (eds) Advances in Multimedia Modeling: International Multimedia Modeling Conference, MMM 2008. Proceedings. Springer, Kyoto, pp 466–476Google Scholar
  30. 30.
    Smith JO, Abel JS (1987) Close-form least-squares source location estimation from range-difference measurements. IEEE Trans Acoust Speech Signal Process 35(12):1661–1669CrossRefGoogle Scholar
  31. 31.
    Stachurski J, Netsch L, Cole R (2013) Sound source localization for video surveillance camera. International conference on advanced video and signal based surveillance. IEEE, Krakow, p 93–98Google Scholar
  32. 32.
    Svaizer P, Matassoni M, Omologo M (1997) Acoustic source location in a three-dimensional space using crosspower spectrum phase. International conference on acoustics, speech, and signal processing. IEEE, Munich, pp 231–234Google Scholar
  33. 33.
    Tan LN, Borgstrom BJ, Alwan A (2010) Voice activity detection using harmonic frequency components in likelihood ratio test. International conference on acoustics speech and signal processing. IEEE, Dallas, pp 4466–4469Google Scholar
  34. 34.
    Wang H, Kaveh M (1985) Coherent signal-subspace processing for the detection and estimation of angles of arrival of multiple wide-band sources. IEEE Trans Acoust Speech Signal Process 33(4):823–831CrossRefGoogle Scholar
  35. 35.
    Wax M, Kailath T (1983) Optimum localization of multiple sources by passive arrays. IEEE Trans Acoust Speech Signal Process 31(5):1210–1217CrossRefGoogle Scholar
  36. 36.
    Yan Y, Nie F, Li W, Gao C, Yang Y, Xu D (2016) Image classification by cross-media active learning with privileged information. IEEE Trans Multimedia 18(12):2494–2502CrossRefGoogle Scholar
  37. 37.
    Yang Y, Ma Z, Hauptmann AG, Sebe N (2013) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans Multimedia 15(3):661–669CrossRefGoogle Scholar
  38. 38.
    Zhu L, Shen J, Liu X, Xie L, Nie L (2016) Learning compact visual representation with canonical views for robust mobile landmark search. International joint conference on artificial intelligence. AAAI, New York, pp 3959–3965Google Scholar
  39. 39.
    Zhu L, Shen J, Xie L, Cheng Z (2016) Unsupervised topic hypergraph hashing for efficient mobile image retrieval. IEEE Trans Cybernetics PP(99):1–14. CrossRefGoogle Scholar
  40. 40.
    Zhu L, Shen J, Xie L, Cheng Z (2017) Unsupervised visual hashing with semantic assistant for content-based image retrieval. IEEE Trans Knowl Data Eng 29(2):472–486CrossRefGoogle Scholar
  41. 41.
    Zieger C, Brutti A, Svaizer P (2009) Acoustic based surveillance system for intrusion detection.  International conference on advanced video and signal based surveillance. IEEE, Genova, pp 314–319Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.School of Communication and Information EngineeringShanghai UniversityShanghaiChina
  2. 2.College of Information Science and EngineeringNingbo UniversityNingboChina

Personalised recommendations