Skip to main content
Log in

Spectral similarity metrics for sound source formation based on the common variation cue

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Scene analysis is a relevant way of gathering information about the structure of an audio stream. For content extraction purposes, it also provides prior knowledge that can be taken into account in order to provide more robust results for standard classification approaches. In order to perform such scene analysis, we believe that the notion of temporality is important. Consequently, we study in this paper a new way of modeling the evolution over time of the frequency and amplitude parameters of spectral components. We evaluate its benefits by considering its ability to automatically gather the components of the same sound source. The evaluation of the proposed metric shows that it achieves good performance and takes better account of micro-modulations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Abe M, Smith IJO (2005) Am/fm rate estimation for time-varying sinusoidal modeling. In: Proc. IEEE international conference on acoustics, speech, and signal processing (ICASSP ’05), 18–23 March 2005, vol 3, pp iii/201–iii/204

  2. Aucouturier J-J, Pachet F (2007) The influence of polyphony on the dynamical modelling of musical timbre. Pattern Recogn Lett 28(5):654–661

    Article  Google Scholar 

  3. Auger F, Flandrin P (1995) Improving the readability of time-frequency and time-scale representations by the reassignment method. IEEE Trans Signal Process 43:1068–1089

    Article  Google Scholar 

  4. Badeau R, Richard G, David B (2008) Performance of esprit for estimating mixtures of complex exponentials modulated by polynomials. In:IEEE Trans Signal Process. See also IEEE Transactions on Acoustics, Speech, and Signal Processing 56:492–504

    MathSciNet  Google Scholar 

  5. Bello JP, Pickens J (2005) A Robust Mid-level representation for harmonic content in music signals. In: ISMIR

  6. Bregman AS (1990) Auditory scene analysis: the perceptual organization of sound. MIT, Cambridge

    Google Scholar 

  7. Burg JP (1975) Maximum entropy spectral analysis. Ph.D. thesis, Stanford University

  8. Christensen MG, Jensen SH (2006) On perceptual distortion minimization and nonlinear least-squares frequency estimation. IEEE Transactions on Audio, Speech, and Language Processing 14(1):99–109

    Article  Google Scholar 

  9. Cooke M (1993) Modelling auditory processing and organization. Cambridge University Press, New York

    Google Scholar 

  10. Daudet L (2006) Sparse and structured decompositions of signals with the molecular matching pursuit. IEEE Transactions on Audio, Speech, and Language Processing 14(5):1808–1816

    Article  Google Scholar 

  11. Depalle P, Garcia G, Rodet X (1993) Tracking of partials for additive sound synthesis using hidden markov models. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol 1, pp 225–228

  12. Ellis D (1996) Prediction-driven computational auditory scene analysis. PhD thesis, Department. of Electrical Engineering & Computer Science, M.I.T

  13. Ellis D, Rosenthal D (1995) Mid-level representations for Computational Auditory Scene Analysis. In: International Joint Conference on Artificial Intelligence (IJCAI) - workshop on computational auditory scene analysis

  14. Ellis D, Vercoe B (1992) A perceptual representation of sound for auditory signal separation. In: 123rd meeting of the acoustical society of America

  15. Fernandez P, Casajus-Quiros J (1998) Multi-pitch estimation for polyphonic musical signals. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 3565–3568

  16. Fritts L (1997) The IOWA music instrument samples. http://theremin.music.uiowa.edu

  17. Grossberg S (1996) Pitch based streaming in auditory perception. MIT, Cambridge

    Google Scholar 

  18. Herrera P, Peeters G, Dubnov S (2003) Automatic classification of musical sounds. J New Music Res 32(1):3–21

    Article  Google Scholar 

  19. A. N. S. Institute (1960) USA Standard Acoustical Terminology

  20. Itakura F (1975) Minimum prediction residual principle applied to speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing 23(1):67–72

    Article  Google Scholar 

  21. Joder C, Essid S, Richard G (2009) Temporal integration for audio classification with application to musical instrument classification. IEEE Transactions on Audio, Speech and Language Processing 17(1):174–186

    Article  Google Scholar 

  22. Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 2(2):241–254

    Article  Google Scholar 

  23. Klapuri A (2002) Separation of harmonic sounds using linear models for the overtone series. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  24. Lagrange M (2005) A new dissimilarity metric for the clustering of partials using the common variation cue. In: Proceedings of the International Computer Music Conference (ICMC), Barcelona, Spain, September 2005. International Computer Music Association (ICMA)

  25. Lagrange M, Marchand S (2007) Estimating the instantaneous frequency of sinusoidal components using phase-based methods. J Audio Eng Soc 55(5):385–399

    Google Scholar 

  26. Lagrange M, Marchand S, Rault J (2007) Enhancing the tracking of partials for the sinusoidal modeling of polyphonic sounds. IEEE Transactions on Audio, Speech and Language Processing 28:357–366

    Google Scholar 

  27. Lagrange M, Marchand S, Rault J-B (2004) Using linear prediction to enhance the tracking of partials. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol 4, pp 241–244

  28. Lagrange M, Martins LG, Murdoch J, Tzanetakis G (2008) Normalized cuts for predominant melodic source separation. IEEE Transactions on Audio, Speech and Language Processing 16(2):278–290

    Article  Google Scholar 

  29. Laroche J (1993) The use of the matrix pencil method for the spectrum analysis of musical signals. J Acoust Soc Am 94(4):1958–1965

    Article  Google Scholar 

  30. Marchand S, Raspaud M (2004) Enhanced time-stretching using order-2 sinusoidal modeling. In: Proc. DAFx. Federico II University of Naple, Italy, pp 76–82

  31. Martin KD, Kim YE (1998) Musical Instrument Recognition: a pattern-recognition approach. In: 136th meeting of the Acoustical Society of America

  32. McAdams S (1989) Segregation of concurrrents sounds: effects of frequency modulation coherence. J Audio Eng Soc 86(6):2148–2159

    Google Scholar 

  33. McAulay RJ, Quatieri TF (1986) Speech analysis/synthesis based on a sinusoidal representation. IEEE Transactions on Acoustics, Speech and Signal Processing 34(4):744–754

    Article  Google Scholar 

  34. Nealen A (2004) An as-short-as-possible introduction to the least squares, weighted least squares and moving least squares methods for scattered data approximation and interpolation. http://www.nealen.com/projects/

  35. Nunes L, Merched R, Biscainho L (2007) Recursive least-squares estimation of the evolution of partials in sinusoidal analysis. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  36. Ramona M, Richard G (2008) Vocal detection in music with support vector machines. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  37. Raspaud M, Evangelista G (2008) Binaural partial tracking. In: Proc. DAFx. Espoo, Finland, pp 123–128

  38. Raspaud M, Marchand S, Girin L (2005) A generalized polynomial and sinusoidal model for partial tracking and time stretching. In: Proc. DAFx. Universidad Politécnica de Madrid, pp 24–29, ISBN: 84-7402-318-1

  39. Régnier L, Peeters G (2009) Singing voice detection in music tracks using direct voice vibrato detection. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  40. Röbel A (2006) Adaptive additive modeling with continuous parameter trajectories. IEEE Transactions on Acoustics, Speech and Signal Processing 14(4):1440–1453

    Google Scholar 

  41. Rosier J, Grenier Y (2004) Unsupervised classification techniques for multipitch estimation. In: 116th Convention of the Audio Engineering Society. Audio Engineering Society (AES)

  42. Röbel A (2008) Frequency-slope estimation and its application to parameter estimation for non-stationary sinusoids. Comp Music J 32:68–79

    Article  Google Scholar 

  43. Serra X (1997) Musical signal processing with sinusoids plus noise, chap 3. In: Studies on New Music Research. Swets & Zeitlinger, Lisse, The Netherlands, pp 91–122

    Google Scholar 

  44. Sterian A, Wakefield GH (1998) A model-based approach to partial tracking for musical transcription. In: SPIE annual meeting, San Diego, California

    Google Scholar 

  45. Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Transactions on Audio, Speech and Language Processing 10(5):293–302

    Article  Google Scholar 

  46. Virtanen T, Klapuri A (2000) Separation of harmonic sound sources using sinusoidal modeling. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol 2, pp 765–768

  47. Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58:238–244

    Article  Google Scholar 

Download references

Acknowledgements

This work has been initiated when the authors were at the LaBRI (UMR-Cnrs 5800, University of Bordeaux 1) and has been partly funded by the OSEO project Quaero within the task 6.4: “Music Search by Similarity” and the French GIP ANR DESAM under contract ANR-06-JCJC-0027-01.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mathieu Lagrange.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lagrange, M., Raspaud, M. Spectral similarity metrics for sound source formation based on the common variation cue. Multimed Tools Appl 48, 185–205 (2010). https://doi.org/10.1007/s11042-009-0382-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-009-0382-9

Keywords

Navigation