Speech/Music Discrimination in Audio Podcast Using Structural Segmentation and Timbre Recognition

  • Mathieu Barthet
  • Steven Hargreaves
  • Mark Sandler
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6684)


We propose two speech/music discrimination methods using timbre models and measure their performances on a 3 hour long database of radio podcasts from the BBC. In the first method, the machine estimated classifications obtained with an automatic timbre recognition (ATR) model are post-processed using median filtering. The classification system (LSF/K-means) was trained using two different taxonomic levels, a high-level one (speech, music), and a lower-level one (male and female speech, classical, jazz, rock & pop). The second method combines automatic structural segmentation and timbre recognition (ASS/ATR). The ASS evaluates the similarity between feature distributions (MFCC, RMS) using HMM and soft K-means algorithms. Both methods were evaluated at a semantic (relative correct overlap RCO), and temporal (boundary retrieval F-measure) levels. The ASS/ATR method obtained the best results (average RCO of 94.5% and boundary F-measure of 50.1%). These performances were favourably compared with that obtained by a SVM-based technique providing a good benchmark of the state of the art.


Speech/Music Discrimination Audio Podcast Timbre Recognition Structural Segmentation Line Spectral Frequencies K-means clustering Mel-Frequency Cepstral Coefficients Hidden Markov Models 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ajmera, J., McCowan, I., Bourlard, H.: Robust HMM-Based Speech/Music Segmentation. In: Proc. ICASSP 2002, vol. 1, pp. 297–300 (2002)Google Scholar
  2. 2.
    Alexandre-Cortizo, E., Rosa-Zurera, M., Lopez-Ferreras, F.: Application of Fisher Linear Discriminant Analysis to Speech Music Classification. In: Proc. EUROCON 2005, vol. 2, pp. 1666–1669 (2005)Google Scholar
  3. 3.
    ANSI: USA Standard Acoustical Terminology. American National Standards Institute, New York (1960)Google Scholar
  4. 4.
    Barthet, M., Depalle, P., Kronland-Martinet, R., Ystad, S.: Acoustical Correlates of Timbre and Expressiveness in Clarinet Performance. Music Perception 28(2), 135–153 (2010)CrossRefGoogle Scholar
  5. 5.
    Barthet, M., Depalle, P., Kronland-Martinet, R., Ystad, S.: Analysis-by-Synthesis of Timbre, Timing, and Dynamics in Expressive Clarinet Performance. Music Perception 28(3), 265–278 (2011)CrossRefGoogle Scholar
  6. 6.
    Barthet, M., Guillemain, P., Kronland-Martinet, R., Ystad, S.: From Clarinet Control to Timbre Perception. Acta Acustica United with Acustica 96(4), 678–689 (2010)CrossRefGoogle Scholar
  7. 7.
    Barthet, M., Sandler, M.: Time-Dependent Automatic Musical Instrument Recognition in Solo Recordings. In: 7th Int. Symposium on Computer Music Modeling and Retrieval (CMMR 2010), Malaga, Spain, pp. 183–194 (2010)Google Scholar
  8. 8.
    Bello, J.P., Daudet, L., Abdallah, S., Duxbury, C., Davies, M., Sandler, M.: A Tutorial on Onset Detection in Music Signals. IEEE Transactions on Speech and Audio Processing (2005)Google Scholar
  9. 9.
    Burred, J.J., Lerch, A.: Hierarchical Automatic Audio Signal Classification. Journal of the Audio Engineering Society 52(7/8), 724–739 (2004)Google Scholar
  10. 10.
    Caclin, A., McAdams, S., Smith, B.K., Winsberg, S.: Acoustic Correlates of Timbre Space Dimensions: A Confirmatory Study Using Synthetic Tones. J. Acoust. Soc. Am. 118(1), 471–482 (2005)CrossRefGoogle Scholar
  11. 11.
    Cannam, C.: Queen Mary University of London: Sonic Annotator,
  12. 12.
    Cannam, C.: Queen Mary University of London: Sonic Visualiser,
  13. 13.
    Cannam, C.: Queen Mary University of London: Vamp Audio Analysis Plugin System,
  14. 14.
    Carey, M., Parris, E., Lloyd-Thomas, H.: A Comparison of Features for Speech, Music Discrimination. In: Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 149–152 (1999)Google Scholar
  15. 15.
    Castellengo, M., Dubois, D.: Timbre ou Timbres? Propriété du Signal, de l’Instrument, ou Construction Cognitive (Timbre or Timbres? Property of the Signal, the Instrument, or Cognitive Construction?). In: Proc. of the Conf. on Interdisciplinary Musicology (CIM 2005), Montréal, Québec, Canada (2005)Google Scholar
  16. 16.
    Chétry, N., Davies, M., Sandler, M.: Musical Instrument Identification using LSF and K-Means. In: Proc. AES 118th Convention (2005)Google Scholar
  17. 17.
    Childers, D., Skinner, D., Kemerait, R.: The Cepstrum: A Guide to Processing. Proc. of the IEEE 65, 1428–1443 (1977)CrossRefGoogle Scholar
  18. 18.
    Davies, M.E.P., Degara, N., Plumbley, M.D.: Evaluation Methods for Musical Audio Beat Tracking Algorithms. Technical report C4DM-TR-09-06, Queen Mary University of London, Centre for Digital Music (2009),
  19. 19.
    Davis, S.B., Mermelstein, P.: Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP-28(4), 357–366 (1980)CrossRefGoogle Scholar
  20. 20.
    El-Maleh, K., Klein, M., Petrucci, G., Kabal, P.: Speech/Music Discrimination for Multimedia Applications. In: Proc. ICASSP 2000, vol. 6, pp. 2445–2448 (2000)Google Scholar
  21. 21.
    Fazekas, G., Sandler, M.: Intelligent Editing of Studio Recordings With the Help of Automatic Music Structure Extraction. In: Proc. of the AES 122nd Convention, Vienna, Austria (2007)Google Scholar
  22. 22.
    Galliano, S., Georois, E., Mostefa, D., Choukri, K., Bonastre, J.F., Gravier, G.: The ESTER Phase II Evaluation Campaign for the Rich Transcription of French Broadcast News. In: Proc. Interspeech (2005)Google Scholar
  23. 23.
    Gauvain, J.L., Lamel, L., Adda, G.: Audio Partitioning and Transcription for Broadcast Data Indexation. Multimedia Tools and Applications 14(2), 187–200 (2001)CrossRefGoogle Scholar
  24. 24.
    Grey, J.M., Gordon, J.W.: Perception of Spectral Modifications on Orchestral Instrument Tones. Computer Music Journal 11(1), 24–31 (1978)Google Scholar
  25. 25.
    Hain, T., Johnson, S., Tuerk, A., Woodland, P.C., Young, S.: Segment Generation and Clustering in the HTK Broadcast News Transcription System. In: Proc. of the DARPA Broadcast News Transcription and Understanding Workshop, pp. 133–137 (1998)Google Scholar
  26. 26.
    Hajda, J.M., Kendall, R.A., Carterette, E.C., Harshberger, M.L.: Methodological Issues in Timbre Research. In: Deliége, I., Sloboda, J. (eds.) Perception and Cognition of Music, 2nd edn., pp. 253–306. Psychology Press, New York (1997)Google Scholar
  27. 27.
    Handel, S.: Hearing. In: Timbre Perception and Auditory Object Identification, 2nd edn., pp. 425–461. Academic Press, San Diego (1995)Google Scholar
  28. 28.
    Harte, C.: Towards Automatic Extraction of Harmony Information From Music Signals. Ph.D. thesis, Queen Mary University of London (2010)Google Scholar
  29. 29.
    Helmholtz, H.v.: On the Sensations of Tone. Dover, New York (1954); (from the works of 1877). English trad. with notes and appendix from E.J. EllisGoogle Scholar
  30. 30.
    Houtgast, T., Steeneken, H.J.M.: The Modulation Transfer Function in Room Acoustics as a Predictor of Speech Intelligibility. Acustica 28, 66–73 (1973)Google Scholar
  31. 31.
    Itakura, F.: Line Spectrum Representation of Linear Predictive Coefficients of Speech Signals. J. Acoust. Soc. Am. 57(S35) (1975)Google Scholar
  32. 32.
    Jarina, R., O’Connor, N., Marlow, S., Murphy, N.: Rhythm Detection For Speech-Music Discrimination In MPEG Compressed Domain. In: Proc. of the IEEE 14th International Conference on Digital Signal Processing (DSP), Santorini (2002)Google Scholar
  33. 33.
    Kedem, B.: Spectral Analysis and Discrimination by Zero-Crossings. Proc. IEEE 74, 1477–1493 (1986)CrossRefGoogle Scholar
  34. 34.
    Kim, H.G., Berdahl, E., Moreau, N., Sikora, T.: Speaker Recognition Using MPEG-7 Descriptors. In: Proc. of EUROSPEECH (2003)Google Scholar
  35. 35.
    Levy, M., Sandler, M.: Structural Segmentation of Musical Audio by Constrained Clustering. IEEE. Transac. on Audio, Speech, and Language Proc. 16(2), 318–326 (2008)CrossRefGoogle Scholar
  36. 36.
    Linde, Y., Buzo, A., Gray, R.M.: An Algorithm for Vector Quantizer Design. IEEE Transactions on Communications 28, 702–710 (1980)CrossRefGoogle Scholar
  37. 37.
    Lu, L., Jiang, H., Zhang, H.J.: A Robust Audio Classification and Segmentation Method. In: Proc. ACM International Multimedia Conference, vol. 9, pp. 203–211 (2001)Google Scholar
  38. 38.
    Marozeau, J., de Cheveigné, A., McAdams, S., Winsberg, S.: The Dependency of Timbre on Fundamental Frequency. Journal of the Acoustical Society of America 114(5), 2946–2957 (2003)CrossRefGoogle Scholar
  39. 39.
    Mauch, M.: Automatic Chord Transcription from Audio using Computational Models of Musical Context. Ph.D. thesis, Queen Mary University of London (2010)Google Scholar
  40. 40.
    McAdams, S., Winsberg, S., Donnadieu, S., De Soete, G., Krimphoff, J.: Perceptual Scaling of Synthesized Musical Timbres: Common Dimensions, Specificities, and Latent Subject Classes. Psychological Research 58, 177–192 (1995)CrossRefGoogle Scholar
  41. 41.
    Music Information Retrieval Evaluation Exchange Wiki: Structural Segmentation (2010),
  42. 42.
    Peeters, G.: Automatic Classification of Large Musical Instrument Databases Using Hierarchical Classifiers with Inertia Ratio Maximization. In: Proc. AES 115th Convention, New York (2003)Google Scholar
  43. 43.
    Queen Mary University of London: QM Vamp Plugins,
  44. 44.
    Ramona, M., Richard, G.: Comparison of Different Strategies for a SVM-Based Audio Segmentation. In: Proc. of the 17th European Signal Processing Conference (EUSIPCO 2009), pp. 20–24 (2009)Google Scholar
  45. 45.
    Risset, J.C., Wessel, D.L.: Exploration of Timbre by Analysis and Synthesis. In: Deutsch, D. (ed.) Psychology of Music, 2nd edn. Academic Press, London (1999)Google Scholar
  46. 46.
    Saunders, J.: Real-Time Discrimination of Broadcast Speech Music. In: Proc. ICASSP 1996, vol. 2, pp. 993–996 (1996)Google Scholar
  47. 47.
    Schaeffer, P.: Traité des Objets Musicaux (Treaty of Musical Objects). Éditions du seuil (1966)Google Scholar
  48. 48.
    Scheirer, E., Slaney, M.: Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator. In: Proc. ICASSP 1997, vol. 2, pp. 1331–1334 (1997)Google Scholar
  49. 49.
    Slawson, A.W.: Vowel Quality and Musical Timbre as Functions of Spectrum Envelope and Fundamental Frequency. J. Acoust. Soc. Am. 43(1) (1968)Google Scholar
  50. 50.
    Sundberg, J.: Articulatory Interpretation of the ‘Singing Formant’. J. Acoust. Soc. Am. 55, 838–844 (1974)CrossRefGoogle Scholar
  51. 51.
    Terasawa, H., Slaney, M., Berger, J.: A Statistical Model of Timbre Perception. In: ISCA Tutorial and Research Workshop on Statistical And Perceptual Audition (SAPA 2006), pp. 18–23 (2006)Google Scholar
  52. 52.
    Gil de Zúñiga, H., Veenstra, A., Vraga, E., Shah, D.: Digital Democracy: Reimagining Pathways to Political Participation. Journal of Information Technology & Politics 7(1), 36–51 (2010)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Mathieu Barthet
    • 1
  • Steven Hargreaves
    • 1
  • Mark Sandler
    • 1
  1. 1.Centre for Digital MusicQueen Mary University of LondonLondonUnited Kingdom

Personalised recommendations