Skip to main content

Speech/Music Discrimination in Audio Podcast Using Structural Segmentation and Timbre Recognition

  • Conference paper
Book cover Exploring Music Contents (CMMR 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6684))

Included in the following conference series:

Abstract

We propose two speech/music discrimination methods using timbre models and measure their performances on a 3 hour long database of radio podcasts from the BBC. In the first method, the machine estimated classifications obtained with an automatic timbre recognition (ATR) model are post-processed using median filtering. The classification system (LSF/K-means) was trained using two different taxonomic levels, a high-level one (speech, music), and a lower-level one (male and female speech, classical, jazz, rock & pop). The second method combines automatic structural segmentation and timbre recognition (ASS/ATR). The ASS evaluates the similarity between feature distributions (MFCC, RMS) using HMM and soft K-means algorithms. Both methods were evaluated at a semantic (relative correct overlap RCO), and temporal (boundary retrieval F-measure) levels. The ASS/ATR method obtained the best results (average RCO of 94.5% and boundary F-measure of 50.1%). These performances were favourably compared with that obtained by a SVM-based technique providing a good benchmark of the state of the art.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ajmera, J., McCowan, I., Bourlard, H.: Robust HMM-Based Speech/Music Segmentation. In: Proc. ICASSP 2002, vol. 1, pp. 297–300 (2002)

    Google Scholar 

  2. Alexandre-Cortizo, E., Rosa-Zurera, M., Lopez-Ferreras, F.: Application of Fisher Linear Discriminant Analysis to Speech Music Classification. In: Proc. EUROCON 2005, vol. 2, pp. 1666–1669 (2005)

    Google Scholar 

  3. ANSI: USA Standard Acoustical Terminology. American National Standards Institute, New York (1960)

    Google Scholar 

  4. Barthet, M., Depalle, P., Kronland-Martinet, R., Ystad, S.: Acoustical Correlates of Timbre and Expressiveness in Clarinet Performance. Music Perception 28(2), 135–153 (2010)

    Article  Google Scholar 

  5. Barthet, M., Depalle, P., Kronland-Martinet, R., Ystad, S.: Analysis-by-Synthesis of Timbre, Timing, and Dynamics in Expressive Clarinet Performance. Music Perception 28(3), 265–278 (2011)

    Article  Google Scholar 

  6. Barthet, M., Guillemain, P., Kronland-Martinet, R., Ystad, S.: From Clarinet Control to Timbre Perception. Acta Acustica United with Acustica 96(4), 678–689 (2010)

    Article  Google Scholar 

  7. Barthet, M., Sandler, M.: Time-Dependent Automatic Musical Instrument Recognition in Solo Recordings. In: 7th Int. Symposium on Computer Music Modeling and Retrieval (CMMR 2010), Malaga, Spain, pp. 183–194 (2010)

    Google Scholar 

  8. Bello, J.P., Daudet, L., Abdallah, S., Duxbury, C., Davies, M., Sandler, M.: A Tutorial on Onset Detection in Music Signals. IEEE Transactions on Speech and Audio Processing (2005)

    Google Scholar 

  9. Burred, J.J., Lerch, A.: Hierarchical Automatic Audio Signal Classification. Journal of the Audio Engineering Society 52(7/8), 724–739 (2004)

    Google Scholar 

  10. Caclin, A., McAdams, S., Smith, B.K., Winsberg, S.: Acoustic Correlates of Timbre Space Dimensions: A Confirmatory Study Using Synthetic Tones. J. Acoust. Soc. Am. 118(1), 471–482 (2005)

    Article  Google Scholar 

  11. Cannam, C.: Queen Mary University of London: Sonic Annotator, http://omras2.org/SonicAnnotator

  12. Cannam, C.: Queen Mary University of London: Sonic Visualiser, http://www.sonicvisualiser.org/

  13. Cannam, C.: Queen Mary University of London: Vamp Audio Analysis Plugin System, http://www.vamp-plugins.org/

  14. Carey, M., Parris, E., Lloyd-Thomas, H.: A Comparison of Features for Speech, Music Discrimination. In: Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 149–152 (1999)

    Google Scholar 

  15. Castellengo, M., Dubois, D.: Timbre ou Timbres? Propriété du Signal, de l’Instrument, ou Construction Cognitive (Timbre or Timbres? Property of the Signal, the Instrument, or Cognitive Construction?). In: Proc. of the Conf. on Interdisciplinary Musicology (CIM 2005), Montréal, Québec, Canada (2005)

    Google Scholar 

  16. Chétry, N., Davies, M., Sandler, M.: Musical Instrument Identification using LSF and K-Means. In: Proc. AES 118th Convention (2005)

    Google Scholar 

  17. Childers, D., Skinner, D., Kemerait, R.: The Cepstrum: A Guide to Processing. Proc. of the IEEE 65, 1428–1443 (1977)

    Article  Google Scholar 

  18. Davies, M.E.P., Degara, N., Plumbley, M.D.: Evaluation Methods for Musical Audio Beat Tracking Algorithms. Technical report C4DM-TR-09-06, Queen Mary University of London, Centre for Digital Music (2009), http://www.eecs.qmul.ac.uk/~matthewd/pdfs/DaviesDegaraPlumbley09-evaluation-tr.pdf

  19. Davis, S.B., Mermelstein, P.: Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP-28(4), 357–366 (1980)

    Article  Google Scholar 

  20. El-Maleh, K., Klein, M., Petrucci, G., Kabal, P.: Speech/Music Discrimination for Multimedia Applications. In: Proc. ICASSP 2000, vol. 6, pp. 2445–2448 (2000)

    Google Scholar 

  21. Fazekas, G., Sandler, M.: Intelligent Editing of Studio Recordings With the Help of Automatic Music Structure Extraction. In: Proc. of the AES 122nd Convention, Vienna, Austria (2007)

    Google Scholar 

  22. Galliano, S., Georois, E., Mostefa, D., Choukri, K., Bonastre, J.F., Gravier, G.: The ESTER Phase II Evaluation Campaign for the Rich Transcription of French Broadcast News. In: Proc. Interspeech (2005)

    Google Scholar 

  23. Gauvain, J.L., Lamel, L., Adda, G.: Audio Partitioning and Transcription for Broadcast Data Indexation. Multimedia Tools and Applications 14(2), 187–200 (2001)

    Article  Google Scholar 

  24. Grey, J.M., Gordon, J.W.: Perception of Spectral Modifications on Orchestral Instrument Tones. Computer Music Journal 11(1), 24–31 (1978)

    Google Scholar 

  25. Hain, T., Johnson, S., Tuerk, A., Woodland, P.C., Young, S.: Segment Generation and Clustering in the HTK Broadcast News Transcription System. In: Proc. of the DARPA Broadcast News Transcription and Understanding Workshop, pp. 133–137 (1998)

    Google Scholar 

  26. Hajda, J.M., Kendall, R.A., Carterette, E.C., Harshberger, M.L.: Methodological Issues in Timbre Research. In: Deliége, I., Sloboda, J. (eds.) Perception and Cognition of Music, 2nd edn., pp. 253–306. Psychology Press, New York (1997)

    Google Scholar 

  27. Handel, S.: Hearing. In: Timbre Perception and Auditory Object Identification, 2nd edn., pp. 425–461. Academic Press, San Diego (1995)

    Google Scholar 

  28. Harte, C.: Towards Automatic Extraction of Harmony Information From Music Signals. Ph.D. thesis, Queen Mary University of London (2010)

    Google Scholar 

  29. Helmholtz, H.v.: On the Sensations of Tone. Dover, New York (1954); (from the works of 1877). English trad. with notes and appendix from E.J. Ellis

    Google Scholar 

  30. Houtgast, T., Steeneken, H.J.M.: The Modulation Transfer Function in Room Acoustics as a Predictor of Speech Intelligibility. Acustica 28, 66–73 (1973)

    Google Scholar 

  31. Itakura, F.: Line Spectrum Representation of Linear Predictive Coefficients of Speech Signals. J. Acoust. Soc. Am. 57(S35) (1975)

    Google Scholar 

  32. Jarina, R., O’Connor, N., Marlow, S., Murphy, N.: Rhythm Detection For Speech-Music Discrimination In MPEG Compressed Domain. In: Proc. of the IEEE 14th International Conference on Digital Signal Processing (DSP), Santorini (2002)

    Google Scholar 

  33. Kedem, B.: Spectral Analysis and Discrimination by Zero-Crossings. Proc. IEEE 74, 1477–1493 (1986)

    Article  Google Scholar 

  34. Kim, H.G., Berdahl, E., Moreau, N., Sikora, T.: Speaker Recognition Using MPEG-7 Descriptors. In: Proc. of EUROSPEECH (2003)

    Google Scholar 

  35. Levy, M., Sandler, M.: Structural Segmentation of Musical Audio by Constrained Clustering. IEEE. Transac. on Audio, Speech, and Language Proc. 16(2), 318–326 (2008)

    Article  Google Scholar 

  36. Linde, Y., Buzo, A., Gray, R.M.: An Algorithm for Vector Quantizer Design. IEEE Transactions on Communications 28, 702–710 (1980)

    Article  Google Scholar 

  37. Lu, L., Jiang, H., Zhang, H.J.: A Robust Audio Classification and Segmentation Method. In: Proc. ACM International Multimedia Conference, vol. 9, pp. 203–211 (2001)

    Google Scholar 

  38. Marozeau, J., de Cheveigné, A., McAdams, S., Winsberg, S.: The Dependency of Timbre on Fundamental Frequency. Journal of the Acoustical Society of America 114(5), 2946–2957 (2003)

    Article  Google Scholar 

  39. Mauch, M.: Automatic Chord Transcription from Audio using Computational Models of Musical Context. Ph.D. thesis, Queen Mary University of London (2010)

    Google Scholar 

  40. McAdams, S., Winsberg, S., Donnadieu, S., De Soete, G., Krimphoff, J.: Perceptual Scaling of Synthesized Musical Timbres: Common Dimensions, Specificities, and Latent Subject Classes. Psychological Research 58, 177–192 (1995)

    Article  Google Scholar 

  41. Music Information Retrieval Evaluation Exchange Wiki: Structural Segmentation (2010), http://www.music-ir.org/mirex/wiki/2010:Structural_Segmentation

  42. Peeters, G.: Automatic Classification of Large Musical Instrument Databases Using Hierarchical Classifiers with Inertia Ratio Maximization. In: Proc. AES 115th Convention, New York (2003)

    Google Scholar 

  43. Queen Mary University of London: QM Vamp Plugins, http://www.omras2.org/SonicAnnotator

  44. Ramona, M., Richard, G.: Comparison of Different Strategies for a SVM-Based Audio Segmentation. In: Proc. of the 17th European Signal Processing Conference (EUSIPCO 2009), pp. 20–24 (2009)

    Google Scholar 

  45. Risset, J.C., Wessel, D.L.: Exploration of Timbre by Analysis and Synthesis. In: Deutsch, D. (ed.) Psychology of Music, 2nd edn. Academic Press, London (1999)

    Google Scholar 

  46. Saunders, J.: Real-Time Discrimination of Broadcast Speech Music. In: Proc. ICASSP 1996, vol. 2, pp. 993–996 (1996)

    Google Scholar 

  47. Schaeffer, P.: Traité des Objets Musicaux (Treaty of Musical Objects). Éditions du seuil (1966)

    Google Scholar 

  48. Scheirer, E., Slaney, M.: Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator. In: Proc. ICASSP 1997, vol. 2, pp. 1331–1334 (1997)

    Google Scholar 

  49. Slawson, A.W.: Vowel Quality and Musical Timbre as Functions of Spectrum Envelope and Fundamental Frequency. J. Acoust. Soc. Am. 43(1) (1968)

    Google Scholar 

  50. Sundberg, J.: Articulatory Interpretation of the ‘Singing Formant’. J. Acoust. Soc. Am. 55, 838–844 (1974)

    Article  Google Scholar 

  51. Terasawa, H., Slaney, M., Berger, J.: A Statistical Model of Timbre Perception. In: ISCA Tutorial and Research Workshop on Statistical And Perceptual Audition (SAPA 2006), pp. 18–23 (2006)

    Google Scholar 

  52. Gil de Zúñiga, H., Veenstra, A., Vraga, E., Shah, D.: Digital Democracy: Reimagining Pathways to Political Participation. Journal of Information Technology & Politics 7(1), 36–51 (2010)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Barthet, M., Hargreaves, S., Sandler, M. (2011). Speech/Music Discrimination in Audio Podcast Using Structural Segmentation and Timbre Recognition. In: Ystad, S., Aramaki, M., Kronland-Martinet, R., Jensen, K. (eds) Exploring Music Contents. CMMR 2010. Lecture Notes in Computer Science, vol 6684. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23126-1_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23126-1_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23125-4

  • Online ISBN: 978-3-642-23126-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics