Speech/Music Discrimination in Audio Podcast Using Structural Segmentation and Timbre Recognition

Barthet, Mathieu; Hargreaves, Steven; Sandler, Mark

doi:10.1007/978-3-642-23126-1_10

Mathieu Barthet²⁰,
Steven Hargreaves²⁰ &
Mark Sandler²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6684))

Included in the following conference series:

International Symposium on Computer Music Modeling and Retrieval

1263 Accesses
1 Citations

Abstract

We propose two speech/music discrimination methods using timbre models and measure their performances on a 3 hour long database of radio podcasts from the BBC. In the first method, the machine estimated classifications obtained with an automatic timbre recognition (ATR) model are post-processed using median filtering. The classification system (LSF/K-means) was trained using two different taxonomic levels, a high-level one (speech, music), and a lower-level one (male and female speech, classical, jazz, rock & pop). The second method combines automatic structural segmentation and timbre recognition (ASS/ATR). The ASS evaluates the similarity between feature distributions (MFCC, RMS) using HMM and soft K-means algorithms. Both methods were evaluated at a semantic (relative correct overlap RCO), and temporal (boundary retrieval F-measure) levels. The ASS/ATR method obtained the best results (average RCO of 94.5% and boundary F-measure of 50.1%). These performances were favourably compared with that obtained by a SVM-based technique providing a good benchmark of the state of the art.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ajmera, J., McCowan, I., Bourlard, H.: Robust HMM-Based Speech/Music Segmentation. In: Proc. ICASSP 2002, vol. 1, pp. 297–300 (2002)
Google Scholar
Alexandre-Cortizo, E., Rosa-Zurera, M., Lopez-Ferreras, F.: Application of Fisher Linear Discriminant Analysis to Speech Music Classification. In: Proc. EUROCON 2005, vol. 2, pp. 1666–1669 (2005)
Google Scholar
ANSI: USA Standard Acoustical Terminology. American National Standards Institute, New York (1960)
Google Scholar
Barthet, M., Depalle, P., Kronland-Martinet, R., Ystad, S.: Acoustical Correlates of Timbre and Expressiveness in Clarinet Performance. Music Perception 28(2), 135–153 (2010)
Article Google Scholar
Barthet, M., Depalle, P., Kronland-Martinet, R., Ystad, S.: Analysis-by-Synthesis of Timbre, Timing, and Dynamics in Expressive Clarinet Performance. Music Perception 28(3), 265–278 (2011)
Article Google Scholar
Barthet, M., Guillemain, P., Kronland-Martinet, R., Ystad, S.: From Clarinet Control to Timbre Perception. Acta Acustica United with Acustica 96(4), 678–689 (2010)
Article Google Scholar
Barthet, M., Sandler, M.: Time-Dependent Automatic Musical Instrument Recognition in Solo Recordings. In: 7th Int. Symposium on Computer Music Modeling and Retrieval (CMMR 2010), Malaga, Spain, pp. 183–194 (2010)
Google Scholar
Bello, J.P., Daudet, L., Abdallah, S., Duxbury, C., Davies, M., Sandler, M.: A Tutorial on Onset Detection in Music Signals. IEEE Transactions on Speech and Audio Processing (2005)
Google Scholar
Burred, J.J., Lerch, A.: Hierarchical Automatic Audio Signal Classification. Journal of the Audio Engineering Society 52(7/8), 724–739 (2004)
Google Scholar
Caclin, A., McAdams, S., Smith, B.K., Winsberg, S.: Acoustic Correlates of Timbre Space Dimensions: A Confirmatory Study Using Synthetic Tones. J. Acoust. Soc. Am. 118(1), 471–482 (2005)
Article Google Scholar
Cannam, C.: Queen Mary University of London: Sonic Annotator, http://omras2.org/SonicAnnotator
Cannam, C.: Queen Mary University of London: Sonic Visualiser, http://www.sonicvisualiser.org/
Cannam, C.: Queen Mary University of London: Vamp Audio Analysis Plugin System, http://www.vamp-plugins.org/
Carey, M., Parris, E., Lloyd-Thomas, H.: A Comparison of Features for Speech, Music Discrimination. In: Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 149–152 (1999)
Google Scholar
Castellengo, M., Dubois, D.: Timbre ou Timbres? Propriété du Signal, de l’Instrument, ou Construction Cognitive (Timbre or Timbres? Property of the Signal, the Instrument, or Cognitive Construction?). In: Proc. of the Conf. on Interdisciplinary Musicology (CIM 2005), Montréal, Québec, Canada (2005)
Google Scholar
Chétry, N., Davies, M., Sandler, M.: Musical Instrument Identification using LSF and K-Means. In: Proc. AES 118th Convention (2005)
Google Scholar
Childers, D., Skinner, D., Kemerait, R.: The Cepstrum: A Guide to Processing. Proc. of the IEEE 65, 1428–1443 (1977)
Article Google Scholar
Davies, M.E.P., Degara, N., Plumbley, M.D.: Evaluation Methods for Musical Audio Beat Tracking Algorithms. Technical report C4DM-TR-09-06, Queen Mary University of London, Centre for Digital Music (2009), http://www.eecs.qmul.ac.uk/~matthewd/pdfs/DaviesDegaraPlumbley09-evaluation-tr.pdf
Davis, S.B., Mermelstein, P.: Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP-28(4), 357–366 (1980)
Article Google Scholar
El-Maleh, K., Klein, M., Petrucci, G., Kabal, P.: Speech/Music Discrimination for Multimedia Applications. In: Proc. ICASSP 2000, vol. 6, pp. 2445–2448 (2000)
Google Scholar
Fazekas, G., Sandler, M.: Intelligent Editing of Studio Recordings With the Help of Automatic Music Structure Extraction. In: Proc. of the AES 122nd Convention, Vienna, Austria (2007)
Google Scholar
Galliano, S., Georois, E., Mostefa, D., Choukri, K., Bonastre, J.F., Gravier, G.: The ESTER Phase II Evaluation Campaign for the Rich Transcription of French Broadcast News. In: Proc. Interspeech (2005)
Google Scholar
Gauvain, J.L., Lamel, L., Adda, G.: Audio Partitioning and Transcription for Broadcast Data Indexation. Multimedia Tools and Applications 14(2), 187–200 (2001)
Article Google Scholar
Grey, J.M., Gordon, J.W.: Perception of Spectral Modifications on Orchestral Instrument Tones. Computer Music Journal 11(1), 24–31 (1978)
Google Scholar
Hain, T., Johnson, S., Tuerk, A., Woodland, P.C., Young, S.: Segment Generation and Clustering in the HTK Broadcast News Transcription System. In: Proc. of the DARPA Broadcast News Transcription and Understanding Workshop, pp. 133–137 (1998)
Google Scholar
Hajda, J.M., Kendall, R.A., Carterette, E.C., Harshberger, M.L.: Methodological Issues in Timbre Research. In: Deliége, I., Sloboda, J. (eds.) Perception and Cognition of Music, 2nd edn., pp. 253–306. Psychology Press, New York (1997)
Google Scholar
Handel, S.: Hearing. In: Timbre Perception and Auditory Object Identification, 2nd edn., pp. 425–461. Academic Press, San Diego (1995)
Google Scholar
Harte, C.: Towards Automatic Extraction of Harmony Information From Music Signals. Ph.D. thesis, Queen Mary University of London (2010)
Google Scholar
Helmholtz, H.v.: On the Sensations of Tone. Dover, New York (1954); (from the works of 1877). English trad. with notes and appendix from E.J. Ellis
Google Scholar
Houtgast, T., Steeneken, H.J.M.: The Modulation Transfer Function in Room Acoustics as a Predictor of Speech Intelligibility. Acustica 28, 66–73 (1973)
Google Scholar
Itakura, F.: Line Spectrum Representation of Linear Predictive Coefficients of Speech Signals. J. Acoust. Soc. Am. 57(S35) (1975)
Google Scholar
Jarina, R., O’Connor, N., Marlow, S., Murphy, N.: Rhythm Detection For Speech-Music Discrimination In MPEG Compressed Domain. In: Proc. of the IEEE 14th International Conference on Digital Signal Processing (DSP), Santorini (2002)
Google Scholar
Kedem, B.: Spectral Analysis and Discrimination by Zero-Crossings. Proc. IEEE 74, 1477–1493 (1986)
Article Google Scholar
Kim, H.G., Berdahl, E., Moreau, N., Sikora, T.: Speaker Recognition Using MPEG-7 Descriptors. In: Proc. of EUROSPEECH (2003)
Google Scholar
Levy, M., Sandler, M.: Structural Segmentation of Musical Audio by Constrained Clustering. IEEE. Transac. on Audio, Speech, and Language Proc. 16(2), 318–326 (2008)
Article Google Scholar
Linde, Y., Buzo, A., Gray, R.M.: An Algorithm for Vector Quantizer Design. IEEE Transactions on Communications 28, 702–710 (1980)
Article Google Scholar
Lu, L., Jiang, H., Zhang, H.J.: A Robust Audio Classification and Segmentation Method. In: Proc. ACM International Multimedia Conference, vol. 9, pp. 203–211 (2001)
Google Scholar
Marozeau, J., de Cheveigné, A., McAdams, S., Winsberg, S.: The Dependency of Timbre on Fundamental Frequency. Journal of the Acoustical Society of America 114(5), 2946–2957 (2003)
Article Google Scholar
Mauch, M.: Automatic Chord Transcription from Audio using Computational Models of Musical Context. Ph.D. thesis, Queen Mary University of London (2010)
Google Scholar
McAdams, S., Winsberg, S., Donnadieu, S., De Soete, G., Krimphoff, J.: Perceptual Scaling of Synthesized Musical Timbres: Common Dimensions, Specificities, and Latent Subject Classes. Psychological Research 58, 177–192 (1995)
Article Google Scholar
Music Information Retrieval Evaluation Exchange Wiki: Structural Segmentation (2010), http://www.music-ir.org/mirex/wiki/2010:Structural_Segmentation
Peeters, G.: Automatic Classification of Large Musical Instrument Databases Using Hierarchical Classifiers with Inertia Ratio Maximization. In: Proc. AES 115th Convention, New York (2003)
Google Scholar
Queen Mary University of London: QM Vamp Plugins, http://www.omras2.org/SonicAnnotator
Ramona, M., Richard, G.: Comparison of Different Strategies for a SVM-Based Audio Segmentation. In: Proc. of the 17th European Signal Processing Conference (EUSIPCO 2009), pp. 20–24 (2009)
Google Scholar
Risset, J.C., Wessel, D.L.: Exploration of Timbre by Analysis and Synthesis. In: Deutsch, D. (ed.) Psychology of Music, 2nd edn. Academic Press, London (1999)
Google Scholar
Saunders, J.: Real-Time Discrimination of Broadcast Speech Music. In: Proc. ICASSP 1996, vol. 2, pp. 993–996 (1996)
Google Scholar
Schaeffer, P.: Traité des Objets Musicaux (Treaty of Musical Objects). Éditions du seuil (1966)
Google Scholar
Scheirer, E., Slaney, M.: Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator. In: Proc. ICASSP 1997, vol. 2, pp. 1331–1334 (1997)
Google Scholar
Slawson, A.W.: Vowel Quality and Musical Timbre as Functions of Spectrum Envelope and Fundamental Frequency. J. Acoust. Soc. Am. 43(1) (1968)
Google Scholar
Sundberg, J.: Articulatory Interpretation of the ‘Singing Formant’. J. Acoust. Soc. Am. 55, 838–844 (1974)
Article Google Scholar
Terasawa, H., Slaney, M., Berger, J.: A Statistical Model of Timbre Perception. In: ISCA Tutorial and Research Workshop on Statistical And Perceptual Audition (SAPA 2006), pp. 18–23 (2006)
Google Scholar
Gil de Zúñiga, H., Veenstra, A., Vraga, E., Shah, D.: Digital Democracy: Reimagining Pathways to Political Participation. Journal of Information Technology & Politics 7(1), 36–51 (2010)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Digital Music, Queen Mary University of London, Mile End Road, London, E1 4NS, United Kingdom
Mathieu Barthet, Steven Hargreaves & Mark Sandler

Authors

Mathieu Barthet
View author publications
You can also search for this author in PubMed Google Scholar
Steven Hargreaves
View author publications
You can also search for this author in PubMed Google Scholar
Mark Sandler
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

CNRS - LMA, 31 Chemin Joseph Aiguier, 13402, Marseille Cedex 20, France
Sølvi Ystad
CNRS-INCM, 31 Chemin Joseph Aiguier, 13402, Marseille Cedex 20, France
Mitsuko Aramaki
CNRS-LMA, 31 Chemin Joseph Aiguier, 13402, Marseille Cedex 20, France
Richard Kronland-Martinet
Aalborg University Esbjerg, Niels Bohr Vej 8, 6700, Esbjerg, Denmark
Kristoffer Jensen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Barthet, M., Hargreaves, S., Sandler, M. (2011). Speech/Music Discrimination in Audio Podcast Using Structural Segmentation and Timbre Recognition. In: Ystad, S., Aramaki, M., Kronland-Martinet, R., Jensen, K. (eds) Exploring Music Contents. CMMR 2010. Lecture Notes in Computer Science, vol 6684. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23126-1_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-23126-1_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23125-4
Online ISBN: 978-3-642-23126-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics