Machine Listening of Music

  • Juan Pablo BelloEmail author


The analysis and recognition of sounds in complex auditory scenes is a fundamental step towards context-awareness in machines, and thus an enabling technology for applications across multiple domains including robotics, human-computer interaction, surveillance and bioacoustics. In the realm of music, endowing computers with listening and analytical skills can aid the organization and study of large music collections, the creation of music recommendation services and personalized radio streams, the automation of tasks in the recording studio or the development of interactive music systems for performance and composition.

In this chapter, we survey common techniques for the automatic recognition of timbral, rhythmic and tonal information from recorded music, and for characterizing the similarities that exist between musical pieces. We explore the assumptions behind these methods and their inherent limitations, and conclude by discussing how current trends in machine learning and signal processing research can shape future developments in the field of machine listening.


Discrete Fourier Transform Novelty Detection Spectral Envelope Magnitude Spectrum Pitch Class 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. Agawu K (2012) Trends in African musicology: a review article. EthnoMusicol 56(1):133–140CrossRefGoogle Scholar
  2. Aucouturier JJ (2006) Ten experiments on the modelling of polyphonic timbre. PhD thesis, University of Paris 6, FranceGoogle Scholar
  3. Aucouturier, J.-J., Defreville, B. and Pachet, F. The bag-of-frame approach to audio pattern recognition: A sufficient model for urban soundscapes but not for polyphonic music. Journal of the Acoustical Society of America, 122(2):881–91, 2007.Google Scholar
  4. Bamberger JS, Hernandez A (2000) Developing musical intuitions: a project-based introduction to making and understanding music. Oxford University Press, New YorkGoogle Scholar
  5. Barbedo JGA (2012) Instrument recognition. In: Li T, Ogihara M, Tzanetakis G (eds) Music data mining. CRC Press, Boca Raton, Florida, USAGoogle Scholar
  6. Battenberg E, Wessel D (2012) Analyzing drum patterns using conditional deep belief networks. In: ISMIR, pp 37–42Google Scholar
  7. Bello JP (2003) Towards the automated analysis of simple polyphonic music: a knowledge-based approach. PhD thesis, Department of Electronic Engineering, Queen Mary University of LondonGoogle Scholar
  8. Bello JP (September 2007) Audio-based cover song retrieval using approximate chord sequences: testing shifts, gaps, swaps and beats. In: Proceedings of the 8th international conference on music information retrieval (ISMIR-07). Vienna, Austria, September 2007.Google Scholar
  9. Bello JP, Daudet L, Abdallah S, Duxbury C, Davies M, Sandler MB (September 2005) A tutorial on onset detection in music signals. IEEE Trans Speech Audio Process 13(5):1035–1047 (Part 2)Google Scholar
  10. Bengio Y (January (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127CrossRefzbMATHMathSciNetGoogle Scholar
  11. Berenzweig A (2007) Anchors and hubs in audio-based music similarity. PhD thesis, Columbia University, New YorkGoogle Scholar
  12. Berenzweig A, Logan B, Ellis D, Whitman B (2003) A large-scale evaluation of acoustic and subjective music similarity measures. In: Proceedings of the international conference on music information retrieval, BaltimoreGoogle Scholar
  13. Bertin-Mahieux T, Ellis DPW (2012) Large-scale cover song recognition using the 2D Fourier transform magnitude. In: The 13th international society for music information retrieval conference, pp 241–246Google Scholar
  14. Bertin-Mahieux T, Ellis DPW, Whitman B, Lamere P (2011) The million song dataset. In: Proceedings of the 12th international conference on music information retrieval (ISMIR 2011)Google Scholar
  15. BMAT (2013) Accessed July 20, 2013
  16. Brown J (1991) Calculation of a constant Q spectral transform. J Acoust Soc Am 89(1):425–434CrossRefGoogle Scholar
  17. Burgoyne JA, Pugin L, Kereliuk C, Fujinaga I (2007) A cross-validated study of modelling strategies for automatic chord recognition in audio. In: ISMIR, pp 251–254Google Scholar
  18. Burgoyne JA, Wild J, Fujinaga I (2011) An expert ground truth set for audio chord recognition and music analysis. In: Proceedings of the conference of the international society for music information retrieval (ISMIR), Miami, FL, pp 633–638Google Scholar
  19. Cho T, Bello JP (2011) A feature smoothing method for chord recognition using recurrence plots. In: Proceedings of the conference of the international society for music information retrieval (ISMIR)Google Scholar
  20. Taemin Cho; Bello, J.P., “On the Relative Importance of Individual Components of Chord Recognition Systems,” Audio, Speech, and Language Processing, IEEE/ACM Transactions on, vol.22, no.2, pp.477,492, Feb. 2014Google Scholar
  21. Cho T, Weiss RJ, Bello JP (July 2010) Exploring common variations in state of the art chord recognition systems. In: Proceedings of the sound and music computing conference (SMC-10), BarcelonaGoogle Scholar
  22. Cook PR (2001) Music, cognition, and computerized sound: an introduction to psychoacoustics. The MIT Press, Cambridge, MA, USA.Google Scholar
  23. Daudet L (September (2006) Sparse and structured decompositions of signals with the molecular matching pursuit. IEEE Trans Audio Speech Lang Process 14(5):1808–1816CrossRefGoogle Scholar
  24. Davies MEP, Plumbley MD (2007) Context-dependent beat tracking of musical audio. IEEE Trans Audio Speech Lang Process 15(3):1009–1020CrossRefGoogle Scholar
  25. Gouyon F, Klapuri A, Dixon S, Alonso M, Tzanetakis G, Uhle C, Cano P (2006) An experimental comparison of audio tempo induction algorithms. IEEE Trans Audio Speech Lang Process 14(5):1832–1844CrossRefGoogle Scholar
  26. Grey JM (1975) An exploration of musical timbre. PhD thesis, Department of Music, Stanford UniversityGoogle Scholar
  27. Grosche P, Muller M (2011, to appear) Extracting predominant local pulse information from music recordings. IEEE Trans Audio Speech Lang ProcessGoogle Scholar
  28. Hamel P, Eck D (2010) Learning features from music audio with deep belief networks. In: ISMIR, Utrecht, pp 339–344Google Scholar
  29. Harte C, Sandler MB, Abdallah SA, Gómez E (2005) Symbolic representation of musical chords: a proposed syntax for text annotations. In: Proceedings of the conference of the international society for music information retrieval (ISMIR), London, pp 66–71Google Scholar
  30. Henaff M, Jarrett K, Kavukcuoglu K, LeCun Y (2011) Unsupervised learning of sparse features for scalable audio classification. In: Proceedings of international symposium on music information retrieval (ISMIR’11)Google Scholar
  31. Herrera P, Klapuri A, Davy M (2006) Automatic classification of pitched musical instrument sounds. In: Klapuri A, Davy M (eds) Signal processing methods for music transcription. Springer, New York, pp 163–200CrossRefGoogle Scholar
  32. Hockman J, Bello JP, Davies MEP, Plumbley M (September 2008) Automated rhythmic transformation of musical audio. In: Proceedings of the International Conference on Digital Audio Effects (DAFX-08), EspooGoogle Scholar
  33. Holzapfel A, Stylianou Y (2009) A scale transform based method for rhythmic similarity of music. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), TaipeiGoogle Scholar
  34. Holzapfel A, Flexer A, Widmer G (2011) Improving tempo-sensitive and tempo-robust descriptors for rhythmic similarity. In: Proceedings of SMC 2011, Conference on Sound and Music ComputingGoogle Scholar
  35. Honing H (2012) The structure and interpretation of rhythm in music. In: Deutsch D (ed) The psychology of music, 3rd edn. Academic Press, London, pp 369–404Google Scholar
  36. Humphrey E, Glennon A, Bello JP (December 2011) Non-linear semantic embedding for organizing large instrument sample libraries. In: Proceedings of the IEEE international conference on machine learning and applications (ICMLA-11), HonoluluGoogle Scholar
  37. Humphrey E, Cho T, Bello JP (2012) Learning a robust tonnetz-space transform for automatic chord recognition. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP-12). Kyoto, Japan. May, 2012Google Scholar
  38. Humphrey E, Bello JP, LeCun Y (December 2013) Feature learning and deep architectures: new directions for music informatics. J Intell Inf Syst 41(3):461–481CrossRefGoogle Scholar
  39. Huron D (2006) Sweet anticipation: music and the psychology of expectation. MIT Press Cambridge, MA, USA.Google Scholar
  40. Janata P, Birk JL, Van Horn JD, Leman M, Tillmann B, Bharucha JJ (2002) The cortical topography of tonal structures underlying western music. Science 298:2167–2170CrossRefGoogle Scholar
  41. Janata P, Tomic ST, Haberman JM (2012) Sensorimotor coupling in music and the psychology of the groove. J Exp Psychol Gen 141(1):54CrossRefGoogle Scholar
  42. Jehan T (2005) Creating music by listening. PhD thesis, Massachusetts Institute of Technology, MA, USA, SeptemberGoogle Scholar
  43. Khadkevich M, Omologo M (2009) Use of hidden markov models and factored language models for automatic chord recognition. In: Proceedings of the conference of the International Society for Music Information Retrieval (ISMIR), Kobe, Japan, pp 561–566Google Scholar
  44. Klapuri A (1999) Sound onset detection by applying psychoacoustic knowledge. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Washington, D.C., USA, pp 3089–3092Google Scholar
  45. Kolinski M (1973) A cross-cultural approach to metro-rhythmic patterns. Ethnomusicology 17(3):494–506CrossRefGoogle Scholar
  46. Krumhansl CL (1990) Cognitive foundations of musical pitch. Oxford University Press, New YorkGoogle Scholar
  47. Lee K (2006) Identifying cover songs from audio using harmonic representation. In: MIREX task on audio cover song IDGoogle Scholar
  48. Lee K (May (2007) A system for chord transcription, key extraction, and cadence recognition from audio using hidden Markov models. PhD thesis. Stanford University, CA, USA, May 2007Google Scholar
  49. Lee H, Largman Y, Pham P, Ng AY (2009) Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Advances in Neural Information Processing Systems (NIPS), pp 1096–1104Google Scholar
  50. Lerdahl F (2001) Tonal pitch space. Oxford University Press, New YorkGoogle Scholar
  51. Lewis AC (2007) Rhythm: what it is and how to improve your sense of it. RhythmSource Press, San FranciscoGoogle Scholar
  52. London J (2012) Hearing in time. Oxford University Press, New YorkCrossRefGoogle Scholar
  53. Martin B, Brown DG, Hanna P, Ferraro P (2012) Blast for audio sequences alignment: a fast scalable cover identification tool. In: ISMIR, pp 529–534Google Scholar
  54. Mauch M, Dixon S (2010a) Approximate note transcription for the improved identification of difficult chords. In: ISMIR, pp 135–140Google Scholar
  55. Mauch M, Dixon S (2010b) Simultaneous estimation of chords and musical context from audio. IEEE Trans Audio Speech Lang Process 18(6):1280–1289CrossRefGoogle Scholar
  56. McFee B, Barrington L, Lanckriet G (2012) Learning content similarity for music recommendation. IEEE Trans Audio Speech Lang Process 20(8):2207–2218CrossRefGoogle Scholar
  57. Nam J, Herrera J, Slaney M, Smith JO (2012) Learning sparse feature representations for music annotation and retrieval. In: ISMIR, pp 565–570Google Scholar
  58. Ni Y, McVicar M, Santos-Rodriguez R, Bie TD (2012) An end-to-end machine learning system for harmonic analysis of music. IEEE Trans Audio Speech Lang Process 20(6):1771–1783CrossRefGoogle Scholar
  59. Oppenheim AV, Schafer RW (2004) From frequency to quefrency: a history of the cepstrum. Signal Processing Mag IEEE 21(5):95–106CrossRefGoogle Scholar
  60. Papadopoulos H, Peeters G (2007) Large-scale study of chord estimation algorithms based on chroma representation and hmm. In: Content-Based Multimedia Indexing. 2007. CBMI’07. International Workshop on (IEEE), pp 53–60Google Scholar
  61. Peeters G (2011) Spectral and temporal periodicity representations of rhythm for the automatic classification of music audio signal. Audio Speech Lang Process IEEE Trans 19(5):1242–1252CrossRefGoogle Scholar
  62. Pohle T, Schnitzer D, Schedl M, Knees P, Widmer G (2009) On rhythm and general music similarity. In: Proceedings of the Conference of the International Society for Music Information Retrieval (ISMIR), Kobe, Japan, pp 525–530Google Scholar
  63. Rabiner LR (1989) A tutorial on HMM and selected applications in speech recognition. Proc IEEE 77(2):257–286CrossRefGoogle Scholar
  64. Ravelli E, Bello JP, Sandler M (April 2007) Automatic rhythm modification of drum loops. IEEE Signal Proc Lett 14(4):228–231Google Scholar
  65. Schluter J, Osendorfer C (2011) Music similarity estimation with the mean-covariance restricted boltzmann machine. In: Machine Learning and Applications and Workshops (ICMLA), 2011 10th International Conference on (IEEE), vol 2, pp 118–123Google Scholar
  66. Schmidt EM, Kim YE (2011) Learning emotion-based acoustic features with deep belief networks. In: Applications of Signal Processing to Audio and Acoustics (WASPAA), 2011 IEEE Workshop on (IEEE), pp 65–68Google Scholar
  67. Schnitzer D, Flexer A, Schedl M, Widmer G (2012) Local and global scaling reduce hubs in space. J Mach Learn Res 13:2871–2902zbMATHMathSciNetGoogle Scholar
  68. Serra J, Gomez E, Herrera P, Serra X (2008) Chroma binary similarity and local alignment applied to cover song identification. IEEE Transactions on Audio, Speech and Language Processing. 16, 2008Google Scholar
  69. Serrà J, Serra X, (September 2009) Andrzejak RG (September 2009) Cross recurrence quantification for cover song identification. New J Phys 11:093017, September 2009Google Scholar
  70. Sheh A, Ellis D (October 2003) Chord segmentation and recognition using EM- trained hidden Markov models. In: Proceedings of the 4th International Conference on Music Information Retrieval (ISMIR-03). Baltimore, USA, pp 185–191Google Scholar
  71. Shepard R (1999) Pitch perception and measurement. In: Cook P (ed) Music, cognition, and computerized sound. MIT Press, Cambridge, pp 149–165Google Scholar
  72. Smaragdis P, Brown JC (2003) Non-negative matrix factorization for polyphonic music transcription. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp 177–180Google Scholar
  73. Smith JO (2007) Mathematics of the discrete fourier transform (DFT): with music and audio applications. W3K
  74. The Echonest (2013) Accessed July 20, 2013
  75. Toussaint G (2013) The geometry of musical rhythm: what makes a good rhythm good? CRC Press, Boca Raton, FL, USA.Google Scholar
  76. Turnbull D, Barrington L, Torres D, Lanckriet G (2008) Semantic annotation and retrieval of music and sound effects. IEEE Trans Audio Speech Lang Proces 16(2):467–476CrossRefGoogle Scholar
  77. Tzanetakis G, Cook P (July 2002) Musical genre classification of audio signals. IEEE Trans Speech Audio Proces 10(5):293–302Google Scholar
  78. Weiss RJ, Bello JP (2011) Unsupervised discovery of temporal structure in music. IEEE J Sel Top Signal Proces 5(6):1240–1251CrossRefGoogle Scholar
  79. Weller A, Ellis D, Jebara T (2009) Structured prediction models for chord transcription of music audio. In: Machine Learning and Applications, 2009. ICMLA’09. International Conference on (IEEE), pp 590–595Google Scholar
  80. Wessel DL (1979) Timbre space as a musical control structure. Comp Music J 3(2):45–52CrossRefMathSciNetGoogle Scholar
  81. Widmer G, Dixon S, Goebl W, Pampalk E, Tobudic A (2003) In search of the Horowitz factor. AI Mag 24(3):111–130Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Music and Audio Research Laboratory (MARL)New York UniversityNew YorkUSA

Personalised recommendations