Content-Based Retrieval for Digital Audio and Music

  • Changsheng Xu
  • David Dagan Feng
  • Qi Tian
Part of the Signals and Communication Technology book series (SCT)


In this chapter, we summarize the research achievements in the area of content-based audio and music retrieval. This chapter covers the research aspects of audio feature extraction, generic audio classification and retrieval, music content analysis, and content-based music retrieval, providing an overview of current research in the area. In addition, two typical systems for content-based audio and music retrieval are discussed in detail. Finally, based on the current technology used in content-based audio/ music retrieval and the demand from real-world applications, future promising directions are identified.


Covariance Turkey Autocorrelation Sine Tempo 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    W.P. Birmingham and R.B. Dannenberg (2001), MUSART: music retrieval via aural queries, In Proc. Second International Symposium on Music Information Retrieval. Google Scholar
  2. [2]
    M. Bobrek and D.B. Koch (1998), Music signal segmentation using tree-structured filter banks, Journal of Audio Engineering Society, Vol. 46, No. 5, pp. 412–427.Google Scholar
  3. [3]
    A. Bregman (1990), Auditory scene analysis, Cambridge: MIT Press.Google Scholar
  4. [4]
    G.R. Charbonneau (1981), Timbre and the perceptual effects of three types of data reduction, Computer Music Journal, Vol. 5, No. 2, pp. 10–19.CrossRefGoogle Scholar
  5. [5]
    A. Chen, M. Chang, J. Chen, J.L. Hsu, C.H. Hsu and S. Hua (2000), Query by music segments: an efficient approach for song retrieval, In Proc. ICME2000, pp. 889–892.Google Scholar
  6. [6]
    M.P. Cook (1993), Modelling Auditory Processing and Organization, Cambridge University Press, Cambridge, UK.Google Scholar
  7. [7]
    G. De Poli and P. Prandoni (1997), Sonological models for timbre characterization, Journal of New Music Research, Vol. 26, pp. 170–197.CrossRefGoogle Scholar
  8. [8]
    S. Dubnov and X. Rodet (1998), Timbre recognition with combined stationary and temporal features, In Proc. International Computer Music Conference, pp. 102–108.Google Scholar
  9. [9]
    K. El-Maleh, M. Klein, G. Petrucci and P. Kabal (2000), Speech/music discrimination for multimedia application, In Proc. ICASSPOO. Google Scholar
  10. [10]
    A. Eronen and A. Klapuri (2000), Musical instrument recognition using cepstral coefficients and temporal features, In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing. Google Scholar
  11. [11]
    B. Feiten and S. Gunzel (1994), Automatic indexing of a sound database using self-organizing neural nets. Computer Music Journal 18, pp. 53–65.CrossRefGoogle Scholar
  12. [12]
    C. P. Fernandez and Q.F.J. Casajus (1999), Multi-pitch estimation for polyphonic musical signal, In Proc. ICASSP-99. Google Scholar
  13. [13]
    J. Foote (1997), Content-based retrieval of music and audio. In Multimedia Storage and Archiving Systems, Proc. of SPIE. 3229, pp. 138–147.Google Scholar
  14. [14]
    I. Fujinaga (1998), Machine recognition of timbre using steady-state tone of acoustic musical instruments. In Proc. International Computer Music Conference, pp. 207–210.Google Scholar
  15. [15]
    K. Fukunage (1990), Introduction to statistical pattern recognition, Academic Press, Boston.Google Scholar
  16. [16]
    A. Ghias (1995), Query by humming, In Proc. ACM Multimedia 95, San Francisco, USA.Google Scholar
  17. [17]
    Y. Gong and X. Liu (2001), Summarizing video by minimizing visual content redundancies, In Proc. IEEE International Conference on Multimedia and Expo, Tokyo, Japan, pp. 788–791.Google Scholar
  18. [18]
    M. Goto and Y. Muraoka (1994), A beat tracking system for acoustic signals of music, in Proc. ACM Multimedia 1994, San Francisco, ACM.Google Scholar
  19. [19]
    A. Gupta and R. Jain (1997), Visual information retrieval. Communications ofACM40, pp. 35–42.Google Scholar
  20. [20]
    S. Handel (1995), Timbre perception and auditory object identification, In Hearing, Moore B. C. J., ed., New York: Academic Press.Google Scholar
  21. [21]
    W. Hess (1983), Pitch determination of speech signals,Springer-Verlag.Google Scholar
  22. [22]
    C. Hori and S. Furui (1998), Improvements in automatic speech summarization and evaluation methods, In Proc. International Conference on Spoken Language Processing,Sydney, Australia.Google Scholar
  23. [23]
    A.J.M. Houstsma (1997), Pitch and timbre: definition, meaning and use, Journal of New Music Research, Vol. 26, pp. 104–115.CrossRefGoogle Scholar
  24. [24]
    I. Kaminkyj and A. Materka (1995), Automatic source identification of monophonic musical instrument sounds, In Proc. IEEE International Conference on Neural Network, pp. 189–194.Google Scholar
  25. [25]
    K. Kashino and A. Makerka (1997), Sound source identification for ensemble music based on the music stream extraction, In Proc. International Joint Conference on Artificial Intelligence. Google Scholar
  26. [26]
    K. Kashino and H. Murase (1999), Music recognition using note transition context, In Proc. ICASSP99. Google Scholar
  27. [27]
    H. Katayose and S. Inokuchi (1989), An intelligent transcription system, In Proc. Int’l Conf Music Perception and Cognition, pp. 95–98.Google Scholar
  28. [28]
    D. Kimber and L. Wilcox (1996), Acoustic segmentation for audio browsers, In Proc. Interface Conference,Sydney, Australia.Google Scholar
  29. [29]
    A. Klapuri (2001), Eronen A., Seppanen J. and Virtanen T., Automatic transcription of music, In Proc. Symposium on Stochastic Modeling of Music,22th of October, Ghent, Belgium.Google Scholar
  30. [30]
    N. Kosugi, Y. Nishihara, S. Kon’ya, M. Yamamuro and K. Kushima (1999), Music retrieval by humming, In Proc. of IEEE PACRIM’99. Google Scholar
  31. [31]
    K. Koumpis and S. Renais (1998), Transcription and summarization of voicemail speech, In Proc. International Conference on Spoken Language Processing,Sydney, Australia.Google Scholar
  32. [32]
    R. Kraft, Q. Lu and S. Teng (2001), Method and apparatus for music summarization and creation of audio summaries, US Patent 6, 225, 546.Google Scholar
  33. [33]
    S.Z. Li (2000), Content-based classification and retrieval of audio using the nearest feature line method, IEEE Transactions on Speech and Audio Processing, September.Google Scholar
  34. [34]
    Z. Liu, J. Huang, Y. Wang and T. Chen (1997), Audio feature extraction and analysis for scene classification. In IEEE Signal Processing Society 1997 Workshop on Multimedia Signal Processing, pp. 523–528.Google Scholar
  35. [35]
    B. Logan and S. Chu (2000), Music summarization using key phrases, In Proc. IEEE International Conference on Audio, Speech and Signal Processing,Orlando, USA.Google Scholar
  36. [36]
    B. Logan and A. Salomon (2001), A music similarity function based on signal analysis, In Proc. ICME2001, Japan, pp. 952–955.Google Scholar
  37. [37]
    L. Lu, H. Jiang and H.J. Zhang (2001), A robust audio classification and segmentation method, In Proc. ACM Multimedia 2001, Ottawa, Canada.Google Scholar
  38. [38]
    W.Y. Ma and H.J. Zhang (1999), Content-based image indexing and retrieval. In Handbook of Multimedia Computing, ed. by Furht B. CRC Press, Florida, pp. 227–244.Google Scholar
  39. [39]
    I. Mani and M.T. Maybury (eds.) (1999), Advances in Automatic Text Summarization, Cambridge, Massachusetts: MIT Press.Google Scholar
  40. [40]
    J. Marques (1999), An Automatic Annotation System for Audio Data Containing Music, Master’s Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA.Google Scholar
  41. [41]
    K.D. Martin (1999), Sound-source Recognition: A Theory and Computational Model, Ph.D Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA.Google Scholar
  42. [42]
    R. McNab, L. Smith, I. Witten, C. Henderson and S. Cunningham (1996), Towards digital music library: Tune retrieval from acoustic input, In Proc. Digital Library’96, pp. 11–18.Google Scholar
  43. [43]
    A.D. Narasimhalu (1995), Special section on content-based retrieval, ACM Multimedia Sys. 3, pp. 141.Google Scholar
  44. [44]
    T. Niihara and S. Inokuchi (1986), Transcription of sung song, In Proc. ICASSP-86, pp. 1277–1280.Google Scholar
  45. [45]
    A. Pentland and R. Picard (1996), Special issue on digital library, IEEE Trans. Patt. Recog. And Intell. 18, pp. 673–733.CrossRefGoogle Scholar
  46. [46]
    S. Pfeiffer, S. Fischer and W.E. Eisberg (1996), Automatic audio content analysis, Tech. Rep. No. 96008, University of Mannheim, Mannheim, Germany.Google Scholar
  47. [47]
    L. Rabiner and B.H. Juang (1993), Fundamentals of speech recognition. Prentice Hall, Englewood Cliffs, N.J., pp. 189.Google Scholar
  48. [48]
    F. Ren and Y. Sadanaga (1998), An automatic extraction of important sentences using statistical information and structure feature, In Proc. NL98–125, pp. 71–78.Google Scholar
  49. [49]
    Y. Rubner, C. Tomasi and L. Guibas (1998), The Earth Mover’s Distance as a metric for image retrieval, Tech. Rep., Stanford University.Google Scholar
  50. [50]
    J. Saunders (1996), Real-time discrimination of broadcast speech/music, In Proc. ICASSP96, Vol. 2, pp. 993–996.Google Scholar
  51. [51]
    B. Schatz and H. Chen (1996), Building large-scale digital libraries. IEEE Comput. Mag. 29, pp. 2277.Google Scholar
  52. [52]
    E. Scheirer and M. Slaney (1997), Construction and evaluation of a robust multifeature music/speech discriminator, In Proc. ICASSP97, Vol. 2, pp. 1331–1334.Google Scholar
  53. [53]
    E. Scheirer (1998), Tempo and beat analysis of acoustic musical signals, in J. Acoust. Soc. Am. 103 (1), pp 588–601.Google Scholar
  54. [54]
    G. Smith, H. Murase and K. Kashino (1999), Quick audio retrieval using active search, In Proc. ICASSP99,Turkey.Google Scholar
  55. [55]
    G. Tzanetakis and P. Cook (1999), Multifeature audio segmentation for browsing and annotation, In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics,New Paltz, New York.Google Scholar
  56. [56]
    G. Tzanetakis, G. Essi and P. Cook (2001), Automatic musical genre classification of audio signals, In. Proc. Im’. Symposium on Music Information Retrieval (ISMIR),Bloomington, Indiana, USA.Google Scholar
  57. [57]
    N.G. Venkat and V.R. Jijay (1995), Special issues on content-based image retrieval systems. IEEE Comput. Mag. 28, pp. 18–62.Google Scholar
  58. [58]
    E. Wold, T. Blum, D. Keislar and J. Wheaton (1996), Content-based classification, search and retrieval of audio, IEEE Multimedia Mag. 3, pp. 27–36.CrossRefGoogle Scholar
  59. [59]
    Yahiaoui, B. Merialdo and B. Huet (2001), Generating summaries of multi-episode video, In Proc. IEEE International Conference on Multimedia and Expo, Tokyo, Japan, pp. 792–795.Google Scholar
  60. [60]
    H.J. Zhang (1999), Content-based video browsing and retrieval, In Handbook of Multimedia Computing, ed. by Furht B. CRC Press, Florida, pp. 255–280.Google Scholar
  61. [61]
    T. Zhang and C.-C. Kuo (1999), Video content parsing based on combined audio and visual information, In Proc. SPIE 1999, San Jose, USA, Vol. 4, pp. 78–89.Google Scholar
  62. [62]
    Y. Zhu, C. Xu and M. Kankanhalli (2001), Melody curve processing for music retrieval, In Proc. ICME2001, Japan, pp. 401–404.Google Scholar
  63. [63]
    Y. Zhu, M. Kankanhalli and C. Xu (2001), Pitch tracking and melody slope matching for song retrieval, In Proc. PCM2001, Beijing, China, pp. 530–537.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Changsheng Xu
  • David Dagan Feng
  • Qi Tian

There are no affiliations available

Personalised recommendations