Content-Based Retrieval for Digital Audio and Music

  • Changsheng Xu
  • David Dagan Feng
  • Qi Tian
Part of the Signals and Communication Technology book series (SCT)


In this chapter, we summarize the research achievements in the area of content-based audio and music retrieval. This chapter covers the research aspects of audio feature extraction, generic audio classification and retrieval, music content analysis, and content-based music retrieval, providing an overview of current research in the area. In addition, two typical systems for content-based audio and music retrieval are discussed in detail. Finally, based on the current technology used in content-based audio/ music retrieval and the demand from real-world applications, future promising directions are identified.


Audio Signal Digital Audio Music Information Retrieval Pitch Tracking Audio Classification 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    W.P. Birmingham and R.B. Dannenberg (2001), MUSART: music retrieval via aural queries, In Proc. Second International Symposium on Music Information Retrieval. Google Scholar
  2. [2]
    M. Bobrek and D.B. Koch (1998), Music signal segmentation using tree-structured filter banks, Journal of Audio Engineering Society, Vol. 46, No. 5, pp. 412–427.Google Scholar
  3. [3]
    A. Bregman (1990), Auditory scene analysis, Cambridge: MIT Press.Google Scholar
  4. [4]
    G.R. Charbonneau (1981), Timbre and the perceptual effects of three types of data reduction, Computer Music Journal, Vol. 5, No. 2, pp. 10–19.CrossRefGoogle Scholar
  5. [5]
    A. Chen, M. Chang, J. Chen, J.L. Hsu, C.H. Hsu and S. Hua (2000), Query by music segments: an efficient approach for song retrieval, In Proc. ICME2000, pp. 889–892.Google Scholar
  6. [6]
    M.P. Cook (1993), Modelling Auditory Processing and Organization, Cambridge University Press, Cambridge, UK.Google Scholar
  7. [7]
    G. De Poli and P. Prandoni (1997), Sonological models for timbre characterization, Journal of New Music Research, Vol. 26, pp. 170–197.CrossRefGoogle Scholar
  8. [8]
    S. Dubnov and X. Rodet (1998), Timbre recognition with combined stationary and temporal features, In Proc. International Computer Music Conference, pp. 102–108.Google Scholar
  9. [9]
    K. El-Maleh, M. Klein, G. Petrucci and P. Kabal (2000), Speech/music discrimination for multimedia application, In Proc. ICASSPOO. Google Scholar
  10. [10]
    A. Eronen and A. Klapuri (2000), Musical instrument recognition using cepstral coefficients and temporal features, In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing. Google Scholar
  11. [11]
    B. Feiten and S. Gunzel (1994), Automatic indexing of a sound database using self-organizing neural nets. Computer Music Journal 18, pp. 53–65.CrossRefGoogle Scholar
  12. [12]
    C. P. Fernandez and Q.F.J. Casajus (1999), Multi-pitch estimation for polyphonic musical signal, In Proc. ICASSP-99. Google Scholar
  13. [13]
    J. Foote (1997), Content-based retrieval of music and audio. In Multimedia Storage and Archiving Systems, Proc. of SPIE. 3229, pp. 138–147.Google Scholar
  14. [14]
    I. Fujinaga (1998), Machine recognition of timbre using steady-state tone of acoustic musical instruments. In Proc. International Computer Music Conference, pp. 207–210.Google Scholar
  15. [15]
    K. Fukunage (1990), Introduction to statistical pattern recognition, Academic Press, Boston.Google Scholar
  16. [16]
    A. Ghias (1995), Query by humming, In Proc. ACM Multimedia 95, San Francisco, USA.Google Scholar
  17. [17]
    Y. Gong and X. Liu (2001), Summarizing video by minimizing visual content redundancies, In Proc. IEEE International Conference on Multimedia and Expo, Tokyo, Japan, pp. 788–791.Google Scholar
  18. [18]
    M. Goto and Y. Muraoka (1994), A beat tracking system for acoustic signals of music, in Proc. ACM Multimedia 1994, San Francisco, ACM.Google Scholar
  19. [19]
    A. Gupta and R. Jain (1997), Visual information retrieval. Communications ofACM40, pp. 35–42.Google Scholar
  20. [20]
    S. Handel (1995), Timbre perception and auditory object identification, In Hearing, Moore B. C. J., ed., New York: Academic Press.Google Scholar
  21. [21]
    W. Hess (1983), Pitch determination of speech signals,Springer-Verlag.Google Scholar
  22. [22]
    C. Hori and S. Furui (1998), Improvements in automatic speech summarization and evaluation methods, In Proc. International Conference on Spoken Language Processing,Sydney, Australia.Google Scholar
  23. [23]
    A.J.M. Houstsma (1997), Pitch and timbre: definition, meaning and use, Journal of New Music Research, Vol. 26, pp. 104–115.CrossRefGoogle Scholar
  24. [24]
    I. Kaminkyj and A. Materka (1995), Automatic source identification of monophonic musical instrument sounds, In Proc. IEEE International Conference on Neural Network, pp. 189–194.Google Scholar
  25. [25]
    K. Kashino and A. Makerka (1997), Sound source identification for ensemble music based on the music stream extraction, In Proc. International Joint Conference on Artificial Intelligence. Google Scholar
  26. [26]
    K. Kashino and H. Murase (1999), Music recognition using note transition context, In Proc. ICASSP99. Google Scholar
  27. [27]
    H. Katayose and S. Inokuchi (1989), An intelligent transcription system, In Proc. Int’l Conf Music Perception and Cognition, pp. 95–98.Google Scholar
  28. [28]
    D. Kimber and L. Wilcox (1996), Acoustic segmentation for audio browsers, In Proc. Interface Conference,Sydney, Australia.Google Scholar
  29. [29]
    A. Klapuri (2001), Eronen A., Seppanen J. and Virtanen T., Automatic transcription of music, In Proc. Symposium on Stochastic Modeling of Music,22th of October, Ghent, Belgium.Google Scholar
  30. [30]
    N. Kosugi, Y. Nishihara, S. Kon’ya, M. Yamamuro and K. Kushima (1999), Music retrieval by humming, In Proc. of IEEE PACRIM’99. Google Scholar
  31. [31]
    K. Koumpis and S. Renais (1998), Transcription and summarization of voicemail speech, In Proc. International Conference on Spoken Language Processing,Sydney, Australia.Google Scholar
  32. [32]
    R. Kraft, Q. Lu and S. Teng (2001), Method and apparatus for music summarization and creation of audio summaries, US Patent 6, 225, 546.Google Scholar
  33. [33]
    S.Z. Li (2000), Content-based classification and retrieval of audio using the nearest feature line method, IEEE Transactions on Speech and Audio Processing, September.Google Scholar
  34. [34]
    Z. Liu, J. Huang, Y. Wang and T. Chen (1997), Audio feature extraction and analysis for scene classification. In IEEE Signal Processing Society 1997 Workshop on Multimedia Signal Processing, pp. 523–528.Google Scholar
  35. [35]
    B. Logan and S. Chu (2000), Music summarization using key phrases, In Proc. IEEE International Conference on Audio, Speech and Signal Processing,Orlando, USA.Google Scholar
  36. [36]
    B. Logan and A. Salomon (2001), A music similarity function based on signal analysis, In Proc. ICME2001, Japan, pp. 952–955.Google Scholar
  37. [37]
    L. Lu, H. Jiang and H.J. Zhang (2001), A robust audio classification and segmentation method, In Proc. ACM Multimedia 2001, Ottawa, Canada.Google Scholar
  38. [38]
    W.Y. Ma and H.J. Zhang (1999), Content-based image indexing and retrieval. In Handbook of Multimedia Computing, ed. by Furht B. CRC Press, Florida, pp. 227–244.Google Scholar
  39. [39]
    I. Mani and M.T. Maybury (eds.) (1999), Advances in Automatic Text Summarization, Cambridge, Massachusetts: MIT Press.Google Scholar
  40. [40]
    J. Marques (1999), An Automatic Annotation System for Audio Data Containing Music, Master’s Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA.Google Scholar
  41. [41]
    K.D. Martin (1999), Sound-source Recognition: A Theory and Computational Model, Ph.D Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA.Google Scholar
  42. [42]
    R. McNab, L. Smith, I. Witten, C. Henderson and S. Cunningham (1996), Towards digital music library: Tune retrieval from acoustic input, In Proc. Digital Library’96, pp. 11–18.Google Scholar
  43. [43]
    A.D. Narasimhalu (1995), Special section on content-based retrieval, ACM Multimedia Sys. 3, pp. 141.Google Scholar
  44. [44]
    T. Niihara and S. Inokuchi (1986), Transcription of sung song, In Proc. ICASSP-86, pp. 1277–1280.Google Scholar
  45. [45]
    A. Pentland and R. Picard (1996), Special issue on digital library, IEEE Trans. Patt. Recog. And Intell. 18, pp. 673–733.CrossRefGoogle Scholar
  46. [46]
    S. Pfeiffer, S. Fischer and W.E. Eisberg (1996), Automatic audio content analysis, Tech. Rep. No. 96008, University of Mannheim, Mannheim, Germany.Google Scholar
  47. [47]
    L. Rabiner and B.H. Juang (1993), Fundamentals of speech recognition. Prentice Hall, Englewood Cliffs, N.J., pp. 189.Google Scholar
  48. [48]
    F. Ren and Y. Sadanaga (1998), An automatic extraction of important sentences using statistical information and structure feature, In Proc. NL98–125, pp. 71–78.Google Scholar
  49. [49]
    Y. Rubner, C. Tomasi and L. Guibas (1998), The Earth Mover’s Distance as a metric for image retrieval, Tech. Rep., Stanford University.Google Scholar
  50. [50]
    J. Saunders (1996), Real-time discrimination of broadcast speech/music, In Proc. ICASSP96, Vol. 2, pp. 993–996.Google Scholar
  51. [51]
    B. Schatz and H. Chen (1996), Building large-scale digital libraries. IEEE Comput. Mag. 29, pp. 2277.Google Scholar
  52. [52]
    E. Scheirer and M. Slaney (1997), Construction and evaluation of a robust multifeature music/speech discriminator, In Proc. ICASSP97, Vol. 2, pp. 1331–1334.Google Scholar
  53. [53]
    E. Scheirer (1998), Tempo and beat analysis of acoustic musical signals, in J. Acoust. Soc. Am. 103 (1), pp 588–601.Google Scholar
  54. [54]
    G. Smith, H. Murase and K. Kashino (1999), Quick audio retrieval using active search, In Proc. ICASSP99,Turkey.Google Scholar
  55. [55]
    G. Tzanetakis and P. Cook (1999), Multifeature audio segmentation for browsing and annotation, In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics,New Paltz, New York.Google Scholar
  56. [56]
    G. Tzanetakis, G. Essi and P. Cook (2001), Automatic musical genre classification of audio signals, In. Proc. Im’. Symposium on Music Information Retrieval (ISMIR),Bloomington, Indiana, USA.Google Scholar
  57. [57]
    N.G. Venkat and V.R. Jijay (1995), Special issues on content-based image retrieval systems. IEEE Comput. Mag. 28, pp. 18–62.Google Scholar
  58. [58]
    E. Wold, T. Blum, D. Keislar and J. Wheaton (1996), Content-based classification, search and retrieval of audio, IEEE Multimedia Mag. 3, pp. 27–36.CrossRefGoogle Scholar
  59. [59]
    Yahiaoui, B. Merialdo and B. Huet (2001), Generating summaries of multi-episode video, In Proc. IEEE International Conference on Multimedia and Expo, Tokyo, Japan, pp. 792–795.Google Scholar
  60. [60]
    H.J. Zhang (1999), Content-based video browsing and retrieval, In Handbook of Multimedia Computing, ed. by Furht B. CRC Press, Florida, pp. 255–280.Google Scholar
  61. [61]
    T. Zhang and C.-C. Kuo (1999), Video content parsing based on combined audio and visual information, In Proc. SPIE 1999, San Jose, USA, Vol. 4, pp. 78–89.Google Scholar
  62. [62]
    Y. Zhu, C. Xu and M. Kankanhalli (2001), Melody curve processing for music retrieval, In Proc. ICME2001, Japan, pp. 401–404.Google Scholar
  63. [63]
    Y. Zhu, M. Kankanhalli and C. Xu (2001), Pitch tracking and melody slope matching for song retrieval, In Proc. PCM2001, Beijing, China, pp. 530–537.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Changsheng Xu
  • David Dagan Feng
  • Qi Tian

There are no affiliations available

Personalised recommendations