Skip to main content
Log in

Indexing and Retrieval of Audio: A Survey

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

With more and more audio being captured and stored, there is a growing need for automatic audio indexing and retrieval techniques that can retrieve relevant audio pieces quickly on demand. This paper provides a comprehensive survey of audio indexing and retrieval techniques. We first describe main audio characteristics and features and discuss techniques for classifying audio into speech and music based on these features. Indexing and retrieval of speech and music is then described separately. Finally, significance of audio in multimedia indexing and retrieval is discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. P. Aigrain, H. Zhang, and D. Petkovic, “Content-based representation and retrieval of visual media: A stateof the-art review,” Journal of Multimedia Tools and Applications, Vol. 3, pp. 179–202, 1996.

    Google Scholar 

  2. J.R. Bach, “The virage image search engine: An open framework for image management,” in Proceedings of Conference on Storage and Retrieval for Image and Video Databases IV (SPIE Proceedings Vol. 2670), 1–2 Feb., San Jose, California, 1996, pp. 76–87.

  3. A.S. Bregman, Auditory Scene Analysis—The Perception Organization of Sound, The MIT Press: Cambridge, MA, 1990.

    Google Scholar 

  4. R. Comerford, J. Makhoul, and R. Schwartz, “The voice of the computer is heard in the land (and it listens too!),” IEEE Spectrum, Vol. 34, No. 12, pp. 39–47, 1997.

    Google Scholar 

  5. V. Digalakis, S. Berkowitz, E. Bocchieri, C. Boulis, W. Byrne, H. Collier, A. Corduneanu, A. Kannan, S. Khudanpur, and A. Sankar, “Rapid speech recognizer adaptation to new speakers,” in 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, March 15–19, Phoenix, Arizona, Vol. II, 1999, pp. 765–768.

    Google Scholar 

  6. J.T. Foote, “A similarity measure for automatic audio classification,” in Pro. AAAI 1997 Spring Symposium on Intelligent Integration and Use of Text, Image, Video and Audio Corpora, Stanford, Palo Alto, CA, Mar. 1997.

  7. W.B. Frakes and R. Baeza-Yates (Eds.), Information Retrieval: Data structures and Algorithms, Prentice Hall: Englewood Cliffs, NJ, 1992.

    Google Scholar 

  8. A. Ghias et al., “Query by humming—Musical information retrieval in an audio database,” in Proceedings of ACM Multimedia 95, November 5–9, San Francisco, California, 1995.

  9. S.J. Gibbs and D.C. Tsichritzis, Multimedia Programming—Objects, Environments and Frameworks, Addison-Wesley Publishing Company: Reading, MA, 1995.

    Google Scholar 

  10. A.G. Hauptmann, M.J. Witbrock, A.I. Rudnicky, and S. Reed, “Speech for multimedia information retrieval,” in UIST-95 Proceedings of the User Interface Software Technology Conference, Pittsburgh, Nov. 1995.

  11. R.L. Klevans and R.D. Rodman, Voice Recognition, Artech House: Boston, MA, 1997.

    Google Scholar 

  12. G. Lu and T. Hankinson, “A technique towards automatic audio classification and retrieval,” in Proceedings of International Conference on Signal Processing, Oct. 12–16, Beijing, China, 1998.

  13. P.A. Lynn and W. Fuerst, Introductory Digital Signal Processing with Computer Applications, John Wiley & Sons: New York, 1989.

    Google Scholar 

  14. K.D. Martin, “Automatic transcription of simple polyphonic music: Robust front end processing,” M.I.T. Media Laboratory Perceptual Computing Section Technical Report No. 399, 1996, available at http://sound.media.mit.edu/papers.html.

  15. R.J. McNab et al., “The New Zealand digital library MELody inDex,” D-Lib Magazine, May 1997, available at http://mirrored.ukoln.ac.uk/lis-journals/dlib/dlib/dlib/may97/meldex/05written.html.

  16. K. Minami et al., “Enhanced video handling based on audio analysis,” in Proceedings of IEEE International Conference on Multimedia Computing and Systems, June 3–6, Ottawa, Canada, 1997, pp. 219–226.

  17. B.C.J. Moore, An Introduction to Psychology of Hearing, Academic Press: New York, 1997.

    Google Scholar 

  18. D.P. Morgan and C.L. Scofield, Neural Networks and Speech Processing, Kluwer: Dordrecht, 1991.

    Google Scholar 

  19. W. Niblack, X. Zhu, J.L. Hafner, T. Breuel, D.B. Panceleon, D. Petkovic, M.D. Flickner, E. Upfal, S.I. Nin, S. Sull, B.E. Dom, B.-L. Yeo, S. Srinivansan, D. Zivkovic and M. Penner, “Updates to the QBIC system,” in Proceedings of Conference on Storage and Retrieval for Image and Video Databases VI (SPIE Proceedings Vol. 3312), 28–30 Jan., San Jose, California, 1998, pp. 150–161.

  20. N.V. Patel and I.K. Sethi, “Audio characterization for video indexing,” SPIE Proceedings, Vol. 2670, pp. 373–384, 1996.

    Google Scholar 

  21. A.W. Peevers, “A real time 3D signal analysis/synthesis tool based on the short time fourier transform,” http://cnmat.CNMAT.Berkeley.EDU/~alan/MS-html/MSthesis.v2ToC.html.

  22. S. Pfeiffer, S. Fischer, and W. Effelsberg, “Automatic audio content analysis,” http://www.informatik.unimannheim.de/informatic/pi4/projects/MoCA/.

  23. R. Polikar, “The wavelet tutorial,” http://www.public.iastate.edu/¡«rpolikar/WAVELETS/WTtutorial.htm.

  24. L.R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” in Proceedings of The IEEE, Vol. 77, No. 2, 1989.

  25. L.R. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice Hall: Englewood Cliffs, NJ, 1993.

    Google Scholar 

  26. G. Salton and M.J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill: New York, 1983.

    Google Scholar 

  27. J. Saunders, “Real-time discrimination of broadcast speech/music,” in Proceedings ACASSP'96, Vol. 2, 1996, pp. 993–996.

    Google Scholar 

  28. E.D. Scheirer, “Tempo and beat analysis of acoustic music signals,” http://sound.media.mit.edu/~eds/papers/ beat-track.html.

  29. E.D. Scheirer, “The MPEG-4 structured audio standard,” in Proc. IEEE ICASSP 1998, also available at http://sound.media.mit.edu/papers.html.

  30. E.D. Scheirer, “Using musical knowledge to extract expressive performance information from audio recordings,” available at http://sound.media.mit.edu/papers.html.

  31. E. Scheirer and M. Slaney, “Construction and evaluation of a robust multifeature speech/music discriminator,” in Proceedings of the 1997 International Conference on Acoustics, Speech, and Signal Processing (ICASSP), April 21–24, Munich, Germany, 1997. Also available at http://web.interval.com/papers/1996-085/index.html.

  32. J.R. Smith and S.-F. Chang, “Visually searching the web for content,” IEEE Multimedia Magazine, July–Sept., pp. 12–19, 1997.

  33. S. Subramanya et al., “Transform-based indexing of audio data for multimedia databases,” in Proceedings of IEEE International Conference on Multimedia Computing and Systems, June 3–6, Ottawa, Canada, 1997, pp. 211–218.

  34. The CMU Speech Project, http://www.speech.cs.cmu.edu/speech.

  35. M.J. Witbrock and A.G. Hauptmann, “Speech recognition and information retrieval,” in Proceedings of the 1997 DARPA Speech Recognition Workshop, February 2–5, 1997.

  36. E. Wold et al., “Content-based classification, search, and retrieval of audio,” IEEE Multimedia, Vol. 3, No. 3, pp. 27–36, 1996.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lu, G. Indexing and Retrieval of Audio: A Survey. Multimedia Tools and Applications 15, 269–290 (2001). https://doi.org/10.1023/A:1012491016871

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1012491016871

Navigation