Identifying perceptually congruent structures for audio retrieval

  • Kathy Melih
  • Ruben Gonzalez
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1483)


The relatively low cost access to large amounts of multimedia data, such as over the WWW, has resulted in an increasing demand for multimedia data management. Audio data has received relatively little research attention. The main reason for this is that audio data poses unique problems. Specifically, the unstructured nature of current audio representations considerably complicates the tasks of content-based retrieval and especially browsing. This paper attempts to address this oversight by developing a representation that is based on the inherent, perceptually congruent structure of audio data. A survey of the pertinent issues is presented that includes some of limitations of current unstructured audio representations and the existing retrieval systems based on these. The benefits of a structured representation are discussed as well as the relevant perceptual issues used to identify the underlying structure of an audio data stream. Finally, the structured representation is described and its possible applications to retrieval and browsing are outlined.


Audio Signal Tone Burst Noise Burst Audio Data Stream Segregation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    R. Gonzalez, “Hypermedia Data Modeling, Coding and Semiotics”, Proc of the IEEE, vol 85, no 7, July 1997, pp 1111–1141.CrossRefGoogle Scholar
  2. 2.
    D. Hindus, C. Schmandt and C. Horner, “Capturing, Structuring and Representing Ubiquitous Audio”, ACM Trans. On Information Systems, v. 11, n. 4, Oct 1993, pp 376–400.CrossRefGoogle Scholar
  3. 3.
    G. Hauptmann, M. J. Witbrock, A. I. Rudnicky and S. Reed, “Speech for Multimedia Information Retrieval”, UIST '95, pp. 79–80.Google Scholar
  4. 4.
    J. McNab, L. A. Smith, D. Bainbridge and I. H. Witten, “The New Zealand Digital Library MELody inDEX”, D-Lib Magazine, May 1997, Scholar
  5. 5.
    Ghias, J. Logan, D. Chamberlin and B. C. Smith, “Query By Humming: Musical Information Retrieval in An Audio Database”, Proc. ACM Multimedia '95, San Francisco, pp 231–236.Google Scholar
  6. 6.
    E. Wold, T. Blum, D. Keislar and J. Wheaton, “Content-Based Classification, Search and Retrieval of Audio”, IEEE Multimedia, Fall 1996, pp. 27–36.CrossRefGoogle Scholar
  7. 7.
    S. Tanguine, “A Principle of Correlativity of Perception and its Application to Music Recognition”, Music Perception, Summer 1994, 11 (4), pp. 465–502.Google Scholar
  8. 8.
    P.J.V. Aigrain, P. Longueville, Lepain, “Representation-based user interfaces for the audiovisual library of year 2000”, Proc. SPIE Multimedia and Computing and Networks 1995, vol. 2417, Feb 1995, pp. 35–45.Google Scholar
  9. 9.
    B. Arons, “SpeechSkimmer: Interactively Skimming Recorded Speech”, Proc. USIT 1993: ACM Symposium on User Interface Software and Technology, Nov 1993.Google Scholar
  10. 10.
    D. P. W. Ellis, B. L. Vercoe, “A Perceptual Representation of Audio for Auditory Signal Separation”, presented at the 23rd meeting of the Acoustical Society of America, Salt Lake City, May 1992.Google Scholar
  11. 11.
    B. C. J. Moore, “An Introduction to the Psychology of Hearing”, fourth edition, Academic Press, 1997.Google Scholar
  12. 12.
    T. F. Quatieri, R. J. McAulay, “Speech Transformations Based on a Sinusoidal Representation”, IEEE Trans. ASSP, vol. ASSP-34, no. 6, Dec 1986, pp. 1449–1463.CrossRefGoogle Scholar
  13. 13.
    N. Ahmed, T. Natarajan and K.R. Rao, “Discrete Cosine Transform”, IEEE Trans on Computers, Jan 1974, pp. 90–93.Google Scholar
  14. 14.
    M. Paraskevas, J. Mourjopoulos, “A Differential Perceptual Audio Coding Method with Reduced Bitrate Requirements”, IEEE Trans ASSP, v. 3, n. 6, Nov 1995.Google Scholar
  15. 15.
    M.R. Schroeder, B. S. Atal, J. L. Hall, “Opimizing digital speech coders by exploiting masking properties of the human ear”, J. Acoust. Soc. Amer., 66(6), Dec 1979, pp 1647–1651.CrossRefGoogle Scholar
  16. 16.
    ISO/IEC 11 172-3.Google Scholar
  17. 17.
    J. Hoyt, H. Wechsler, “Detection of Human Speech in Structured Noise”, IEEE ICASSP, vol 2. 1994, pp 237–240Google Scholar
  18. 18.
    A. B. Fineberg, R. J. Mammone, “Detection and Classification of Multicomponent Signals”, Proc. 25th Asilomar Conference on Computer, Signals and Systems, Nov 4–6, 1991.Google Scholar
  19. 19.
    E. Terhardt, G. Stoll, M. Seewann, “Algorithm for extraction of pitch and pitch salience from complex tonal signals”, J. Acoust Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1998

Authors and Affiliations

  • Kathy Melih
    • 1
  • Ruben Gonzalez
    • 1
  1. 1.School of Information TechnologyGriffith UniversityAustralia

Personalised recommendations