Skip to main content

Audio Content Discovery: An Unsupervised Approach

  • Chapter
  • First Online:
Multimedia Content Analysis

Part of the book series: Signals and Communication Technology ((SCT))

  • 1772 Accesses

Abstract

Automatically extracting semantic content from audio streams can be helpful in many multimedia applications. Motivated by the known limitations of traditional supervised approaches to content extraction, which are hard to generalize and require suitable training data, we propose in this chapter a completely unsupervised approach to content discovery in composite audio signals. The approach adopts the ideas from text analysis to find the fundamental and representative audio segments (analog to words and keywords), and to employ them for parsing a general audio document into meaningful "paragraphs" and "paragraphs" clusters. In our approach, we first employ spectral clustering to discover natural semantic sound clusters (e.g. speech, music, noise, applause, speech mixed with music). These clusters are referred to as audio elements, and analog to words in text analysis. Based on the obtained set of audio elements, the key audio elements, which are most prominent in characterizing the content of input audio data, are selected. The obtained (key) audio elements are then used to detect potential boundaries of semantic audio "paragraphs" denoted as auditory scenes, which are finally clustered in terms of the audio elements appearing therein, by investigating the relations between audio elements and auditory scenes with an information-theoretic co-clustering scheme. Evaluations of the proposed approach performed on 5 hours of diverse audio data indicate that promising results can be achieved, both regarding audio element discovery and auditory scene segmentation/clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baeza-Yates R, and Ribeiro-Neto B. Modern Information Retrieval. Addison-Wesley, Boston, MA, 1999.

    Google Scholar 

  2. Cai R, Lu L, and Cai L-H. Unsupervised auditory scene categorization via key audio effects and information-theoretic co-clustering. Proc. the 30th IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, 1073–1076, 2005.

    Google Scholar 

  3. Cai R, Lu L, Hanjalic A, Zhang H-J, and Cai L-H. A flexible framework for key audio effects detection and auditory context inference. IEEE Trans. Audio, Speech and Language Processing, Vol. 14, No. 3, 1026–1039, 2006

    Article  Google Scholar 

  4. Cai R, Lu L, and Hanjalic A. Unsupervised Content Discovery in Composite Audio, Proc. ACM Multimedia 05, 628–637, 2005

    Google Scholar 

  5. Cheng W-H, Chu W-T, and Wu J-L. Semantic context detection based on hierarchical audio models. Proc. the 5th ACM SIGMM International Workshop on Multimedia Information Retrieval, 109–115, 2003.

    Google Scholar 

  6. Dhillon IS, Co-Clustering Documents and Words Using Bipartite Spectral Graph Partitioning. Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 269–274, 2001.

    Google Scholar 

  7. Dhillon IS, Mallela S, and Modha DS. Information-theoretic co-clustering. Proc. the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, 89–98.

    Google Scholar 

  8. Dhillon IS, and Guan Y. Information theoretic clustering of sparse co-occurrence data. Proc. the 3rd IEEE International Conference on Data Mining, 517–520, 2003.

    Google Scholar 

  9. Duda RO, Hart PE, and Stork DG. Pattern Classification, Second Edition. John Wiley & Sons, NJ, 2000.

    Google Scholar 

  10. Ellis D, and Lee K. Minimal-impact audio-based personal archives. Proc. ACM Workshop on Continuous Archival and Retrieval of Personal Experiences, 39–47, 2004.

    Google Scholar 

  11. Gu J, Lu L, Cai R, Zhang H-J, Yang J. “Dominant Feature Vectors Based Audio Similarity Measure”, Proc. of Pacific-Rim Conference on Multimedia (PCM), 2, 890–897, 2004

    Google Scholar 

  12. Hanjalic A., Lagendijk RL, and Biemond J. Automated high-level movie segmentation for advanced video-retrieval systems. IEEE Trans. Circuits and Systems for Video Technology, Vol. 9, No. 4, pp. 580–588, 1999.

    Article  Google Scholar 

  13. Hanjalic A, and Xu L-Q. Affective video content representation and modeling. IEEE Trans. Multimedia, Vol. 7, No. 1, pp. 143–154, 2005.

    Article  Google Scholar 

  14. Kass RE, and Wasserman L. A Reference Bayesian Test for Nested Hypotheses and Its Relationship to the Schwarz Criterion. Journal of the American Statistical Association, Vol. 90, No. 431, 928–934, 1995

    Article  MathSciNet  MATH  Google Scholar 

  15. Kender JR, and Yeo BL. Video scene segmentation via continuous video coherence. Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 367–373, 1998.

    Google Scholar 

  16. Liu Z, Wang Y and Chen T. Audio Feature Extraction and Analysis for Scene Segmentation and Classification. Journal of VLSI Signal Processing Systems, Vol. 20, pp.61–79, 1998 http://www.springerlink.com/content/n57147513455454k/

    Google Scholar 

  17. Lu L, Cai R, and Hanjalic A. Towards a unified framework for content-based audio analysis. Proc. the 30th IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, 1069–1072, 2005.

    Google Scholar 

  18. Lu L, Cai R, and Hanjalic A. Audio Elements based Auditory Scene Segmentation, Proc. ICASSP06, Vol. V, pp.17–20, 2006.

    Google Scholar 

  19. Lu L, and Hanjalic A. Towards Optimal Audio Keywords Detection for Audio Content Analysis and Discovery, Proc. ACM Multimedia 06, 825–834, 2006

    Google Scholar 

  20. Lu L, Zhang H-J, and Jiang H. Content analysis for audio classification and segmentation. IEEE Trans. Speech Audio Processing, Vol. 10, No. 7, 504–516, 2002.

    Article  Google Scholar 

  21. Ma Y-F, Lu L, Zhang H-J, and Li M-J. A user attention model for video summarization. Proc. ACM International Conference on Multimedia, 533–542, 2002.

    Google Scholar 

  22. Moncrieff S, Dorai C, and Venkatesh S. Detecting indexical signs in film audio for scene interpretation. Proc. the 2nd IEEE International Conference on Multimedia and Expo, 989–992, 2001.

    Google Scholar 

  23. Ng AY, Jordan MI, and Weiss Y. On spectral clustering: analysis and an algorithm. Advances in Neural Information Processing Systems (NIPS) 14, 849–856, 2001.

    Google Scholar 

  24. Ngo C-W, Ma Y-F, and Zhang H-J. Video summarization and scene detection by graph modeling. IEEE Trans. Circuits and Systems for Video Technology, Vol. 15, No. 2, 296–305, 2005.

    Article  Google Scholar 

  25. Pelleg D, and Moore AW. X-means: extending K-means with efficient estimation of the number of clusters. Proc. the 17th International Conference on Machine Learning, 727–734, 2000.

    Google Scholar 

  26. Peltonen V, Tuomi J, Klapuri AP, Huopaniemi J, and Sorsa T. Computational auditory scene recognition. Proc. the 27th IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, 1941–1944, 2002.

    Google Scholar 

  27. Radhakrishnan R, Divakaran A, and Xiong Z. A time series clustering based framework for multimedia mining and summarization using audio features. Proc. the 6th ACM SIGMM International Workshop on Multimedia Information Retrieval, 157–164, 2004.

    Google Scholar 

  28. Scott GL, and Longuet-Higgins HC. Feature grouping by relocalisation of eigenvectors of the proximity matrix. Proc. British Machine Vision Conference, 103–108, 1990

    Google Scholar 

  29. Shi J, and Malik J. Normalized cuts and image segmentation. Proc. IEEE Conf. Computer Vision and Pattern Recognition, 731–737, 1997.

    Google Scholar 

  30. Sundaram H, and Chang S-F. Determining Computable scenes in films and their structures using audio visual memory models. Proc. the 8th ACM International Conference on Multimedia, 95–104, 2000.

    Google Scholar 

  31. Venugopal S, Ramakrishnan KR, Srinivas SH, and Balakrishnan N. “Audio scene analysis and scene change detection in the MPEG compressed domain,” Proc. MMSP99, 191–196, 1999.

    Google Scholar 

  32. Wang D, Lu L, Zhang H-J. Speech Segmentation without Speech Segmentation, Proc. the 28th IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. I, 468–471, 2003.

    Google Scholar 

  33. Weiss Y, Segmentation using eigenvectors: a unifying view. Proc. IEEE International Conference on Computer Vision, Vol. 2, 975–982, 1999

    Article  Google Scholar 

  34. Xie L, Chang S-F, Divakaran A, and Sun H. Unsupervised mining of statistical temporal structures in video. Video Mining, Kluwer Academic Publishers, 279–307, 2003.

    Google Scholar 

  35. Xu M, Maddage N, Xu CS, Kankanhalli M, and Tian Q. Creating audio keywords for event detection in soccer video. Proc. the 4th IEEE International Conference on Multimedia and Expo, Vol. 2, 281–284, 2003.

    Google Scholar 

  36. Yu SX, and Shi J. Multiclass spectral clustering. Proc. the 9th IEEE International Conference on Computer Vision, 2003, Vol. 1, 313–319.

    Article  Google Scholar 

  37. Zelnik-Manor L, and Perona P. Self-tuning spectral clustering. Proc. Advances in Neural Information Processing Systems (NIPS) 17, 2004, 1601–1608.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lie Lu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Lu, L., Hanjalic, A. (2009). Audio Content Discovery: An Unsupervised Approach. In: Divakaran, A. (eds) Multimedia Content Analysis. Signals and Communication Technology. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-76569-3_4

Download citation

  • DOI: https://doi.org/10.1007/978-0-387-76569-3_4

  • Published:

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-0-387-76567-9

  • Online ISBN: 978-0-387-76569-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics