Abstract
Automatically extracting semantic content from audio streams can be helpful in many multimedia applications. Motivated by the known limitations of traditional supervised approaches to content extraction, which are hard to generalize and require suitable training data, we propose in this chapter a completely unsupervised approach to content discovery in composite audio signals. The approach adopts the ideas from text analysis to find the fundamental and representative audio segments (analog to words and keywords), and to employ them for parsing a general audio document into meaningful "paragraphs" and "paragraphs" clusters. In our approach, we first employ spectral clustering to discover natural semantic sound clusters (e.g. speech, music, noise, applause, speech mixed with music). These clusters are referred to as audio elements, and analog to words in text analysis. Based on the obtained set of audio elements, the key audio elements, which are most prominent in characterizing the content of input audio data, are selected. The obtained (key) audio elements are then used to detect potential boundaries of semantic audio "paragraphs" denoted as auditory scenes, which are finally clustered in terms of the audio elements appearing therein, by investigating the relations between audio elements and auditory scenes with an information-theoretic co-clustering scheme. Evaluations of the proposed approach performed on 5 hours of diverse audio data indicate that promising results can be achieved, both regarding audio element discovery and auditory scene segmentation/clustering.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Baeza-Yates R, and Ribeiro-Neto B. Modern Information Retrieval. Addison-Wesley, Boston, MA, 1999.
Cai R, Lu L, and Cai L-H. Unsupervised auditory scene categorization via key audio effects and information-theoretic co-clustering. Proc. the 30th IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, 1073–1076, 2005.
Cai R, Lu L, Hanjalic A, Zhang H-J, and Cai L-H. A flexible framework for key audio effects detection and auditory context inference. IEEE Trans. Audio, Speech and Language Processing, Vol. 14, No. 3, 1026–1039, 2006
Cai R, Lu L, and Hanjalic A. Unsupervised Content Discovery in Composite Audio, Proc. ACM Multimedia 05, 628–637, 2005
Cheng W-H, Chu W-T, and Wu J-L. Semantic context detection based on hierarchical audio models. Proc. the 5th ACM SIGMM International Workshop on Multimedia Information Retrieval, 109–115, 2003.
Dhillon IS, Co-Clustering Documents and Words Using Bipartite Spectral Graph Partitioning. Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 269–274, 2001.
Dhillon IS, Mallela S, and Modha DS. Information-theoretic co-clustering. Proc. the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, 89–98.
Dhillon IS, and Guan Y. Information theoretic clustering of sparse co-occurrence data. Proc. the 3rd IEEE International Conference on Data Mining, 517–520, 2003.
Duda RO, Hart PE, and Stork DG. Pattern Classification, Second Edition. John Wiley & Sons, NJ, 2000.
Ellis D, and Lee K. Minimal-impact audio-based personal archives. Proc. ACM Workshop on Continuous Archival and Retrieval of Personal Experiences, 39–47, 2004.
Gu J, Lu L, Cai R, Zhang H-J, Yang J. “Dominant Feature Vectors Based Audio Similarity Measure”, Proc. of Pacific-Rim Conference on Multimedia (PCM), 2, 890–897, 2004
Hanjalic A., Lagendijk RL, and Biemond J. Automated high-level movie segmentation for advanced video-retrieval systems. IEEE Trans. Circuits and Systems for Video Technology, Vol. 9, No. 4, pp. 580–588, 1999.
Hanjalic A, and Xu L-Q. Affective video content representation and modeling. IEEE Trans. Multimedia, Vol. 7, No. 1, pp. 143–154, 2005.
Kass RE, and Wasserman L. A Reference Bayesian Test for Nested Hypotheses and Its Relationship to the Schwarz Criterion. Journal of the American Statistical Association, Vol. 90, No. 431, 928–934, 1995
Kender JR, and Yeo BL. Video scene segmentation via continuous video coherence. Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 367–373, 1998.
Liu Z, Wang Y and Chen T. Audio Feature Extraction and Analysis for Scene Segmentation and Classification. Journal of VLSI Signal Processing Systems, Vol. 20, pp.61–79, 1998 http://www.springerlink.com/content/n57147513455454k/
Lu L, Cai R, and Hanjalic A. Towards a unified framework for content-based audio analysis. Proc. the 30th IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, 1069–1072, 2005.
Lu L, Cai R, and Hanjalic A. Audio Elements based Auditory Scene Segmentation, Proc. ICASSP06, Vol. V, pp.17–20, 2006.
Lu L, and Hanjalic A. Towards Optimal Audio Keywords Detection for Audio Content Analysis and Discovery, Proc. ACM Multimedia 06, 825–834, 2006
Lu L, Zhang H-J, and Jiang H. Content analysis for audio classification and segmentation. IEEE Trans. Speech Audio Processing, Vol. 10, No. 7, 504–516, 2002.
Ma Y-F, Lu L, Zhang H-J, and Li M-J. A user attention model for video summarization. Proc. ACM International Conference on Multimedia, 533–542, 2002.
Moncrieff S, Dorai C, and Venkatesh S. Detecting indexical signs in film audio for scene interpretation. Proc. the 2nd IEEE International Conference on Multimedia and Expo, 989–992, 2001.
Ng AY, Jordan MI, and Weiss Y. On spectral clustering: analysis and an algorithm. Advances in Neural Information Processing Systems (NIPS) 14, 849–856, 2001.
Ngo C-W, Ma Y-F, and Zhang H-J. Video summarization and scene detection by graph modeling. IEEE Trans. Circuits and Systems for Video Technology, Vol. 15, No. 2, 296–305, 2005.
Pelleg D, and Moore AW. X-means: extending K-means with efficient estimation of the number of clusters. Proc. the 17th International Conference on Machine Learning, 727–734, 2000.
Peltonen V, Tuomi J, Klapuri AP, Huopaniemi J, and Sorsa T. Computational auditory scene recognition. Proc. the 27th IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, 1941–1944, 2002.
Radhakrishnan R, Divakaran A, and Xiong Z. A time series clustering based framework for multimedia mining and summarization using audio features. Proc. the 6th ACM SIGMM International Workshop on Multimedia Information Retrieval, 157–164, 2004.
Scott GL, and Longuet-Higgins HC. Feature grouping by relocalisation of eigenvectors of the proximity matrix. Proc. British Machine Vision Conference, 103–108, 1990
Shi J, and Malik J. Normalized cuts and image segmentation. Proc. IEEE Conf. Computer Vision and Pattern Recognition, 731–737, 1997.
Sundaram H, and Chang S-F. Determining Computable scenes in films and their structures using audio visual memory models. Proc. the 8th ACM International Conference on Multimedia, 95–104, 2000.
Venugopal S, Ramakrishnan KR, Srinivas SH, and Balakrishnan N. “Audio scene analysis and scene change detection in the MPEG compressed domain,” Proc. MMSP99, 191–196, 1999.
Wang D, Lu L, Zhang H-J. Speech Segmentation without Speech Segmentation, Proc. the 28th IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. I, 468–471, 2003.
Weiss Y, Segmentation using eigenvectors: a unifying view. Proc. IEEE International Conference on Computer Vision, Vol. 2, 975–982, 1999
Xie L, Chang S-F, Divakaran A, and Sun H. Unsupervised mining of statistical temporal structures in video. Video Mining, Kluwer Academic Publishers, 279–307, 2003.
Xu M, Maddage N, Xu CS, Kankanhalli M, and Tian Q. Creating audio keywords for event detection in soccer video. Proc. the 4th IEEE International Conference on Multimedia and Expo, Vol. 2, 281–284, 2003.
Yu SX, and Shi J. Multiclass spectral clustering. Proc. the 9th IEEE International Conference on Computer Vision, 2003, Vol. 1, 313–319.
Zelnik-Manor L, and Perona P. Self-tuning spectral clustering. Proc. Advances in Neural Information Processing Systems (NIPS) 17, 2004, 1601–1608.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Lu, L., Hanjalic, A. (2009). Audio Content Discovery: An Unsupervised Approach. In: Divakaran, A. (eds) Multimedia Content Analysis. Signals and Communication Technology. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-76569-3_4
Download citation
DOI: https://doi.org/10.1007/978-0-387-76569-3_4
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-76567-9
Online ISBN: 978-0-387-76569-3
eBook Packages: EngineeringEngineering (R0)