Abstract
In this research different clustering techniques are applied for grouping transcribed textual documents obtained out of audio streams. Since audio transcripts are normally highly erroneous, it is essential to reduce the negative impact of errors gained at the speech recognition stage. In attempt to overcome some of these errors, different stemming techniques are applied on the transcribed text. The goal of this research is to achieve automatic topic clustering of transcribed speech documents, and investigate the impact of applying stemming techniques in combination with a Chi-square similarity measure on the accuracy of the selected clustering algorithms. The evaluation—using F-Measure—showed that using root-based stemming in combination of spectral clustering technique achieved the highest accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Erling Wold, Thorn Blum, Douglas Keislar, and James Wheaton, “Content-based classification, search, and retrieval of audio,” IEEE Multimedia, pp. 27-36, Fall 1996.
N. V. Patel and I. K. Sethi, “Audio characterization for video indexing,” Proceedings of IS&T/SPIE Conf. Storage and Retrieval for Image and Video Databases IV, pp. 373-384, San Jose, CA, February 1996.
N. V. Patel and I. K. Sethi, “Video classification using Speaker identification,” Proceedings of IS&T/SPIE Conf. Storage and Retrieval for Image and Video Databases V, pp. 218-225, San Jose, February 1997.
Dongge Li, IK Sethi, N Dimitrova and T McGee, “Classification of general audio data for content-based retrieval,” Pattern Recognition Letters, Vol. 22, pp. 533-544, April 2001.
Anni R. Coden, and Eric W. Brown, “Speech transcript analysis for automatic search,” IBM Research Report, RC 21838 (98287), September 2000.
Oger, S.; Rouvier, M.; Linares, G., “Transcription-based video genre classification,” Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pp.5114, 5117, March 2010.
Abberley, D., Renals, S., and Cook G., “Retrieval of broadcast news documents with the THISL system,” Proc. of the IEEE International Conference on Acoustic, Speech, and Signal Processing, 1998, pp. 3781-3784.
J.L. Gauvain, L. Lamel, and G. Adda, “Transcribing broadcast news for audio and video indexing,” Communications of the ACM, 43(2), 2000.
Oktay Ibrahimov, Ishwar Sethi, and Nevenka Dimitrova, “A novel similarity based clustering algorithm for grouping broadcast news,” Proc. of SPIE Conf. ‘Data Mining and Knowledge Discovery: Theory, Tools, and Technology IV, 2002.
Michael Steinbach, George Karypis, and Vipin Kumar, “A comparison of document clustering techniques”, University of Minnesota, 2000.
Ulrike von Luxburg, “A tutorial on spectral clustering,” Springer Statistics and Computing, vol. 17, Issue 4, pp 395-416, December 2007.
R. R. Korfhage, “Information storage and retrieval,” John Wiley, 1997.
Ibrahim Abu El-Khair, “Effects of stop words elimination for Arabic information retrieval: a comparative study,” International Journal of Computing & Information Sciences, 2006, pp. 119-133.
P. Schauble, “Multimedia information retrieval: Content-based information retrieval from Large Text and Audio Databases,” Kluwer Academic Publishers, 1997.
I.A. Al-Kharashi and M.W. Evens, “Comparing words, stems, and roots as index terms in an Arabic information retrieval system,” Journal of the American Society for Information Science, vol. 45, 1994, pp. 548-60.
Eiman Al-Shammari and Jessica Lin, “A novel Arabic lemmatization algorithm,” Proceedings of the second workshop on Analytics for noisy unstructured text data, 2008, pp. 113-118.
L.S. Larkey, L. Ballesteros, and M.E. Connell, “Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis,” Tampere, Finland: ACM, 2002, pp. 275-282.
L.S. Larkey and M.E. Connell, “Arabic information retrieval at UMass in TREC-10,” Proceedings of the Tenth Text Retrieval Conference (TREC-10)”, EM Voorhees and DK Harman ed, 2001, pp. 562-570.
S. Khoja and R. Garside, “Stemming Arabic text,” Lancaster, UK, Computing Department, Lancaster University, 1999.
W. Al-Fares, “Arabic root-based clustering: an algorithm for identifying roots based on n-grams and morphological similarity,” University of Essex (United Kingdom), 2002.
G. Salton, “Automatic text processing: the transformation, analysis, and retrieval of information by computer,” Addison-Wesley, 1989.
Fabrizio Sebastiani, “A tutorial on automated text categorization,” Istituto di Elaborazione dell’Informazione, 1999.
Nicholas Awde and Putros Samano, The Arabic Alphabet: How to Read & Write It, Lyle Stuart, October 2000.
M. Singler, and R. Jin, A. Hauptmann, “CMU spoken document retrieval in Trec-8: analysis of the role of term frequency TF,” The 8th Text REtrieval Conference, NIST, Gaithersburg, MD, November 1999.
Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. “Okapi at TREC-3,” Proceedings of the Third Text REtrieval Conference (TREC 1994). Gaithersburg, USA, November 1994. NIST SP 500-225, 1995, 109-126.
Jianbo Shi and Jitendra Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, No. 8, August 2000.
Dragon Dictation App home page on iTunes store, https://itunes.apple.com/us/app/dragon-dictation/id341446764?mt=8
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Jafar, A.A., Fakhr, M.W., Farouk, M.H. (2015). Clustering-Based Topic Identification of Transcribed Arabic Broadcast News. In: Elleithy, K., Sobh, T. (eds) New Trends in Networking, Computing, E-learning, Systems Sciences, and Engineering. Lecture Notes in Electrical Engineering, vol 312. Springer, Cham. https://doi.org/10.1007/978-3-319-06764-3_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-06764-3_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06763-6
Online ISBN: 978-3-319-06764-3
eBook Packages: EngineeringEngineering (R0)