Clustering-Based Topic Identification of Transcribed Arabic Broadcast News

Jafar, Ahmed Abdelaziz; Fakhr, Mohamed Waleed; Farouk, Mohamed Hesham

doi:10.1007/978-3-319-06764-3_32

Ahmed Abdelaziz Jafar³,
Mohamed Waleed Fakhr³ &
Mohamed Hesham Farouk⁴

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 312))

2402 Accesses
1 Citations

Abstract

In this research different clustering techniques are applied for grouping transcribed textual documents obtained out of audio streams. Since audio transcripts are normally highly erroneous, it is essential to reduce the negative impact of errors gained at the speech recognition stage. In attempt to overcome some of these errors, different stemming techniques are applied on the transcribed text. The goal of this research is to achieve automatic topic clustering of transcribed speech documents, and investigate the impact of applying stemming techniques in combination with a Chi-square similarity measure on the accuracy of the selected clustering algorithms. The evaluation—using F-Measure—showed that using root-based stemming in combination of spectral clustering technique achieved the highest accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Erling Wold, Thorn Blum, Douglas Keislar, and James Wheaton, “Content-based classification, search, and retrieval of audio,” IEEE Multimedia, pp. 27-36, Fall 1996.
Google Scholar
N. V. Patel and I. K. Sethi, “Audio characterization for video indexing,” Proceedings of IS&T/SPIE Conf. Storage and Retrieval for Image and Video Databases IV, pp. 373-384, San Jose, CA, February 1996.
Google Scholar
N. V. Patel and I. K. Sethi, “Video classification using Speaker identification,” Proceedings of IS&T/SPIE Conf. Storage and Retrieval for Image and Video Databases V, pp. 218-225, San Jose, February 1997.
Google Scholar
Dongge Li, IK Sethi, N Dimitrova and T McGee, “Classification of general audio data for content-based retrieval,” Pattern Recognition Letters, Vol. 22, pp. 533-544, April 2001.
Google Scholar
Anni R. Coden, and Eric W. Brown, “Speech transcript analysis for automatic search,” IBM Research Report, RC 21838 (98287), September 2000.
Google Scholar
Oger, S.; Rouvier, M.; Linares, G., “Transcription-based video genre classification,” Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pp.5114, 5117, March 2010.
Google Scholar
Abberley, D., Renals, S., and Cook G., “Retrieval of broadcast news documents with the THISL system,” Proc. of the IEEE International Conference on Acoustic, Speech, and Signal Processing, 1998, pp. 3781-3784.
Google Scholar
J.L. Gauvain, L. Lamel, and G. Adda, “Transcribing broadcast news for audio and video indexing,” Communications of the ACM, 43(2), 2000.
Google Scholar
Oktay Ibrahimov, Ishwar Sethi, and Nevenka Dimitrova, “A novel similarity based clustering algorithm for grouping broadcast news,” Proc. of SPIE Conf. ‘Data Mining and Knowledge Discovery: Theory, Tools, and Technology IV, 2002.
Google Scholar
Michael Steinbach, George Karypis, and Vipin Kumar, “A comparison of document clustering techniques”, University of Minnesota, 2000.
Google Scholar
Ulrike von Luxburg, “A tutorial on spectral clustering,” Springer Statistics and Computing, vol. 17, Issue 4, pp 395-416, December 2007.
Google Scholar
R. R. Korfhage, “Information storage and retrieval,” John Wiley, 1997.
Google Scholar
Ibrahim Abu El-Khair, “Effects of stop words elimination for Arabic information retrieval: a comparative study,” International Journal of Computing & Information Sciences, 2006, pp. 119-133.
Google Scholar
P. Schauble, “Multimedia information retrieval: Content-based information retrieval from Large Text and Audio Databases,” Kluwer Academic Publishers, 1997.
Google Scholar
I.A. Al-Kharashi and M.W. Evens, “Comparing words, stems, and roots as index terms in an Arabic information retrieval system,” Journal of the American Society for Information Science, vol. 45, 1994, pp. 548-60.
Article Google Scholar
Eiman Al-Shammari and Jessica Lin, “A novel Arabic lemmatization algorithm,” Proceedings of the second workshop on Analytics for noisy unstructured text data, 2008, pp. 113-118.
Google Scholar
L.S. Larkey, L. Ballesteros, and M.E. Connell, “Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis,” Tampere, Finland: ACM, 2002, pp. 275-282.
Google Scholar
L.S. Larkey and M.E. Connell, “Arabic information retrieval at UMass in TREC-10,” Proceedings of the Tenth Text Retrieval Conference (TREC-10)”, EM Voorhees and DK Harman ed, 2001, pp. 562-570.
Google Scholar
S. Khoja and R. Garside, “Stemming Arabic text,” Lancaster, UK, Computing Department, Lancaster University, 1999.
Google Scholar
W. Al-Fares, “Arabic root-based clustering: an algorithm for identifying roots based on n-grams and morphological similarity,” University of Essex (United Kingdom), 2002.
Google Scholar
G. Salton, “Automatic text processing: the transformation, analysis, and retrieval of information by computer,” Addison-Wesley, 1989.
Google Scholar
Fabrizio Sebastiani, “A tutorial on automated text categorization,” Istituto di Elaborazione dell’Informazione, 1999.
Google Scholar
Nicholas Awde and Putros Samano, The Arabic Alphabet: How to Read & Write It, Lyle Stuart, October 2000.
Google Scholar
M. Singler, and R. Jin, A. Hauptmann, “CMU spoken document retrieval in Trec-8: analysis of the role of term frequency TF,” The 8th Text REtrieval Conference, NIST, Gaithersburg, MD, November 1999.
Google Scholar
Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. “Okapi at TREC-3,” Proceedings of the Third Text REtrieval Conference (TREC 1994). Gaithersburg, USA, November 1994. NIST SP 500-225, 1995, 109-126.
Google Scholar
Jianbo Shi and Jitendra Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, No. 8, August 2000.
Google Scholar
Dragon Dictation App home page on iTunes store, https://itunes.apple.com/us/app/dragon-dictation/id341446764?mt=8

Download references

Author information

Authors and Affiliations

Department of Computer Science, College of Computing and Information Technology, Arab Academy for Science and Technology (AAST), Cairo, Egypt
Ahmed Abdelaziz Jafar & Mohamed Waleed Fakhr
Faculty of Engineering, Department of Engineering Math & Physics, Cairo University, Giza, 12613, Egypt
Mohamed Hesham Farouk

Authors

Ahmed Abdelaziz Jafar
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Waleed Fakhr
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Hesham Farouk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ahmed Abdelaziz Jafar .

Editor information

Editors and Affiliations

Computer Science and Engineering, University of Bridgeport Associate Dean for Graduate Programs, Bridgeport, Connecticut, USA
Khaled Elleithy
Engineering and Computer Science, University of Bridgeport Dean of the School of Engineering, Bridgeport, Connecticut, USA
Tarek Sobh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jafar, A.A., Fakhr, M.W., Farouk, M.H. (2015). Clustering-Based Topic Identification of Transcribed Arabic Broadcast News. In: Elleithy, K., Sobh, T. (eds) New Trends in Networking, Computing, E-learning, Systems Sciences, and Engineering. Lecture Notes in Electrical Engineering, vol 312. Springer, Cham. https://doi.org/10.1007/978-3-319-06764-3_32

Download citation

DOI: https://doi.org/10.1007/978-3-319-06764-3_32
Published: 08 November 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06763-6
Online ISBN: 978-3-319-06764-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics