Skip to main content

Clustering-Based Topic Identification of Transcribed Arabic Broadcast News

  • Conference paper
  • First Online:
New Trends in Networking, Computing, E-learning, Systems Sciences, and Engineering

Abstract

In this research different clustering techniques are applied for grouping transcribed textual documents obtained out of audio streams. Since audio transcripts are normally highly erroneous, it is essential to reduce the negative impact of errors gained at the speech recognition stage. In attempt to overcome some of these errors, different stemming techniques are applied on the transcribed text. The goal of this research is to achieve automatic topic clustering of transcribed speech documents, and investigate the impact of applying stemming techniques in combination with a Chi-square similarity measure on the accuracy of the selected clustering algorithms. The evaluation—using F-Measure—showed that using root-based stemming in combination of spectral clustering technique achieved the highest accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Erling Wold, Thorn Blum, Douglas Keislar, and James Wheaton, “Content-based classification, search, and retrieval of audio,” IEEE Multimedia, pp. 27-36, Fall 1996.

    Google Scholar 

  2. N. V. Patel and I. K. Sethi, “Audio characterization for video indexing,” Proceedings of IS&T/SPIE Conf. Storage and Retrieval for Image and Video Databases IV, pp. 373-384, San Jose, CA, February 1996.

    Google Scholar 

  3. N. V. Patel and I. K. Sethi, “Video classification using Speaker identification,” Proceedings of IS&T/SPIE Conf. Storage and Retrieval for Image and Video Databases V, pp. 218-225, San Jose, February 1997.

    Google Scholar 

  4. Dongge Li, IK Sethi, N Dimitrova and T McGee, “Classification of general audio data for content-based retrieval,” Pattern Recognition Letters, Vol. 22, pp. 533-544, April 2001.

    Google Scholar 

  5. Anni R. Coden, and Eric W. Brown, “Speech transcript analysis for automatic search,” IBM Research Report, RC 21838 (98287), September 2000.

    Google Scholar 

  6. Oger, S.; Rouvier, M.; Linares, G., “Transcription-based video genre classification,” Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pp.5114, 5117, March 2010.

    Google Scholar 

  7. Abberley, D., Renals, S., and Cook G., “Retrieval of broadcast news documents with the THISL system,” Proc. of the IEEE International Conference on Acoustic, Speech, and Signal Processing, 1998, pp. 3781-3784.

    Google Scholar 

  8. J.L. Gauvain, L. Lamel, and G. Adda, “Transcribing broadcast news for audio and video indexing,” Communications of the ACM, 43(2), 2000.

    Google Scholar 

  9. Oktay Ibrahimov, Ishwar Sethi, and Nevenka Dimitrova, “A novel similarity based clustering algorithm for grouping broadcast news,” Proc. of SPIE Conf. ‘Data Mining and Knowledge Discovery: Theory, Tools, and Technology IV, 2002.

    Google Scholar 

  10. Michael Steinbach, George Karypis, and Vipin Kumar, “A comparison of document clustering techniques”, University of Minnesota, 2000.

    Google Scholar 

  11. Ulrike von Luxburg, “A tutorial on spectral clustering,” Springer Statistics and Computing, vol. 17, Issue 4, pp 395-416, December 2007.

    Google Scholar 

  12. R. R. Korfhage, “Information storage and retrieval,” John Wiley, 1997.

    Google Scholar 

  13. Ibrahim Abu El-Khair, “Effects of stop words elimination for Arabic information retrieval: a comparative study,” International Journal of Computing & Information Sciences, 2006, pp. 119-133.

    Google Scholar 

  14. P. Schauble, “Multimedia information retrieval: Content-based information retrieval from Large Text and Audio Databases,” Kluwer Academic Publishers, 1997.

    Google Scholar 

  15. I.A. Al-Kharashi and M.W. Evens, “Comparing words, stems, and roots as index terms in an Arabic information retrieval system,” Journal of the American Society for Information Science, vol. 45, 1994, pp. 548-60.

    Article  Google Scholar 

  16. Eiman Al-Shammari and Jessica Lin, “A novel Arabic lemmatization algorithm,” Proceedings of the second workshop on Analytics for noisy unstructured text data, 2008, pp. 113-118.

    Google Scholar 

  17. L.S. Larkey, L. Ballesteros, and M.E. Connell, “Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis,” Tampere, Finland: ACM, 2002, pp. 275-282.

    Google Scholar 

  18. L.S. Larkey and M.E. Connell, “Arabic information retrieval at UMass in TREC-10,” Proceedings of the Tenth Text Retrieval Conference (TREC-10)”, EM Voorhees and DK Harman ed, 2001, pp. 562-570.

    Google Scholar 

  19. S. Khoja and R. Garside, “Stemming Arabic text,” Lancaster, UK, Computing Department, Lancaster University, 1999.

    Google Scholar 

  20. W. Al-Fares, “Arabic root-based clustering: an algorithm for identifying roots based on n-grams and morphological similarity,” University of Essex (United Kingdom), 2002.

    Google Scholar 

  21. G. Salton, “Automatic text processing: the transformation, analysis, and retrieval of information by computer,” Addison-Wesley, 1989.

    Google Scholar 

  22. Fabrizio Sebastiani, “A tutorial on automated text categorization,” Istituto di Elaborazione dell’Informazione, 1999.

    Google Scholar 

  23. Nicholas Awde and Putros Samano, The Arabic Alphabet: How to Read & Write It, Lyle Stuart, October 2000.

    Google Scholar 

  24. M. Singler, and R. Jin, A. Hauptmann, “CMU spoken document retrieval in Trec-8: analysis of the role of term frequency TF,” The 8th Text REtrieval Conference, NIST, Gaithersburg, MD, November 1999.

    Google Scholar 

  25. Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. “Okapi at TREC-3,” Proceedings of the Third Text REtrieval Conference (TREC 1994). Gaithersburg, USA, November 1994. NIST SP 500-225, 1995, 109-126.

    Google Scholar 

  26. Jianbo Shi and Jitendra Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, No. 8, August 2000.

    Google Scholar 

  27. Dragon Dictation App home page on iTunes store, https://itunes.apple.com/us/app/dragon-dictation/id341446764?mt=8

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ahmed Abdelaziz Jafar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Jafar, A.A., Fakhr, M.W., Farouk, M.H. (2015). Clustering-Based Topic Identification of Transcribed Arabic Broadcast News. In: Elleithy, K., Sobh, T. (eds) New Trends in Networking, Computing, E-learning, Systems Sciences, and Engineering. Lecture Notes in Electrical Engineering, vol 312. Springer, Cham. https://doi.org/10.1007/978-3-319-06764-3_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-06764-3_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-06763-6

  • Online ISBN: 978-3-319-06764-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics