Skip to main content

Agglomerative Hierarchical Clustering Techniques for Arabic Documents

  • Conference paper
Advances in Computational Science, Engineering and Information Technology

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 225))

Abstract

Arabic Documents Clustering is an important task for obtaining good results with Search Engines, Information Retrieval (IR) systems, Text Mining Applications especially with the rapid growth of the number of online documents present in Arabic language. Document clustering is the process of segmenting a particular collection of texts into subgroups including content based similar ones. Clustering algorithms are mainly divided into two categories: Hierarchical algorithms and Partition algorithms. In this paper, we propose to study the most popular approach of Hierarchical algorithms: Agglomerative Hierarchical algorithm using seven linkage techniques with a wide variety of distance functions and similarity measures, such as the Euclidean Distance, Cosine Similarity, Jaccard Coefficient, and the Pearson Correlation Coefficient; in order to test their effectiveness on Arabic documents clustering, and finally we recommend the best techniques tested. Furthermore, we propose also to study the effect of using the stemming for the testing dataset to cluster it with the same documents clustering technique and similarity/distance measures cited above. The obtained results show that, on the one hand, the Ward function outperformed the other linkage techniques; on the other hand, the use of the stemming will not yield good results, but makes the representation of the document smaller and the clustering faster.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Dubes, R.C., Jain, A.K.: Algorithms for Clustering Data. Prentice Hall College Div, Englewood Cliffs (1998)

    Google Scholar 

  2. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons ( March 1990)

    Google Scholar 

  3. Huang, A.: Similarity Measures for Text Document Clustering. In: NZCSRSC 2008, Christchurch, New Zealand (April 2008)

    Google Scholar 

  4. Khoja, S., Garside, R.: Stemming Arabic Text. Computing Department. Lancaster University, Lancaster (1999)

    Google Scholar 

  5. Larkey, L.S., Ballesteros, L., Connell, M.: Improving Stemming for Arabic Information Retrieval: Light Stemming and Co-occurrence Analysis. In: Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval (SIGIR 2002), Tampere, Finland, August 11-15, pp. 275–282 (2002)

    Google Scholar 

  6. Yates, R.B., Neto, B.R.: Modern Information Retrieval. Addison-Wesley, New York (1999)

    Google Scholar 

  7. Larsen, B., Aone, C.: Fast and Effective Text Mining using Linear-time Document Clustering. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (1999)

    Google Scholar 

  8. Tishby, N.Z., Pereira, F., Bialek, W.: The Information Bottleneck Method. In: Proceedings of the 37th Allerton Conference on Communication, Control and Computing (1999)

    Google Scholar 

  9. Al-Sulaiti, L., Atwell, E.: The Design of a Corpus of Contemporary Arabic. University of Leeds

    Google Scholar 

  10. Zhao, Y., Karypis, G.: Evaluation of Hierarchical Clustering Algorithms for Document Datasets. In: Proceedings of the International Conference on Information and Knowledge Management (2002)

    Google Scholar 

  11. Zhao, Y., Karypis, G.: Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering. Machine Learning 55(3) (2004)

    Google Scholar 

  12. El-Kourdi, M., Bensaid, A., Rachidi, T.: Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm. In: 20th International Conference on Computational Linguistics, Geneva (August 2004)

    Google Scholar 

  13. Sathiyakumari, K., Manimekalai, G., Preamsudha, V., Phil Scholar, M.: A Survey on Various Approaches in Document Clustering. Int. J. Comp. Tech. Appl., IJCTA 2(5), 1534–1539 (2011)

    Google Scholar 

  14. Müllner, D.: Modern hierarchical, agglomerative clustering algorithms. LNCS. Springer (2011)

    Google Scholar 

  15. Teknomo, K.: Hierarchical Clustering Tutorial (2009), http://people.revoledu.com/kardi/tutorial/clustering/

  16. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2002)

    Google Scholar 

  17. Froud, H., Lachkar, A., Alaoui Ouatik, S.: A Comparative Study Of Root-Based And Stem-Based Approaches For Measuring The Similarity Between Arabic Words For Arabic Text Mining Applications. Advanced Computing: An International Journal (ACIJ) 3(6) (November 2012)

    Google Scholar 

  18. Froud, H., Lachkar, A., Ouatik, S., Benslimane, R.: Stemming and Similarity Measures for Arabic Documents Clustering. In: 5th International Symposium on I/V Communications and Mobile Networks ISIVC. IEEE Xplore (2010)

    Google Scholar 

  19. Berkhin, P.: Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, California (2002), http://citeseer.nj.nec.com/berkhin02survey.html

  20. Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329 (1992)

    Google Scholar 

  21. Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: KDD 1999, pp. 16–22 (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hanane Froud .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer International Publishing Switzerland

About this paper

Cite this paper

Froud, H., Lachkar, A. (2013). Agglomerative Hierarchical Clustering Techniques for Arabic Documents. In: Nagamalai, D., Kumar, A., Annamalai, A. (eds) Advances in Computational Science, Engineering and Information Technology. Advances in Intelligent Systems and Computing, vol 225. Springer, Heidelberg. https://doi.org/10.1007/978-3-319-00951-3_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-00951-3_25

  • Publisher Name: Springer, Heidelberg

  • Print ISBN: 978-3-319-00950-6

  • Online ISBN: 978-3-319-00951-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics