Agglomerative Hierarchical Clustering Techniques for Arabic Documents

Froud, Hanane; Lachkar, Abdelmonaime

doi:10.1007/978-3-319-00951-3_25

Hanane Froud⁴ &
Abdelmonaime Lachkar⁴

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 225))

1216 Accesses
5 Citations

Abstract

Arabic Documents Clustering is an important task for obtaining good results with Search Engines, Information Retrieval (IR) systems, Text Mining Applications especially with the rapid growth of the number of online documents present in Arabic language. Document clustering is the process of segmenting a particular collection of texts into subgroups including content based similar ones. Clustering algorithms are mainly divided into two categories: Hierarchical algorithms and Partition algorithms. In this paper, we propose to study the most popular approach of Hierarchical algorithms: Agglomerative Hierarchical algorithm using seven linkage techniques with a wide variety of distance functions and similarity measures, such as the Euclidean Distance, Cosine Similarity, Jaccard Coefficient, and the Pearson Correlation Coefficient; in order to test their effectiveness on Arabic documents clustering, and finally we recommend the best techniques tested. Furthermore, we propose also to study the effect of using the stemming for the testing dataset to cluster it with the same documents clustering technique and similarity/distance measures cited above. The obtained results show that, on the one hand, the Ward function outperformed the other linkage techniques; on the other hand, the use of the stemming will not yield good results, but makes the representation of the document smaller and the clustering faster.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Dubes, R.C., Jain, A.K.: Algorithms for Clustering Data. Prentice Hall College Div, Englewood Cliffs (1998)
Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons ( March 1990)
Google Scholar
Huang, A.: Similarity Measures for Text Document Clustering. In: NZCSRSC 2008, Christchurch, New Zealand (April 2008)
Google Scholar
Khoja, S., Garside, R.: Stemming Arabic Text. Computing Department. Lancaster University, Lancaster (1999)
Google Scholar
Larkey, L.S., Ballesteros, L., Connell, M.: Improving Stemming for Arabic Information Retrieval: Light Stemming and Co-occurrence Analysis. In: Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval (SIGIR 2002), Tampere, Finland, August 11-15, pp. 275–282 (2002)
Google Scholar
Yates, R.B., Neto, B.R.: Modern Information Retrieval. Addison-Wesley, New York (1999)
Google Scholar
Larsen, B., Aone, C.: Fast and Effective Text Mining using Linear-time Document Clustering. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (1999)
Google Scholar
Tishby, N.Z., Pereira, F., Bialek, W.: The Information Bottleneck Method. In: Proceedings of the 37th Allerton Conference on Communication, Control and Computing (1999)
Google Scholar
Al-Sulaiti, L., Atwell, E.: The Design of a Corpus of Contemporary Arabic. University of Leeds
Google Scholar
Zhao, Y., Karypis, G.: Evaluation of Hierarchical Clustering Algorithms for Document Datasets. In: Proceedings of the International Conference on Information and Knowledge Management (2002)
Google Scholar
Zhao, Y., Karypis, G.: Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering. Machine Learning 55(3) (2004)
Google Scholar
El-Kourdi, M., Bensaid, A., Rachidi, T.: Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm. In: 20th International Conference on Computational Linguistics, Geneva (August 2004)
Google Scholar
Sathiyakumari, K., Manimekalai, G., Preamsudha, V., Phil Scholar, M.: A Survey on Various Approaches in Document Clustering. Int. J. Comp. Tech. Appl., IJCTA 2(5), 1534–1539 (2011)
Google Scholar
Müllner, D.: Modern hierarchical, agglomerative clustering algorithms. LNCS. Springer (2011)
Google Scholar
Teknomo, K.: Hierarchical Clustering Tutorial (2009), http://people.revoledu.com/kardi/tutorial/clustering/
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2002)
Google Scholar
Froud, H., Lachkar, A., Alaoui Ouatik, S.: A Comparative Study Of Root-Based And Stem-Based Approaches For Measuring The Similarity Between Arabic Words For Arabic Text Mining Applications. Advanced Computing: An International Journal (ACIJ) 3(6) (November 2012)
Google Scholar
Froud, H., Lachkar, A., Ouatik, S., Benslimane, R.: Stemming and Similarity Measures for Arabic Documents Clustering. In: 5th International Symposium on I/V Communications and Mobile Networks ISIVC. IEEE Xplore (2010)
Google Scholar
Berkhin, P.: Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, California (2002), http://citeseer.nj.nec.com/berkhin02survey.html
Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329 (1992)
Google Scholar
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: KDD 1999, pp. 16–22 (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

L.S.I.S, E.N.S.A, University Sidi Mohamed Ben Abdellah (USMBA), Fez, Morocco
Hanane Froud & Abdelmonaime Lachkar

Authors

Hanane Froud
View author publications
You can also search for this author in PubMed Google Scholar
Abdelmonaime Lachkar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hanane Froud .

Editor information

Editors and Affiliations

, Dept of Computer Engineering, KTO Karatay University, 130 Karatay, Konya, 42020, Turkey
Dhinaharan Nagamalai
, School of Computing and Informatics, University of Louisiana at Lafayette, 214 Oliver Hall, Lafayette, 70504-4330, Louisiana, USA
Ashok Kumar
, Dept. of Electrical &, Prairie View A&M University, Prairie View, 77446-0519, Texas, USA
Annamalai Annamalai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Froud, H., Lachkar, A. (2013). Agglomerative Hierarchical Clustering Techniques for Arabic Documents. In: Nagamalai, D., Kumar, A., Annamalai, A. (eds) Advances in Computational Science, Engineering and Information Technology. Advances in Intelligent Systems and Computing, vol 225. Springer, Heidelberg. https://doi.org/10.1007/978-3-319-00951-3_25

Download citation

DOI: https://doi.org/10.1007/978-3-319-00951-3_25
Publisher Name: Springer, Heidelberg
Print ISBN: 978-3-319-00950-6
Online ISBN: 978-3-319-00951-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics