Abstract
Arabic Documents Clustering is an important task for obtaining good results with Search Engines, Information Retrieval (IR) systems, Text Mining Applications especially with the rapid growth of the number of online documents present in Arabic language. Document clustering is the process of segmenting a particular collection of texts into subgroups including content based similar ones. Clustering algorithms are mainly divided into two categories: Hierarchical algorithms and Partition algorithms. In this paper, we propose to study the most popular approach of Hierarchical algorithms: Agglomerative Hierarchical algorithm using seven linkage techniques with a wide variety of distance functions and similarity measures, such as the Euclidean Distance, Cosine Similarity, Jaccard Coefficient, and the Pearson Correlation Coefficient; in order to test their effectiveness on Arabic documents clustering, and finally we recommend the best techniques tested. Furthermore, we propose also to study the effect of using the stemming for the testing dataset to cluster it with the same documents clustering technique and similarity/distance measures cited above. The obtained results show that, on the one hand, the Ward function outperformed the other linkage techniques; on the other hand, the use of the stemming will not yield good results, but makes the representation of the document smaller and the clustering faster.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Dubes, R.C., Jain, A.K.: Algorithms for Clustering Data. Prentice Hall College Div, Englewood Cliffs (1998)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons ( March 1990)
Huang, A.: Similarity Measures for Text Document Clustering. In: NZCSRSC 2008, Christchurch, New Zealand (April 2008)
Khoja, S., Garside, R.: Stemming Arabic Text. Computing Department. Lancaster University, Lancaster (1999)
Larkey, L.S., Ballesteros, L., Connell, M.: Improving Stemming for Arabic Information Retrieval: Light Stemming and Co-occurrence Analysis. In: Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval (SIGIR 2002), Tampere, Finland, August 11-15, pp. 275–282 (2002)
Yates, R.B., Neto, B.R.: Modern Information Retrieval. Addison-Wesley, New York (1999)
Larsen, B., Aone, C.: Fast and Effective Text Mining using Linear-time Document Clustering. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (1999)
Tishby, N.Z., Pereira, F., Bialek, W.: The Information Bottleneck Method. In: Proceedings of the 37th Allerton Conference on Communication, Control and Computing (1999)
Al-Sulaiti, L., Atwell, E.: The Design of a Corpus of Contemporary Arabic. University of Leeds
Zhao, Y., Karypis, G.: Evaluation of Hierarchical Clustering Algorithms for Document Datasets. In: Proceedings of the International Conference on Information and Knowledge Management (2002)
Zhao, Y., Karypis, G.: Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering. Machine Learning 55(3) (2004)
El-Kourdi, M., Bensaid, A., Rachidi, T.: Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm. In: 20th International Conference on Computational Linguistics, Geneva (August 2004)
Sathiyakumari, K., Manimekalai, G., Preamsudha, V., Phil Scholar, M.: A Survey on Various Approaches in Document Clustering. Int. J. Comp. Tech. Appl., IJCTA 2(5), 1534–1539 (2011)
Müllner, D.: Modern hierarchical, agglomerative clustering algorithms. LNCS. Springer (2011)
Teknomo, K.: Hierarchical Clustering Tutorial (2009), http://people.revoledu.com/kardi/tutorial/clustering/
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2002)
Froud, H., Lachkar, A., Alaoui Ouatik, S.: A Comparative Study Of Root-Based And Stem-Based Approaches For Measuring The Similarity Between Arabic Words For Arabic Text Mining Applications. Advanced Computing: An International Journal (ACIJ)Â 3(6) (November 2012)
Froud, H., Lachkar, A., Ouatik, S., Benslimane, R.: Stemming and Similarity Measures for Arabic Documents Clustering. In: 5th International Symposium on I/V Communications and Mobile Networks ISIVC. IEEE Xplore (2010)
Berkhin, P.: Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, California (2002), http://citeseer.nj.nec.com/berkhin02survey.html
Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329 (1992)
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: KDD 1999, pp. 16–22 (1999)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer International Publishing Switzerland
About this paper
Cite this paper
Froud, H., Lachkar, A. (2013). Agglomerative Hierarchical Clustering Techniques for Arabic Documents. In: Nagamalai, D., Kumar, A., Annamalai, A. (eds) Advances in Computational Science, Engineering and Information Technology. Advances in Intelligent Systems and Computing, vol 225. Springer, Heidelberg. https://doi.org/10.1007/978-3-319-00951-3_25
Download citation
DOI: https://doi.org/10.1007/978-3-319-00951-3_25
Publisher Name: Springer, Heidelberg
Print ISBN: 978-3-319-00950-6
Online ISBN: 978-3-319-00951-3
eBook Packages: EngineeringEngineering (R0)