Abstract
Document categorization is gaining importance due to the large volume of electronic information which requires automatic organization and pattern identification. Due to the morphological complexity of the language, automatic categorization of Amharic documents has become a difficult talk to carry out. This paper presents a system that categorizes Amharic documents based on the frequency of itemsets obtained after analyzing the morphology of the language. We selected seven categories into which a given document is to be classified. The task of categorization is achieved by employing an extended version of a priori algorithm which had been traditionally used for the purpose of knowledge mining in the form of association rules. The system is tested with a corpus containing Amharic news documents and experimental results are reported.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Afework, Y.: Automatic amharic text categorization. Master’s thesis, Addis Ababa University, Ethiopia (2008)
Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington, DC (1993)
Eyassu, S., Gambäck, B.: Classifying amharic news text using self-organizing maps. In: Proceedings of 43rd Annual Meeting of the Association for Computational Linguistics, Michigan, USA (2005)
Goller, C., Löning, J., Will, T., Wolff, W.: Automatic document classification: a thorough evaluation of various methods (2009). doi:10.1.1.90.966
Han, J., Kamber, M.: Data Mining Concepts and Techniques, 2nd edn. Morgan Kaufmann Publishers, USA (2006)
Hynek, J., Jezek, K., Rohlik, O.: Short document categorization - itemsets method. In: The Proceedings of PKDD 2000 Conference, Lyon, France (2000)
Lewis, M.P., Simons, G.F., Fennig, C.D.: Ethnologue: Languages of the World, Seventeenth edn. SIL International, Dallas (2013)
Morshed, A.: Towards the automatic categorization of documents in user-generated categorizations, Technical report No. DIT-06-001, University of Trento, Italy (2006)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Teklu, S.: Automatic categorization of Amharic news text: a machine learning approach. Master’s thesis, Addis Ababa University, Ethiopia (2003)
Tilahun, S.: Automatic Amharic news categorization. Master’s thesis, Addis Ababa University, Ethiopia (2001)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Hailu, A., Assabie, Y. (2016). Itemsets-Based Amharic Document Categorization Using an Extended A Priori Algorithm. In: Vetulani, Z., Uszkoreit, H., Kubis, M. (eds) Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2013. Lecture Notes in Computer Science(), vol 9561. Springer, Cham. https://doi.org/10.1007/978-3-319-43808-5_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-43808-5_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43807-8
Online ISBN: 978-3-319-43808-5
eBook Packages: Computer ScienceComputer Science (R0)