Abstract
The objective of clustering, a class of techniques that fall under the category of machine learning is to consequently isolate information into groups called clusters. Clustering of Punjabi documents finds numerous applications in the domain of natural language processing. Currently, not much work has been done for native languages such as Punjabi. This study presents the results of certain common document clustering techniques such as agglomerative and K-means experimented with different feature extraction methods to compare its performance using intrinsic and extrinsic measures. The recently released pre-trained Punjabi word vector model by Facebook has also been experimented as one of the feature extraction methods. This study is conducted to know which combination of clustering algorithm and feature extraction technique gives the most optimum results. This study also uses a supervised approach to evaluate the results of an unsupervised learning algorithm such as clustering.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Gupta, V., Gurpreet S.L.: A survey of text mining techniques and applications. J. Emerg. Tech. Web Intell. 1(1), 60–76 (2009)
Kaur, G., Kaur, K.: Sentiment analysis on Punjabi news articles using SVM. Int. J. Sci. Res. 6(8), 414–421 (2015)
Baarsch, J., Celebi, E.C.: Investigation of internal validity measures for K-means clustering. In: Proceedings of the International Multi Conference of Engineers and Computer Scientists, vol. 1 (2012)
Steinbach, M., George K., Kumar, V.: A comparison of document clustering techniques. KDD workshop on text mining, vol. 400, 1 (2000)
Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vectors for 157 languages. arXiv preprint arXiv:1802.06893 (2018)
Schütze, H., Christopher, D. Manning, Raghavan, P.: Introduction to information retrieval. In: Proceedings of the International Communication of Association for Computing Machinery Conference (2008)
Aggarwal, C., Zhai, C.X.: A Survey of Text Clustering Algorithms. Mining Text Data, pp. 77–128. Springer, Boston (2012)
Wang, L., Tian., Jia, Y., Han, W.: A hybrid algorithm for web document clustering based on frequent term sets and k-means. In: Advances in Web and Network Technologies, and Information Management, pp. 198–203. Springer (2007)
Wei, T., Yonghe, L., Chang, H., Zhou, Q., Bao, X.: A semantic approach for text clustering using WordNet and lexical chains. Expert Syst. Appl. 42, 2264–2275 (2015)
Amoli, P.V.: Scientific documents clustering based on text summarization. Int. J. Electr. Comput. Eng. 5(4), 2088–8708 (2015)
Wang, B., Liu, W., Lin, Z., Hu, X., Wei, J., Liu, C.: Text clustering algorithm based on deep representation learning. J. Eng. 16, 1407–1414 (2018)
Fang, J., Zivic, P., Lin, Y., Ko., A.: Enhanced text clustering based on topic clusters. U.S. Patent Application 10/049,148, filed 14 Aug 2018
Gupta, V., Gupta, V.: Algorithm for Punjabi text classification. Int. J. Comput. Appl. 37(11), 30–35 (2012)
Kaur, J., kumar, J., Saini., R.: Punjabi stop words: A Gurmukhi, Shahmukhi and Roman scripted chronicle. In: Proceedings of the ACM Symposium on Women in Research 2016. ACM (2016)
Gupta, V.: Algorithm for Punjabi text classification. Int. J. Comput. Appl. 37, 30–35 (2012)
Gupta, V.: Punjabi text classification using Naive Bayes, centroid and hybrid approach (2012)
Sharma, S., Gupta, V.: Domain based Punjabi text document clustering. In: Proceedings of COLING 2012: Demonstration Papers (2012)
Luu, T.: Approach to evaluating clustering using classification labelled data. M.S. thesis. University of Waterloo (2011)
Gupta, V., Lehal, G.S.: A survey of common stemming techniques and existing stemmers for Indian languages. J. Emerg. Technol. Web Intell. 2, 157–161 (2013)
Jha, G.N.: The TDIL program and the Indian language corpora initiative. In: Language Resources and Evaluation Conference (2012)
Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)
Ramos, J.: Using tf-idf to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, vol. 242 (2003)
Gupta, V.: Automatic Stemming of Words for Punjabi Language. Advances in Signal Processing and Intelligent Recognition Systems, pp. 73–84. Springer, Cham (2014)
Berkhin, P.: A Survey of Clustering Data Mining Techniques. Grouping multidimensional Data, pp. 25–71. Springer, Berlin (2006)
Napoleon, D., Pavalakodi, S.: A new method for dimensionality reduction using k-means clustering algorithm for high dimensional data set. Int. J. Comput. Appl. 13(7), 41–46 (2011)
Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemometr. Intell. Lab. Syst. 3, 37–52 (1987)
Ding, C., He, X.: K-means clustering via principal component analysis. In: Proceedings of the Twenty-First International Conference on Machine Learning. ACM (2004)
Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retrieval 12(4), 461–486 (2009)
Hossin, M., Sulaiman., M.N.: A review on evaluation metrics for data classification evaluations. Int. J. Data Min. Knowl. Manag. Process 5(2), 1 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Singh, I., Singh, V.P., Aggarwal, N. (2020). Comparative Study on Punjabi Document Clustering. In: Bansal, J., Gupta, M., Sharma, H., Agarwal, B. (eds) Communication and Intelligent Systems. ICCIS 2019. Lecture Notes in Networks and Systems, vol 120. Springer, Singapore. https://doi.org/10.1007/978-981-15-3325-9_38
Download citation
DOI: https://doi.org/10.1007/978-981-15-3325-9_38
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-3324-2
Online ISBN: 978-981-15-3325-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)