Abstract
Training speed of the classifier without degrading its predictive capability is an important concern in text classification. Feature selection plays a key role in this context. It selects a subset of most informative words (terms) from the set of all words. The correlative association of words towards the classes increases an incertitude for the words to represent a class. The representative words of a class are either of positive or negative nature. The standard feature selection methods, viz. Mutual Information (MI), Information Gain (IG), Discriminating Feature Selection (DFS) and Chi Square (CHI), do not consider positive and negative nature of the words that affects the performance of the classifiers. To address this issue, this paper presents a novel feature selection method named Correlative Association Score (CAS). It combines the strength, mutual information, and strong association of the words to determine their positive and negative nature for a class. CAS selects a few (k) informative words from the set of all words (m). These informative words generate a set of N-grams of length 1-3. Finally, the standard Apriori algorithm ensembles the power of CAS and CHI to select the top most, b informative N-grams, where b is a number set by an empirical evaluation. Multinomial Naive Bayes (MNB) and Linear Support Vector Machine (LSVM) classifiers evaluate the performance of the selected N-Grams. Four standard text data sets, viz. Webkb, 20Newsgroup, Ohsumed10, and Ohsumed23 are used for experimental analysis. Two standard performance measures named Macro_F1 and Micro_F1 show a significant improvement in the results using proposed CAS method.
Similar content being viewed by others
Notes
References
Agnihotri, D., Verma, K., & Tripathi, P. (2014). Pattern and cluster mining on text data. In IEEE Computer Society, CSNT, Bhopal In Fourth International Conference on Communication Systems and Network Technologies. doi:10.1109/CSNT.2014.92 (pp. 428–432).
Agnihotri, D., Verma, K., & Tripathi, P. (2016). Computing symmetrical strength of n-grams: a two pass filtering approach in automatic classification of text documents. SpringerPlus, 5(942), 1–29.
Agnihotri, D., Verma, K., & Tripathi, P. (2017). Variable global feature selection scheme for automatic classification of text documents. Expert Systems with Applications, Elsevier, 81, 268–281. doi:10.1016/j.eswa.2017.03.057, http://www.sciencedirect.com/science/article/pii/S0957417417302208.
Dewang, R. K., & Singh, A. K. (2017). State-of-art approaches for review spammer detection: a survey. Journal of Intelligent Information Systems, 1–34. doi:10.1007/s10844-017-0454-7.
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289–1305.
Guo, H., Zhou, L. Z., & Feng, L. (2009). Self-switching classification framework for titled documents. Journal Of Computer Science And Technology, Springer, 24(4), 615–625.
Joachims, T. (1996). A probabilistic analysis of the rocchio algorithm with tfidf for text classification. Technical Report CMU-CS-96-118, Department of Computer Science, Carnegie Mellon University.
Joachims, T. (1998). Text categorization with Support Vector Machines: Learning with many relevant features, Springer Berlin, pp 137–142. doi:10.1007/BFb0026683.
Kevin, B., & Moshe, L. (2013). Uci machine learning repository. http://www.archiveicsuciedu/ml901.
Lamirel, J. C., Cuxac, P., Chivukula, A. S., & Hajlaoui, K. (2015). Optimizing text classification through efficient feature selection based on quality metric. Journal of Intelligent Information Systems, 45(3), 379–396. doi:10.1007/s10844-014-0317-4.
Lewis, D. D., & Ringuette, M. (1994). A comparison of two learning algorithms for text categorization. In Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval (pp. 81–93). Las Vegas.
Manning, C. D., Raghavan, P., & Schutze, H. (2008). Introduction to information retrieval. NY: Cambridge University Press.
Mitchell, T. (1997). Machine learning. McGraw Hill.
Mladenic, D., & Grobelnik, M. (1999). Feature selection for unbalanced class distribution and naive bayes. In Proceeding of the 16th International Conference on Machine Learning (pp. 258–267). SF.
Rehman, A., Kashif, J., Babri, H. A., & Mehreen, S. (2015). Relative discrimination criterion- a novel feature ranking method for text data. Expert Systems with Applications, Elsevier, 42, 3670–3681.
Sebastiani, F. (2002). Machine learning in automated text classification. ACM Computing Surveys, 34(1), 1–47.
Uysal, A. K., & Gunal, S. (2012). A novel probabilistic feature selection method for text classification. Knowledge-Based Systems, Elsevier, 36, 226–235.
Uysal, A. K., & Kursat, A. (2016). An improved global feature selection scheme for text classification. Expert Systems with Applications, Elsevier, 43, 82–92.
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text classification. In Proceedings of the 14th International Conference on Machine Learning (pp. 412–420). USA.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Agnihotri, D., Verma, K. & Tripathi, P. An automatic classification of text documents based on correlative association of words. J Intell Inf Syst 50, 549–572 (2018). https://doi.org/10.1007/s10844-017-0482-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-017-0482-3