An automatic classification of text documents based on correlative association of words

Agnihotri, Deepak; Verma, Kesari; Tripathi, Priyanka

doi:10.1007/s10844-017-0482-3

An automatic classification of text documents based on correlative association of words

Published: 14 August 2017

Volume 50, pages 549–572, (2018)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

593 Accesses
9 Citations
1 Altmetric
Explore all metrics

Abstract

Training speed of the classifier without degrading its predictive capability is an important concern in text classification. Feature selection plays a key role in this context. It selects a subset of most informative words (terms) from the set of all words. The correlative association of words towards the classes increases an incertitude for the words to represent a class. The representative words of a class are either of positive or negative nature. The standard feature selection methods, viz. Mutual Information (MI), Information Gain (IG), Discriminating Feature Selection (DFS) and Chi Square (CHI), do not consider positive and negative nature of the words that affects the performance of the classifiers. To address this issue, this paper presents a novel feature selection method named Correlative Association Score (CAS). It combines the strength, mutual information, and strong association of the words to determine their positive and negative nature for a class. CAS selects a few (k) informative words from the set of all words (m). These informative words generate a set of N-grams of length 1-3. Finally, the standard Apriori algorithm ensembles the power of CAS and CHI to select the top most, b informative N-grams, where b is a number set by an empirical evaluation. Multinomial Naive Bayes (MNB) and Linear Support Vector Machine (LSVM) classifiers evaluate the performance of the selected N-Grams. Four standard text data sets, viz. Webkb, 20Newsgroup, Ohsumed10, and Ohsumed23 are used for experimental analysis. Two standard performance measures named Macro_F1 and Micro_F1 show a significant improvement in the results using proposed CAS method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ARTC: feature selection using association rules for text classification

Article 07 September 2022

A high-quality feature selection method based on frequent and correlated items for text classification

Article Open access 04 June 2023

A discriminative model selection approach and its application to text classification

Article 15 July 2017

Notes

References

Agnihotri, D., Verma, K., & Tripathi, P. (2014). Pattern and cluster mining on text data. In IEEE Computer Society, CSNT, Bhopal In Fourth International Conference on Communication Systems and Network Technologies. doi:10.1109/CSNT.2014.92 (pp. 428–432).
Agnihotri, D., Verma, K., & Tripathi, P. (2016). Computing symmetrical strength of n-grams: a two pass filtering approach in automatic classification of text documents. SpringerPlus, 5(942), 1–29.
Google Scholar
Agnihotri, D., Verma, K., & Tripathi, P. (2017). Variable global feature selection scheme for automatic classification of text documents. Expert Systems with Applications, Elsevier, 81, 268–281. doi:10.1016/j.eswa.2017.03.057, http://www.sciencedirect.com/science/article/pii/S0957417417302208.
Article Google Scholar
Dewang, R. K., & Singh, A. K. (2017). State-of-art approaches for review spammer detection: a survey. Journal of Intelligent Information Systems, 1–34. doi:10.1007/s10844-017-0454-7.
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289–1305.
MATH Google Scholar
Guo, H., Zhou, L. Z., & Feng, L. (2009). Self-switching classification framework for titled documents. Journal Of Computer Science And Technology, Springer, 24(4), 615–625.
Article Google Scholar
Joachims, T. (1996). A probabilistic analysis of the rocchio algorithm with tfidf for text classification. Technical Report CMU-CS-96-118, Department of Computer Science, Carnegie Mellon University.
Joachims, T. (1998). Text categorization with Support Vector Machines: Learning with many relevant features, Springer Berlin, pp 137–142. doi:10.1007/BFb0026683.
Kevin, B., & Moshe, L. (2013). Uci machine learning repository. http://www.archiveicsuciedu/ml901.
Lamirel, J. C., Cuxac, P., Chivukula, A. S., & Hajlaoui, K. (2015). Optimizing text classification through efficient feature selection based on quality metric. Journal of Intelligent Information Systems, 45(3), 379–396. doi:10.1007/s10844-014-0317-4.
Article Google Scholar
Lewis, D. D., & Ringuette, M. (1994). A comparison of two learning algorithms for text categorization. In Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval (pp. 81–93). Las Vegas.
Manning, C. D., Raghavan, P., & Schutze, H. (2008). Introduction to information retrieval. NY: Cambridge University Press.
Book MATH Google Scholar
Mitchell, T. (1997). Machine learning. McGraw Hill.
Mladenic, D., & Grobelnik, M. (1999). Feature selection for unbalanced class distribution and naive bayes. In Proceeding of the 16th International Conference on Machine Learning (pp. 258–267). SF.
Rehman, A., Kashif, J., Babri, H. A., & Mehreen, S. (2015). Relative discrimination criterion- a novel feature ranking method for text data. Expert Systems with Applications, Elsevier, 42, 3670–3681.
Article Google Scholar
Sebastiani, F. (2002). Machine learning in automated text classification. ACM Computing Surveys, 34(1), 1–47.
Article Google Scholar
Uysal, A. K., & Gunal, S. (2012). A novel probabilistic feature selection method for text classification. Knowledge-Based Systems, Elsevier, 36, 226–235.
Article Google Scholar
Uysal, A. K., & Kursat, A. (2016). An improved global feature selection scheme for text classification. Expert Systems with Applications, Elsevier, 43, 82–92.
Article Google Scholar
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text classification. In Proceedings of the 14th International Conference on Machine Learning (pp. 412–420). USA.

Download references

Author information

Authors and Affiliations

Department of Computer Applications, National Institute of Technology Raipur, CG, India
Deepak Agnihotri & Kesari Verma
Department of Computer Engineering and Applications, National Institute of Technical Teachers Training and Research Bhopal, MP, India
Priyanka Tripathi

Authors

Deepak Agnihotri
View author publications
You can also search for this author in PubMed Google Scholar
Kesari Verma
View author publications
You can also search for this author in PubMed Google Scholar
Priyanka Tripathi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Deepak Agnihotri.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Agnihotri, D., Verma, K. & Tripathi, P. An automatic classification of text documents based on correlative association of words. J Intell Inf Syst 50, 549–572 (2018). https://doi.org/10.1007/s10844-017-0482-3

Download citation

Received: 07 April 2016
Revised: 04 August 2017
Accepted: 04 August 2017
Published: 14 August 2017
Issue Date: June 2018
DOI: https://doi.org/10.1007/s10844-017-0482-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An automatic classification of text documents based on correlative association of words

Abstract

Access this article

Similar content being viewed by others

ARTC: feature selection using association rules for text classification

A high-quality feature selection method based on frequent and correlated items for text classification

A discriminative model selection approach and its application to text classification

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An automatic classification of text documents based on correlative association of words

Abstract

Access this article

Similar content being viewed by others

ARTC: feature selection using association rules for text classification

A high-quality feature selection method based on frequent and correlated items for text classification

A discriminative model selection approach and its application to text classification

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation