Skip to main content
Log in

An automatic classification of text documents based on correlative association of words

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Training speed of the classifier without degrading its predictive capability is an important concern in text classification. Feature selection plays a key role in this context. It selects a subset of most informative words (terms) from the set of all words. The correlative association of words towards the classes increases an incertitude for the words to represent a class. The representative words of a class are either of positive or negative nature. The standard feature selection methods, viz. Mutual Information (MI), Information Gain (IG), Discriminating Feature Selection (DFS) and Chi Square (CHI), do not consider positive and negative nature of the words that affects the performance of the classifiers. To address this issue, this paper presents a novel feature selection method named Correlative Association Score (CAS). It combines the strength, mutual information, and strong association of the words to determine their positive and negative nature for a class. CAS selects a few (k) informative words from the set of all words (m). These informative words generate a set of N-grams of length 1-3. Finally, the standard Apriori algorithm ensembles the power of CAS and CHI to select the top most, b informative N-grams, where b is a number set by an empirical evaluation. Multinomial Naive Bayes (MNB) and Linear Support Vector Machine (LSVM) classifiers evaluate the performance of the selected N-Grams. Four standard text data sets, viz. Webkb, 20Newsgroup, Ohsumed10, and Ohsumed23 are used for experimental analysis. Two standard performance measures named Macro_F1 and Micro_F1 show a significant improvement in the results using proposed CAS method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

Notes

  1. http://www.isical.ac.in/~acmsc/TMW2014/TMW2014.html

  2. http://www.isical.ac.in/~scc/DInK%2710/studymaterial/textmining.eps

  3. https://www.healthgrades.com/conditions/viral-diseases

  4. https://www.healthgrades.com/conditions/bacterial-diseases

  5. http://nbviewer.ipython.org/gist/rjweiss/7158866

  6. https://pypi.python.org/pypi/beautifulsoup4

  7. http://scikit-learn.org/stable/modules/

  8. http://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics

References

  • Agnihotri, D., Verma, K., & Tripathi, P. (2014). Pattern and cluster mining on text data. In IEEE Computer Society, CSNT, Bhopal In Fourth International Conference on Communication Systems and Network Technologies. doi:10.1109/CSNT.2014.92 (pp. 428–432).

  • Agnihotri, D., Verma, K., & Tripathi, P. (2016). Computing symmetrical strength of n-grams: a two pass filtering approach in automatic classification of text documents. SpringerPlus, 5(942), 1–29.

    Google Scholar 

  • Agnihotri, D., Verma, K., & Tripathi, P. (2017). Variable global feature selection scheme for automatic classification of text documents. Expert Systems with Applications, Elsevier, 81, 268–281. doi:10.1016/j.eswa.2017.03.057, http://www.sciencedirect.com/science/article/pii/S0957417417302208.

    Article  Google Scholar 

  • Dewang, R. K., & Singh, A. K. (2017). State-of-art approaches for review spammer detection: a survey. Journal of Intelligent Information Systems, 1–34. doi:10.1007/s10844-017-0454-7.

  • Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289–1305.

    MATH  Google Scholar 

  • Guo, H., Zhou, L. Z., & Feng, L. (2009). Self-switching classification framework for titled documents. Journal Of Computer Science And Technology, Springer, 24(4), 615–625.

    Article  Google Scholar 

  • Joachims, T. (1996). A probabilistic analysis of the rocchio algorithm with tfidf for text classification. Technical Report CMU-CS-96-118, Department of Computer Science, Carnegie Mellon University.

  • Joachims, T. (1998). Text categorization with Support Vector Machines: Learning with many relevant features, Springer Berlin, pp 137–142. doi:10.1007/BFb0026683.

  • Kevin, B., & Moshe, L. (2013). Uci machine learning repository. http://www.archiveicsuciedu/ml901.

  • Lamirel, J. C., Cuxac, P., Chivukula, A. S., & Hajlaoui, K. (2015). Optimizing text classification through efficient feature selection based on quality metric. Journal of Intelligent Information Systems, 45(3), 379–396. doi:10.1007/s10844-014-0317-4.

    Article  Google Scholar 

  • Lewis, D. D., & Ringuette, M. (1994). A comparison of two learning algorithms for text categorization. In Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval (pp. 81–93). Las Vegas.

  • Manning, C. D., Raghavan, P., & Schutze, H. (2008). Introduction to information retrieval. NY: Cambridge University Press.

    Book  MATH  Google Scholar 

  • Mitchell, T. (1997). Machine learning. McGraw Hill.

  • Mladenic, D., & Grobelnik, M. (1999). Feature selection for unbalanced class distribution and naive bayes. In Proceeding of the 16th International Conference on Machine Learning (pp. 258–267). SF.

  • Rehman, A., Kashif, J., Babri, H. A., & Mehreen, S. (2015). Relative discrimination criterion- a novel feature ranking method for text data. Expert Systems with Applications, Elsevier, 42, 3670–3681.

    Article  Google Scholar 

  • Sebastiani, F. (2002). Machine learning in automated text classification. ACM Computing Surveys, 34(1), 1–47.

    Article  Google Scholar 

  • Uysal, A. K., & Gunal, S. (2012). A novel probabilistic feature selection method for text classification. Knowledge-Based Systems, Elsevier, 36, 226–235.

    Article  Google Scholar 

  • Uysal, A. K., & Kursat, A. (2016). An improved global feature selection scheme for text classification. Expert Systems with Applications, Elsevier, 43, 82–92.

    Article  Google Scholar 

  • Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text classification. In Proceedings of the 14th International Conference on Machine Learning (pp. 412–420). USA.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Deepak Agnihotri.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Agnihotri, D., Verma, K. & Tripathi, P. An automatic classification of text documents based on correlative association of words. J Intell Inf Syst 50, 549–572 (2018). https://doi.org/10.1007/s10844-017-0482-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-017-0482-3

Keywords

Navigation