Skip to main content

Text Classification Using “Anti”-Bayesian Quantile Statistics-Based Classifiers

  • Chapter
  • First Online:
Transactions on Computational Collective Intelligence XXV

Part of the book series: Lecture Notes in Computer Science ((TCCI,volume 9990))

  • 487 Accesses

Abstract

The problem of Text Classification (TC) has been studied for decades, and this problem is particularly interesting because the features are derived from syntactic or semantic indicators, while the classification, in and of itself, is based on statistical Pattern Recognition (PR) strategies. Thus, all the recorded TC schemes work using the fundamental paradigm that once the statistical features are inferred from the syntactic/semantic indicators, the classifiers themselves are the well-established ones such as the Bayesian, the Naïve Bayesian, the SVM etc. and those that are neural or fuzzy. In this paper, we shall demonstrate that by virtue of the skewed distributions of the features, one could advantageously work with information latent in certain “non-central” quantiles (i.e., those distant from the mean) of the distributions. We, indeed, demonstrate that such classifiers exist and are attainable, and show that the design and implementation of such schemes work with the recently-introduced paradigm of Quantile Statistics (QS)-based classifiers(The foundational properties for CMQS (for generic and some straightforward distributions) were initially described in [17]. Their properties for uni-dimensional distributions of the exponential family are included in [9], and for multi-dimensional distributions in [18]. The authors of [17], [9] and [18] had initially proposed their results as being based on the Order-Statistics of the distributions. This was later corrected in [19], where they showed that their results were rather based on their Quantile Statistics.). These classifiers, referred to as Classification by Moments of Quantile Statistics (CMQS), are essentially “Anti”-Bayesian in their modus operandi. To achieve our goal, in this paper we demonstrate the power and potential of CMQS to describe the very high-dimensional TC-related vector spaces in terms of a limited number of “outlier-based” statistics. Thereafter, the PR task in classification invokes the CMQS classifier for the underlying multi-class problem by using a linear number of pair-wise CMQS-based classifiers. By a rigorous testing on the standard 20-Newsgroups corpus we show that CMQS-based TC attains accuracy that is comparable to the best-reported classifiers. We also propose the potential of fusing the results of a CMQS-based methodology with those obtained from a more traditional scheme.

The authors are grateful for the partial support provided by NSERC, the Natural Sciences and Engineering Research Council of Canada. A preliminary version of this paper was presented at ICCCI’15, the 2015 International Conference on Computational Collective Intelligence Technologies and Applications, in Madrid, Spain, in September 2015. The paper was a Plenary/Keynote Talk at the conference. The first author is also an Adjunct Professor with the University of Agder in Grimstad, Norway.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 16.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    SMART is an abbreviation for Salton’s Magic Automatic Retriever of Text.

  2. 2.

    The formal definitions for the TF and the TFIDF are given in Sect. 4.3.

  3. 3.

    Since the static TFIDF weighting scheme presented above becomes inefficient when the system has documents that are continuously arriving, for example, systems used for online detection, the literature also reports the use of the Adaptive TFIDF. The Adaptive IDF can be efficiently used for document retrieval after a sufficient number of “past” documents have been processed. The initial IDF values are calculated using a retrospective corpus of documents, and these IDF values are then updated incrementally. The literature also reports other metrics of comparison, such as the Jaccard similarity, but since this is not the primary concern of this paper, we will not elaborate on these here.

  4. 4.

    “Anti”-Bayesian methods have also been used to design novel Prototype Reduction Schemes (PRS) [21] and new novel Border Identification (BI) algorithms [20]. The use of such “Anti”-Bayesian PRS and BI techniques in TC are extremely promising and are still unreported.

  5. 5.

    As mentioned earlier, the authors of [17], [9] and [18] (cited in their chronological order) had initially proposed their results as being based on the Order-Statistics of the distributions. This was later corrected in [19], where they showed that their results were, rather, based on their Quantile Statistics.

  6. 6.

    All of the theoretical results of [17], [9] and [18] were confirmed with rigorous experimental testing. The results of [18] were also proven on real-life data sets.

  7. 7.

    In all the cases, they worked with the assumption that the a priori distributions were identical.

  8. 8.

    The documents used in this test were very short, which explains why the histograms are heavily skewed in favour of lower word frequencies.

  9. 9.

    Given that these extreme points give better results in the next experiment when we classify using the TFIDF criteria (instead of merely the TF criteria), we hypothesize that this poor behavior is probably due to noise from non-significant words that is somehow amplified in the extreme CMQS points. But this issue is still unresolved.

References

  1. Alahmadi, A., Joorabchi, A., Mahdi, A.E.: A new text representation scheme combining bag-of-words and bag-of-concepts approaches for automatic text classification. In: Proceedings of the 7th IEEE GCC Conference and Exhibition, Doha, Qatar, pp. 108–113, November 2014

    Google Scholar 

  2. Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Proceedings of the 18th ACM Symposium on Applied Computing, Melbourne USA, pp. 784–788, March 2003

    Google Scholar 

  3. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. A Wiley Interscience Publication, New York (2006)

    MATH  Google Scholar 

  4. Dumoulin, J.: Smoothing of n-gram language models of human chats. In: Proceedings of the Joint 6th International Conference on Soft Computing and Intelligent Systems (SCIS) and 13th International Symposium on Advanced Intelligent Systems (ISIS), Kobe, Japan, pp. 1–4, November 2012

    Google Scholar 

  5. Lu, L., Liu, Y.-S.: Research of english text classification methods based on semantic meaning. In: Proceedings of the ITI 3rd International Conference on Information and Communications Technology, Cairo, Egypt, pp. 689–700, December 2005

    Google Scholar 

  6. Madsen, R.E., Sigurdsson, S., Hansen, L.K., Larsen, J.: Pruning the vocabulary for better context recognition. In: Proceedings of the 17th International Conference on Pattern Recognition, Cambridge, UK, vol. 2, pp. 483–488, August 2004

    Google Scholar 

  7. Menon, R., Keerthi, S.S., Loh, H.T., Brombacher, A.C.: On the effectiveness of latent semantic analysis for the categorization of call centre records. In: Proceedings of the IEEE International Engineering Management Conference, Singapore, vol. 2, pp. 545–550 (2004)

    Google Scholar 

  8. Ning, Y., Zhu, T., Wang, Y.: Affective-word based chinese text sentiment classification. In: Proceedings of the 5th International Conference on Pervasive Computing and Applications (ICPCA), Maribor, Slovenia, pp. 111–115, December 2010

    Google Scholar 

  9. Oommen, B.J., Thomas, A.: Optimal order statistics-based “Anti-Bayesian” parametric pattern classification for the exponential family. Pattern Recogn. 47, 40–55 (2014)

    Article  MATH  Google Scholar 

  10. Ouamour, S., Sayoud, H.: Authorship attribution of ancient texts written by ten arabic travelers using character N-Grams. In: Proceedings of the 2013 International Conference on Computer, Information and Telecommunication Systems (CITS), Piraeus-Athens, Greece, pp. 1–5, May 2013

    Google Scholar 

  11. Qiang, G.: An effective algorithm for improving the performance of Naïve Bayes for text classification. In: Proceedings of the Second International Conference on Computer Research and Development, Kuala Lumpur, Malaysia, pp. 699–701, May 2010

    Google Scholar 

  12. Salton, G., McGill, M.: Introduction to Modern Information Retrieval. Mc-Graw Hill Book Company, New York (1983)

    MATH  Google Scholar 

  13. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Comm. ACM 18, 613–620 (1975)

    Article  MATH  Google Scholar 

  14. Salton, G., Yang, C.S., Yu, C.: A theory of term importance in automatic text analysis. Technical report, Ithaca, NY, USA (1974)

    Google Scholar 

  15. Salton, G., Yang, C.S., Yu, C.: Term weighting approaches in automatic text retrieval. Technical report, Ithaca, NY, USA (1987)

    Google Scholar 

  16. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  17. Thomas, A., Oommen, B.J.: The fundamental theory of optimal “Anti-Bayesian” parametric pattern classification using order statistics criteria. Pattern Recogn. 46, 376–388 (2013)

    Article  MATH  Google Scholar 

  18. Thomas, A., Oommen, B.J.: Order statistics-based parametric classification for multi-dimensional distributions. Pattern Recogn. 46, 3472–3482 (2013)

    Article  MATH  Google Scholar 

  19. Thomas, A., Oommen, B.J.: Corrigendum to three papers that deal with “Anti”-Bayesian pattern recognition. Pattern Recogn. 47, 2301–2302 (2014)

    Article  MATH  Google Scholar 

  20. Thomas, A., Oommen, B.J.: A novel border identification algorithm based on an “Anti-Bayesian” paradigm. In: Proceedings of CAIP’13, the 2013 International Conference on Computer Analysis of Images and Patterns, York, UK, pp. 196–203, August 2013

    Google Scholar 

  21. Thomas, A., Oommen, B.J.: Ultimate order statistics-based prototype reduction schemes. In: Proceedings of AI 2013, The 2013 Australasian Joint Conference on Artificial Intelligence, Dunedin, New Zealand, pp. 421–433, December 2013

    Google Scholar 

  22. Wu, G., Liu, K.: Research on text classification algorithm by combining statistical and ontology methods. In: Proceedings of the International Conference on Computational Intelligence and Software Engineering, Wuhan, China, pp. 1–4, December 2009

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to B. John Oommen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer-Verlag GmbH Germany

About this chapter

Cite this chapter

Oommen, B.J., Khoury, R., Schmidt, A. (2016). Text Classification Using “Anti”-Bayesian Quantile Statistics-Based Classifiers. In: Nguyen, N., Kowalczyk, R., Orłowski, C., Ziółkowski, A. (eds) Transactions on Computational Collective Intelligence XXV. Lecture Notes in Computer Science(), vol 9990. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-53580-6_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-53580-6_7

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-53579-0

  • Online ISBN: 978-3-662-53580-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics