Text Classification Using “Anti”-Bayesian Quantile Statistics-Based Classifiers

Oommen, B. John; Khoury, Richard; Schmidt, Aron

doi:10.1007/978-3-662-53580-6_7

B. John Oommen¹⁷,
Richard Khoury¹⁸ &
Aron Schmidt¹⁹

Part of the book series: Lecture Notes in Computer Science ((TCCI,volume 9990))

487 Accesses

Abstract

The problem of Text Classification (TC) has been studied for decades, and this problem is particularly interesting because the features are derived from syntactic or semantic indicators, while the classification, in and of itself, is based on statistical Pattern Recognition (PR) strategies. Thus, all the recorded TC schemes work using the fundamental paradigm that once the statistical features are inferred from the syntactic/semantic indicators, the classifiers themselves are the well-established ones such as the Bayesian, the Naïve Bayesian, the SVM etc. and those that are neural or fuzzy. In this paper, we shall demonstrate that by virtue of the skewed distributions of the features, one could advantageously work with information latent in certain “non-central” quantiles (i.e., those distant from the mean) of the distributions. We, indeed, demonstrate that such classifiers exist and are attainable, and show that the design and implementation of such schemes work with the recently-introduced paradigm of Quantile Statistics (QS)-based classifiers(The foundational properties for CMQS (for generic and some straightforward distributions) were initially described in [17]. Their properties for uni-dimensional distributions of the exponential family are included in [9], and for multi-dimensional distributions in [18]. The authors of [17], [9] and [18] had initially proposed their results as being based on the Order-Statistics of the distributions. This was later corrected in [19], where they showed that their results were rather based on their Quantile Statistics.). These classifiers, referred to as Classification by Moments of Quantile Statistics (CMQS), are essentially “Anti”-Bayesian in their modus operandi. To achieve our goal, in this paper we demonstrate the power and potential of CMQS to describe the very high-dimensional TC-related vector spaces in terms of a limited number of “outlier-based” statistics. Thereafter, the PR task in classification invokes the CMQS classifier for the underlying multi-class problem by using a linear number of pair-wise CMQS-based classifiers. By a rigorous testing on the standard 20-Newsgroups corpus we show that CMQS-based TC attains accuracy that is comparable to the best-reported classifiers. We also propose the potential of fusing the results of a CMQS-based methodology with those obtained from a more traditional scheme.

The authors are grateful for the partial support provided by NSERC, the Natural Sciences and Engineering Research Council of Canada. A preliminary version of this paper was presented at ICCCI’15, the 2015 International Conference on Computational Collective Intelligence Technologies and Applications, in Madrid, Spain, in September 2015. The paper was a Plenary/Keynote Talk at the conference. The first author is also an Adjunct Professor with the University of Agder in Grimstad, Norway.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 16.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
SMART is an abbreviation for Salton’s Magic Automatic Retriever of Text.
2.
The formal definitions for the TF and the TFIDF are given in Sect. 4.3.
3.
Since the static TFIDF weighting scheme presented above becomes inefficient when the system has documents that are continuously arriving, for example, systems used for online detection, the literature also reports the use of the Adaptive TFIDF. The Adaptive IDF can be efficiently used for document retrieval after a sufficient number of “past” documents have been processed. The initial IDF values are calculated using a retrospective corpus of documents, and these IDF values are then updated incrementally. The literature also reports other metrics of comparison, such as the Jaccard similarity, but since this is not the primary concern of this paper, we will not elaborate on these here.
4.
“Anti”-Bayesian methods have also been used to design novel Prototype Reduction Schemes (PRS) [21] and new novel Border Identification (BI) algorithms [20]. The use of such “Anti”-Bayesian PRS and BI techniques in TC are extremely promising and are still unreported.
5.
As mentioned earlier, the authors of [17], [9] and [18] (cited in their chronological order) had initially proposed their results as being based on the Order-Statistics of the distributions. This was later corrected in [19], where they showed that their results were, rather, based on their Quantile Statistics.
6.
All of the theoretical results of [17], [9] and [18] were confirmed with rigorous experimental testing. The results of [18] were also proven on real-life data sets.
7.
In all the cases, they worked with the assumption that the a priori distributions were identical.
8.
The documents used in this test were very short, which explains why the histograms are heavily skewed in favour of lower word frequencies.
9.
Given that these extreme points give better results in the next experiment when we classify using the TFIDF criteria (instead of merely the TF criteria), we hypothesize that this poor behavior is probably due to noise from non-significant words that is somehow amplified in the extreme CMQS points. But this issue is still unresolved.

References

Alahmadi, A., Joorabchi, A., Mahdi, A.E.: A new text representation scheme combining bag-of-words and bag-of-concepts approaches for automatic text classification. In: Proceedings of the 7th IEEE GCC Conference and Exhibition, Doha, Qatar, pp. 108–113, November 2014
Google Scholar
Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Proceedings of the 18th ACM Symposium on Applied Computing, Melbourne USA, pp. 784–788, March 2003
Google Scholar
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. A Wiley Interscience Publication, New York (2006)
MATH Google Scholar
Dumoulin, J.: Smoothing of n-gram language models of human chats. In: Proceedings of the Joint 6th International Conference on Soft Computing and Intelligent Systems (SCIS) and 13th International Symposium on Advanced Intelligent Systems (ISIS), Kobe, Japan, pp. 1–4, November 2012
Google Scholar
Lu, L., Liu, Y.-S.: Research of english text classification methods based on semantic meaning. In: Proceedings of the ITI 3rd International Conference on Information and Communications Technology, Cairo, Egypt, pp. 689–700, December 2005
Google Scholar
Madsen, R.E., Sigurdsson, S., Hansen, L.K., Larsen, J.: Pruning the vocabulary for better context recognition. In: Proceedings of the 17th International Conference on Pattern Recognition, Cambridge, UK, vol. 2, pp. 483–488, August 2004
Google Scholar
Menon, R., Keerthi, S.S., Loh, H.T., Brombacher, A.C.: On the effectiveness of latent semantic analysis for the categorization of call centre records. In: Proceedings of the IEEE International Engineering Management Conference, Singapore, vol. 2, pp. 545–550 (2004)
Google Scholar
Ning, Y., Zhu, T., Wang, Y.: Affective-word based chinese text sentiment classification. In: Proceedings of the 5th International Conference on Pervasive Computing and Applications (ICPCA), Maribor, Slovenia, pp. 111–115, December 2010
Google Scholar
Oommen, B.J., Thomas, A.: Optimal order statistics-based “Anti-Bayesian” parametric pattern classification for the exponential family. Pattern Recogn. 47, 40–55 (2014)
Article MATH Google Scholar
Ouamour, S., Sayoud, H.: Authorship attribution of ancient texts written by ten arabic travelers using character N-Grams. In: Proceedings of the 2013 International Conference on Computer, Information and Telecommunication Systems (CITS), Piraeus-Athens, Greece, pp. 1–5, May 2013
Google Scholar
Qiang, G.: An effective algorithm for improving the performance of Naïve Bayes for text classification. In: Proceedings of the Second International Conference on Computer Research and Development, Kuala Lumpur, Malaysia, pp. 699–701, May 2010
Google Scholar
Salton, G., McGill, M.: Introduction to Modern Information Retrieval. Mc-Graw Hill Book Company, New York (1983)
MATH Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Comm. ACM 18, 613–620 (1975)
Article MATH Google Scholar
Salton, G., Yang, C.S., Yu, C.: A theory of term importance in automatic text analysis. Technical report, Ithaca, NY, USA (1974)
Google Scholar
Salton, G., Yang, C.S., Yu, C.: Term weighting approaches in automatic text retrieval. Technical report, Ithaca, NY, USA (1987)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)
Article MathSciNet Google Scholar
Thomas, A., Oommen, B.J.: The fundamental theory of optimal “Anti-Bayesian” parametric pattern classification using order statistics criteria. Pattern Recogn. 46, 376–388 (2013)
Article MATH Google Scholar
Thomas, A., Oommen, B.J.: Order statistics-based parametric classification for multi-dimensional distributions. Pattern Recogn. 46, 3472–3482 (2013)
Article MATH Google Scholar
Thomas, A., Oommen, B.J.: Corrigendum to three papers that deal with “Anti”-Bayesian pattern recognition. Pattern Recogn. 47, 2301–2302 (2014)
Article MATH Google Scholar
Thomas, A., Oommen, B.J.: A novel border identification algorithm based on an “Anti-Bayesian” paradigm. In: Proceedings of CAIP’13, the 2013 International Conference on Computer Analysis of Images and Patterns, York, UK, pp. 196–203, August 2013
Google Scholar
Thomas, A., Oommen, B.J.: Ultimate order statistics-based prototype reduction schemes. In: Proceedings of AI 2013, The 2013 Australasian Joint Conference on Artificial Intelligence, Dunedin, New Zealand, pp. 421–433, December 2013
Google Scholar
Wu, G., Liu, K.: Research on text classification algorithm by combining statistical and ontology methods. In: Proceedings of the International Conference on Computational Intelligence and Software Engineering, Wuhan, China, pp. 1–4, December 2009
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Carleton University, Ottawa, K1S 5B6, Canada
B. John Oommen
Department of Computer Science and Software Engineering, Laval University, Quebec City, G1V 0A6, Canada
Richard Khoury
Department of Software Engineering, Lakehead University, Thunder Bay, P7B 5E1, Canada
Aron Schmidt

Authors

B. John Oommen
View author publications
You can also search for this author in PubMed Google Scholar
Richard Khoury
View author publications
You can also search for this author in PubMed Google Scholar
Aron Schmidt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to B. John Oommen .

Editor information

Editors and Affiliations

Department of Information Systems, Wrocław University of Technology, Wroclaw, Poland
Ngoc Thanh Nguyen
Swinburne University of Technology, Hawthorn, Victoria, Australia
Ryszard Kowalczyk
Gdansk School of Banking (WSB Gdańsk), Gdańsk, Poland
Cezary Orłowski
Gdansk School of Banking (WSB Gdańsk), Gdańsk, Poland
Artur Ziółkowski

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Oommen, B.J., Khoury, R., Schmidt, A. (2016). Text Classification Using “Anti”-Bayesian Quantile Statistics-Based Classifiers. In: Nguyen, N., Kowalczyk, R., Orłowski, C., Ziółkowski, A. (eds) Transactions on Computational Collective Intelligence XXV. Lecture Notes in Computer Science(), vol 9990. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-53580-6_7

Download citation

DOI: https://doi.org/10.1007/978-3-662-53580-6_7
Published: 28 September 2016
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-53579-0
Online ISBN: 978-3-662-53580-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics