An Improvement for Naive Bayes Text Classification Applied to Online Imbalanced Crowdsourced Corpuses

Swezey, Robin M. E.; Shiramatsu, Shun; Ozono, Tadachika; Shintani, Toramatsu

doi:10.1007/978-3-642-30732-4_19

An Improvement for Naive Bayes Text Classification Applied to Online Imbalanced Crowdsourced Corpuses

Robin M. E. Swezey⁵,
Shun Shiramatsu⁵,
Tadachika Ozono⁵ &
…
Toramatsu Shintani⁵

Conference paper

810 Accesses

Part of the book series: Studies in Computational Intelligence ((SCI,volume 431))

Abstract

In order to be able to use the advantage of public corpuses such as Wikipedia to address problems of classification by hierarchically structured topics with a large amount of classes, we propose an improvement of Naive Bayes based text classification algorithms which we call Semantically-Aware Hierarchical Balancing. SAHB addresses two issues in that specific use case with real-world applications, namely the large amount of topic labels to classify against, and the lack of balance in the hierarchy of the corpuses. This meta-algorithm performs with better accuracy and log-time complexity than straightforward naive bayes text classification methods or specific document weighing techniques, whilst taking equivalent time to train, which makes it more efficient, and also scalable to process and classify big data.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Dumais, S., Chen, H.: Hierarchical classification of web content. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2000), pp. 256–263. ACM, New York (2000)
Chapter Google Scholar
Freund, Y., Schapire, R., Abe, N.: A short introduction to boosting. Journal-Japanese Society for Artificial Intelligence 14, 771–780 (1999)
Google Scholar
Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intell. Data Anal. 6(5), 429–449 (2002)
MATH Google Scholar
Rennie, J.D.M., Teevan, J., Karger, D.R.: Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning, pp. 616–623 (2003)
Google Scholar
Swezey, R., Shiramatsu, S., Ozono, T., Shintani, T.: Intelligent Page Recommender Agents: Real-Time Content Delivery for Articles and Pages Related to Similar Topics. In: Proceedings of the Twenty Fourth International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA-AIE (2011)
Google Scholar
Toutanova, K., Chen, F., Popat, K., Hofmann, T.: Text classification in a hierarchical mixture model for small training sets. In: Proceedings of the Tenth International Conference on Information and Knowledge Management (CIKM 2001), pp. 105–113. ACM, New York (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Engineering, Nagoya Institute of Technology, Gokiso-cho, Showa-ku, Nagoya, Aichi, 466-8555, Japan
Robin M. E. Swezey, Shun Shiramatsu, Tadachika Ozono & Toramatsu Shintani

Authors

Robin M. E. Swezey
View author publications
You can also search for this author in PubMed Google Scholar
Shun Shiramatsu
View author publications
You can also search for this author in PubMed Google Scholar
Tadachika Ozono
View author publications
You can also search for this author in PubMed Google Scholar
Toramatsu Shintani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Robin M. E. Swezey .

Editor information

Editors and Affiliations

, Department of Computer Science, University of Massachusetts Boston, 100 Morrissey Blvd., Boston, 02125-3393, USA
Wei Ding
, School of Software, Dalian University of Technology, Dalian, 116621, China, People's Republic
He Jiang
, Department of Computer Science, Texas State University San Marcos, University Drive 601, San Marcos, 78666-4616, USA
Moonis Ali
, School of Software, Dalian University of Technology, Dalian, 116621, China, People's Republic
Mingchu Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Swezey, R.M.E., Shiramatsu, S., Ozono, T., Shintani, T. (2012). An Improvement for Naive Bayes Text Classification Applied to Online Imbalanced Crowdsourced Corpuses. In: Ding, W., Jiang, H., Ali, M., Li, M. (eds) Modern Advances in Intelligent Systems and Tools. Studies in Computational Intelligence, vol 431. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30732-4_19

Download citation

DOI: https://doi.org/10.1007/978-3-642-30732-4_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-30731-7
Online ISBN: 978-3-642-30732-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics