Abstract
In order to be able to use the advantage of public corpuses such as Wikipedia to address problems of classification by hierarchically structured topics with a large amount of classes, we propose an improvement of Naive Bayes based text classification algorithms which we call Semantically-Aware Hierarchical Balancing. SAHB addresses two issues in that specific use case with real-world applications, namely the large amount of topic labels to classify against, and the lack of balance in the hierarchy of the corpuses. This meta-algorithm performs with better accuracy and log-time complexity than straightforward naive bayes text classification methods or specific document weighing techniques, whilst taking equivalent time to train, which makes it more efficient, and also scalable to process and classify big data.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Dumais, S., Chen, H.: Hierarchical classification of web content. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2000), pp. 256–263. ACM, New York (2000)
Freund, Y., Schapire, R., Abe, N.: A short introduction to boosting. Journal-Japanese Society for Artificial Intelligence 14, 771–780 (1999)
Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intell. Data Anal. 6(5), 429–449 (2002)
Rennie, J.D.M., Teevan, J., Karger, D.R.: Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning, pp. 616–623 (2003)
Swezey, R., Shiramatsu, S., Ozono, T., Shintani, T.: Intelligent Page Recommender Agents: Real-Time Content Delivery for Articles and Pages Related to Similar Topics. In: Proceedings of the Twenty Fourth International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA-AIE (2011)
Toutanova, K., Chen, F., Popat, K., Hofmann, T.: Text classification in a hierarchical mixture model for small training sets. In: Proceedings of the Tenth International Conference on Information and Knowledge Management (CIKM 2001), pp. 105–113. ACM, New York (2001)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Swezey, R.M.E., Shiramatsu, S., Ozono, T., Shintani, T. (2012). An Improvement for Naive Bayes Text Classification Applied to Online Imbalanced Crowdsourced Corpuses. In: Ding, W., Jiang, H., Ali, M., Li, M. (eds) Modern Advances in Intelligent Systems and Tools. Studies in Computational Intelligence, vol 431. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30732-4_19
Download citation
DOI: https://doi.org/10.1007/978-3-642-30732-4_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-30731-7
Online ISBN: 978-3-642-30732-4
eBook Packages: EngineeringEngineering (R0)