Skip to main content

An Improvement for Naive Bayes Text Classification Applied to Online Imbalanced Crowdsourced Corpuses

  • Conference paper
  • 810 Accesses

Part of the book series: Studies in Computational Intelligence ((SCI,volume 431))

Abstract

In order to be able to use the advantage of public corpuses such as Wikipedia to address problems of classification by hierarchically structured topics with a large amount of classes, we propose an improvement of Naive Bayes based text classification algorithms which we call Semantically-Aware Hierarchical Balancing. SAHB addresses two issues in that specific use case with real-world applications, namely the large amount of topic labels to classify against, and the lack of balance in the hierarchy of the corpuses. This meta-algorithm performs with better accuracy and log-time complexity than straightforward naive bayes text classification methods or specific document weighing techniques, whilst taking equivalent time to train, which makes it more efficient, and also scalable to process and classify big data.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Dumais, S., Chen, H.: Hierarchical classification of web content. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2000), pp. 256–263. ACM, New York (2000)

    Chapter  Google Scholar 

  2. Freund, Y., Schapire, R., Abe, N.: A short introduction to boosting. Journal-Japanese Society for Artificial Intelligence 14, 771–780 (1999)

    Google Scholar 

  3. Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intell. Data Anal. 6(5), 429–449 (2002)

    MATH  Google Scholar 

  4. Rennie, J.D.M., Teevan, J., Karger, D.R.: Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning, pp. 616–623 (2003)

    Google Scholar 

  5. Swezey, R., Shiramatsu, S., Ozono, T., Shintani, T.: Intelligent Page Recommender Agents: Real-Time Content Delivery for Articles and Pages Related to Similar Topics. In: Proceedings of the Twenty Fourth International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA-AIE (2011)

    Google Scholar 

  6. Toutanova, K., Chen, F., Popat, K., Hofmann, T.: Text classification in a hierarchical mixture model for small training sets. In: Proceedings of the Tenth International Conference on Information and Knowledge Management (CIKM 2001), pp. 105–113. ACM, New York (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robin M. E. Swezey .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Swezey, R.M.E., Shiramatsu, S., Ozono, T., Shintani, T. (2012). An Improvement for Naive Bayes Text Classification Applied to Online Imbalanced Crowdsourced Corpuses. In: Ding, W., Jiang, H., Ali, M., Li, M. (eds) Modern Advances in Intelligent Systems and Tools. Studies in Computational Intelligence, vol 431. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30732-4_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-30732-4_19

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-30731-7

  • Online ISBN: 978-3-642-30732-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics