Skip to main content

BoWT: A Hybrid Text Representation Model for Improving Text Categorization Based on AdaBoost.MH

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10053))

Abstract

Text representation is the fundamental task in text categorization system. The Bag-of-Words (BoW) is a typical model for representing the texts into vectors of single words. Even though it is a simple representation model, BoW has been criticized for its disregard of the relationships between the words. Alternatively, the Latent Dirichlet Allocation (LDA) topic model has been proposed to represent the texts into a Bag-of-Topics (BoT). In LDA, the words in the corpus are statistically grouped into a small number of themes called “latent topics” in which the topics capture the semantic relationships between the words. Thus, representing the documents using BoT will dramatically accelerate the training time; as well improve the classification performance. However, BoT has been proven to not be effective for imbalanced datasets. Accordingly, this paper presents a hybrid text representation model as a combination of BoW and BoT, namely BoWT. In BoWT, the high weighted BoW’s features are merged with the BoT’s features to produce a new feature space. The proposed representation model BoWT is evaluated for multi-label text categorization based on the well-known boosting algorithm AdaBoost.MH. The experimental results on four benchmarks demonstrated that the BoWT representation model notably outperforms both BoW and BoT and dramatically improves the classification performance of AdaBoost.MH for text categorization.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Al-Salemi, B., Ab Aziz, M.J.: Statistical bayesian learning for automatic arabic text categorization. J. Comput. Sci. 7, 39 (2010)

    Article  Google Scholar 

  2. Al-Salemi, B., Ab Aziz, M.J., Noah, S.A.: Boosting algorithms with topic modeling for multi-label text categorization: a comparative empirical study. J. Inf. Sci. 41, 732–746 (2015)

    Article  Google Scholar 

  3. Al-Salemi, B., Ab Aziz, M.J., Noah, S.A.: LDA-AdaBoost.MH: Accelerated AdaBoost.MH based on latent Dirichlet allocation for text categorization. J. Inf. Sci. 41, 27–40 (2015)

    Article  Google Scholar 

  4. Al-Salemi, B., Mohd Noah, S.A., Ab Aziz, M.J.: RFBoost: an improved multi-label boosting algorithm and its application to text categorisation. Knowl.-Based Syst. 103, 104–117 (2016)

    Article  Google Scholar 

  5. Alhutaish, R., Omar, N.: Arabic text classification using k-nearest neighbour algorithm. Int. Arab J. Inf. Technol. (IAJIT) 12, 190–195 (2015)

    Google Scholar 

  6. Aphinyanaphongs, Y., Fu, L.D., Li, Z., et al.: A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization. J. Assoc. Inf. Sci. Technol. 65, 1964–1987 (2014)

    Article  Google Scholar 

  7. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  8. Dumais, S.T.: Improving the retrieval of information from external sources. Behav. Res. Methods Instrum. Comput. 23, 229–236 (1991)

    Article  Google Scholar 

  9. Duwairi, R., Al-Refai, M.N., Khasawneh, N.: Feature reduction techniques for arabic text categorization. J. Am. Soc. Inform. Sci. Technol. 60, 2347–2352 (2009)

    Article  Google Scholar 

  10. Galavotti, L., Sebastiani, F., Simi, M.: Experiments on the use of feature selection and negative evidence in automated text categorization. In: Borbinha, J., Baker, T. (eds.) ECDL 2000. LNCS, vol. 1923, pp. 59–68. Springer, Heidelberg (2000). doi:10.1007/3-540-45268-0_6

    Chapter  Google Scholar 

  11. Lewis, D.D.: Feature selection and feature extraction for text categorization. In: Proceedings of the workshop on Speech and Natural Language. Association for Computational Linguistics, pp. 212–217 (1992)

    Google Scholar 

  12. Li, X., Liu, B.: Learning to classify texts using positive and unlabeled data. In: IJCAI, pp. 587–592 (2003)

    Google Scholar 

  13. Mukherjee, I., Schapire, R.E.: A theory of multiclass boosting. J. Mach. Learn. Res. 14, 437–497 (2013)

    MathSciNet  MATH  Google Scholar 

  14. Pekar, V., Krkoska, M., Staab, S.: Feature weighting for co-occurrence-based classification of words. In: Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics, p. 799 (2004)

    Google Scholar 

  15. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34, 1–47 (2002)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bassam Al-Salemi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Al-Salemi, B., Juzaiddin Ab Aziz, M., Noah, S.A.M. (2016). BoWT: A Hybrid Text Representation Model for Improving Text Categorization Based on AdaBoost.MH. In: Sombattheera, C., Stolzenburg, F., Lin, F., Nayak, A. (eds) Multi-disciplinary Trends in Artificial Intelligence. MIWAI 2016. Lecture Notes in Computer Science(), vol 10053. Springer, Cham. https://doi.org/10.1007/978-3-319-49397-8_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-49397-8_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-49396-1

  • Online ISBN: 978-3-319-49397-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics