Skip to main content

Improved Term Weighting Factors for Keyword Extraction in Hierarchical Category Structure and Thai Text Classification

  • Conference paper
  • First Online:
Advances in Intelligent Informatics, Smart Technology and Natural Language Processing (iSAI-NLP 2017)

Abstract

Keyword extraction of complex hierarchical categories becomes a challenge in text mining since commonly used classification for flat categories results in low accuracy. This paper presents a method to improve keyword extraction from hierarchical categories considering terms occurred in category from a hierarchy as additional factors in term-weighting. The method is an enhancement of a basic TF-IDF calculation; thus, it can comfortably be used for keyword extraction and classification. By taking term frequency and inverse document frequency of categories hierarchically related to a focused category, we can determine how important terms are in their family categories. In this work, hierarchy relations used in calculation are sub-categories, supercategories and sibling-categories. From experiment results, we found that the proposed method gained higher accuracy for about 40% from a baseline in a classification task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Uzun, Y.: Keyword extraction using Naive Bayes. Department of Computer Science, Bilkent University, Turkey (2005). www.cs.bilkent.edu.tr/~guvenir/courses/CS550/Workshop/Yasin_Uzun.pdf

  2. Siddiqi, S., Sharan, A.: Keyword and keyphrase extraction techniques: a literature review. Int. J. Comput. Appl. 109, 18–23 (2015)

    Google Scholar 

  3. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24, 513–523 (1988)

    Article  Google Scholar 

  4. Tipsena, R.: Automatic question classification on webboard using text mining techniques. J. Sci. Technol. Mahasarakham Univ. 33, 493 (2014). (in Thai)

    Google Scholar 

  5. Sarakit, P., Theeramunkong, T., Haruechaiyasak, C., Okumura, M.: Classifying emotion in Thai Youtube comments. In: 2015 6th International Conference of Information and Communication Technology for Embedded Systems (IC-ICTES), pp. 1–5. IEEE (2015)

    Google Scholar 

  6. Obasi, C.K., Ugwu, C.: Feature selection and vectorization in legal case documents using chi-square statistical analysis and Naïve Bayes approaches. IOSR J. Comput. Eng. 17, 42–50 (2015)

    Google Scholar 

  7. Shen, D., Ruvini, J.-D., Sarwar, B.: Large-scale item categorization for e-commerce. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 595–604. ACM (2012)

    Google Scholar 

  8. SillaJr, C.N., Freitas, A.A.: A survey of hierarchical classification across different application domains. Data Min. Knowl. Discov. 22, 31–72 (2011)

    Article  MathSciNet  Google Scholar 

  9. Qiu, X., Huang, X., Liu, Z., Zhou, J.: Hierarchical text classification with latent concepts. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, vol. 2, pp. 598–602. Association for Computational Linguistics (2011)

    Google Scholar 

  10. Qu, B., Cong, G., Li, C., Sun, A., Chen, H.: An evaluation of classification models for question topic categorization. J. Am. Soc. Inf. Sci. Technol. 63, 889–903 (2012)

    Article  Google Scholar 

  11. Phachongkitphiphat, N., Vateekul, P.: An improvement of flat approach on hierarchical text classification using top-level pruning classifiers. In: 2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE), pp. 86–90. IEEE (2014)

    Google Scholar 

  12. Thai Reform website. http://static.thaireform.org/

  13. Javed, F., Luo, Q., McNair, M., Jacob, F., Zhao, M., Carotene, K.T.: A job title classification system for the online recruitment domain. In: 2015 IEEE First International Conference on Big Data Computing Service and Applications (BigDataService), pp. 286–293. IEEE (2015)

    Google Scholar 

  14. Kashireddy, S.D., Gauch, S., Billah, S.M.: Automatic class labeling for CiteSeerX. In: 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), pp. 241–245. IEEE (2013)

    Google Scholar 

  15. LexTo. http://www.sansarn.com/lexto/

  16. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, vol. 1. Cambridge University Press, Cambridge (2008)

    Book  Google Scholar 

  17. Frank, E., Bouckaert, R.R.: Naive Bayes for text classification with unbalanced classes. In: European Conference on Principles of Data Mining and Knowledge Discovery, pp. 503–510. Springer (2006)

    Google Scholar 

  18. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Machine Learning: ECML 1998, pp. 137–142 (1998)

    Chapter  Google Scholar 

  19. Al-Jadir, L.: Encapsulating classification in an OODBMS for data mining applications. In: Proceedings of Seventh International Conference on Database Systems for Advanced Applications, pp. 100–101. IEEE (2001)

    Google Scholar 

Download references

Acknowledgement

Author would like to thank National Reform Council for providing comment data from Thai Reform Website.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Boonthida Chiraratanasopha .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chiraratanasopha, B., Theeramunkong, T., Boonbrahm, S. (2019). Improved Term Weighting Factors for Keyword Extraction in Hierarchical Category Structure and Thai Text Classification. In: Theeramunkong, T., et al. Advances in Intelligent Informatics, Smart Technology and Natural Language Processing. iSAI-NLP 2017. Advances in Intelligent Systems and Computing, vol 807. Springer, Cham. https://doi.org/10.1007/978-3-319-94703-7_6

Download citation

Publish with us

Policies and ethics