Skip to main content

Compact Representation of Documents Using Terms and Termsets

  • Conference paper
  • First Online:
Machine Learning and Data Mining in Pattern Recognition (MLDM 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10934))

  • 1749 Accesses

Abstract

In this study, computation of compact document vectors by utilizing both terms and termsets for binary text categorization is addressed. In general, termsets are concatenated with all terms, leading to large document vectors. Selection of a subset of terms and termsets for compact but also effective representation of documents is considered in this study. Two different methods are studied for this purpose. In the first method, combination of terms and termsets in different proportions is evaluated. As an alternative approach, normalized ranking scores of terms and termsets are employed for subset selection. Experiments conducted on two widely used datasets have shown that termsets can effectively complement terms also in cases when small number of features are used to represent documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24, 513–523 (1988)

    Article  Google Scholar 

  2. Zhang, W., Yoshida, T., Tang, X.: Text classification based on multi-word with support vector machine. Knowl.-Based Syst. 21(8), 879–886 (2008)

    Article  Google Scholar 

  3. Tripathy, A., Agrawal, A., Rath, S.K.: Classification of sentiment reviews using n-gram machine learning approach. Expert Syst. Appl. 57, 117–126 (2016)

    Article  Google Scholar 

  4. Zhai, Z., Xu, H., Kang, B., Jia, P.: Exploiting effective features for chinese sentiment classification. Expert Syst. Appl. 38, 9139–9146 (2011)

    Article  Google Scholar 

  5. Tesar, R., Poesio, M., Strnad, V., Jezek, K.: Extending the single words-based document model: a comparison of bigrams and 2-itemsets. In: Proceedings of the 2006 ACM Symposium on Document Engineering, pp. 138–146. ACM, New York (2006)

    Google Scholar 

  6. Zaïane, O.R., Antonie, M.L.: Classifying text documents by associating terms with text categories. In: Proceedings of the 13th Australasian Database Conference ADC 2002, vol. 5, pp. 215–222. Australian Computer Society, Inc. (2002)

    Google Scholar 

  7. Figueiredo, F., Rocha, L., Couto, T., Salles, T., Gonçalves, M.A., Meira, W.: Word co-occurrence features for text classification. Inf. Syst. 36(5), 843–858 (2011)

    Article  Google Scholar 

  8. Badawi, D., Altınçay, H.: A novel framework for termset selection and weighting in binary text classification. Eng. Appl. Artif. Intell. 35, 38–53 (2014)

    Article  Google Scholar 

  9. Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)

    Article  Google Scholar 

  10. Ogura, H., Amano, H., Kondo, M.: Comparison of metrics for feature selection in imbalanced text classification. Expert Syst. Appl. 38(5), 4978–4989 (2011)

    Article  Google Scholar 

  11. Rossi, R.G., Rezende, S.O.: Building a topic hierarchy using the bag-of-related-words representation. In: DocEng, pp. 195–204. ACM, New York (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hakan Altınçay .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Badawi, D., Altınçay, H. (2018). Compact Representation of Documents Using Terms and Termsets. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2018. Lecture Notes in Computer Science(), vol 10934. Springer, Cham. https://doi.org/10.1007/978-3-319-96136-1_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-96136-1_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-96135-4

  • Online ISBN: 978-3-319-96136-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics