Compact Representation of Documents Using Terms and Termsets
In this study, computation of compact document vectors by utilizing both terms and termsets for binary text categorization is addressed. In general, termsets are concatenated with all terms, leading to large document vectors. Selection of a subset of terms and termsets for compact but also effective representation of documents is considered in this study. Two different methods are studied for this purpose. In the first method, combination of terms and termsets in different proportions is evaluated. As an alternative approach, normalized ranking scores of terms and termsets are employed for subset selection. Experiments conducted on two widely used datasets have shown that termsets can effectively complement terms also in cases when small number of features are used to represent documents.
- 5.Tesar, R., Poesio, M., Strnad, V., Jezek, K.: Extending the single words-based document model: a comparison of bigrams and 2-itemsets. In: Proceedings of the 2006 ACM Symposium on Document Engineering, pp. 138–146. ACM, New York (2006)Google Scholar
- 6.Zaïane, O.R., Antonie, M.L.: Classifying text documents by associating terms with text categories. In: Proceedings of the 13th Australasian Database Conference ADC 2002, vol. 5, pp. 215–222. Australian Computer Society, Inc. (2002)Google Scholar
- 11.Rossi, R.G., Rezende, S.O.: Building a topic hierarchy using the bag-of-related-words representation. In: DocEng, pp. 195–204. ACM, New York (2011)Google Scholar