Effectively Classifying Short Texts via Improved Lexical Category and Semantic Features
Classification of short text is challenging due to its severe sparseness and high dimension, which are typical characteristics of short text. In this paper, we propose a novel approach to classify short texts based on both lexical and semantic features. Firstly, the term dictionary is constructed by selecting lexical features that are most representative words of a certain category, and then the optimal topic distribution from the background knowledge repository is extracted via Latent Dirichlet Allocation. The new feature for short text is thereafter constructed. The experimental results show that our method achieved significant quality enhancement in terms of short text classification.
KeywordsShort text classification Latent Dirichlet allocation Lexical features Semantic features Optimal topic distribution
This work is supported by the National Natural Science Foundation of China (No. 61363058), Youth Science and technology support program of Gansu Province (145RJZA232, 145RJYA259), 2016 undergraduate innovation capacity enhancement program and 2016 annual public record open space Fund Project 1505JTCA007.
- 1.Gupta, V., Lehal, G.S.: A survey of text mining techniques and applications. J. Emerg. Technol. Web Intell. 1(1), 60–76 (2009)Google Scholar
- 3.Cheng, Q.Q., Wang, L.L., Zheng, T., et al.: Microblog friend recommendation based on multi-feature classification. Comput. Eng. 41(4), 65–69 (2015)Google Scholar
- 4.Sun, A.: Short text classification using very few words. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, USA, pp. 1145–1146 (2012)Google Scholar
- 6.Hu, X., Zhang, X., Lu, C., et al.: Exploiting Wikipedia as external knowledge for document clustering. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, pp. 389–396 (2009)Google Scholar
- 7.Hu, J., Fang, L., Cao, Y.: Enhancing text clustering by leveraging Wikipedia semantics. In: Proceedings of the 31th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Singapore, pp. 179–186 (2008)Google Scholar
- 9.Yang, L.L., Li, C.P., Ding, Q., et al.: Combining lexical and semantic features for short text classification. In: Proceedings of the 17th International Conference in Knowledge Based and Intelligent Information and Engineering Systems, KES, pp. 78–86 (2013)Google Scholar
- 10.Cheng, H., Qin, Z., Qian, W., et al.: Conditional mutual information based feature selection. In: International Symposium on Knowledge Acquisition and Modeling, pp. 103–107 (2008)Google Scholar
- 12.Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM, New York (2008)Google Scholar
- 13.Chen, M., Jin, X., Shen, D.: Short text classification improved by learning multi-granularity topics. In: Proceedings of the 22th International Joint Conference on Artificial Intelligence, pp. 1776–1781 (2011)Google Scholar
- 15.Sogou Labs: Text Categorization Dataset [EB/OL]. http://www.sogou.com/labs/dl/c.html. Accessed 01 Sept 2008
- 16.ICTCLAS, ICTCLAS2012-SDK-0101, rar [EB/OL]. http://www.nlpir.org/download/. Accessed 18 Aug 2014