Abstract
Classification of short text is challenging due to its severe sparseness and high dimension, which are typical characteristics of short text. In this paper, we propose a novel approach to classify short texts based on both lexical and semantic features. Firstly, the term dictionary is constructed by selecting lexical features that are most representative words of a certain category, and then the optimal topic distribution from the background knowledge repository is extracted via Latent Dirichlet Allocation. The new feature for short text is thereafter constructed. The experimental results show that our method achieved significant quality enhancement in terms of short text classification.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Gupta, V., Lehal, G.S.: A survey of text mining techniques and applications. J. Emerg. Technol. Web Intell. 1(1), 60–76 (2009)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Cheng, Q.Q., Wang, L.L., Zheng, T., et al.: Microblog friend recommendation based on multi-feature classification. Comput. Eng. 41(4), 65–69 (2015)
Sun, A.: Short text classification using very few words. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, USA, pp. 1145–1146 (2012)
Vo, D.T., Ock, C.Y.: Learning to classify short text from scientific documents using topic models with various types of knowledge. Expert Syst. Appl. 42(3), 1684–1698 (2015)
Hu, X., Zhang, X., Lu, C., et al.: Exploiting Wikipedia as external knowledge for document clustering. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, pp. 389–396 (2009)
Hu, J., Fang, L., Cao, Y.: Enhancing text clustering by leveraging Wikipedia semantics. In: Proceedings of the 31th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Singapore, pp. 179–186 (2008)
Song, S., Zhu, H., Chen, L.: Probabilistic correlation-based similarity measure on text records. Inf. Sci. 289(1), 8–24 (2014)
Yang, L.L., Li, C.P., Ding, Q., et al.: Combining lexical and semantic features for short text classification. In: Proceedings of the 17th International Conference in Knowledge Based and Intelligent Information and Engineering Systems, KES, pp. 78–86 (2013)
Cheng, H., Qin, Z., Qian, W., et al.: Conditional mutual information based feature selection. In: International Symposium on Knowledge Acquisition and Modeling, pp. 103–107 (2008)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM, New York (2008)
Chen, M., Jin, X., Shen, D.: Short text classification improved by learning multi-granularity topics. In: Proceedings of the 22th International Joint Conference on Artificial Intelligence, pp. 1776–1781 (2011)
Kononenko, I.: Estimating attributes: analysis and extensions of relief. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 171–182. Springer, Heidelberg (1994)
Sogou Labs: Text Categorization Dataset [EB/OL]. http://www.sogou.com/labs/dl/c.html. Accessed 01 Sept 2008
ICTCLAS, ICTCLAS2012-SDK-0101, rar [EB/OL]. http://www.nlpir.org/download/. Accessed 18 Aug 2014
Acknowledgement
This work is supported by the National Natural Science Foundation of China (No. 61363058), Youth Science and technology support program of Gansu Province (145RJZA232, 145RJYA259), 2016 undergraduate innovation capacity enhancement program and 2016 annual public record open space Fund Project 1505JTCA007.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Ma, H., Zhou, R., Liu, F., Lu, X. (2016). Effectively Classifying Short Texts via Improved Lexical Category and Semantic Features. In: Huang, DS., Bevilacqua, V., Premaratne, P. (eds) Intelligent Computing Theories and Application. ICIC 2016. Lecture Notes in Computer Science(), vol 9771. Springer, Cham. https://doi.org/10.1007/978-3-319-42291-6_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-42291-6_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-42290-9
Online ISBN: 978-3-319-42291-6
eBook Packages: Computer ScienceComputer Science (R0)