Abstract
In view of the problems faced by the effective classification and management of massive text, a new classification method for mass Web text information is proposed. The core idea is based on the characteristics of the low quantity, high value rate of the long text and the high quantity and low price rate of the short text in the current network environment. The feature selection method based on complex network is proposed. The number of features obtained by this method is more stable, and the accuracy of the selection of features in large text centralization is improved. Secondly, a text classification method based on density statistical merging is proposed, and the classification method is studied from the point of view of data sampling. The method is classified. In the process, we not only use the density information of the text feature set, but also use the difference information of each feature of the text obtained by the statistical merging criteria. Therefore, the algorithm has better robustness to noise and has a better classification effect to the large text set.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Wang, Y.Z., Jin, X.L., Cheng, X.Q.: Network big data: present and future. Chin. J. Comput. 36(6), 1125–1138 (2013)
Zhao, Y., Fan, Z.A., Zhu, Q.: Conceptualization and research progress on user-generated content. J. Libr. Sci. China 5, 008 (2012)
Cancho, R.F.I., Solé, R.V.: The small world of human language. Proc. Biol. Sci. 268(1482), 2261–2265 (2001)
Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 493–502 (2004)
Liu, B.B., Ru-Ning, M.A., Ding, J.D.: Density-based statistical merging algorithm for large data sets. J. Softw. 26, 2820–2835 (2015)
Vijaya, P.A., Murty, M.N., Subramanian, D.K.: Leaders-Subleaders: An Efficient Hierarchical Clustering Algorithm for Large Data Sets. Elsevier, Amsterdam (2004)
Romero, E.: Using the leader algorithm with support vector machines for large data sets. In: Artificial Neural Networks and Machine Learning—ICANN, vol. 6791, pp. 225–232 (2011)
Viswanath, P., Babu, V.S.: Rough-DBSCAN: a fast hybrid density based clustering method for large data sets. Pattern Recogn. Lett. 30(16), 1477–1488 (2009)
Nock, R., Nielsen, F.: Statistical region merging. IEEE Trans. Pattern Anal. Mach. Intell. 26(11), 1452 (2004)
Xu, L., Fu, Y., Li, S.: Web text classifier based on an improved SVM decision tree. J. Soochow Univ. 5, 003 (2011)
Zhang, X.F., Huang, H.Y.: An improved KNN text categorization algorithm by adopting cluster technology. Pattern Recogn. Artif. Intell. 22(6), 936–940 (2009)
Acknowledgments
This work was partially supported by The Education Department of Jilin province science and technology research project “13th Five-Year” Kyrgyzstan UNESCO Zi [2016] No. 159th.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, R., Wang, S. (2019). Large Scale Text Categorization Based on Density Statistics Merging. In: Xhafa, F., Patnaik, S., Tavana, M. (eds) Advances in Intelligent, Interactive Systems and Applications. IISA 2018. Advances in Intelligent Systems and Computing, vol 885. Springer, Cham. https://doi.org/10.1007/978-3-030-02804-6_43
Download citation
DOI: https://doi.org/10.1007/978-3-030-02804-6_43
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02803-9
Online ISBN: 978-3-030-02804-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)