Abstract
In recent years, there have been vast research and remarkable progresses in automatic documents classification which becomes a research focus in information retrieval and data mining field gradually. These studies have achieved some success while still having very great limitations to deal with abundant features of documents. Things get worser especially in the big data environment, where the documents amount is considerably huge. In order to address these challenges, we propose a fast method called Simhash based document classification (SHDC) in this paper. In the method, we first compress the vast features of documents into a certain dimension to reduce the computation, and then generate the features of each category according to the documents belonging to them. At last, we parallelize the most computational expensive step, the feature extraction of documents and categories using the Apache Spark. To show the performance of SHDC, we give theoretic analysis of it and conduct a series of experiments on a real world dataset to evaluate the feasibility and performance of our method. Results show that our method (SHDC) outperform other methods in classification precision and temporal efficiency. Meanwhile, SHDC possesses good scalability as the number of computation nodes increases.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Forman, G.: BNS feature scaling: an improved representation over tf-idf for svm text classification. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 263–270 (2008)
Zhang, W., Yoshida, T., Tang, X.: Text classification based on multi-word with support vector machine. Knowl.-Based Syst. 21(8), 879–886 (2008)
Andrés-Ferrer, J., Juan, A.: Constrained domain maximum likelihood estimation for naive Bayes text classification. Pattern Anal. Appl. 13(2), 189–196 (2010)
Zhang, W., Gao, F.: An improvement to naive bayes for text classification. Procedia Engineering 15, 2160–2164 (2011)
Kauermann, G., Ormerod, J.T., Wand, M.P.: Parsimonious classification via generalized linear mixed models. J. Classif. 27(1), 89–110 (2010)
Zhang, X., Song, Q.: Predicting the number of nearest neighbors for the k-NN classification algorithm. Intell. Data Anal. 18(3), 449–464 (2014)
Wu, J., Pan, S., Zhu, X., Cai, Z., Zhang, P., Zhang, C.: Self-adaptive attribute weighting for Naive Bayes classification. Expert Syst. Appl. 42(3), 1487–1502 (2015)
Ali, J.B., Saidi, L., Mouelhi, A., Chebel-Morello, B., Fnaiech, F.: Linear feature selection and classification using PNN and SFAM neural networks for a nearly online diagnosis of bearing naturally progressing degradations. Eng. Appl. Artif. Intell. 42, 67–81 (2015)
Min, R., Stanley, D., Yuan, Z., Bonner, A., Zhang, Z.: A deep non-linear feature mapping for large-margin knn classification. In: Proceedings of 9th IEEE International Conference on Data Mining, pp. 357–366 (2009)
Kulis, B., Jain, P., Grauman, K.: Fast similarity search for learned metrics. IEEE Trans. Pattern Anal. Mach. Intell. 31(12), 2143–2157 (2009)
Ke, Y., Sukthankar, R., Huston, L., Ke, Y., Sukthankar, R.: Efficient near-duplicate detection and sub-image retrieval. ACM Multimedia 4(1), 5 (2004)
Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, pp. 141–150 (2007)
Costa, G., Manco, G., Ortale, R.: An incremental clustering scheme for data de-duplication. Data Min. Knowl. Disc. 20(1), 152–187 (2010)
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the 34th Annual ACM Symposium on Theory of Computing, pp. 380–388 (2002)
Hong, T.P., Lin, C.W., Yang, K.T., Wang, S.L.: Using TF-IDF to hide sensitive itemsets. Appl. Intell. 38(4), 502–510 (2013)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, vol. 10, p. 10 (2010)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2 (2012)
Har-Peled, S., Indyk, P., Motwani, R.: Approximate nearest neighbor: towards removing the curse of dimensionality. Theory Comput. 8(1), 321–350 (2012)
Acknowledgments
This work is supported by National High Technology Research and Development Program (863 Program) of China under grant no. 2013AA013503, the National Science Foundation of China under grants No. 61472080, No. 61272532, the Consulting Project of Chinese Academy of Engineering under grant 2015-XY-04 and Collaborative Innovation Center of Novel Software Technology and Industrialization.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Gu, L., Yang, P., Dong, Y. (2015). SHDC: A Fast Documents Classification Method Based on Simhash. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9529. Springer, Cham. https://doi.org/10.1007/978-3-319-27122-4_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-27122-4_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27121-7
Online ISBN: 978-3-319-27122-4
eBook Packages: Computer ScienceComputer Science (R0)