SHDC: A Fast Documents Classification Method Based on Simhash

Gu, Liang; Yang, Peng; Dong, Yongqiang

doi:10.1007/978-3-319-27122-4_14

SHDC: A Fast Documents Classification Method Based on Simhash

Liang Gu^17,18,
Peng Yang^17,18 &
Yongqiang Dong^17,18

Conference paper
First Online: 16 December 2015

1373 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9529))

Abstract

In recent years, there have been vast research and remarkable progresses in automatic documents classification which becomes a research focus in information retrieval and data mining field gradually. These studies have achieved some success while still having very great limitations to deal with abundant features of documents. Things get worser especially in the big data environment, where the documents amount is considerably huge. In order to address these challenges, we propose a fast method called Simhash based document classification (SHDC) in this paper. In the method, we first compress the vast features of documents into a certain dimension to reduce the computation, and then generate the features of each category according to the documents belonging to them. At last, we parallelize the most computational expensive step, the feature extraction of documents and categories using the Apache Spark. To show the performance of SHDC, we give theoretic analysis of it and conduct a series of experiments on a real world dataset to evaluate the feasibility and performance of our method. Results show that our method (SHDC) outperform other methods in classification precision and temporal efficiency. Meanwhile, SHDC possesses good scalability as the number of computation nodes increases.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Forman, G.: BNS feature scaling: an improved representation over tf-idf for svm text classification. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 263–270 (2008)
Google Scholar
Zhang, W., Yoshida, T., Tang, X.: Text classification based on multi-word with support vector machine. Knowl.-Based Syst. 21(8), 879–886 (2008)
Article Google Scholar
Andrés-Ferrer, J., Juan, A.: Constrained domain maximum likelihood estimation for naive Bayes text classification. Pattern Anal. Appl. 13(2), 189–196 (2010)
Article MathSciNet Google Scholar
Zhang, W., Gao, F.: An improvement to naive bayes for text classification. Procedia Engineering 15, 2160–2164 (2011)
Article Google Scholar
Kauermann, G., Ormerod, J.T., Wand, M.P.: Parsimonious classification via generalized linear mixed models. J. Classif. 27(1), 89–110 (2010)
Article MathSciNet MATH Google Scholar
Zhang, X., Song, Q.: Predicting the number of nearest neighbors for the k-NN classification algorithm. Intell. Data Anal. 18(3), 449–464 (2014)
Google Scholar
Wu, J., Pan, S., Zhu, X., Cai, Z., Zhang, P., Zhang, C.: Self-adaptive attribute weighting for Naive Bayes classification. Expert Syst. Appl. 42(3), 1487–1502 (2015)
Article Google Scholar
Ali, J.B., Saidi, L., Mouelhi, A., Chebel-Morello, B., Fnaiech, F.: Linear feature selection and classification using PNN and SFAM neural networks for a nearly online diagnosis of bearing naturally progressing degradations. Eng. Appl. Artif. Intell. 42, 67–81 (2015)
Article Google Scholar
Min, R., Stanley, D., Yuan, Z., Bonner, A., Zhang, Z.: A deep non-linear feature mapping for large-margin knn classification. In: Proceedings of 9th IEEE International Conference on Data Mining, pp. 357–366 (2009)
Google Scholar
Kulis, B., Jain, P., Grauman, K.: Fast similarity search for learned metrics. IEEE Trans. Pattern Anal. Mach. Intell. 31(12), 2143–2157 (2009)
Article Google Scholar
Ke, Y., Sukthankar, R., Huston, L., Ke, Y., Sukthankar, R.: Efficient near-duplicate detection and sub-image retrieval. ACM Multimedia 4(1), 5 (2004)
Google Scholar
Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, pp. 141–150 (2007)
Google Scholar
Costa, G., Manco, G., Ortale, R.: An incremental clustering scheme for data de-duplication. Data Min. Knowl. Disc. 20(1), 152–187 (2010)
Article MathSciNet Google Scholar
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the 34th Annual ACM Symposium on Theory of Computing, pp. 380–388 (2002)
Google Scholar
Hong, T.P., Lin, C.W., Yang, K.T., Wang, S.L.: Using TF-IDF to hide sensitive itemsets. Appl. Intell. 38(4), 502–510 (2013)
Article Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, vol. 10, p. 10 (2010)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2 (2012)
Google Scholar
Har-Peled, S., Indyk, P., Motwani, R.: Approximate nearest neighbor: towards removing the curse of dimensionality. Theory Comput. 8(1), 321–350 (2012)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

This work is supported by National High Technology Research and Development Program (863 Program) of China under grant no. 2013AA013503, the National Science Foundation of China under grants No. 61472080, No. 61272532, the Consulting Project of Chinese Academy of Engineering under grant 2015-XY-04 and Collaborative Innovation Center of Novel Software Technology and Industrialization.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Southeast University, Nanjing, 211189, China
Liang Gu, Peng Yang & Yongqiang Dong
Ministry of Education, Key Laboratory of Computer Network and Information Integration (Southeast University), Nanjing, 211189, China
Liang Gu, Peng Yang & Yongqiang Dong

Authors

Liang Gu
View author publications
You can also search for this author in PubMed Google Scholar
Peng Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yongqiang Dong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liang Gu .

Editor information

Editors and Affiliations

Central South University, Changsha, China
Guojun Wang
The University of Sydney, Sydney, New South Wales, Australia
Albert Zomaya
University of Murcia, Murcia, Murcia, Spain
Gregorio Martinez
Hunan University , Changsha, China
Kenli Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gu, L., Yang, P., Dong, Y. (2015). SHDC: A Fast Documents Classification Method Based on Simhash. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9529. Springer, Cham. https://doi.org/10.1007/978-3-319-27122-4_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-27122-4_14
Published: 16 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27121-7
Online ISBN: 978-3-319-27122-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics