Skip to main content

SHDC: A Fast Documents Classification Method Based on Simhash

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9529))

Abstract

In recent years, there have been vast research and remarkable progresses in automatic documents classification which becomes a research focus in information retrieval and data mining field gradually. These studies have achieved some success while still having very great limitations to deal with abundant features of documents. Things get worser especially in the big data environment, where the documents amount is considerably huge. In order to address these challenges, we propose a fast method called Simhash based document classification (SHDC) in this paper. In the method, we first compress the vast features of documents into a certain dimension to reduce the computation, and then generate the features of each category according to the documents belonging to them. At last, we parallelize the most computational expensive step, the feature extraction of documents and categories using the Apache Spark. To show the performance of SHDC, we give theoretic analysis of it and conduct a series of experiments on a real world dataset to evaluate the feasibility and performance of our method. Results show that our method (SHDC) outperform other methods in classification precision and temporal efficiency. Meanwhile, SHDC possesses good scalability as the number of computation nodes increases.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Forman, G.: BNS feature scaling: an improved representation over tf-idf for svm text classification. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 263–270 (2008)

    Google Scholar 

  2. Zhang, W., Yoshida, T., Tang, X.: Text classification based on multi-word with support vector machine. Knowl.-Based Syst. 21(8), 879–886 (2008)

    Article  Google Scholar 

  3. Andrés-Ferrer, J., Juan, A.: Constrained domain maximum likelihood estimation for naive Bayes text classification. Pattern Anal. Appl. 13(2), 189–196 (2010)

    Article  MathSciNet  Google Scholar 

  4. Zhang, W., Gao, F.: An improvement to naive bayes for text classification. Procedia Engineering 15, 2160–2164 (2011)

    Article  Google Scholar 

  5. Kauermann, G., Ormerod, J.T., Wand, M.P.: Parsimonious classification via generalized linear mixed models. J. Classif. 27(1), 89–110 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  6. Zhang, X., Song, Q.: Predicting the number of nearest neighbors for the k-NN classification algorithm. Intell. Data Anal. 18(3), 449–464 (2014)

    Google Scholar 

  7. Wu, J., Pan, S., Zhu, X., Cai, Z., Zhang, P., Zhang, C.: Self-adaptive attribute weighting for Naive Bayes classification. Expert Syst. Appl. 42(3), 1487–1502 (2015)

    Article  Google Scholar 

  8. Ali, J.B., Saidi, L., Mouelhi, A., Chebel-Morello, B., Fnaiech, F.: Linear feature selection and classification using PNN and SFAM neural networks for a nearly online diagnosis of bearing naturally progressing degradations. Eng. Appl. Artif. Intell. 42, 67–81 (2015)

    Article  Google Scholar 

  9. Min, R., Stanley, D., Yuan, Z., Bonner, A., Zhang, Z.: A deep non-linear feature mapping for large-margin knn classification. In: Proceedings of 9th IEEE International Conference on Data Mining, pp. 357–366 (2009)

    Google Scholar 

  10. Kulis, B., Jain, P., Grauman, K.: Fast similarity search for learned metrics. IEEE Trans. Pattern Anal. Mach. Intell. 31(12), 2143–2157 (2009)

    Article  Google Scholar 

  11. Ke, Y., Sukthankar, R., Huston, L., Ke, Y., Sukthankar, R.: Efficient near-duplicate detection and sub-image retrieval. ACM Multimedia 4(1), 5 (2004)

    Google Scholar 

  12. Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, pp. 141–150 (2007)

    Google Scholar 

  13. Costa, G., Manco, G., Ortale, R.: An incremental clustering scheme for data de-duplication. Data Min. Knowl. Disc. 20(1), 152–187 (2010)

    Article  MathSciNet  Google Scholar 

  14. Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the 34th Annual ACM Symposium on Theory of Computing, pp. 380–388 (2002)

    Google Scholar 

  15. Hong, T.P., Lin, C.W., Yang, K.T., Wang, S.L.: Using TF-IDF to hide sensitive itemsets. Appl. Intell. 38(4), 502–510 (2013)

    Article  Google Scholar 

  16. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, vol. 10, p. 10 (2010)

    Google Scholar 

  17. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2 (2012)

    Google Scholar 

  18. Har-Peled, S., Indyk, P., Motwani, R.: Approximate nearest neighbor: towards removing the curse of dimensionality. Theory Comput. 8(1), 321–350 (2012)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

This work is supported by National High Technology Research and Development Program (863 Program) of China under grant no. 2013AA013503, the National Science Foundation of China under grants No. 61472080, No. 61272532, the Consulting Project of Chinese Academy of Engineering under grant 2015-XY-04 and Collaborative Innovation Center of Novel Software Technology and Industrialization.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liang Gu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Gu, L., Yang, P., Dong, Y. (2015). SHDC: A Fast Documents Classification Method Based on Simhash. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9529. Springer, Cham. https://doi.org/10.1007/978-3-319-27122-4_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27122-4_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27121-7

  • Online ISBN: 978-3-319-27122-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics