Hadoop Based Parallel Deduplication Method for Web Documents

  • Junjie Song
  • Jin Liu
  • Yuhui ZhengEmail author
Conference paper
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 474)


This paper proposes a method of deleting duplicate web pages through tf-idf and splay tree. According to the keywords which are extracted by TextRank, those pages which may be duplicate copies will be sent to a group. Then these pages will be judged by the method above. We use three Map-Reduce tasks to ensure the method of calculating tf-idf and deleting duplicate web pages. The experiment result shows that the algorithm can remove duplicate web pages efficiently and accurately.


Webpage deduplication Vertical search engine 


  1. 1.
    Lopresti, D.P.: Models and algorithms for duplicate document detection. In: Proceedings of the Fifth International Conference on Document Analysis and Recognition, ICDAR 1999, pp. 297–300. IEEE (1999)Google Scholar
  2. 2.
    Jianyong, W., Zhengmao, X., Ming, L., et al.: Research and evaluation of near-replicas of Web pages detection algorithms. Chin. J. Electron. (2000)Google Scholar
  3. 3.
    Liu, S., Zhang, Y., Xia, Y., et al.: Duplicate web page elimination based on HTML and extraction of long sentence. Microcomput. Appl. (2009)Google Scholar
  4. 4.
    Salton, G., McGill, M.J.: Introduction to modern information retrieval (1986)Google Scholar
  5. 5.
    Salton, G., Fox, E.A., Wu, H.: Extended Boolean information retrieval. Commun. ACM 26(11), 1022–1036 (1983)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)CrossRefGoogle Scholar
  7. 7.
    Wan, J., Yu, W., Xu, X.: Design and implement of distributed document clustering based on MapReduce. In: Proceedings of the Second Symposium International Computer Science and Computational Technology (ISCSCT), Huangshan, PR China, pp. 278–280 (2009)Google Scholar
  8. 8.
    Mihalcea, R., Tarau, P.: TextRank: bringing order into texts. Association for Computational Linguistics (2004)Google Scholar
  9. 9.
    Page, L., Brin, S., Motwani, R., et al.: The PageRank citation ranking: bringing order to the web. Stanford InfoLab (1999)Google Scholar
  10. 10.
    Sleator, D.D., Tarjan, R.E.: Self-adjusting binary search trees. J. ACM (JACM) 32(3), 652–686 (1985)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  12. 12.
    Broder, A.Z., Glassman, S.C., Manasse, M.S., et al.: Syntactic clustering of the web. Comput. Netw. ISDN Syst. 29(8–13), 1157–1166 (1997)CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.College of InformationShanghai Martime UniversityShanghaiChina
  2. 2.School of Computer and SoftwareNanjing University of Information Science and TechnologyNanjingChina

Personalised recommendations