Skip to main content

Near-Duplicate Document Detection Using Semantic-Based Similarity Measure: A Novel Approach

  • Conference paper
  • First Online:
Computational Intelligence in Data Mining

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 990))

Abstract

Detection of near-duplicate pages, especially based on their semantic content is a relevant concern in information retrieval. It is needed to avoid redundancy in the search results against a query as well as facilitate the ranking of the documents in the order of their semantic similarities. Although much work has been done in near-duplicate page detection based on content similarity (as evident in existing literature), the realm of semantic similarity provides a relatively unexplored pool of opportunities. In this paper, a novel technique is proposed to detect whether two documents belonging to a corpus have near-duplicate semantic content or not and a heuristic method is introduced to rank the documents based on their semantic similarity scores. This objective is achieved by examining the proposed technique for computing semantic-based similarity between two documents and applying an averaging mechanism to associate a similarity score to each document in the corpus. The empirical results on DUC datasets witness the effectiveness of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://tartarus.org/martin/PorterStemmer/.

  2. 2.

    https://algorithmia.com/algorithms/StanfordNLP/Lemmatizer.

  3. 3.

    decided experimentally for DUC-2001 dataset on which the better result is obtained and thresholds decided for different DUC datasets are different.

  4. 4.

    https://www.nltk.org/.

  5. 5.

    www.encyclopediaofmath.org/index.php?title=Hellinger_distance&oldid=16453.

  6. 6.

    http://www.duc.nist.gov.

References

  1. Manber, U.: Finding similar files in a large file system. In: Winter USENIX Technical Conference. vol. 94, pp. 1–10 (1994)

    Google Scholar 

  2. Roul, R.K., Mittal, S., Joshi, P.: Efficient approach for near duplicate document detection using textual and conceptual based techniques. In: Advanced Computing, Networking and Informatics, Volume 1: Advanced Computing and Informatics Proceedings of the Second International Conference on Advanced Computing, Networking and Informatics (Icacni-2014). vol. 27, pp. 195–203. Springer, Berlin (2014)

    Google Scholar 

  3. Zhou, Z., Yang, C.-N., Chen, B., Sun, X., Liu, Q., QM, J.: Effective and efficient image copy detection with resistance to arbitrary rotation. IEICE Trans. Inf. Syst. 99(6), 1531–1540 (2016)

    Article  Google Scholar 

  4. Zhou, Z., Wu, Q.J., Sun, X.: Encoding multiple contextual clues for partial-duplicate image retrieval. Pattern Recognit. Lett. (2017)

    Google Scholar 

  5. Zhou, Z., Wu, Q.J., Huang, F., Sun, X.: Fast and accurate near-duplicate image elimination for visual sensor networks. Int. J. Distrib. Sens. Netw. 13(2), 1–12 (2017)

    Article  Google Scholar 

  6. Zhou, Z., Mu, Y., Wu, Q.J.: Coverless image steganography using partial-duplicate image retrieval. Soft Comput. 1–12 (2018)

    Google Scholar 

  7. Bharat, K., Broder, A.: Mirror, mirror on the web: A study of host pairs with replicated content. Comput. Netw. 31(11), 1579–1590 (1999)

    Article  Google Scholar 

  8. Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, pp. 141–150 (2007)

    Google Scholar 

  9. Feng, Y., Bagheri, E., Ensan, F., Jovanovic, J.: The state of the art in semantic relatedness: a framework for comparison. Knowl. Eng. Rev. 32 (2017)

    Google Scholar 

  10. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceeding of ICLR-2013, pp. 1–12 (2013)

    Google Scholar 

  11. Chua, S., Kulathuramaiyer, N.: Semantic feature selection using wordnet. In: Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence. IEEE Computer Society, pp. 166–172 (2004)

    Google Scholar 

  12. Cilibrasi, R.L., Vitanyi, P.M.: The google similarity distance. IEEE Trans. Knowl. Data Eng. 19(3), 1–15 (2007)

    Article  Google Scholar 

  13. Blei, D.M., Ng, A.Y., Jordan, M.I: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    Google Scholar 

  14. Girolami, M., Kabán, A.: On an equivalence between plsi and lda. In: Proceedings of the 26th annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 433–434. ACM (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rajendra Kumar Roul .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Roul, R.K., Sahoo, J.K. (2020). Near-Duplicate Document Detection Using Semantic-Based Similarity Measure: A Novel Approach. In: Behera, H., Nayak, J., Naik, B., Pelusi, D. (eds) Computational Intelligence in Data Mining. Advances in Intelligent Systems and Computing, vol 990. Springer, Singapore. https://doi.org/10.1007/978-981-13-8676-3_46

Download citation

Publish with us

Policies and ethics