Near-Duplicate Document Detection Using Semantic-Based Similarity Measure: A Novel Approach

Roul, Rajendra Kumar; Sahoo, Jajati Keshari

doi:10.1007/978-981-13-8676-3_46

Rajendra Kumar Roul¹⁸ &
Jajati Keshari Sahoo¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 990))

717 Accesses
1 Citations

Abstract

Detection of near-duplicate pages, especially based on their semantic content is a relevant concern in information retrieval. It is needed to avoid redundancy in the search results against a query as well as facilitate the ranking of the documents in the order of their semantic similarities. Although much work has been done in near-duplicate page detection based on content similarity (as evident in existing literature), the realm of semantic similarity provides a relatively unexplored pool of opportunities. In this paper, a novel technique is proposed to detect whether two documents belonging to a corpus have near-duplicate semantic content or not and a heuristic method is introduced to rank the documents based on their semantic similarity scores. This objective is achieved by examining the proposed technique for computing semantic-based similarity between two documents and applying an averaging mechanism to associate a similarity score to each document in the corpus. The empirical results on DUC datasets witness the effectiveness of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://tartarus.org/martin/PorterStemmer/.
2.
https://algorithmia.com/algorithms/StanfordNLP/Lemmatizer.
3.
decided experimentally for DUC-2001 dataset on which the better result is obtained and thresholds decided for different DUC datasets are different.
4.
https://www.nltk.org/.
5.
www.encyclopediaofmath.org/index.php?title=Hellinger_distance&oldid=16453.
6.
http://www.duc.nist.gov.

References

Manber, U.: Finding similar files in a large file system. In: Winter USENIX Technical Conference. vol. 94, pp. 1–10 (1994)
Google Scholar
Roul, R.K., Mittal, S., Joshi, P.: Efficient approach for near duplicate document detection using textual and conceptual based techniques. In: Advanced Computing, Networking and Informatics, Volume 1: Advanced Computing and Informatics Proceedings of the Second International Conference on Advanced Computing, Networking and Informatics (Icacni-2014). vol. 27, pp. 195–203. Springer, Berlin (2014)
Google Scholar
Zhou, Z., Yang, C.-N., Chen, B., Sun, X., Liu, Q., QM, J.: Effective and efficient image copy detection with resistance to arbitrary rotation. IEICE Trans. Inf. Syst. 99(6), 1531–1540 (2016)
Article Google Scholar
Zhou, Z., Wu, Q.J., Sun, X.: Encoding multiple contextual clues for partial-duplicate image retrieval. Pattern Recognit. Lett. (2017)
Google Scholar
Zhou, Z., Wu, Q.J., Huang, F., Sun, X.: Fast and accurate near-duplicate image elimination for visual sensor networks. Int. J. Distrib. Sens. Netw. 13(2), 1–12 (2017)
Article Google Scholar
Zhou, Z., Mu, Y., Wu, Q.J.: Coverless image steganography using partial-duplicate image retrieval. Soft Comput. 1–12 (2018)
Google Scholar
Bharat, K., Broder, A.: Mirror, mirror on the web: A study of host pairs with replicated content. Comput. Netw. 31(11), 1579–1590 (1999)
Article Google Scholar
Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, pp. 141–150 (2007)
Google Scholar
Feng, Y., Bagheri, E., Ensan, F., Jovanovic, J.: The state of the art in semantic relatedness: a framework for comparison. Knowl. Eng. Rev. 32 (2017)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceeding of ICLR-2013, pp. 1–12 (2013)
Google Scholar
Chua, S., Kulathuramaiyer, N.: Semantic feature selection using wordnet. In: Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence. IEEE Computer Society, pp. 166–172 (2004)
Google Scholar
Cilibrasi, R.L., Vitanyi, P.M.: The google similarity distance. IEEE Trans. Knowl. Data Eng. 19(3), 1–15 (2007)
Article Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Google Scholar
Girolami, M., Kabán, A.: On an equivalence between plsi and lda. In: Proceedings of the 26th annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 433–434. ACM (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Thapar Institute of Engineering and Technology, Patiala, 147004, Punjab, India
Rajendra Kumar Roul
Department of Mathematics, BITS-Pilani, K. K. Birla Goa Campus, Zuarinagar, 403726, Goa, India
Jajati Keshari Sahoo

Authors

Rajendra Kumar Roul
View author publications
You can also search for this author in PubMed Google Scholar
Jajati Keshari Sahoo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rajendra Kumar Roul .

Editor information

Editors and Affiliations

Department of Information Technology, Veer Surendra Sai University of Technology, Burla, Sambalpur, Odisha, India
Himansu Sekhar Behera
Department of Computer Science and Engineering, Sri Sivani College of Engineering, Srikakulam, Andhra Pradesh, India
Janmenjoy Nayak
Department of Computer Application, Veer Surendra Sai University of Technology, Burla, Sambalpur, Odisha, India
Bighnaraj Naik
Faculty of Communication Sciences, University of Teramo, Teramo, Italy
Danilo Pelusi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Roul, R.K., Sahoo, J.K. (2020). Near-Duplicate Document Detection Using Semantic-Based Similarity Measure: A Novel Approach. In: Behera, H., Nayak, J., Naik, B., Pelusi, D. (eds) Computational Intelligence in Data Mining. Advances in Intelligent Systems and Computing, vol 990. Springer, Singapore. https://doi.org/10.1007/978-981-13-8676-3_46

Download citation

DOI: https://doi.org/10.1007/978-981-13-8676-3_46
Published: 18 August 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-8675-6
Online ISBN: 978-981-13-8676-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics