Abstract
Detection of near-duplicate pages, especially based on their semantic content is a relevant concern in information retrieval. It is needed to avoid redundancy in the search results against a query as well as facilitate the ranking of the documents in the order of their semantic similarities. Although much work has been done in near-duplicate page detection based on content similarity (as evident in existing literature), the realm of semantic similarity provides a relatively unexplored pool of opportunities. In this paper, a novel technique is proposed to detect whether two documents belonging to a corpus have near-duplicate semantic content or not and a heuristic method is introduced to rank the documents based on their semantic similarity scores. This objective is achieved by examining the proposed technique for computing semantic-based similarity between two documents and applying an averaging mechanism to associate a similarity score to each document in the corpus. The empirical results on DUC datasets witness the effectiveness of the proposed approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
decided experimentally for DUC-2001 dataset on which the better result is obtained and thresholds decided for different DUC datasets are different.
- 4.
- 5.
- 6.
References
Manber, U.: Finding similar files in a large file system. In: Winter USENIX Technical Conference. vol. 94, pp. 1–10 (1994)
Roul, R.K., Mittal, S., Joshi, P.: Efficient approach for near duplicate document detection using textual and conceptual based techniques. In: Advanced Computing, Networking and Informatics, Volume 1: Advanced Computing and Informatics Proceedings of the Second International Conference on Advanced Computing, Networking and Informatics (Icacni-2014). vol. 27, pp. 195–203. Springer, Berlin (2014)
Zhou, Z., Yang, C.-N., Chen, B., Sun, X., Liu, Q., QM, J.: Effective and efficient image copy detection with resistance to arbitrary rotation. IEICE Trans. Inf. Syst. 99(6), 1531–1540 (2016)
Zhou, Z., Wu, Q.J., Sun, X.: Encoding multiple contextual clues for partial-duplicate image retrieval. Pattern Recognit. Lett. (2017)
Zhou, Z., Wu, Q.J., Huang, F., Sun, X.: Fast and accurate near-duplicate image elimination for visual sensor networks. Int. J. Distrib. Sens. Netw. 13(2), 1–12 (2017)
Zhou, Z., Mu, Y., Wu, Q.J.: Coverless image steganography using partial-duplicate image retrieval. Soft Comput. 1–12 (2018)
Bharat, K., Broder, A.: Mirror, mirror on the web: A study of host pairs with replicated content. Comput. Netw. 31(11), 1579–1590 (1999)
Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, pp. 141–150 (2007)
Feng, Y., Bagheri, E., Ensan, F., Jovanovic, J.: The state of the art in semantic relatedness: a framework for comparison. Knowl. Eng. Rev. 32 (2017)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceeding of ICLR-2013, pp. 1–12 (2013)
Chua, S., Kulathuramaiyer, N.: Semantic feature selection using wordnet. In: Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence. IEEE Computer Society, pp. 166–172 (2004)
Cilibrasi, R.L., Vitanyi, P.M.: The google similarity distance. IEEE Trans. Knowl. Data Eng. 19(3), 1–15 (2007)
Blei, D.M., Ng, A.Y., Jordan, M.I: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Girolami, M., Kabán, A.: On an equivalence between plsi and lda. In: Proceedings of the 26th annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 433–434. ACM (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Roul, R.K., Sahoo, J.K. (2020). Near-Duplicate Document Detection Using Semantic-Based Similarity Measure: A Novel Approach. In: Behera, H., Nayak, J., Naik, B., Pelusi, D. (eds) Computational Intelligence in Data Mining. Advances in Intelligent Systems and Computing, vol 990. Springer, Singapore. https://doi.org/10.1007/978-981-13-8676-3_46
Download citation
DOI: https://doi.org/10.1007/978-981-13-8676-3_46
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-8675-6
Online ISBN: 978-981-13-8676-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)