Abstract
In Web search engines, digital libraries and other types of online information services, duplicates and near-duplicates may cause severe problems if unaddressed. Typical problems include more space needed than necessary, longer indexing time and redundant results presented to users. In this paper, we propose a method of detecting near-duplicate documents. Two sentence level features, number of terms and terms at particular positions, are used in the method. Suffix tree is used to match sentence blocks very efficiently. Experiments are carried out to compare our method with two other representative methods and show that our method is effective and efficient. It has potential to be used in practice.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Andrei, Z.B., Steven, C.G., Mark, S., Manasse, G.Z.: Syntactic clustering of the web. Comput. Netw. 29(8–13), 1157–1166 (1997)
Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst. 20(2), 171–191 (2002)
Lin, Y.S., Liao, T.Y., Lee, S.J.: Detecting near-duplicate documents using sentence-level features and supervised learning. Expert Syst. Appl. 40, 1467–1476 (2013)
Wang, J.-H., Chang, H.-C.: Exploiting sentence-level features for near-duplicate document detection. In: Lee, G.G., Song, D., Lin, C.-Y., Aizawa, A., Kuriyama, K., Yoshioka, M., Sakai, T. (eds.) AIRS 2009. LNCS, vol. 5839, pp. 205–217. Springer, Heidelberg (2009)
Shivakumar, N., Garcia-Molina, H.: SCAM: a copy detection mechanism for digital documents. In: Proceedings of the International Conference on Theory and Practice of Digital Libraries (1995)
Zhang, Q., Zhang, Y., Yu, H.M., Huang, X.J.: Efficient partial-duplicate detection based on sequence matching. In: Proceedings of ACM SIGIR, pp. 675–682 (2010)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of 6th Symposium on Operating System Design and Implementation (2004)
Chang, H.C., Wang, J.H., Chiu, C.Y.: Finding event-relevant content from the web using a near-duplicate detection approach. In: Proceedings of the 2007 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 291–294 (2007)
Hoad, T., Zobel, J.: Methods for identifying versioned and plagiarized documents. J. Am. Soc. Inf. Sci. Technol. 203–215, 54 (2003)
Manku, G.S., Jain, A., Sarma, A.D.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, pp. 141–150 (2007)
Schleimer, S., Wilkerson, D., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD, pp. 76–85 (2003)
Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: Proceedings of ACM SIGMOD, pp. 388–409 (1995)
Salton, G.: The state of retrieval system evaluation. Inf. Process. Manage. 28(4), 441–448 (1992)
Frakes, W.B., Baeza-Yates, R.A.: Information Retrieval: Data Structures and Algorithms. Prentice Hall, Englewood Cliffs (1992)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval: The Concepts and technology Behind Search. Pearson Education Limited, Harlow (2011)
Ukkonen, E.: On-line construction of suffix tree. Algorithmica 14(3), 249–260 (1995)
Huang, L., Wang, L., Li, X.: Achieving both high precision and high recall in near-duplicate detection. In: Proceedings of ACM CIKM, pp. 63–72 (2008)
Yerra, R., Ng, Y.-K.: A sentence-based copy detection approach for web documents. In: Wang, L., Jin, Y. (eds.) FSKD 2005. LNCS (LNAI), vol. 3613, pp. 557–570. Springer, Heidelberg (2005)
Theobald, M., Siddharth, J., Paepcke, A.: SpotSigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of ACM SIGIR, pp. 563–570 (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Feng, J., Wu, S. (2015). Detecting Near-Duplicate Documents Using Sentence Level Features. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds) Database and Expert Systems Applications. Globe DEXA 2015 2015. Lecture Notes in Computer Science(), vol 9262. Springer, Cham. https://doi.org/10.1007/978-3-319-22852-5_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-22852-5_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22851-8
Online ISBN: 978-3-319-22852-5
eBook Packages: Computer ScienceComputer Science (R0)