A Scalable System for Identifying Co-derivative Documents

Bernstein, Yaniv; Zobel, Justin

doi:10.1007/978-3-540-30213-1_6

A Scalable System for Identifying Co-derivative Documents

Yaniv Bernstein¹⁸ &
Justin Zobel¹⁸

Conference paper

761 Accesses
27 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3246))

Abstract

Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other or some portion of both must be derived from a third document. The current technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, or chunks. Fingerprinting is currently hampered by an inability to accurately isolate information that is useful in identifying co-derivatives. In this paper we present spex, a novel hash-based algorithm for extracting duplicated chunks from a document collection. We discuss how information about shared chunks can be used for efficiently and reliably identifying co-derivative clusters, and describe deco, a prototype system that makes use of spex. Our experiments with several document collections demonstrate the effectiveness of the approach.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman, Amsterdam (1999)
Google Scholar
Brin, S., Davis, J., García-Molina, H.: Copy detection mechanisms for digital documents. In: Proceedings of the ACM SIGMOD Annual Conference, pp. 398–409 (1995)
Google Scholar
Broder, A.Z.: On the Resemblance and Containment of Documents. In: Compression and Complexity of Sequences (SEQUENCES 1997), pp. 21–29 (1997)
Google Scholar
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the Web. Computer Networks and ISDN Systems 29(8-13), 1157–1166 (1997)
Article Google Scholar
Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997)
Book MATH Google Scholar
Harman, D.: Overview of the second text retrieval conference (TREC-2). Information Processing and Management 31(3), 271–289 (1995)
Article Google Scholar
Heintze, N.: Scalable Document Fingerprinting. In: 1996 USENIX Workshop on Electronic Commerce (1996)
Google Scholar
Hoad, T.C., Zobel, J.: ‘Methods for Identifying Versioned and Plagiarised Documents’. Journal of the American Society for Information Science and Technology 54(3), 203–215 (2003)
Article Google Scholar
Larsson, N.J., Moffat, A.: Offline Dictionary-Based Compression 88(11), 1722–1732 (2000)
Google Scholar
Manber, U.: Finding Similar Files in a Large File System. In: Proceedings of the USENIX Winter, Technical Conference, San Fransisco, CA, USA, pp. 1–10 (1994)
Google Scholar
Moffat, A., Wan, R.: Re-Store: A System for Compressing, Browsing, and Searching Large Documents. In: Proceedings of the International Symposium on String Processing and Information Retrieval, pp. 162–174. IEEE Computer Society, Los Alamitos (2001)
Chapter Google Scholar
Nevill-Manning, C.G., Witten, I.H.: Compression and Explanation Using Hierarchical Grammars. The Computer Journal 40(2/3), 103–116 (1997)
Article Google Scholar
Nevill-Manning, C.G., Witten, I.H., Paynter, G.W.: Browsing in digital libraries: a phrase-based approach. In: Proceedings of the second ACM international conference on Digital libraries, pp. 230–236. ACM Press, New York (1997)
Chapter Google Scholar
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD international conference on on Management of data, pp. 76–85. ACM Press, New York (2003)
Chapter Google Scholar
Shivakumar, N., García-Molina, H.: SCAM: A Copy Detection Mechanism for Digital Documents. In: Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries (1995)
Google Scholar
Shivakumar, N., García-Molina, H.: Finding Near-Replicas of Documents on the Web. In: WEBDB: International Workshop on the World Wide Web and Databases, WebDB, Springer, Heidelberg (1999)
Google Scholar
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Information Technology, RMIT University, Melbourne, Australia
Yaniv Bernstein & Justin Zobel

Authors

Yaniv Bernstein
View author publications
You can also search for this author in PubMed Google Scholar
Justin Zobel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Georgia Institute of Technology and Università di Padova,
Alberto Apostolico
Department of Information Engineering, University of Padova,
Massimo Melucci

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bernstein, Y., Zobel, J. (2004). A Scalable System for Identifying Co-derivative Documents. In: Apostolico, A., Melucci, M. (eds) String Processing and Information Retrieval. SPIRE 2004. Lecture Notes in Computer Science, vol 3246. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30213-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-540-30213-1_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23210-0
Online ISBN: 978-3-540-30213-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics