Skip to main content

A Scalable System for Identifying Co-derivative Documents

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3246))

Abstract

Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other or some portion of both must be derived from a third document. The current technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, or chunks. Fingerprinting is currently hampered by an inability to accurately isolate information that is useful in identifying co-derivatives. In this paper we present spex, a novel hash-based algorithm for extracting duplicated chunks from a document collection. We discuss how information about shared chunks can be used for efficiently and reliably identifying co-derivative clusters, and describe deco, a prototype system that makes use of spex. Our experiments with several document collections demonstrate the effectiveness of the approach.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman, Amsterdam (1999)

    Google Scholar 

  • Brin, S., Davis, J., García-Molina, H.: Copy detection mechanisms for digital documents. In: Proceedings of the ACM SIGMOD Annual Conference, pp. 398–409 (1995)

    Google Scholar 

  • Broder, A.Z.: On the Resemblance and Containment of Documents. In: Compression and Complexity of Sequences (SEQUENCES 1997), pp. 21–29 (1997)

    Google Scholar 

  • Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the Web. Computer Networks and ISDN Systems 29(8-13), 1157–1166 (1997)

    Article  Google Scholar 

  • Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997)

    Book  MATH  Google Scholar 

  • Harman, D.: Overview of the second text retrieval conference (TREC-2). Information Processing and Management 31(3), 271–289 (1995)

    Article  Google Scholar 

  • Heintze, N.: Scalable Document Fingerprinting. In: 1996 USENIX Workshop on Electronic Commerce (1996)

    Google Scholar 

  • Hoad, T.C., Zobel, J.: ‘Methods for Identifying Versioned and Plagiarised Documents’. Journal of the American Society for Information Science and Technology 54(3), 203–215 (2003)

    Article  Google Scholar 

  • Larsson, N.J., Moffat, A.: Offline Dictionary-Based Compression 88(11), 1722–1732 (2000)

    Google Scholar 

  • Manber, U.: Finding Similar Files in a Large File System. In: Proceedings of the USENIX Winter, Technical Conference, San Fransisco, CA, USA, pp. 1–10 (1994)

    Google Scholar 

  • Moffat, A., Wan, R.: Re-Store: A System for Compressing, Browsing, and Searching Large Documents. In: Proceedings of the International Symposium on String Processing and Information Retrieval, pp. 162–174. IEEE Computer Society, Los Alamitos (2001)

    Chapter  Google Scholar 

  • Nevill-Manning, C.G., Witten, I.H.: Compression and Explanation Using Hierarchical Grammars. The Computer Journal 40(2/3), 103–116 (1997)

    Article  Google Scholar 

  • Nevill-Manning, C.G., Witten, I.H., Paynter, G.W.: Browsing in digital libraries: a phrase-based approach. In: Proceedings of the second ACM international conference on Digital libraries, pp. 230–236. ACM Press, New York (1997)

    Chapter  Google Scholar 

  • Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD international conference on on Management of data, pp. 76–85. ACM Press, New York (2003)

    Chapter  Google Scholar 

  • Shivakumar, N., García-Molina, H.: SCAM: A Copy Detection Mechanism for Digital Documents. In: Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries (1995)

    Google Scholar 

  • Shivakumar, N., García-Molina, H.: Finding Near-Replicas of Documents on the Web. In: WEBDB: International Workshop on the World Wide Web and Databases, WebDB, Springer, Heidelberg (1999)

    Google Scholar 

  • Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bernstein, Y., Zobel, J. (2004). A Scalable System for Identifying Co-derivative Documents. In: Apostolico, A., Melucci, M. (eds) String Processing and Information Retrieval. SPIRE 2004. Lecture Notes in Computer Science, vol 3246. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30213-1_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30213-1_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23210-0

  • Online ISBN: 978-3-540-30213-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics