Skip to main content

The Case of the Duplicate Documents Measurement, Search, and Science

  • Conference paper
Frontiers of WWW Research and Development - APWeb 2006 (APWeb 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3841))

Included in the following conference series:

Abstract

Many of the documents in large text collections are duplicates and versions of each other. In recent research, we developed new methods for finding such duplicates; however, as there was no directly comparable prior work, we had no measure of whether we had succeeded. Worse, the concept of “duplicate” not only proved difficult to define, but on reflection was not logically defensible. Our investigation highlighted a paradox of computer science research: objective measurement of outcomes involves a subjective choice of preferred measure; and attempts to define measures can easily founder in circular reasoning. Also, some measures are abstractions that simplify complex real-world phenomena, so success by a measure may not be meaningful outside the context of the research. These are not merely academic concerns, but are significant problems in the design of research projects. In this paper, the case of the duplicate documents is used to explore whether and when it is reasonable to claim that research is successful.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Allan, J., Carterette, B., Lewis, J.: When will information retrieval be “good enough”? In: Proc. ACM-SIGIR Ann. Int. Conf. on Research and Development in Information Retrieval, pp. 433–440. ACM Press, New York (2005)

    Google Scholar 

  • Askitis, N., Zobel, J.: Cache-conscious collision resolution in string hash tables. In: Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 91–102. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  • Bernstein, Y., Zobel, J.: A scalable system for identifying co-derivative documents. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 55–67. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  • Bernstein, Y., Zobel, J.: Redundant documents and search effectiveness. In: Proc. ACM Ann. Int. Conf. on Information and Knowledge Management (CIKM) (2005) (to appear)

    Google Scholar 

  • Booth, W.C., Colomb, G.G., Williams, J.M.: The Craft of Research. U. Chicago Press, Chicago (1995)

    Google Scholar 

  • Brin, S., Davis, J., García-Molina, H.: Copy detection mechanisms for digital documents. In: Carey, M., Schneider, D. (eds.) Proc. ACM-SIGMOD Ann. Int. Conf. on Management of Data, pp. 398–409. ACM Press, San Jose (1995)

    Google Scholar 

  • Broder, A.Z.: On the resemblance and containment of documents. In: Compression and Complexity of Sequences (SEQUENCES 1997), pp. 21–29. IEEE Computer Society Press, Positano (1997)

    Google Scholar 

  • Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems (TOIS) 20(2), 171–191 (2002)

    Article  Google Scholar 

  • Fetterly, D., Manasse, M., Najork, M.: On the evolution of clusters of near-duplicate web pages. In: Baeza-Yates, R. (ed.) Proc. 1st Latin AmericanWeb Congress, pp. 37–45. IEEE, Santiago (2003)

    Google Scholar 

  • Heintze, N.: Scalable document fingerprinting. In: 1996 USENIXWorkshop on Electronic Commerce, Oakland, California, USA, pp. 191–200 (1996)

    Google Scholar 

  • Johnson, D.S.: A theoretician’s guide to the experimental analysis of algorithms. In: Goldwasser, M., Johnson, D.S., McGeoch, C.C. (eds.) Proceedings of the 5th and 6th DIMACS Implementation Challenges. American Mathematical Society, Providence (2002)

    Google Scholar 

  • Manber, U.: Finding similar files in a large file system. In: Proc. USENIX Winter 1994 Technical Conference, San Fransisco, CA, USA, pp. 1–10 (1994)

    Google Scholar 

  • Metzler, D., Bernstein, Y., Croft, W.B., Moffat, A., Zobel, J.: Similarity measures for tracking information flow. In: Proc. ACM Ann. Int. Conf. on Information and Knowledge Management (CIKM) (2005) (to appear)

    Google Scholar 

  • Moffat, A., Zobel, J.: What does it mean to ‘measure performance’? In: Zhou, X., Su, S., Papazoglou, M.P., Orlowska, M.E., Jeffery, K. (eds.) WISE 2004. LNCS, vol. 3306, pp. 1–12. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  • Roberts, F.S.: Measurement Theory. Addison-Wesley, Reading (1979)

    MATH  Google Scholar 

  • Suppes, P., Pavel, M., Falmagne, J.-C.: Representations and models in psychology. Annual Review of Psychology 45, 517–544 (1994)

    Article  Google Scholar 

  • Tichy, W.F.: Should computer scientists experiment more? IEEE Computer 31(5), 32–40 (1998)

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zobel, J., Bernstein, Y. (2006). The Case of the Duplicate Documents Measurement, Search, and Science. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds) Frontiers of WWW Research and Development - APWeb 2006. APWeb 2006. Lecture Notes in Computer Science, vol 3841. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11610113_4

Download citation

  • DOI: https://doi.org/10.1007/11610113_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-31142-3

  • Online ISBN: 978-3-540-32437-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics