Skip to main content

A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3007))

Abstract

Duplication of Web pages greatly hurts the perceived relevance of a search engine. Existing methods for detecting duplicated Web pages can be classified into two categories, i.e. offline and online methods. The offline methods target to detect all duplicates in a large set of Web pages, but none of the reported methods is capable of processing more than 30 million Web pages, which is about 1% of the pages indexed by todayś commercial search engines. On the contrary, the online methods focus on removing duplicated pages in the search results at run time. Although the number of pages to be processed is smaller, these methods could heavily increase the response time of search engines. Our experiments on real query logs show that there is a significant difference between popular and unpopular queries in terms of query number and duplicate distributions. Then, we propose a hybrid query-dependent duplicate detection method which combines both advantage of offline and online methods. This hybrid method provides not only an effective but also scalable solution for duplicate detection.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brin, S., Davis, J., Garcia-Molina, H.: Copy Detection Mechanisms for Digital Documents. In: Proceeding of the Special Interest Group on Management of Data (SIGMOD 1995), pp. 298–409 (1995)

    Google Scholar 

  2. Denning, P.J.: Plagiarism in the Web. Communications of the ACM 38 ( December 1995)

    Google Scholar 

  3. Heintze, N.: Scalable Document Fingerprinting. In: Proceedings of the Second USENIX Electronic Commerce Worksop, pp.191-200 (November 1996)

    Google Scholar 

  4. Broder, A.Z., Glassman, S.C., Manasse, M.S.: Syntactic Clastering of the Web. In: Proceedings of the Sixth International World Wide Web Conference(WWW6) (1997)

    Google Scholar 

  5. Shivakumar, N., Garica-Molina, H.: Finding Near-Replicas of Documents on the Web. In: International Workshop on the Web and Databases (WebDB 1998) (1998)

    Google Scholar 

  6. Silverstein, C., Henzinger, M., Marais, H., Moricz, M.: Analysis of a Very Large AltaVista Query Log. Technical Report 1998-014, Digital System Research Center (October 1998)

    Google Scholar 

  7. Lopresti, D.P.: Models and Algorithms for Duplicate Document Detection. In: Proceedings of the 5th International Conference on Document Analysis and Recognition (September 1999)

    Google Scholar 

  8. Bharat, K., Broder, A.: Mirror on the Web: A Study of HostPairs with Replicated Content. In: Proceedings of 8th International World Wide Web Conference (WWW8 1999), pp.501–512 (1999)

    Google Scholar 

  9. Turner, M., Katsnelson, Y., Smith, J.: Large-Scale Duplicate Document Detection in Operation. In: Proceedings of the 2001 Symposium on Document Image Understanding Technology (2001)

    Google Scholar 

  10. Spink, A., Wolfram, D., Jansen, B., Saracevic, T.: Searching The Web: The Public and Their Queries. Journal of the American Society for Information Science 53(2), 226–234 (2001)

    Google Scholar 

  11. Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection Statistics for Fast Duplicate Document Detection. ACM Transactions on Information Systems 20(2), 171–191 (2002)

    Article  Google Scholar 

  12. Cooper, J.W., Coden, A.R., Brown, E.W.: Detecting Similar Documents using Salient Terms. In: the 11th International Conference on Information and Knowledge Management, CIKM 2002 (November 2002)

    Google Scholar 

  13. Xie, Y., O’Hallaron, D.: Locality in Search Engine Queries and its Implications for Caching. In: Proceedings of IEEE Infocom (June 2002)

    Google Scholar 

  14. Bar-Yossef, Z., Rajagopalan, S.: Temlate Detection via Data Mining and its Applications. In: Proceedings of the 11th International World Wide Web Conference, WWW 2002 (2002)

    Google Scholar 

  15. Yu, S., Cai, D., Wen, J.-R., Ma, W.-Y.: Improving Pseudo-Relevance Feedback in Web Infromation Retrieval Using Web Page Segmentation. In: Proceedings of the 12th International World Wide Web Conference, WWW 2003, May 2003, pp.11–18 (2003)

    Google Scholar 

  16. Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting Content Structure for Web Pages Based on Visual Representation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ye, S., Song, R., Wen, JR., Ma, WY. (2004). A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds) Advanced Web Technologies and Applications. APWeb 2004. Lecture Notes in Computer Science, vol 3007. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24655-8_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24655-8_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-21371-0

  • Online ISBN: 978-3-540-24655-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics