Skip to main content
Log in

Improved Approximate Detection of Duplicates for Data Streams Over Sliding Windows

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Detecting duplicates in data streams is an important problem that has a wide range of applications. In general, precisely detecting duplicates in an unbounded data stream is not feasible in most streaming scenarios, and, on the other hand, the elements in data streams are always time sensitive. These make it particular significant approximately detecting duplicates among newly arrived elements of a data stream within a fixed time frame. In this paper, we present a novel data structure, Decaying Bloom Filter (DBF), as an extension of the Counting Bloom Filter, that effectively removes stale elements as new elements continuously arrive over sliding windows. On the DBF basis we present an efficient algorithm to approximately detect duplicates over sliding windows. Our algorithm may produce false positive errors, but not false negative errors as in many previous results. We analyze the time complexity and detection accuracy, and give a tight upper bound of false positive rate. For a given space G bits and sliding window size W, our algorithm has an amortized time complexity of \( O + (\sqrt{G/W}) \). Both analytical and experimental results on synthetic data demonstrate that our algorithm is superior in both execution time and detection accuracy to the previous results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Babcock B, Babu S, Datar M, Windom J. Model and issues in data stream systems. In Proc. Principles of Database Systems (PODS), Wisconsin, USA, June 2002, pp.1–16.

  2. Golab L, Ozsu M T. Issues in data stream management. SIGMOD Record, June 2003, 32(2): 5–14.

    Article  Google Scholar 

  3. Metwally A, Agrawal D, Abbadi A E. Duplicate detection in click streams. In Proc. 14th Int. Conf. World Wide Web, Chiba, Japan, May 2005, pp.12–21.

  4. Deng F, Rafiei D. Approximately detecting duplicates for streaming data using stable Bloom filters. In Proc. ACM SIGMOD Conf., New York, USA, June 2006, pp.25–36.

  5. Anupam V, Mayer A, Nissim K, Pinkas B, Reiter M. On the security of pay-per-click and other web advertising schemes. In Proc. 8th Int. Conf. World Wide Web, Toronto, Canada, May 1999, pp.1091–1100.

  6. Heydon A, Najork M Mercator: A scalable, extensible web crawler. In Proc. 8th Int. Conf. World Wide Web, Toronto, Canada, May 1999, pp.219–229.

  7. Broder A Z, Najork M, Wiener J L. Efficient URL caching for world wide web crawling. In Proc. 12th Int. Conf. World Wide Web, Budapest, Hungary, May 2003, pp.680–689.

  8. Cisco network accounting services white paper. Cisco System Inc, 2002, pp.1–20, http://www.cisco.com/warp/public/cc/pd/iosw/prodlit/nwactwp.pdf.

  9. Garcia-Molina H, Ullman J D, Widom J. Database System Implementation. Upper Saddle River: Prentice Hall, Inc., NJ, USA, 1999.

    Google Scholar 

  10. Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In Proc. ACM SIGKDD Conf., Washington DC, USA, August 2003, pp.39–48.

  11. Weis M, Naumann F. Dogmatix tracks down duplicates in XML. In Proc. ACM SIGMOD Conf., Baltimore, Maryland, USA, June 2005, pp.431–442.

  12. Ananthakrishna R, Chaudhuri S, Ganti V. Eliminating fuzzy duplicates in data warehouses. In Proc. 28th Int. Conf. Very Large Data Bases, Hong Kong, China, 2002, pp.586–597.

  13. Chaudhuri S, Ganti V, Motwani R. Robust identification of fuzzy duplicates. In Proc. 21st Int. Conf. Data Engineering (ICDE), Tokyo, Japan, 2005, pp.865–876.

  14. Bloom B H. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, July 1970, 13(7): 422–426.

    Article  MATH  Google Scholar 

  15. Fan L, Cao P, Almeida J, Broder A Z. Summary cache: A scalable wide-area Web cache sharing protocol. IEEE/ACM Trans. Networking, 2000, 8(3): 281–293.

    Article  Google Scholar 

  16. Mitzenmacher M. Compressed Bloom filters. In Proc. 20th ACM Symposium on Principles of Distributed Computing Netw. Rhode Island, 2001, pp.144–150.

  17. Cohen S, Matias Y. Spectral Bloom filters. In Proc. ACM SIGMOD Conf., California, USA, June 2003, pp.241–252.

  18. Whang K, Zenden B T V, Taylor H M. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Syst., 1990, 15(2): 208–299.

    Article  Google Scholar 

  19. Cheng K, Xiang L, Iwaihara M. Time-decaying Bloom filters for data streams with skewed distributions. In Proc. 15th Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications (RIDE-SDMA), Tokyo, Japan, 2005, pp.63–69.

  20. Cohen E, Strauss M. Maintaining time-decaying stream aggregates. In Proc. Principles of Database Systems (PODS), San Diego, California, USA, June 2003, pp.223–233.

  21. Cormode G, Tirthapura S, Xu B. Time-decaying sketches for sensor data aggregation. In Proc. Principles of Distributed Computing (PODC), Portland, Oregon, USA, May 2007, pp.215–224.

  22. Arasu A, Manku G. Approximate counts and quanlites over sliding windows. In Proc. Principles of Database Systems (PODS), Paris, France, June 2004, pp.286–296.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hong Shen.

Additional information

This work is supported by the “Hundred Talents Program” of CAS and the National Natural Science Foundation of China under Grant No. 60772034.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PDF 105 kb).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shen, H., Zhang, Y. Improved Approximate Detection of Duplicates for Data Streams Over Sliding Windows. J. Comput. Sci. Technol. 23, 973–987 (2008). https://doi.org/10.1007/s11390-008-9192-1

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-008-9192-1

Keywords

Navigation