FID-sketch: an accurate sketch to store frequencies in data streams

  • Tong Yang
  • Haowei Zhang
  • Hao Wang
  • Muhammad Shahzad
  • Xue Liu
  • Qin Xin
  • Xiaoming Li
Article
  • 21 Downloads
Part of the following topical collections:
  1. Special Issue on Web and Big Data

Abstract

Sketches are being extensively used in a large number of real world applications to estimate frequencies of data items. Due to the unprecedented increase in the amount of Internet data and a relatively slower increase in the size of on-chip memories, existing sketches are becoming increasingly unable to keep the accuracy of the frequency estimates at an acceptable level. In this paper, we design a new sketch, called FID-sketch, that has a significantly higher accuracy and a much smaller on-chip memory footprint compared to the existing sketches. The key intuition behind the design of the FID-sketch is that before inserting an item, unlike prior sketches, it first estimates the current value of the frequency of that item stored in the sketch, and then increments as few counters as possible instead of incrementing a pre-determined fixed number of counters. We carried out extensive experiments to evaluate and compare the performance of FID-sketch with existing sketches on multi-core CPU and GPU platforms. Our experimental results show that our FID-sketch significantly outperforms the state-of-the-art with 36.7 times lower relative error. We have released the source code of our proposed sketch and other related sketches that we implemented at Github [21].

Keywords

Sketch Data streams Accuracy Speed Measurement 

Notes

Acknowledgments

This work is partially supported by Primary Research & Development Plan of China (2016YFB1000304), National Basic Research Program of China (2014CB340405), NSFC (61672061), the Open Project Funding of CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, and National Science Foundation (CNS 1616317, CNS 1616273).

References

  1. 1.
    Aguilar-Saborit, J., Trancoso, P., Muntes-Mulero, V., Larriba-Pey, J. -L.: Dynamic count filters. ACM SIGMOD Record, pp/ 26–32 (2006)Google Scholar
  2. 2.
    Barman, D., Satapathy, P., Ciardo, G.: Detecting attacks in routers using sketches. In: Proceedings of the High Performance Switching and Routing (2007)Google Scholar
  3. 3.
    Bu, T., Cao, J., Chen, A., Lee, P.P.: A fast and compact method for unveiling significant patterns in high speed networks. In: Proceedings of the IEEE INFOCOM, pp. 1893–1901 (2007)Google Scholar
  4. 4.
    Callegari, C., Cyprus, N.: Statistical approaches for network anomaly detection. In: Proceedings of the ICIMP (2009)Google Scholar
  5. 5.
    Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate query processing using wavelets. VLDB 10(2-3), 199–223 (2000)MATHGoogle Scholar
  6. 6.
    Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Automata, Languages and Programming (2002)Google Scholar
  7. 7.
    Chen, A., Jin, Y., Cao, J., Li, L.E.: Tracking long duration flows in network traffic. In: Proceedings of the IEEE INFOCOM (2010)Google Scholar
  8. 8.
    Cisco visual networking index: Forecast and methodology, 2015–2020. CISCO White paperGoogle Scholar
  9. 9.
    Cohen, S., Matias, Y.: Spectral bloom filters. In: Proceedings of the ACM SIGMOD, pp. 241–252 (2003)Google Scholar
  10. 10.
    Cormode, G., Garofalakis, M.: Sketching streams through the net: Distributed approximate query tracking. In: Proceedings of the VLDB (2005)Google Scholar
  11. 11.
    Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithm. 55(1), 58–75 (2005)MathSciNetCrossRefMATHGoogle Scholar
  12. 12.
    Cormode, G., Johnson, T., et al.: Holistic udafs at streaming speeds. In: Proceedings of the SIGMOD (2004)Google Scholar
  13. 13.
    Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. Proc. VLDB 1(2), 1530–1541 (2008)CrossRefGoogle Scholar
  14. 14.
    Estan, C., Varghese, G.: New directions in traffic measurement and accounting. Proc. ACM SIGMCOMM 32(4), 323–338 (2002)CrossRefGoogle Scholar
  15. 15.
    Fan, L., Cao, P., Almeida, J., Broder, A.Z.: Summary cache: A scalable wide-area web cache sharing protocol. In: Proceedings of the ACM SIGCOMM (1998)Google Scholar
  16. 16.
    Kollios, G., Byers, J.W., Considine, J., Hadjieleftheriou, M., Li, F.: Robust aggregation in sensor networks. IEEE Data Eng. Bull. 28(1), 26–32 (2005)Google Scholar
  17. 17.
    Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data Parallel analysis with sawzall. Dyn. Grids Worldw. Comput. 13(4), 277–298 (2005)Google Scholar
  18. 18.
    Pitel, G., Fouquier, G.: Count-min-log sketch: Approximately counting with approximate counters. arXiv:1502.04885 (2015)
  19. 19.
    Potti, N., Patel, J.M.: Daq: a new paradigm for approximate query processing. In: Proceedings of the VLDB (2015)Google Scholar
  20. 20.
    Powers, D.M.: Applications and explanations of Zipf’s law. In: Proceedings of the EMNLP-CoNLL. Association for Computational Linguistics (1998)Google Scholar
  21. 21.
    Source code of FID sketches with CUDA implementation. https://github.com/papers2016/FID-sketch.git
  22. 22.
    Yang, T., Xie, G., Li, Y., et al.: Guarantee IP lookup performance with FIB explosion. In: Proceedings of the SIGCOMM (2014)Google Scholar
  23. 23.
    Yang, T., Liu, A.X., Shahzad, M., Zhong, Y., Fu, Q., Li, Z., Xie, G., Li, X.: A shifting bloom Filter Framework for Set Queries. In: Proceedings of the VLDB (2016)Google Scholar
  24. 24.
    Yang, T., Liu, A.X., Shahzad, M., Yang, D., Fu, Q., Xie, G., Li, X.: A Shifting Framework for Set Queries. In: Proceedings of the IEEE/ACM Transaction on Networking (ToN) (2017)Google Scholar
  25. 25.
    Yang, T., Zhou, Y., Jin, H., Chen, S., Li, X.: Pyramid Sketch: a Sketch Framework for Frequency Estimation of Data Streams. In: Proceedings of the VLDB (2017)Google Scholar
  26. 26.
    Zhang, Y., Singh, S., Sen, S., Duffield, N., Lund, C.: Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications. In: Proceedings of the ACM IMC (2004)Google Scholar
  27. 27.
    Zhao, Q.G., Ogihara, M., Wang, H., Xu, J.J.: Finding global icebergs over distributed data sets. In: Proceedigs of the ACM PODS. ACM (2006)Google Scholar
  28. 28.
    Zhou, Y., Yang, T., Jiang, J., Cui, B., Yu, M., Li, X., Uhlig, S.: Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing. In: Proceedings of the SIGMOD (2018)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Tong Yang
    • 1
  • Haowei Zhang
    • 1
  • Hao Wang
    • 1
  • Muhammad Shahzad
    • 2
  • Xue Liu
    • 3
  • Qin Xin
    • 4
  • Xiaoming Li
    • 1
  1. 1.Peking UniversityHaidian QuChina
  2. 2.North Carolina State UniversityRaleighUSA
  3. 3.Institute of AcousticsChinese Academy of ScienceBeijingChina
  4. 4.Tsinghua UniversityHaidian QuChina

Personalised recommendations