Advertisement

Computing

pp 1–27 | Cite as

Optimizing the confidence bound of count-min sketches to estimate the streaming big data query results more precisely

  • Ruixin Guo
  • Erkang Xue
  • Feng ZhangEmail author
  • Gansen Zhao
  • Guangzhi Qu
Article
  • 47 Downloads

Abstract

A count-min sketch is a probabilistic data structure, which serves as a frequency table of events to process a stream of big data. It uses hash functions to map events to frequencies. Querying a count-min sketch returns the targeted event along with an estimated frequency, which is not less than the actual frequency. The estimated error, i.e., the difference between the estimated frequency and the actual, can be measured by a pre-defined confidence bound. However, the bound originally defined is too loose. The reason is that the Markov inequality used to derive the bound does not perform well. In this paper, based on binomial distribution and central limit theorem, we define a tighter bound. We indicate that the reliability of the bound is related to the deviation of data, which can be measured by the data’s coefficient of standard deviation. Our extensive experiments well support the effectiveness and efficiency of the new bound.

Keywords

Count-min sketch Confidence bound Probabilistic data structure Streaming big data Optimizing 

Mathematics Subject Classification

68P05 (Data structures) 

Notes

Acknowledgements

The study is partially supported by the National Natural Science Foundation of China under Grant No. U1711266, U1711267, and the Fundamental Research Founds for National University under Grant No. 1610491B22, China University of Geosciences (Wuhan).

References

  1. 1.
    Chen D, Wang L, Xiaomin W, Chen J, Khan SU, Koodziej J, Tian M, Huang F, Liu W (2013) Hybrid modelling and simulation of huge crowd over a hierarchical grid architecture. Future Gener Comput Syst 29(5):1309–1317CrossRefGoogle Scholar
  2. 2.
    Cormode G (2009) Count-min sketch. In: Liu L, Özsu MT (eds) Encyclopedia of database systems. Springer, pp 511–516.  https://doi.org/10.1007/978-0-387-39940-9_87
  3. 3.
    Cormode G, Muthukrishnan S (2005) An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1):58–75MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Deng Z, Wu X, Wang L, Chen X, Ranjan R, Zomaya A, Chen D (2015) Parallel processing of dynamic continuous queries over streaming data flows. IEEE Trans Parallel Distrib Syst 26(3):834–846CrossRefGoogle Scholar
  5. 5.
    Deng Z, Han W, Wang L, Ranjan R, Zomaya AY, Jie W (2017) An efficient online direction-preserving compression approach for trajectory streaming data. Future Gener Comput Syst 68:150–162CrossRefGoogle Scholar
  6. 6.
    Dong L, Yao H, Ranjan R, Zhang F, Pan M (2017) Fast lightweight reconfiguration of virtual constellation for obtaining of earth observation big data. Clust Comput 20(3):2299–2310CrossRefGoogle Scholar
  7. 7.
    Everitt B, Skrondal A (2002) The Cambridge dictionary of statistics, vol 106. Cambridge University Press, CambridgezbMATHGoogle Scholar
  8. 8.
    Ge Luo L, Wang KY, Cormode G (2016) Quantiles over data streams: experimental comparisons, new analyses, and further improvements. The VLDB J 25(4):449–472CrossRefGoogle Scholar
  9. 9.
    Goyal A, Jagarlamudi J, Daumé III, Hal VS (2010) Sketch techniques for scaling distributional similarity to the web. In: Proceedings of the 2010 workshop on geometrical models of natural language semantics, Association for Computational Linguistics, pp 51–56Google Scholar
  10. 10.
    He Z, Chonglong W, Liu G, Zheng Z, Tian Y (2015) Decomposition tree: a spatio-temporal indexing method for movement big data. Clust Comput 18(4):1481–1492CrossRefGoogle Scholar
  11. 11.
    Ippoliti D, Jiang C, Ding Z, Zhou X (2016) Online adaptive anomaly detection for augmented network flows. ACM Trans Auton Adapt Syst (TAAS) 11(3):17Google Scholar
  12. 12.
    Khoshkbarforoushha A, Ranjan R, Gaire R, Abbasnejad E, Wang L, Zomaya AY (2017) Distribution based workload modelling of continuous queries in clouds. IEEE Trans Emerg Top Comput 5(1):120–133CrossRefGoogle Scholar
  13. 13.
    Leon-Garcia A (2008) Probability, statistics, and random processes for electrical engineering, 3rd edn. Pearson, LondonGoogle Scholar
  14. 14.
    Li H, Huang H (2005) New estimation methods of count-min sketch. In: Research issues in data engineering: stream data mining and applications, 2005. RIDE-SDMA 2005. 15th international workshop on, IEEE, pp 73–80Google Scholar
  15. 15.
    Liu H, Sun Y, Kim MS (2011) Fine-grained ddos detection scheme based on bidirectional count sketch. In: Computer communications and networks (ICCCN), 2011 proceedings of 20th international conference on, IEEE, pp 1–6Google Scholar
  16. 16.
    Minton GT, Price E (2014) Improved concentration bounds for count-sketch. In: Proceedings of the twenty-fifth annual ACM-SIAM symposium on discrete algorithms, society for industrial and applied mathematics, pp 669–686Google Scholar
  17. 17.
    Mood AMF (1950) Introduction to the theory of statistics. McGraw-hill, NYzbMATHGoogle Scholar
  18. 18.
    Papapetrou O, Garofalakis M, Deligiannakis A (2015) Sketching distributed sliding-window data streams. The VLDB J 24(3):345–368CrossRefGoogle Scholar
  19. 19.
    Perera C, Ranjan R, Wang L, Khan SU, Zomaya AY (2015) Big data privacy in the internet of things era. IT Prof 17(3):32–39CrossRefGoogle Scholar
  20. 20.
    Probabilistic data structures. https://en.wikipedia.org/wiki/Category:Probabilistic_data_structures/. Accessed 29 Dec 2018
  21. 21.
    Ranjan R, Wang L, Zomaya AY, Tao J, Jayaraman PP, Georgakopoulos D (2016) Advances in methods and techniques for processing streaming big data in datacentre clouds. IEEE Trans Emerg Top Comput 4(2):262–265CrossRefGoogle Scholar
  22. 22.
    Rottenstreich O, Kanizo Y, Keslassy I (2014) The variable-increment counting bloom filter. IEEE/ACM Trans Netw 22(4):1092–1105CrossRefGoogle Scholar
  23. 23.
    Rusu F, Dobra A (2008) Sketches for size of join estimation. ACM Trans Database Syst (TODS) 33(3):15CrossRefGoogle Scholar
  24. 24.
    Schechter S, Herley C, Mitzenmacher M (2010) Popularity is everything: a new approach to protecting passwords from statistical-guessing attacks. In: Proceedings of the 5th USENIX conference on Hot topics in security, USENIX Association, pp 1–8Google Scholar
  25. 25.
    Tong D, Prasanna V (2016) High throughput sketch based online heavy hitter detection on fpga. ACM SIGARCH Comput Archit N 43(4):70–75CrossRefGoogle Scholar
  26. 26.
    Wang L, Ranjan R (2015) Processing distributed internet of things data in clouds. IEEE Cloud Comput 2(1):76–80CrossRefGoogle Scholar
  27. 27.
    Yang Y, Zhu J (2016) Write skew and zipf distribution: evidence and implications. ACM Trans Storage (TOS) 12(4):21Google Scholar
  28. 28.
    Zhang F, Gong T, Lee VE, Zhao G, Rong C, Guangzhi Q (2016) Fast algorithms to evaluate collaborative filtering recommender systems. Knowl-Based Syst 96(3):96–103Google Scholar
  29. 29.
    Zhang F, Lee VE, Raymond Choo K-K (2018) Jo-dpmf: differentially private matrix factorization learning through joint optimization. Inf Sci 467(10):271–281MathSciNetGoogle Scholar

Copyright information

© Springer-Verlag GmbH Austria, part of Springer Nature 2019

Authors and Affiliations

  1. 1.School of Computer ScienceChina University of GeosciencesWuhanChina
  2. 2.Hubei Key Laboratory of Intelligent Geo-Information ProcessingChina University of GeosciencesWuhanChina
  3. 3.School of Computer ScienceSouth China Normal UniversityGuangzhouChina
  4. 4.Department of Engineering and Computer ScienceOakland UniversityRochesterUSA

Personalised recommendations