Abstract
When many objects are counted simultaneously in large data streams, as in the course of network traffic monitoring, or Webgraph and molecular sequence analyses, memory becomes a limiting factor. Robert Morris [Communications of the ACM, 21:840–842, 1978] proposed a probabilistic technique for approximate counting that is extremely economical. The basic idea is to increment a counter containing the value X with probability 2− X. As a result, the counter contains an approximation of \(\lg n\) after n probabilistic updates, stored in \(\lg\lg n\) bits. Here we revisit the original idea of Morris. We introduce a binary floating-point counter that combines a d-bit significand with a binary exponent, stored together on \(d+\lg\lg n\) bits. The counter yields a simple formula for an unbiased estimation of n with a standard deviation of about 0.6·n2− d/2.
We analyze the floating-point counter’s performance in a general framework that applies to any probabilistic counter. In that framework, we provide practical formulas to construct unbiased estimates, and to assess the asymptotic accuracy of any counter.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Estan, C., Varghese, G.: New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM TOCS 21, 270–313 (2003)
Stanojević, R.: Small active counters. In: Proceedings INFOCOM, pp. 2153–2161 (2007)
Donato, D., Laura, L., Leonardi, S., Millozzi, S.: Large-scale properties of the Webgraph. Eur. Phys. J. B 38, 239–243 (2004)
Karlin, S.: Statistical signals in bioinformatics. PNAS 102, 13355–13362 (2005)
Jones, N.C., Pevzner, P.A.: Comparative genomics reveals unusually long motifs in mammalian genomes. Bioinformatics 22, e236–e242 (2006)
Rigoutsos, I., Huynh, T., Miranda, K., Tsirigos, A., McHardy, A., Platt, D.: Short blocks from the noncoding parts of the human genome have instances within nearly all known genes and relate to biological processes. PNAS 103, 6605–6610 (2006)
Csűrös, M., Noé, L., Kucherov, G.: Reconsidering the significance of genomic word frequencies. Trends Genet. 23, 543–546 (2007)
Sindi, S.S., Hunt, B.R., Yorke, J.A.: Duplication count distributions in DNA sequences. Phys. Rev. E 78, 61912 (2008)
Ning, Z., Cox, A.J., Mullikin, J.C.: SSAHA: A fast search method for large DNA databases. Genome. Res. 11(10), 1725–1729 (2001)
Flajolet, P.: Approximate counting: A detailed analysis. BIT 25, 113–134 (1985)
Morris, R.: Counting large number of events in small registers. CACM 21(10), 840–842 (1978)
Flajolet, P., Fusy, É., Gandouet, O., Meunier, F.: HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: Proceedings AofA, DMTCS Proceedings, pp. 127–146 (2007)
Kirschenhoffer, P., Prodinger, H.: Approximate counting: An alternative approach. RAIRO ITA 25, 43–48 (1991)
Kruskal, J.B., Greenberg, A.G.: A flexible way of counting large numbers approximately in small registers. Algorithmica 6, 590–596 (1991)
Karlin, S., Taylor, H.M.: A First Course in Stochastic Processes, 2nd edn. Academic Press, San Diego (1975)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Csűrös, M. (2010). Approximate Counting with a Floating-Point Counter. In: Thai, M.T., Sahni, S. (eds) Computing and Combinatorics. COCOON 2010. Lecture Notes in Computer Science, vol 6196. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14031-0_39
Download citation
DOI: https://doi.org/10.1007/978-3-642-14031-0_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14030-3
Online ISBN: 978-3-642-14031-0
eBook Packages: Computer ScienceComputer Science (R0)