Efficient Exact Algorithm for Count Distinct Problem
This paper describes and analyses optimization approaches, which make possible the exact calculation of millions of hierarchical count distinct measures over hundreds of billions data rows. Described approach evolved for several years, in parallel with the growth of tasks from a fast growing internet company, and was finally implemented as a PEAPM (Pipelined Exact Accumulation for Paralleled Measures) algorithm. Current version of an algorithm outputs exact values (not estimates), works in a single thread, in minutes using a general commodity hardware, and requires volume of RAM equal to the doubled size of required measures.
KeywordsBig Data MPP Database Analytics Cardinality estimation Distinct elements problem Clickstream analysis Performance
- 1.Flajolet, P., Fusy, É., Gandouet, O., Meunier, F.: HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: Discrete Mathematics and Theoretical Computer Science Proceedings, Nancy, France, AH, pp. 127–146. CiteSeerX 10.1.1.76.4286Google Scholar
- 3.Banerjee, A., Ghosh, J.: Clickstream clustering using weighted longest common subsequences. In: Proceedings of Web Mining Workshop at the 1st SIAM Conference on Data Mining (2001)Google Scholar
- 6.Naspers takes full control of Russian classifieds site Avito in 1.16B dollars deal. https://techcrunch.com/2019/01/28/naspers-avito-1-16-billion/
- 7.Benchmarks of modern analytical databases for typical click stream analysis scenarious. https://clickhouse.yandex/benchmark.html
- 8.C++ Vertica extension implementing described algorithm. https://github.com/phil-88/vertica-udf