Streaming data and method to get a good approximation of COUNT(DISTINCT) problem

Daria Kovtun

So, I am sure you know LogLog / HyperLogLog algorithms to get a good approximation of some interesting business metrics “on-the-fly” (or Streaming Data).

From technical side, there are a good technical realization to get well good approximation like database engines below

  • Apache Druid
  • Yandex Clickhouse

Daria decided to use a “collaboration” between algorithms and data structures. She decided to take a class of structures Bloom Filters (in her case , CBF ~ Countable Bloom Filter) from one side and Machine Learning algorithm from other side. Please just take a look on results of prediction based on her algorithm below.

The High Level of algorithm also I am publishing here

So, it means end user instead of execution “SELECT COUNT(DISTINCT)…” just gets metric from special table structure directly without any internal calculations on database engine level.

Leave a Reply