So, I am sure you know LogLog / HyperLogLog algorithms to get a good approximation of some interesting business metrics “on-the-fly” (or Streaming Data).
From technical side, there are a good technical realization to get well good approximation like database engines below
- Apache Druid
- Yandex Clickhouse
Daria decided to use a “collaboration” between algorithms and data structures. She decided to take a class of structures Bloom Filters (in her case , CBF ~ Countable Bloom Filter) from one side and Machine Learning algorithm from other side. Please just take a look on results of prediction based on her algorithm below.
The High Level of algorithm also I am publishing here
So, it means end user instead of execution “SELECT COUNT(DISTINCT)…” just gets metric from special table structure directly without any internal calculations on database engine level.