Hi there!
Sometimes, we need to know how many UNIQUE rows exist in a table. In a Relational Database World there is a simple but highcostly action for DB engine level (for example DISTINCT or GROUP BY with subquery). The simple business case for that action is to get amount of unique users who visited your website during period of time.
SELECT count(t1.user_id) AS "Unique users" FROM (SELECT DISTINCT(user_id) AS user_id FROM visits WHERE visit_date BETWEEN date1 AND date2) AS t1;
SELECT count(DISTINCT(user_id)) AS "Unique users" FROM visits WHERE visit_date BETWEEN date1 AND date2;
SELECT count(t1.user_id) AS "Unique users" FROM (SELECT user_id FROM visits WHERE visit_date BETWEEN date1 AND date2 GROUP BY user_id) AS t1;
In the modern era of Big Data, Streaming Data, IoT we often face the speed of our request to the database and honestly we do not need to get the exact amount of unique users at the period of time. Just approximate number is enough for our needs. To help us with our task we can use a really awesome probabilistic structure.
Let me present an example of FlajoletMartin algorithm (aka “FM” or “LogLog counting of large cardinalities”)
Inputs
A static or stream bag (~ multiset) of integers.
A hash function.
Steps of FMalgorithm
Step #1
To calculate a hash for each element from a bag


1 

3 

5 

7 

5 

2 

7 

Step #2
To write a binary representation of a hash value



1  4  100 
3  0  000 
5  1  001 
7  2  010 
5  1  001 
2  2  010 
7  2  010 
Step #3
To calculate the count of trailing zeros in each binary representation of a hash value



1  100  2 
3  000  0 
5  001  0 
7  010  1 
5  001  0 
2  010  1 
7  010  1 
Step #4
Let’s estimate unique elements by formula
By fact, distinct values is 5. So, almost equal ðŸ™‚
More formalistic way
The original algorithm looks below like that (presented like a pseudocode)
for (i:=0 to L1) do BITMAP[i]:=0; for (all x in X) do begin index:=g(h(x)); if BITMAP[index]=0 then BITMAP[index]:=1; end R:=the largest index in BITMAP whose value equals to 1
Let’s return to our example.
Initiaize BITMAP vector by 0s for length L bits before our calculations.
Let’s write y as binary representation of hash value for each x
Let’s calculate all g(y) for each y.
Substitute corresponding indexes in BITMAP vector based on g values
Choose largest index in BITMAP array with 1
Estimate unique elements fy formula
Estimate the cardinality of X
The 2 main points of FlajoletMartin algorithm are presented below.
 FlajoletMartin algorithm approximates the number of unique objects in a stream or a database in one pass.
 if our “storage” (database or stream) contains n elements with m unique elements, so this algorithm runs in O(n) time and needs O(log(m)) memory.
Source articles and books
 “Probabilistic counting algorithms for data base applications” Flajolet, Philippe; Martin, G. Nigel
 “Mining of Massive Datasets” Anand Rajaraman, Jeffrey David Ullman