Flajolet-Martin algorithm

Hi there!

Sometimes, we need to know how many UNIQUE rows exist in a table. In a Relational Database World there is a simple but high-costly action for DB engine level (for example DISTINCT or GROUP BY with subquery). The simple business case for that action is to get amount of unique users who visited your web-site during period of time.

SELECT count(t1.user_id) AS "Unique users"
  FROM (SELECT DISTINCT(user_id) AS user_id
          FROM visits
         WHERE visit_date BETWEEN date1 AND date2) AS t1;
SELECT count(DISTINCT(user_id)) AS "Unique users"
  FROM visits
 WHERE visit_date BETWEEN date1 AND date2;
SELECT count(t1.user_id) AS "Unique users"
  FROM (SELECT user_id
          FROM visits
         WHERE visit_date BETWEEN date1 AND date2
      GROUP BY user_id) AS t1;

In the modern era of Big Data, Streaming Data, IoT we often face the speed of our request to the database and honestly we do not need to get the exact amount of unique users at the period of time. Just approximate number is enough for our needs. To help us with our task we can use a really awesome probabilistic structure.

Let me present an example of Flajolet-Martin algorithm (aka “FM” or “LogLog counting of large cardinalities”)

Inputs

A static or stream bag (~ multiset) of integers.

    \[    X = \{1,3,5,7,5,2,7\} \]

A hash function.

    \[    h(x_i) = (3x_i+1)\; mod \; 5 \]

Steps of FM-algorithm

Step #1

To calculate a hash for each element from a bag

    \[x_i\]

    \[h(x_i)\]

1

    \[(3 \cdot 1 +1)\; mod \; 5=4\]

3

    \[(3 \cdot 3 +1)\; mod \; 5=0\]

5

    \[(3 \cdot 5 +1)\; mod \; 5=1\]

7

    \[(3 \cdot 7 +1)\; mod \; 5=2\]

5

    \[(3 \cdot 5 +1)\; mod \; 5=1\]

2

    \[(3 \cdot 2 +1)\; mod \; 5=2\]

7

    \[(3 \cdot 7 +1)\; mod \; 5=2\]

Step #2

To write a binary representation of a hash value

    \[x_i\]

    \[h(x_i)\]

    \[binary(h(x_i))\]

1 4 100
3 0 000
5 1 001
7 2 010
5 1 001
2 2 010
7 2 010

Step #3

To calculate the count of trailing zeros in each binary representation of a hash value

    \[x_i\]

    \[binary(h(x_i))\]

    \[r_i\]

1 100 2
3 000 0
5 001 0
7 010 1
5 001 0
2 010 1
7 010 1

Step #4

Let’s estimate unique elements by formula

    \[    R=2^{max(r_i)} = 2^2 =4 \]

By fact, distinct values is 5. So, almost equal 🙂

More formalistic way

The original algorithm looks below like that (presented like a pseudo-code)

for (i:=0 to L-1) do BITMAP[i]:=0;

for (all x in X) do
    begin
        index:=g(h(x));
        if BITMAP[index]=0 then
            BITMAP[index]:=1;
    end
R:=the largest index in BITMAP whose value equals to 1

    \[    L > \log_2{n},n=|X|  \]

    \[    g(y)=min\{k \geq 0 \; | \; bit(y,k)\neq 0\}, \forall y>0 \]

    \[    g(y)=L, \forall y=0 \]

    \[    y=\sum_{k \geq 0}{bit(y,k)2^k} \]

Let’s return to our example.

    \[    L =3 > \log_2{7} \approx 2.8 \]

Initiaize BITMAP vector by 0s for length L bits before our calculations.

Let’s write y as binary representation of hash value for each x

    \[    y_1(hash(x_1))=\sum_{k \geq 0}{bit(y,k)2^k}=1 \cdot 2^2 + 0 \cdot 2^1 + 0 \cdot 2^0 \]

    \[    y_2(hash(x_2))=\sum_{k \geq 0}{bit(y,k)2^k}=0 \cdot 2^2 + 0 \cdot 2^1 + 0 \cdot 2^0 \]

    \[    y_3(hash(x_3))=\sum_{k \geq 0}{bit(y,k)2^k}=0 \cdot 2^2 + 0 \cdot 2^1 + 1 \cdot 2^0 \]

    \[    y_4(hash(x_4))=\sum_{k \geq 0}{bit(y,k)2^k}=0 \cdot 2^2 + 1 \cdot 2^1 + 0 \cdot 2^0 \]

    \[    y_5(hash(x_5))=\sum_{k \geq 0}{bit(y,k)2^k}=0 \cdot 2^2 + 0 \cdot 2^1 + 1 \cdot 2^0 \]

    \[    y_6(hash(x_6))=\sum_{k \geq 0}{bit(y,k)2^k}=0 \cdot 2^2 + 1 \cdot 2^1 + 0 \cdot 2^0 \]

    \[    y_7(hash(x_7))=\sum_{k \geq 0}{bit(y,k)2^k}=0 \cdot 2^2 + 1 \cdot 2^1 + 0 \cdot 2^0 \]

Let’s calculate all g(y) for each y.

    \[    g(y_1)=min\{k \geq 0 \; | \; bit(y,k)\neq 0\}=min\{2\}=2 \]

    \[    g(y_2)=min\{k \geq 0 \; | \; bit(y,k)\neq 0\}=\varnothing \]

    \[    g(y_3)=min\{k \geq 0 \; | \; bit(y,k)\neq 0\}=min\{0\}=0 \]

    \[    g(y_4)=min\{k \geq 0 \; | \; bit(y,k)\neq 0\}=min\{1\}=1 \]

    \[    g(y_5)=min\{k \geq 0 \; | \; bit(y,k)\neq 0\}=min\{0\}=0 \]

    \[    g(y_6)=min\{k \geq 0 \; | \; bit(y,k)\neq 0\}=min\{1\}=1 \]

    \[    g(y_7)=min\{k \geq 0 \; | \; bit(y,k)\neq 0\}=min\{1\}=1 \]

Substitute corresponding indexes in BITMAP vector based on g values

Choose largest index in BITMAP array with 1

    \[    R=2 \]

Estimate unique elements fy formula

    \[    2^R=4 \]

Estimate the cardinality of X

    \[    2^R/ \phi =4 / 0.77351 \approx 5.1 \]

The 2 main points of Flajolet-Martin algorithm are presented below.

  • Flajolet-Martin algorithm approximates the number of unique objects in a stream or a database in one pass.
  • if our “storage” (database or stream) contains n elements with m unique elements, so this algorithm runs in O(n) time and needs O(log(m)) memory.

Source articles and books

  • “Probabilistic counting algorithms for data base applications” Flajolet, Philippe; Martin, G. Nigel
  • “Mining of Massive Datasets” Anand Rajaraman, Jeffrey David Ullman

Leave a Reply