PostgreSQL extension for hierarchical clusterization algorithm

Irina Nesterova

Data Science and Data Structures are really tightly coupled. DataScience (DS) Team needs data (structured or semistructured), other words they want to know where is a data storage to run Machine Learning or Data Mining models. In collaboration between DS and Data Platform (DP) teams different solutions appear how to transport / prepare / cleaning / etc. data.

Sometimes if you want to make a Data Science research in the RDBMS (without new physical layer, ETL processes, to keep your budget) you really just want to run built-in functional block (procedure / function), pass needed table like a parameter with researched data and get the result. Sounds awesome, moreover RDBMS storage can work with a several tens of petabytes data without problem (at least I have this experience with PostgreSQL database in production).

So, Irina decided to closely work with Hierarchical Clusterization algorithm which currently implemented like EXTENSION (specific library) for PostgreSQL database. In the kernel , she uses a few approaches to cluster passed data into her API:


Below presented a schematic representation of solution

