Method for blocking unverified ad-hoc SQL traffic for Relational Databases

How to detect “heavy” SQL for Relational Databases? Thats a really good question actually, because you know, for example someone inside a company who has access to production Database, just execute unverified SQL statement and … other verified SQL traffic is getting negative impact during time of execution. But why?

So, nowadays there are 2 paradigms of Relational Databases: OLAP and OLTP.

OLTP – this is a mode with fastest SQL queries.

OLAP – this is another side to read and aggregate data and sometimes one query can working during several minutes or probably hours.

“Heavy” OLAP SQL needs more resources of database (other words, more memory buffers) and during this movement other buffers with fastest data for OLTP traffic will be replaced buffers for OLAP.

Yep, of course, there are some attempts to create a new paradigm of “diagonal” (OLAP + OLTP) databases.

Okay, but how to understand that SQL is “heavy” ?

SELECT user_id, count(*)
  FROM visits
GROUP BY user_id

First of all we need to understand definition of “verified / unverified SQL”. Other words, how to understand that a particular SQL is “verified”?

To solve this task easily , just add a multi line comment inside of SQL statement before all reviews and load tests like below.

SELECT /*
          {status:verified, 
          hash: ebbac752d3b33dd5d311ed5b04d0c04f,
          domain: finance,
          ms: app_worker, 
          ...}*/ user_id, count(*)
  FROM visits
GROUP BY user_id

To be honest that’s a really good point, just to get some piece of user defined metadata about particular SQL query during HighLoad ( for example to understand which micro service generates a lot of traffic).

Actually, there is another “rocket-science” approach which Marina implemented in her project. Below you can find main aspects in her work.

  • Machine Learning
  • Monitoring of activity in Database
  • Comparison particular “heavy” SQL with ML-model

Her method steps look like below.

ML algorithm using knowledge about LLD (Local Lexemes Dictionary) and GLD (Global Lexemes Dictionary).

  • LLD is based on all keyword from existing SQL statements from verified traffic
  • GLD is dictionary of all keywords of particular Relational Database

Lets take a look how to create matrix of frequencies for particular SQLs

SQL1 =

SELECT * 
  FROM a INNER JOIN b ON a.id = b.id 
         LEFT JOIN c ON c.id = a.id  
 WHERE a.c2 = :c2 AND b.c1 = :c1

Amount of keywords with duplicates is 10.

{SELECT, FROM, INNER, JOIN, ON, LEFT, JOIN, ON, WHERE, AND}

Lets calculate frequencies of each GLD unit based on this particular query

SELECTFROMINNERONRIGHTLEFTJOINWHEREAND
SQL11/101/101/102/1001/102/101/101/10
vector of frequencies for SQL

SQL2 =

SELECT * FROM a 
         RIGHT JOIN b ON a.id = b.id
SELECTFROMINNERONRIGHTLEFTJOINWHEREAND
SQL21/101/1001/101/1001/1000
vector of frequencies for SQL

and finally (after preparation phase of each verified SQL) we can construct common matrix of frequencies

SELECTFROMINNERONRIGHTLEFTJOINWHEREAND
SQL11/101/101/102/1001/102/101/101/10
SQL21/101/1001/101/1001/1000
SQLn
data matrix of frequencies

This data matrix is input for for learning of ML model and Marina used a several algorithms to achieve a best results.

  • OCSVM
  • Isolation Forest
  • Elliptic Envelope
  • Local Outlier Factor

Leave a Reply