OLTP. Compiling SQL Bindings. - IT Hare on Soft.ware

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

[[This is Chapter 20(h) from “beta” Volume VI of the upcoming book “Development&Deployment of Multiplayer Online Games”, which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the “release” book to those who help with improving; for further details see “Book Beta Testing“. All the content published during Beta Testing, is subject to change before the book is published.

To navigate through the book, you may want to use Development&Deployment of MOG: Table of Contents.]]

One thing which causes quite a bit of unnecessary work for developers working with databases – is SQL parameter binding. Having seen quite a few mid-size projects, I became quite a fan of an approach which, while possibly an overkill in the beginning, will help a LOT in the long run (and will help to avoid quite a few of nasty and completely unnecessary bugs too). I’m speaking about making a compiler to generate a binding for your prepared SQL statements.

First of all, let’s postulate that

ALL SQL Statements used for OLTP MUST be prepared.

A very quick recap: prepared statement is something along the lines of “SELECT MONEY FROM USERS WHERE USERID=?” (instead of non-prepared “SELECT MONEY FROM USERS WHERE USERID=’NoBugs’” and so on). The idea here is to have statements prepared once – and then just usie them as many times as possible, effectively replacing ‘?’ with the value we need right now. Oh, and BTW – even if your app is based on stored procedures, it doesn’t relieve you from using prepared statements (as you should call your stored procedures via prepared statements too).

I don’t want to elaborate on the reasons why preparing statements are important (it was discussed ad nauseam all over the Internet), but very shortly – there are two main reasons to use prepared statements (and I’m not sure which one is more important):

Prepared statements allow your RDBMS to compile your SQL into execution plan only once – and then re-use the plan again and again.
- “If we’re speaking about millions transactions per day over just a few hundred of different SQL statements – compiling those statements a million times (instead of a few hundred times) will be a dramatic waste of resources.If (which is typical for OLTP) we’re speaking about millions transactions per day over just a few hundred of different SQL statements – well, compiling those statements a million times (instead of a few hundred times) will be a dramatic waste of resources.
- Caching non-prepared statements on RDBMS side (even if RDBMS does it) rarely cuts it for OLTP (because of parameters within SQL statements being different).
- Since MySQL 4.1, they’re faster even on MySQL [Zaitsev].
Prepared statements allow to eliminate “SQL injection” attacks.
- “SQL injection” is one thing you REALLY don’t want to happen. The point here is that if you form your SQL statement from a Client-provided string – you may easily become vulnerable to SQL injection. For good examples of what SQL injection can do to your DB, look, for example, at [Friedl]; in short – simple SQL injection can lead to attacker obtaining full control over your DB (both read and write).
- While proper escaping (whatever it is for the DB you’re using) solves this problem too – keeping an eye on all your SQL statements to make sure that all of them use escaping – is quite tiresome and worse – error-prone.

It is possible to elaborate on these two items on and on – but as the “use prepared statements” is pretty much a consensus among serious database developers, ¹ I’d rather save my breath to argue about something less conventional ;-).

¹ for example, [Darren] says “Always [use prepared statements]. 100% of the time, use it. Always; and even if you don’t need to use it. USE IT STILL.” 🙂

Binding Prepared Statements: Ugly and Even Uglier

Let’s look at the typical code which uses prepared statements (keeping also in mind that for OLTP we clearly want to prepare the statement once, store it somewhere, and use it later, so there are two separate pieces of code – the first one preparing the statement and the second one using it). Let’s note that exact code is not the point here: what I’m trying to illustrate that binding code is full of mundane tedious details, and is outright ugly.

In Python (with mysql.connector) prepared statements may look like this:

#Python/mysql.connector
#SOMEWHERE IN INIT: PREPARING
preparedMoneyFromUsersByUserIdStmt = \
  'SELECT MONEY FROM USERS WHERE USERID=%s'
preparedMoneyFromUsersByUserIdCursor = \
  dbConn.cursor(prepared=True)

#FUNCTION TO USE IT:
def moneyfromUsersByUserId( userId ):
  preparedMoneyFromUsersByUserIdCursor.execute(
    preparedMoneyFromUsersByUserIdStmt, (userId,))
  row = preparedMoneyFromUsersByUserIdCursor.fetchone()
  if row is None:
    raise EmptyResultSet()
  ret = row[0]
  row = preparedMoneyFromUsersByUserIdCursor.fetchone()
  assert row == None
  #assuming that USERID is a PRIMARY KEY
  #   - cannot possibly happen
  return ret

In Java/JDBC the same prepared statement will probably look along the following lines:

//Java/JDBC:
//SOMEWHERE IN INIT: PREPARING
preparedMoneyFromUsersByUserId =
  dbConn.prepareStatement(
    "SELECT MONEY FROM USERS WHERE USERID=?");

//FUNCTION TO USE IT:
public int moneyFromUsersByUserId(String userId) {
  preparedMoneyFromUsersByUserId.setString(1,userId);
  ResultSet rs =
    preparedMoneyFromUsersByUserId.executeQuery();
  if(!rs.next())
    throw new EmptyResultSet();
  int ret = rs.getInt(1);
  assert !rs.next();
  //assuming that USERID is a PRIMARY KEY
  //   - cannot possibly happen
  return ret;
}

And the pinnacle of ugly prepared statement code is certainly C/ODBC (the same goes for C/CLI):

//C/ODBC:
//SOMEWHERE IN INIT: PREPARING
SQLRETURN rc = SQLPrepare(preparedMoneyFromUsersByUserId,
  "SELECT MONEY FROM USERS WHERE USERID=?", SQL_NTS);
  throwIfNotSuccess(rc);

//FUNCTION TO USE IT:
public int moneyFromUsersByUserId(const char* userId) {
  SQLRETURN rc = SQLBindParameter(
    preparedMoneyFromUsersByUserId, 1, SQL_PARAM_INPUT,
    SQL_C_CHAR, SQL_VARCHAR, MONEY_COLUMN_SIZE, 0,
    userId, strlen(userId), NULL
  );
  throwIfNotSuccess(rc);

  SQL_C_LONG money;
  rc = SQLBindCol(preparedMoneyFromUsersByUserId, 1,
       SQL_C_LONG, &money, sizeof(SQL_C_LONG), NULL);
  throwIfNotSuccess(rc);

  rc = SQLExecute(preparedMoneyFromUsersByUserId);
  throwIfNotSuccess(rc);

  rc = SQLFetch(preparedMoneyFromUsersByUserId);
  throwIfNotSuccess(rc);//including SQL_NO_DATA

  int ret = money;
  assert( ret == money );//in case if truncation has happened
  rc = SQLFetch(preparedMoneyFromUsersByUserId);
  assert( rc == SQL_NO_DATA );

  //assuming that USERID is a PRIMARY KEY
  //   - cannot possibly happen
  return ret;
}

As we can see, all these pieces of code are ugly (the C/ODBC one being “extremely ugly”), very tedious to write (imagine writing it for 200 times for 200 of your prepared statements) – but fortunately, are very mechanistic, so they can be generated from SQL statements (and DB structure) rather easily. As a result, here goes the idea of…

Compiling SQL bindings

YACC Yacc is a... a Look Ahead Left-to-Right (LALR) parser generator, generating a... LALR parser, based on an analytic grammar written in a notation similar to Backus–Naur Form (BNF).— Wikipedia —As we’ve seen above – code in SQL bindings ranges from “rather ugly” to “extremely ugly”, and as soon as you need to have a few dozens of these – there is a value in creating a kinda compiler which would take some descriptions and generate the code above (modified to fit according to your specific needs). To do it, we’ll need to make some kind of “SQL description file” which will describe our system.

First of all, let’s note that SQL grammar is a LALR(1) one, and there is even a YACC grammar for it (see, for example, [NdYaccLexTool], though I suggest to write your own grammar – supporting only the stuff you really need at the moment – as you write your SQL bindings compiler). It means that parsing SQL itself is simple (it will be done by YACC), so the only thing we really need to create our “SQL bindings compiler” is to make the rest of our “SQL description file” LALR(1) – which is not a problem.

For example, let’s consider the following “SQL description file”:

#SQL DESCRIPTION FILE FOR WHATEVER-OUR-PROJECT-IS
TABLE {
  SQL ”CREATE TABLE users (userId VARCHAR(30),
                           money INTEGER
                          ),
                          PRIMARY KEY(userId)”
};

STMT {
  SQL ”SELECT MONEY FROM USERS WHERE USERID=?” UNIQUE,
  GENERATE FUNCTION NAME moneyFromUsersByUserId
};

Of course, exact syntax beyond SQL can be pretty much anything you like, my point here is to demonstrate that all those messy bindings can be generated from this simple “SQL description file”.

Note TABLE statement in the “SQL description file” above; it is intended to provide description of the tables of your DB – which in turn are necessary to figure out column types for bindings.

A few notes:

“I am advocating for our bindings compiler to enforce that all the tables have PRIMARY KEY.I am advocating for our bindings compiler to enforce that all the tables have PRIMARY KEY. Tables without PRIMARY KEY shouldn’t be used for many reasons (starting with violating 1NF), it is so common knowledge that I don’t even want to discuss it here.
I am advocating (deeply commited / a proponent for) for prohibiting “SELECT *” statements in this file; all the fields in OLTP SQL MUST be explicitly specified.
If you need to read multiple values – returning a tuple from your generated function is probably the best (and is possible even in C++11).
Strictly speaking, UNIQUE attribute is optional, but I still prefer to have it as an explicit one; our SQL bindings compiler SHOULD check that DB itself enforces uniqueness before relying on it (in the example above it happens because USERID is PRIMARY KEY).

Implementation Complexity

Of course, when undertaking an effort such as the one above – we need to estimate how long it will take to implement it. I would say that for me personally (and having quite a bit of experience with YACC), a very simple compiler would take (including testing) about one working week. Once again – the compiler written this way will be very simple and able to handle only the stuff above (in particular, only a few basic types – additional (further / extra) ones can be added later) – with an intention to extend it later, as the need arises. Of course, if you’re not experienced with YACC – it will take longer; however, if doing it along the lines discussed in Chapter [[TODO]] – even for not-so-experienced-in-YACC (but experienced otherwise) developer this very simple SQL bindings compiler shouldn’t take more than two-three weeks.

“Once upon a time, I observed the largest C++ file in my career – it was a 30’000-line file(!) consisting merely of ODBC bindings (and that was just for 300 or so SQL statements)Sure, developing such a compiler is a significant effort. On the other hand, it is pretty much a one-time effort – and as soon as it is done, all other SQL-related development will be sped up significantly (and make sure to check “Additional Goodies” section below).

A real-world story in this regard. Once upon a time, I observed the largest C++ file in my career – it was a 30’000-line file(!) consisting merely of ODBC bindings (and that was just for 300 or so SQL statements). And as you can imagine, maintaining this file was quite a big pain in the …ahem… neck. It was at that point when I started to think about a way to generate all this stuff from a few hundred lines of “SQL description file”.

For the way how I’d implement this kind of kinda-compiler, see discussion in Chapter XII (while Chapter XII discussed IDL compiler, the ideas behind IDL and SQL-binding compilers are very similar).

Contrast with Code-First ORM

ORM Object-relational mapping (ORM, O/RM, and O/R mapping tool) is a programming technique for converting data between incompatible type systems in object-oriented programming languages. This creates, in effect, a 'virtual object database' that can be used from within the programming language. — Wikipedia —It should be noted that “SQL bindings compiler” is not an ORM (at least not a “code-first” ORM which takes code and generates SQL out of it). In particular, it doesn’t try to use some “magic” to generate tables and SQL – and all the SQL is still under complete control, allowing for reasonably efficient implementations.

Instead, such “bindings compiler” simply performs mundane work of generating SQL bindings – and for this kind of obvious-but-tedious things, automated programs are The Way To Go™. In other words – unlike ORM, “SQL binding compiler” does NOT attempt to write SQL for you; instead – it just takes manually-written SQL and generates code out of this SQL.

Separation of Concerns

One important (and not-so-obvious) benefit of SQL bindings compiler is that it provides a rather clean separation of concerns between SQL and program-using-SQL. Moreover, this separation goes along the lines of responsibilities in quite a few teams.

More specifically – in many teams it is common that DBAs are responsible for SQL, and developers are responsible for using whatever-SQL-DBAs-have-provided. However, with usual prepared statements SQL is buried within the code, which tends to lead to rather awkward and time-consuming interactions between DBAs and developers (it works, but slower than it can/should). With “SQL description file” – the situation improves significantly; all the SQL statements are gathered in one single place (without being scattered all over the code) – and DBAs can easily adjust them as necessary (to optimize performance, to deformalize certain things, to add/extend field, etc. etc.) even without knowledge of the programming language in question.

[[TODO: generating stored procedures]]

Additional Goodies

Last but not least: while everything I’ve said above is correct, and bindings themselves are indeed a sufficient reason to switch to such a compiler, actually I love such compilers not just because of bindings as such, but also because they allow to introduce additional features 🙂 . Moreover, these features can be added without modifying the code-in-the-vicinity-of-SQL-statements manually. Examples of such goodies include (note that some of them will require adding new keywords to the file above, but this should be trivial as you already control the grammar):

Comparing real DB structure to a structure declared in the “SQL description file”
- DDL A data definition language or data description language (DDL) is a syntax similar to a computer programming language for defining data structures, especially database schemas.— Wikipedia —As a result – it becomes possible to auto-generate DDL statements to upgrade DB from “whatever-state-it-has-now” to the state in “SQL description file”. While we’re at it – normally, for OLTP purposes generating such an upgrade SQL should fail if it goes beyond adding/widening columns.
Auto-generated audit tables. Most of the time, it will be sufficient to specify that you want this table to be audited (and type of audit you want) – and bingo! You’ve got your audit table at almost-zero development cost(!) (and exactly to your liking too).
App-level replication (for single-write-connection DB). We will discuss app-level async replication in Vol. 3 (tentatively Chapter [[TODO]]), but for now let’s note that the most reliable way of implementing it – is via capturing UPDATE/INSERT SQL statements, and forwarding information from them to the replica. A perfect job for SQL bindings compiler (my apologies for sounding like a commercial).
- Answering a question “why in the hell you’d want to do it yourself”:
  - first, it will work (unlike quite a few not-really-working RDBMS-provided solutions out there)
  - second – that such an app-level replication can easily run between different types of RDBMS (and this is one thing you can easily want; as it was discussed in [[TODO]] section above – running OLTP DB on DB/2 or Oracle while running replicas on Postgres, can be a very good combination to achieve extremely solid processing while limiting licensing costs); moreover – with some effort you can even replicate between SQL and NoSQL (and you may want to do it for quite a few reasons, in particular for analytics).
  - “last but not least - app-level replication will allow to modify the data along the replication path (for example - re-normalizing audit-like data so it becomes more suitable for querying)And last but not least – it will allow to modify the data along the replication path (for example – re-normalizing audit-like data so it becomes more suitable for querying). More on app-level replication and its benefits is in Vol. 3 (tentatively Chapter [[TODO]])
App-level caches and AUTOINCREMENT field for single-write-connection-DBs. For single-write-connection DB architectures, it is often more optimal to do autoincrement at app-level – and relevant code can be generated easily too. The same goes for app-level caches.
I’m pretty sure that I forgot something important here – but you hopefully already got the idea 🙂 .

Summary

I hope that I’ve managed to convince you that if you’re at the start of the project which will eventually need a few hundred of OLTP statements (and these statements will need to be prepared – there is no way around it) – it makes perfect sense to start development from writing your own (and very basic at this point) SQL bindings compiler. In the long run – it will save you lots of time, and will provide lots of benefits too (while you may not need them right away – chances are that sooner or later you will).

[[To Be Continued…

This concludes beta Chapter 20(h) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”. Stay tuned for beta Chapter 20(i), where we’ll discuss very basics of database organization (for dummies)]]

[+]References

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

Comments

Wanderer says

December 1, 2016 at 1:26 pm

Hmm… I find this chapter a bit contradictory. Initially, you are saying that this approach is “not-ORM”. At the same time, it looks like one for me. Especially taking the usual “migrations” concept, DDL, auto-generated tables, all described in Additional Goodies.

It’s just not a “code-first” ORM approach, which is a popular one, but just one way to do it. And neither it’s “database-first” (like “update EDMX from database” in .NET Entity Framework), but something like “SQL-first”.

At the same time, it seems that you’re proposing manually written game entities/classes and manually written SQL queries and tables. So, the only generated stuff is the binding boilerplate code, right?

If it’s so, then there will be 3 independent sources that should match each other – source code of game classes, SQL queries, and DDL for bindings. Do I get it correctly?

- "No Bugs" Hare says
  
  December 2, 2016 at 7:00 am
  
  About “whether it is ORM or not ORM” – for any vaguely enough term, there will be somebody who will interpret it in a different manner :-). Those ORMs I’ve seen (and which are despised by DBAs, and for a good reason too) – they were universally “code-first”. Can you refer to an existing ORM which is “SQL-first”?
  
  > So, the only generated stuff is the binding boilerplate code, right?
  
  Yes, this is basically the point. Boilerplate is bad – while control over SQL is good :-).
  
  > If it’s so, then there will be 3 independent sources that should match each other – source code of game classes, SQL queries, and DDL for bindings. Do I get it correctly?
  
  I’d say there are two sources: (a) SQL (which includes both DML and DDL, and is managed by DBAs; of course, DML and DDL need to be coherent, but as long as it is in the same hands and it is a very common requirement, it is never a real problem), and (b) functions (rather than objects) to be used within the game classes to manipulate DBs (and these represent a well-defined API provided by “SQL layer” and “app layer”). Overall, I do not like thinking about DB in terms of manipulating “objects” (it tends to lead to very ugly solutions way too often); instead, I clearly prefer to define stateless atomic operations (rather than stateful classes) as an API to the OLTP DB – and then to build potentially stateful Game Objects on top of these stateless atomic operations.
  
  - Wanderer says
    
    December 2, 2016 at 9:46 pm
    
    Regarding “SQL-first” approach, the only thing that comes into my mind (while being a bit different beast, actually) is MyBatis. Never worked with it on a real commercial project, although. At the same time, I suspect it’s only “that cute” only because of reflection goodies and there is no way to do anything similar in C++ without code generators.
    
    As of “DB-to-objects”, I think your point is clear as soon as you say “Event sourcing” (or CQRS). And I’m pretty much with you here. Moreover, as we already discussed somewhere on your site, most ORMs out there have Active Record approach. And those are despised not only by DBAs, but programmers too, cause more often than not you end up with those POCOs/PODs that auto-magically persist as soon as you do commit(). In other words, those push you to have a bunch of dumb “C-struct” like objects as the core of business logic.
    
    - "No Bugs" Hare says
      
      December 3, 2016 at 8:15 am
      
      Thanks for the ref to MyBatis and it indeed has significant similarities, but they don’t seem to name themselves an ORM, which seems to reinforce my point that this kind of “SQL-first” things is not usually named ORMs. As for its “cuteness” – TBH, I would take an offline IDL compiler (with an ability to check generated code) over SqlSessionFactoryBuilder (Builder to build Factories which then build Objects, ouch!) which reads XML at run-time, any time of the day :-).
      
      > those push you to have a bunch of dumb “C-struct” like objects as the core of business logic.
      
      Exactly; while POD/POJO stuff is pretty much inevitable for data transfers/storing, pushing it “as is” into your core logic rarely qualifies as a good idea. I’d say that POD/POJO used for transfers/storing should be just one of many “projections” of the complicated core logic.