OLTP DB Optimizations 101 – Thinking in Terms of Execution Plans

Author:  Follow: TwitterFacebook
Job Title:Sarcastic Architect
Hobbies:Thinking Aloud, Arguing with Managers, Annoying HRs,
Calling a Spade a Spade, Keeping Tongue in Cheek
DB Optimizations: Execution Plans

[rabbit_ddmog vol=”7″ chap=”Chapter 27(a) from “beta” Volume VII”]

After you have launched your game and it got successful – the first thing you’ll likely need to deal with as a result of your success1 is optimizing your OLTP database.

Hare wondering if you are crazy:hey, our database will scale, you'll just need to add more hardwareAs it was discussed at length in Vol. III’s chapter on Scalability, if you’re doing everything properly – your OLTP DB is extremely likely to become The Bottleneck(tm) of your system. Everything else can be usually scaled more or less easily (with one exception being scaling seamless Game Worlds – see Vol. III for discussion), but making sure your real-world database works as intended – is going to be a biiig ongoing problem. This stands regardless of whatever-your-DB-sales-person-will-say (“hey, our database will scale, you’ll just need to add more hardware”2), and of observations such as “hey, I can easily make 1000 updates/second on my $1K laptop, so Almighty Powerful $50K Server should be able to do 50x more than that easily”.3

1 that is, IF you followed the advice in this book, and got your other components such as Game Worlds scalable without problems
2 for most of real-world OLTP payloads no database scales even in a close-to-linear fashion
3 the whole thing is about WHICH transactions we’re speaking about, and HOW they’re interacting with each other. While obtaining high number of transactions in artificial tests is easy, making it to real-world transactions is a very-well-known Huge Headache; also we need to keep in mind that these days performance-wise servers are not that much more powerful than laptops, especially if speaking about per-core performance (which tends to dominate real-world OLTP loads due to inter-transaction interactions), and ignore stuff such as BBWC RAID or NVMe. What we’re paying for with $50K servers – is actually not performance, but improved reliability (for parallelizable performance/$ – $10K server will beat $50K almost for sure, and $1K-laptop will beat $10K server).



The key [1NF], the whole key [2NF], and nothing but the key [3NF], so save me Codd

— Unknown —

As it says on the tin, in this Chapter we’ll be speaking about OLTP performance – the one with LOTS of relatively-small update transactions. Optimizing read-intensive databases such as reporting DBs and Big Data analytical DBs is quite a separate art – and quite a few things we’ll be speaking about here, won’t apply to them. Sure, there is quite a bit of reads in OLTP DB too – but with OLTP it is like 50-50 split4 between reads and writes, while for analytical DBs it can easily be like 1% writes – 99% reads, which changes the optimization landscape radically.

3NF Third normal form is a normal form which ensures that... (1) the entity is in second normal form, and (2) all the attributes in a table are determined only by the candidate keys of that relation and not by any non-prime attributes.— Wikipedia —In addition – we’ll assume that you already know what you’re doing, and have done your homework with regards to your database:

  • Your DB is 3NF (ok, in some cases 2NF will do, but you REALLY shouldn’t go lower than that)
    • In particular, it implies that all your tables have PRIMARY KEY (which in turn can be used as a unique identifier for every row).
      • Whether to use surrogate keys vs “natural” keys – is not that obvious; personally, I tend to decide it on a case-by-case basis. Keep in mind though that surrogate keys DO have their significant merits (at least for those tables which are NOT junction tables a.k.a. associative tables) – especially if you have a table where your “natural” key is going to change. I have to admit that in quite a large system built quite a long ago, one of my biggest mistakes was to go for user-visible userId being a PK for almighty USERS table (given the assurance that users will be NEVER EVER renamed); of course – after some time NEVER EVER was relaxed to “almost NEVER EVER”, which has caused quite a bit of trouble with using “natural” keys (not fatal, but quite unpleasant).
  • ORM Object-relational mapping (ORM, O/RM, and O/R mapping tool) in computer science is a programming technique for converting data between incompatible type systems using object-oriented programming languages. This creates, in effect, a 'virtual object database' that can be used from within the programming language.— Wikipedia —You DO control your SQL (rather than relying on a 3rd-party tool such as some-magic-ORM to do it for you5)
  • Your DB transactions are aligned with business transactions (i.e. there is no such DB transaction as “take money from user A”, and “give money to user B” intended to be called one after another – but there is one transaction “transfer money from user A to user B”)6
    • Another way to see it – is to say that “database transactions are sufficient to ensure database integrity in whatever-manner-business defines it”
  • You don’t do silly things (such as SELECT *) [[TODO: what else?]]
  • You are NOT using temp tables (if you think you need them for your OLTP DB – think twice, and then throw them away regardless).
  • You DO use prepared statements7

4 actually, can be anywhere between 20-80 and 80-20
5 in theory, there can be ORMs which work reasonably fast. In practice – yet to see any such ORM
6 well, this one is actually not directly related to optimizations, but it is that important, that I’d better re-iterate it once more <wink />
7 BTW, using stored procedures usually doesn’t spare you from using prepared statements


RDBMS Under The Hood

One single most important thing when trying to optimize your DB query – is not to “use smart tools which will do everything for you” (they won’t), but rather to

Understand how your query is going to be executed

And as soon as we understand what it going on under the hood of our RDBMS – we can start making sense of your queries not only in terms of what they return, but also in terms of when they return it.

Pages, The Whole Pages, and Nothing but Pages, so Save Me <my-RBDMS-here>

The very first thing we need to realize on this way – is that

Both the data and indexes are stored in “database pages”

The reason for it is simple: as costs of seeking data is huge – it makes perfect sense to store data in page-size chunks, and read the whole page; it also simplifies modifications etc. As a very rough rule of thumb – page size usually happens to be somewhere between 8K and 64K (but usually you can override it at least for the whole database).

Hare pointing out:for most of the loads, costs of reading DB pages from disk tend to dominate query execution costsMore importantly for us now – we can say that for most of the loads, costs of reading DB pages from disk tend to dominate query execution costs (for OLTP, there are also very significant costs of syncing DB log file, but we cannot do much about it <sad-face /> – except for using BBWC RAID and/or NVMe which was discussed earlier). These page-reading costs are so important that I’d say that for a very rough estimate, we can even say that time taken by a DB query, can be estimated by counting number of DB pages we’re going to access while executing our queries. Sure, different disks can have different access times, and even more importantly – different pages have different chances of being cached (in particular, root pages of index are much more likely to be cached even with a DB-which-relies-on-filesystem-to-perform-caching, more on caching index pages below).

Moreover, even if our whole DB is cached (which, as it was discussed in [[TODO]] chapter, we should aim for OLTP DBs) – this estimate still tends to stand as a reasonably good first estimate. Even when the page is accessed from RAM – there is still quite a bit of work involved (parsing it etc.) – so we’re not that far off with assuming that number of accessed pages is a reasonable metric to understand performance. Sure, we cannot say that “a query accessing 6 pages will outperform another query accessing 7 pages” , but for queries accessing 1 page and accessing 100 pages, the latter will lose pretty much for sure.

As a result – for our very first estimates we’ll use this “number of accessed pages” metric to get an idea of relative performance of different approaches (and in practice – most of the time it won’t be too far off).

Thinking in Terms of Execution Plan

The second thing we need to do to realize how our queries are working –

we need to start thinking in terms of so-called “execution plans”.
Execution plan (a.k.a. query plan, or query execution plan) explains not what you’ll get (which is stipulated by SQL), but how database will get your results. This how is exactly what we need to understand when speaking about query performance.

The very first thing you need to use on this way – is a database-provided SQL statement such as EXPLAIN (or EXPLAIN PLAN), which will provide you with information about the plan. However, for quite a few DBs output of the EXPLAIN as such is outright ugly (or is stored into tables which are outright ugly), so you may need to use a 3rd-party tool to visualize results of EXPLAIN. This is fine – provided that you’re using the tool which faithfully visualizes the data, and does not try to do any smart things such as suggesting indexes to improve (more on such index-suggesting tools below).

As soon as you got your execution plan in a visible form – you will see what exactly your DB does to execute your SQL query. Within the plan, you’ll see several different primitives – which essentially define what exactly your DB is going to do with the data. Each of the primitives takes certain table (or a dataset-coming-from-another-primitive) – and produces a dataset. As a result – in a general case, execution plan is essentially a tree with tables and indexes as leaves, and then datasets from the tables and leaves combined-using-primitives into datasets, which are further combined in further datasets, and so on until we get one final dataset representing the result of our SQL query.

We’ll see some examples of real-world execution plans later, but for now we need to define what are those mysterious primitives they’ll be using. The list of primitives varies a bit from one DBMS to another, but there are lots of things which are the same across the board:

  • Table Scan. When performing a table scan, your RDBMS simply goes over all the rows in the specific table, running your WHERE clause over each row, and filtering out all the records which didn’t match. This what happens by default if you don’t have any indexes on your table – or if DB doesn’t see a suitable index for your query. For a 100-row table it might be even fine, but for a table with a billion rows – chances are that you won’t live for long enough to get results of your Table Scan <sad-face />
    • Judging hare:for any sizeable table, avoid Table Scans like plagueNumber of DB pages you’ll need to access during Table Scan – is the whole size of your table (which can be in hundreds of gigabytes very easily even for an OLTP DB). Which translates into “for any sizeable table, avoid Table Scans like plague”
  • Index Scan. It is pretty obvious that Index Scan has something to do with indexes. In general – there are at least three distinct kinds of Index Scan (though some DBs will present some of them as the same Index Scan with different restrictions):
    • Unique Index Scan (a.k.a. Index Search). Unique Index Scan happens when your DBMS knows for sure that there is at most one row satisfying your WHERE clause. For your DBMS being able to use Unique Index Scan – your query should have WHERE clause such as WHERE ID=?, and your table should have a UNIQUE index (or PK)8 on ID.Both B-tree indexes9 and hash indexes will work for Unique Index Scan.
      • Unique Index Scans are certainly the very best case for performance; as a rule of thumb – they need to read only SearchIx+1 pages: SearchIx pages from index, plus one page with actual data; for a discussion on SearchIx – see below.
    • Range Index Scan. Index scan starting from value X ending with value Y (with both X and Y being optional). Applicable to clauses such as WHERE VAL BETWEEN X AND Y, WHERE VAL >= X, and WHERE VAL <= Y10.
      • Note that for these clauses to use index, we need a B-tree index on X ASCENDING (or a bi-directional one); and for the clause such as WHERE VAL <=X, we need index on X DESCENDING (or once again – a bi-directional one).
      • If both X and Y are not specified, we’ll get so-called Full Index Scan, but it is usually Damn Expensive(tm)11, and is rarely used in practice at least for OLTP loads.
      • If the range is relatively small (i.e. fetches only a few rows) – Range Index Scan works reasonably well, though it is usually difficult to ensure in advance that the range is indeed small enough. Number of pages accessed during Range Index Scan, is very roughly SearchIx+ScanIx+number_of_returned_rows; here ScanIx is a number of pages involved in scan (and is roughly equal to number_of_scanned_rows/number_of_index_entries_per_page, see also calculation of K for index pages in [[TODO]] section below), and  number_of_returned_rows is necessary to account for fetching the data from the table to return it to the application; for large tables and relatively small returned sets – we’ll end up with reading one page per each returned row.
    • Hare thumb up:The reason for LIMIT option is simple: it allows to speed things up, sometimes by orders of magnitudeRange Index Scan With a LIMIT. In SQL, there is a not-so-well-known LIMIT clause (a.k.a. FETCH FIRST, SELECT TOP, and WHERE ROWNUM <=). The idea of LIMIT clause is to limit the number of rows returned by query; and while this option is non-standard – it is present for all the RBDMS-I-know (albeit with a very different syntax). The reason for LIMIT option is simple: it allows to speed things up, sometimes by orders of magnitude (the most popular example for this clause to be efficient – is for queries such as “gimme last N games played by player P”; we’ll discuss a real-world example with real-world gains observed in [[TODO]] section below).
      • all the limitations of the unLIMITed Range Index Scan (such as ASCENDING/DESCENDING or bi-directional) still apply
      • A related option is OFFSET – and it MAY speed things up (mostly due to reduction of number of table pages RDBMS needs to fetch), but for OFFSET we’re usually speaking about gains of up to 2x-5x (while LIMIT can get up to 100x quite easily).
  • Sort. Sort is just as it says on the tin – it takes an unsorted dataset, and gets it sorted (in-memory, if the set is small enough).
    • As a Big Fat Rule of Thumb(tm), sorts are to be avoided for OLTP loads at all costs. The only exception is ultra-small sets (including ultra-small lookup tables) consisting of <1000 rows or so, but even then it is better to avoid sorting if possible (and also we need to be extra-careful to be 100% sure that we won’t ever try to sort a multi-million-row set, which can easily have devastating results).
    • Note that Sort is not only necessary when you specify ORDER BY clause in your SQL; even more importantly – Sorts can be used to facilitate Sort-Merge Joins (more on them below).
  • Joins. As SQL is all about relations and JOINs12 – it means that we need to have a way to implement a JOIN (or an equivalent SELECT with sub-SELECT) efficiently. In this regard, most common are the following types of execution-plan-level joins (which should not to be confused with SQL-level JOINs):
    • Nested Loop Join. Nested Loop Join is the simplest form of execution-plan-level join. The idea is simple: we’re going over first table (more generally – over a first dataset) – and performing lookups in the 2nd table for each of the rows in the 1st table/dataset.
      • Hare with hopeless face:And if we need to perform a Table Scan on the 2nd table within the Nested Loop - well, we'd rather never ever do itPerformance-wise – nested loop joins are killers <sad-face />. Even if the 2nd table has a unique index on the column-we’re-using-for-join (so within the loop we’re able to use Unique Index Scan) – it still tends to be very expensive. And if we need to perform a Table Scan on the 2nd table within the Nested Loop – well, we’d rather never ever do it (except, maybe, for certain tables which fit into one single database page).
    • Sort Merge Join (a.k.a. Merge Scan Join). The idea of Sort-Merge Join is simple – we have two sorted sets, and then we’re joining them into one single sorted set (just as we’d merge two sorted lists).
      • Sort Merge Join is very efficient – as long as it doesn’t require separate Sort operation (i.e. if both subsets are already sorted in the desired manner).
      • Note that to get sorted subset, quite often it is not necessary to use separate sort. Instead – optimizer knows that our usual B-tree-based indexes are already returning sorted sets during Index Range Scans (note that Table Scans do NOT have this property – that is, unless we’re speaking about clustered indexes).
    • Hash Join. If we’re joining two datasets, one of which is small, and another one is large – we can easily take smaller one, make an in-memory hash table out of it, and then go through the larger set and query hash table for each of the rows in the larger set.
      • Hash Joins are quite efficient – as long as one of the datasets involved is really small (as a rule of thumb, for OLTP it should be at most in thousands of rows).

8 or a composite index starting with an ID, more on composite indexes below
9 BTW, for our purposes, the difference between B+-tree and B-tree is negligible, and we’ll ignore it, so whenever I’m saying “B-tree index” it should be understood as “B-tree index or B+-tree index”
10 strict inequations will work too
11 unless all the data you need is stored within the index, without the need to access the table itself
12 well, for OLTP is not that much the case as for the reporting DBs, but quite a bit of joins still exist even in OLTP SQL


On cost of Index Search

When speaking about the cost of Index Search (denoted above as SearchIx) – there are several things to be kept in mind:

  • Most of the indexes out there, are B-tree indexes. Which means that we have a root index page with K index values; very roughly – K ~= page_size/(per_entry_overhead+size_of_indexed_values), with per_entry_overhead being of the order of 20-30 bytes. In practice – it means that values of K are usually around several hundreds.
    • Each of the entries in the root index page, leads to ~K other pages, and so on.
    • In turn – it means that for B-tree index with K=100, 2 levels of the tree can represent roughly K^2 = 1e4 values, 3 levels of the tree – K^3 = 1e6 values, and 5 levels of the tree – can represent 1e10 (that’s 10 billion) values, which is likely to be sufficient for quote a long while.
      • Wtf hare:you're quite unlikely to see an index with number of B-tree levels exceeding 5-6In practice, it doesn’t only mean that index search costs grow only logarithmic with number of rows (it is expected from the tree), but it additionally means that you’re quite unlikely to see an index with number of B-tree levels exceeding 5-6; this, in turn, tends to make index searches Damn Fast(tm).
  • At least if speaking about B-tree indexes, root pages (and actually, level-2 pages) tend to be cached most-of-the-time, so accessing them is going to be rather fast. This tends to work better if DB uses its own caching – but even for those DBs which rely on filesystem, it is usually still not too bad.13 NB: this argument mostly disappears if we’re speaking about fully-cached DB, which IMO every OLTP should be.

On Cost Estimates and Cost-Based Optimization

When looking at your execution plan – you’ll usually see not only the plan itself, but also some kind of cost estimate (expressed in unspecified cost units, “timerons”, or “disk fetches”); also usually there are estimates of number of rows expected to be obtained from different stages of the execution plan. There is an all-important observation which applies to all such cost and row estimates:

All these estimates are just estimates, and are actually based on statistical guesstimate made by your SQL compiler

The way cost estimates work, goes as follows:

  • from time to time, you run RUNSTATS (a.k.a. ANALYZE, UPDATE STATISTICS, etc.) statement over tables within your database
    • Make sure to run your RUNSTATS without locking the whole table (that is, if your RDBMS does support it, if not – you’re in Deep Trouble(tm)).
    • It is of paramount importance to run your RUNSTATS on regular basis (such as “daily” if you can afford it) – otherwise, your cost-based optimizer is likely to occasionally fail really badly (more on it below).
    • There are some RDBMS out there which update some of the stats silently (on certain events happening) without you knowing it. For a seriously-loaded DB, this usually tends to complicate things <sad-face />, as you have no idea when exactly your almighty RDBMS will decide to update the stats (unless your RDBMS manages to guarantee stats to be always-in-sync, which I don’t think any existing-as-of-mid-2017 database really does <sad-face />).
  • RUNSTATS scans the table and stores results of the scan to one of the database catalog tables. Information stored varies, but usually it is information about tables and indexes, including such things as:
    • number of rows in the table (per-table)
    • Hare thumb down:it can be faster to use Table Scan rather than Index-Scan+fetch over such a low-selectivity indexnumber of unique values in index (sometimes named “index cardinality”) – per-index. This value is very important because it allows to calculate “index selectivity” (as a ratio of number_of_unique_values_in_index to number_of_rows_in_table). And if you have a low-selectivity index (such as “only 10 different values for a million-row table with each row being 200 bytes long”) – it can be faster to use Table Scan rather than Index-Scan+fetch over such a low-selectivity index. A quick exercise in arithmetics for the example above: while Index Scan itself will be fast, it will require fetching of 100’000 rows of table data – and with Index Scan, it will roughly translate into 100’000 page reads(!); on the other hand – with Table Scan, we’ll need to read million rows, but it will result in a sequential read of about 1’000’000*200/page_size ~= 200’000’000/~16K ~= 12’500 page reads, which is much lower than 100’000 page reads needed when using Index Scan (and this is not to mention that random reads for fetching-after-Index-Scan tend to be significantly more expensive then sequential reads for Table-Scan, even for SSDs).
      • Bottom line: avoid low-selectivity indexes, they won’t make things any better. As a very rough rule of thumb – make sure to question usefulness of any index which has selectivity worse than 1%.
    • minimum/maximum values in index. NB: while this thing can improve cost optimization significantly, it also has been seen to backfire when outdated (see a real-world example below).
    • information about key distribution in the index (pays off if index key distribution is heavily skewed – but honestly, you should try to avoid such heavily-skewed index key distributions)
    • other stuff, such as number of levels in index B-tree, number of leaf blocks, and so on.
  • Whenever a query compiler (more specifically – a cost-based query compiler) needs to prepare your statement (effectively creating an execution plan for your SQL statement) – it uses this information to determine which of the candidate execution plans is better.
    • All-important observation: in practice, by design, and as described above –  cost-based optimizers are pretty much bound to outdated statistics.

This, in turn, leads to quite unpleasant real-world problems. The worst case of such horror stories I know about, went along the following lines:

  • there was a very active OLTP DB with like 50M rows added per day to a certain table.
  • one of the indexes was by time.
  • once a day, RUNSTATS was run over the database, with maximum current value for all the indexes (including index-over-time) stored in the stats table
  • now, if the SELECT was using WHERE TIME > specific_date, where specific_date was somewhere after the RUNSTATS was run – the almighty optimizer decided “hey, if we do Ranged Index Scan by date with a limitation of specific_date, there will be (almost-)zero rows in the resulting set, so whatever-we-do with this set later, it will be cheap” – and decided on an execution plan using this Ranged Index Scan. However, in reality – this Ranged Index Scan returned several million rows, which has led to pretty bad problems with these execution plans <sad-face />

It illustrates one all-important point:

Outdated stats can easily cause LOTS of trouble, so make sure you do understand specifics of your database compiler – in particular, information it stores in the process of RUNSTATS/ANALYZE/UPDATE STATISTICS, and why it stores this information

BTW, a funny observation – if in our example above, we’d be using a prepared statement with “WHERE TIME > ?”, optimizer couldn’t use specific_date to make that incorrect guess about the size of the data set (it just didn’t know the specific_date at the point when the statement was compiled and execution plan prepared), so it wasn’t able to screw things up (in practice, only reports were directly affected by this bug-or-feature, but the load and cache poisoning they caused, was bad enough to affect operation in an observable way too). While it doesn’t really count as a “yet another argument for using prepared statements” – it is still good to keep this potential peculiarity in mind.

[[To Be Continued…

Tired hare:This concludes beta Chapter 27(a) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”.

Stay tuned for beta Chapter 27(b), where we’ll continue our discussion on execution plans (and the way to understand them for a C++ developer <wink />)]]

Don't like this post? Comment↯ below. You do?! Please share: ...on LinkedIn...on Reddit...on Twitter...on Facebook


Cartoons by Sergey GordeevIRL from Gordeev Animation Graphics, Prague.

Join our mailing list:

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.