Gradual OLTP DB Development – from Zero to 10 Billion Transactions per Year and Beyond

 
Author:  Follow: TwitterFacebook
Job Title:Sarcastic Architect
Hobbies:Thinking Aloud, Arguing with Managers, Annoying HRs,
Calling a Spade a Spade, Keeping Tongue in Cheek
 
 
10 transactions per second to 10K transactions per second

Cover of the upcoming book
[[This is Chapter XVII(j) from “beta” Volume 2 of the upcoming book “Development&Deployment of Multiplayer Online Games”, which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the “release” book to those who help with improving; for further details see “Book Beta Testing“. All the content published during Beta Testing, is subject to change before the book is published.

To navigate through the book, you may want to use Development&Deployment of MOG: Table of Contents.]]

Ok, after we spent all that time discussing various DB-related techniques in the context of multiplayer games – it is time to put all these pieces of the puzzle together. And, moreover, I am also going to describe a roadmap of DB development for an MOG (BTW, as it was discussed in Vol. 1 – I insist that both stock exchanges and banks are multiplayer games at least from technical point of view).

In this section, we’ll center our discussion on a way how a game-like DB can evolve from zero to many billions of transactions per year. It is based on several real-world observations, including seeing and leading development of a game which went from zero to half a million simultaneous players, and from zero to 10+ billion transactions per year.

I. Starting Point

Most of the time, I suggest a first version of your game DB to work along the following lines (though for larger games, you MIGHT want to start it with a bit more complicated implementation, more on it in [[TODO]] section below):

Fig XVII.1

CSR Customer Service Representatives interact with customers to provide answers to inquiries involving a company's product or services.— Wikipedia —The whole system is operating from two server boxes. The first box, “DB Server Box”, is running an RDBMS – and “DB Server App”. The second box, “Intra-Web Server Box”, provides an intranet web server for CSR reporting. Actually, for a really small game you can run everything on one single server box – but as soon as you can afford two servers, it is better to separate web from RDBMS (while keeping DB Server App and RDBMS on the same box – this is important).

Important Points and Implementation Details

First, some important points:

  • As it was discussed in Vol. 1 – all the interactions of Game Server Apps with DB Server App MUST go via “DB Server API” (with each request being atomic, and also expressed not in terms of SQL, but in terms of Game Logic).
  • DB Server App SHOULD be implemented using Single Writing DB Connection (as described in [[TODO]] section above)
    • As it was discussed in [[TODO]] section above – unless you have a DBA who develops and successfully runs 24×7 multi-connection DB with millions transaction/day – I do NOT recommend going for multiple-writing-DB-connection. While it is possible to do it this way – doing so requires way too much subtle expertise in specific DB (and there is a high risk of data consistency problems manifested only under the heavy load).
    • Hare pointing out:For Single Writing DB Connection, a BBWC RAID controller SHOULD be usedFor Single Writing DB Connection, a BBWC RAID controller SHOULD be used (as mentioned in [[TODO]] section above). Disks for RDBMS SHOULD be either within the server box itself, or within “directly attached storage” (and not within SAN or NAS).
    • RAID-1 is recommended with at least 2 logical RAID disks each consisting of 2 physical ones; one logical disk should be dedicated to DB logs, another one – to everything else (feel free to split the latter one further – as long as each of the split disks is RAID-1). More on it in Vol. 3 (tentatively Chapter XXI on Preparing for Launch)
    • DB Server App SHOULD run on the same server box as RDBMS.
  • All the DB requests from DB Server App to RDBMS MUST go via prepared statements
    • It is also a nice place to use that SQL Bindings Compiler we’ve discussed in [[TODO]] section above 🙂
  • If you RDBMS is a lock-based one – all “reporting” DB connections SHOULD use “Read Uncommitted” transaction isolation level.
    • this is necessary to reduce (in practice – almost-eliminate) locks by readers over writer. As MVCC-based RDBMSs don’t normally cause readers to lock – usually this requirement doesn’t apply to them.
  • When implementing your DB Server App, you SHOULD make it Reactor-based. This will facilitate quite a few things down the road (such as app-level cache).
    • Even if you decide not to do it – you MUST make sure that your DB Server App does NOT use any thread sync.

Besides those important points listed above, there are some details, which depend greatly on the nature of your game, so you should choose them yourself:

  • Which exactly RDBMS to use
  • Which programming language to use to write your DB Server App or your Reporting App
  • Which API to use between DB Server App and RDBMS
  • etc. etc. etc.

II. Enter App-Level Cache and Replicas

If all the necessary optimizations (indexes, physical DB layout, decent hardware, etc., see Vol. 3 [[TODO]] for a detailed discussion) are made – then the system shown on Fig XVII.1 should be able to handle of the order of single-digit million DB transactions per day.

To go further – I suggest to move in the direction shown on Fig. XVII.2:

Fig XVII.2

Two most important changes on Fig XVII.2 here are:

  • adding App-Level Cache (facilitated by using Single-Writing-DB-Connection).
  • adding eventually-consistent replica.

The idea behind app-level cache was described in [[TODO:Stage 2]] section above. Replicas are asynchronous replicas; we’ve discussed them briefly in [[TODO:Stage 3]] section above, and we’ll discuss DIY implementation in Vol. 3 [[TODO]].

Important Points and Implementation Details

Important implementation points for app-level caches and replicas:

  • To have a coherent app-level cache – you MUST use single-writing-DB-connection architecture
  • Replica MUST be an asynchronous one
    • In addition, I prefer to use app-level replication. In particular, it allows for heterogeneous replication, and for modifying data for replica in-any-way-you-want too (and we’ll need both these things a bit later).
      • We’ll discuss implementing app-level replication in Vol. 3 ([[TODO]]). For now, let’s just note that:
        • To implement app-level replication, you need to run your source as a single-writing-DB connection.
        • OTOH, target DB Replica Server implementing app-level eventually-consistent replica MAY use multiple writing connections (some restrictions apply, batteries not included, see Vol. 3 for further discussion)
      • Arguing hare:Usually, I am suggesting to truncate history on OLTP DBUsually, I am suggesting to truncate history on OLTP DB (see discussion in [[TODO]] section above); this will usually allow to keep it small (within 100G) – and to fit all of it in RAM of your OLTP DB Server; it will help your performance even if your RDBMS is not a really in-memory one.
        • This in turn means that Server Replicas are not exactly reconstructible from the DB Replica. Which in turn means that they need to be backed up (and that replication initialization procedure will become more complicated, more on it in Vol. 2).
      • I strongly prefer to have at least two replicas. Replicas are known to desync for quite a few reasons (from hardware and RDBMS failures to DBA errors), and as one replica reconstruction from the OLTP DB can easily take many hours (effectively leaving your CSRs without reporting) – second replica is a Very Nice To Have thing. Also it comes very handy when some ad-hoc request causing table scan brings your replica caches to their knees (causing 10-minute delays etc.)
      • As a rule of thumb, Server Replica boxes do NOT strictly need BBWC RAID with a directly attached storage – and so a SAN and cloud-based lower-latency storage can be used (a sigh of relief is expected from your admins at this point ;-)) – though NAS and higher-latency cloud storage is IMO still questionable for Replicas.
        • With the transaction volumes we’re speaking about, Server Replicas are going to grow Damn Huge (like 10+Tbytes in size) in just a few years. That’s when Archive DBs become necessary. Archive DB is essentially a one-time snapshot of a Replica DB (with irrelevant portions of historical data truncated) – and going, say, for a year of historical data. Such stale data is rarely requested in practice (but is still requested so throwing it away is rarely an option) – so keeping it separate doesn’t cause too much inconvenience.
          • Note that not all the historical data can be truncated from Replica DB – information such as financial one is both small enough and important enough to keep it in Replicas “pretty much forever”
          • An example of data which can go to the Archive while being truncated from Replicas – is such things as “stale” audits of all the matches/tournaments or all “stale” logins; while it is desirable to keep these “just in case” (and they will be used) – these uses will be very rare.
        • Hare thumb up:Archive DBs often allow to reduce Replica DBs in size – very roughly to the order of several terabytes or soArchive DBs often allow to reduce Replica DBs in size – very roughly to the order of several terabytes or so. And Archive DBs themselves can grow forever (actually, in implementations I’ve seen they were not really growing – instead, new Archive DB was created once a year or once half a year).
        • For Archives, pretty much any storage will do – on the diagram it is SAN, but it can easily be NAS or any kind of cloud storage too.

III. Analytics

One thing which is not really covered by the diagrams above – is analytics. Of course, reporting is possible from straightforward OLTP replicas, but analytics – as in star-schema analytics, or requests over aggregates (“give me per-hour stats of game occurrence X and correlate it with promotion Y”) – are not really feasible with the systems described above. That’s why the addition shown on Fig. XVII.3 is likely to become necessary (note that OLTP DB is not shown on Fig XVII.3):

Fig XVII.3

Here, we’re adding a new DBMS for analytical purposes. It can be pretty much anything you may need – from classical star-based fact table plus N dimension tables, to anything NoSQL-based. Moreover, it can easily happen that you need to run some analytics from an SQL DB, and some analytics from a NoSQL DB; this, of course, means that you’d need heterogeneous replication. However, if you’re using (as I suggested above) an app-level replication – you can easily do it.

Implementation-wise, the key on the Fig XVII.3 is all about the two dotted lines: “Initial Population (Modified)” and “Online Updates” leading to Modifying DB Replica Server App. The idea here is to have an analytics-oriented replica, with the data already optimized for those analytical requests you will want there.

Judging hare:to make an efficient representation usable for OLAP – we need to modify our data on its way to OLAP replicasA very important word along both these dotted lines is “modifying”; indeed, to make an efficient representation usable for OLAP – we need to modify our data (denormalize and/or aggregate and/or whatever-your-OLAP-DBA-says-she-needs-to-run-analytical-queries-efficiently) on its way to OLAP replicas. This modification is one of the things which are quite easy with app-level DIY replication, and that’s one of the Big Reasons why I positively love it. How exactly to modify the data – depends on the analytics you have; however, from what I’ve seen – most of the time it will be either some kind of denormalization and/or some kind of data aggregation.

IV. OLTP Scalability

I’ve seen the system shown on Fig XVII.2 (with or without analytics from Fig XVII.3) able to handle at around dozens of million transactions per day (that’s, assuming 24×7 operation with a more-or-less typical distribution of the load over 24 hours, roughly corresponds to 10B transactions/year). When/if it becomes insufficient – we can go further along the lines of Share-Nothing (almost-)Perfect Scalability discussed in [[TODO: Stage 5-6]] section above. I’ve seen it implemented and deployed along the lines shown on Fig XVII.4 (note that only OLTP DB server box is shown on Fig XVII.4):

Fig XVII.4

Here, we have several (almost-)Share-Nothing subsystems, as it was discussed in [[TODO: Stages 5-6]] sections above. However, what we didn’t mention in the previous discussion, is that it is perfectly possible to run separate Share-Nothing tuples of (single-writing-connection-DB Server,RDBMS,database) on top of the same RDBMS, merely separating the data which belong to different DB Server Apps, into separate tables within the same RDBMS. As long the tables belonging to different DB Servers Apps, do not interact in any manner whatsoever – the only resource which can cause contention between different (and supposedly Shared-Nothing DB Server Apps), is DB log, but I didn’t see it to cause significant troubles. That being said:

  • This single-RDBMS configuration is merely a convenience
    • if it causes any trouble – you can (and SHOULD) go for complete Shared-Nothing (with separate RDBMS, though usually they still may run off one single box).
  • Advantages of running it this way are (in no particular order):
    • license cost savings
    • simplified coherent backups
    • improved MTBFs without fault tolerance (running off one single HW server means that it is the only thing which can fail, and for a good 4S/4U server box we’re speaking about 1 crash without data corruption per 3-5 years, which is good enough for a vast majority of the games out there)

My estimates for such a system, when running on one single OLTP DB 4S/4U server box – is that it can reach tens of billions of transactions per year (of course, assuming very good optimization). Going further is possible too, see “Scalability” section below.

Important Points

Two all-important points with regards to Fig XVII.4:

  • Assertive hare:Each of the DB Server Apps is a replica master, but all replica targets are within the same Replica DB.Each of the DB Server Apps is a replica master, but all replica targets are within the same Replica DB. It is not a problem – as long as all of these DB portions land into separate tables within the Replica DB, so there can be no possible conflicts
    • As soon as Replica is one single DB again – all kinds of reporting requests can be run against it 🙂 – and without federation at OLTP level.
  • In spite of running off the same RDBMS, from app-level point of view it still MUST be Share-Nothing. In other words – each app should still treat its portion of DB as the only one it accesses – and should be redeployable onto completely separate DBs without any changes. This “running multiple-portions” instead of “running multiple-DBs” is merely a convenience option, not an architectural decision.

Scalability – both read and write

After we have your system with Fig XVII.2+Fig XVII.4 implemented – we have it near-perfectly scalable. Read scalability is achieved via Replicas (and maybe analytics) – and read-only stuff is inherently scalable (as adding new replicas to spread the load over them is trivial at least in theory).

And write scalability is achieved by app-level separation into different DB Server Apps (each having their own DBs or portions of DB). With this architecture, while the configuration on Fig XVII.4 is not really Share-Nothing – there is always an option to re-deploy exactly the same code into real Share-Nothing on different server boxes; and Share-Nothing means that there is no contention – and therefore, is a linear scalability.1


1 That is, after we account for the costs of keeping inter-DB transfers

 

V. Throwing in Fault Tolerance

As it was discussed in [[TODO]] section, if we manage to run our OLTP DB from one single 4S/4U server box, we’re speaking about MTBFs of the order of 3-5 years (even if we account for crashes of DBMS itself). As a result, as long as this “single OLTP server” stands, for quite a few games out there we won’t really need to think about fault tolerance (as an average game crashes much more frequently than that – well, there are much better points to optimize within your system). However, IF you want to do your system fault-tolerant – it is possible too. We’ll discuss implementing it in detail in Vol. 3; for now, let’s discuss a very brief sketch on Fig. XVII.5 (note that only one of DB Server Apps is shown on Fig. XVII.5):

Fig XVII.5

Here, the idea goes along the following lines:

  • After we already have our DB Server App (which can be a Game World DB Server App, User DB Server App, or whatever else DB Server App) working – we just add a Proxy App, having the same DB Server API as our original DB Server App.
    • This allows to deploy Proxy App seamlessly.
  • This Proxy App just takes each request – and sends it to both DB Server Apps sitting behind it. As everything is perfectly deterministic (which is possible because of Single-Writing DB Connections) – all the updates on both OLTP DB Server boxes are identical.
  • Surprised hare:If Proxy App realizes that one of DB Servers is not answering for-longer-than-allowed – well, it declares that OLTP DB Server Box is deadIf Proxy App realizes that one of DB Servers is not answering for-longer-than-allowed – well, it declares that OLTP DB Server Box is dead, raises a big red flag for admins (raising the hell for support team in all possible ways), and proceeds with working with one OLTP DB Server Box.
  • Now, we may seem to be back to square one – as our Proxy box needs to be fault-tolerant too. However, achieving fault tolerance on a small in-memory app like is a rather well-known task, and can be implemented in at least two ways (which were discussed in Chapter VII):
    • VM-based fault tolerance (note that for DB we don’t care much about additional latencies of about 100ms, so even checkpoint-based fault tolerance can fly)
    • Determinism-based fault tolerance (which was discussed in detail in Chapter VII).
  • Note that we still can run several DB Server Apps on the same DB Server box (in a pretty much the same manner as shown on Fig. XVII.4) – as long as there is always one instance of DB Server App (with an appropriate portion of DB) running on one physical server box, and another instance of the same DB Server App – running on another physical server box.

Bingo! We’ve got fault tolerance too. Of course, the process of recovery from the OLTP Server box failure – while being 100% online without stopping the processing whatsoever – won’t be a picnic, as we’ll need to create another perfectly synchronized box. However, it is achievable even via using nothing but ubiquitous online-backup-with-rollforward mechanism (causing a small stop in processing at the very end of this process to re-sync the DBs); for implementation details of such synchronization – please refer to Vol. 3.

On Game World State DBs

In some (though certainly not all) cases you may want to additionally have Game World State DBs – just to save current Game World State in a persistent manner. Usually, unless your game is asynchronous – I don’t feel it is necessary (preferring to store game state in-memory and syncing with DB only at the end of “game events” – see detailed discussion on it in Chapter VII).

However, if you do need Game World State DBs – in general, two cases are possible:

  • if our game allows to keep Game World State DB very separate from the rest of the system, syncing with main DB only at the end of “game events” – well, we’re fine with using pretty much anything to implement them (including NoSQL DBs).
    • BTW, doing so doesn’t imply that we won’t need to have a corresponding Game World Server DB App – we probably will (to sync the states to main DB after “game event” is over)
  • otherwise – we will need to consider Game World State as a part of our main DB, and store these States via appropriate Game World Server DB App (in a corresponding DB or portion of DB).

That’s Pretty Much It

Assertive hare:If you implement the architecture on Fig XVII.2 with additions from Fig XVII.3 and Fig XVII.4 – honestly, I know of only one game-like system which MIGHT be able to overwhelm it performance-wise: it is post-mid-2000 NASDAQ.If you implement the architecture on Fig XVII.2 with additions from Fig XVII.3 and Fig XVII.4 – honestly, I know of only one game-like system which MIGHT be able to overwhelm it performance-wise: it is post-mid-2000 NASDAQ. Handling current NASDAQ volumes (around million transactions/second) goes well beyond my experience (which was up to several thousand transactions/second), and while I can be reasonably comfortable about extrapolating my experience for another order of magnitude (I’ve got a feeling that it has ample reserves) – going beyond that is likely to reveal some things which I am still unaware of.

Still, as long as you’re staying within tens of thousands of transactions/second (with 24×7 operation and more-or-less usual distribution of the load per day, they will land you hundreds of billions DB transactions per year) – my pretty-well educated guess is that you should be ok (that is, if you did optimize the hell out of your system).

Starting Low – or a Bit Higher

This whole development (starting from Fig XVII.1 to Fig. XVII.2+ Fig. XVII.3+ Fig. XVII.4) is perfectly doable in a gradual manner.

And for an indie game I clearly suggest to start with Fig XVII.1 (and forget about the rest until you start expecting performance problems in 3 months from now).

On the other hand, if you’re going to launch big – well, you might need to start with something more elaborated that Fig. XVII.1; while gradual path does exist (and is pretty straightforward) – going alongside the whole path is going to take at least a year (probably two) – which can cause a certain trouble if you’re not prepared. Where exactly you need to start – depends on your expected DB transaction volume (with typical achievable numbers listed above).

BTW, the first thing you need to think about if you’re not fitting into Fig XVII.1 – is “are we Really Sure we want to write all this information that often?” – consider, in particular, writing to DB only at the end of “game events” (see relevant discussion in Chapter VII, especially ‘No Bugs’ Rule of Thumb for Multi-Player Games).

Final Comments

Above we’ve discussed one architecture which allows to achieve tens of billions of DB transactions per year – alongside with both CSR-oriented “real-time” reporting and analytics. While this is certainly not the only way to skin this cat, I’ve tried it myself, and am very fond of the results (both performance-wise and development-simplicity-wise).

While the architecture above is certainly not the only one possible – I suggest to take a close look at it2

2 as always, YMMV, and batteries not included

 

[[To Be Continued…

Tired hare:This concludes beta Chapter XVII from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”. Stay tuned for beta Chapter XVIII, where we’ll start discussing payment processing.]]

Don't like this post? Comment↯ below. You do?! Please share: ...on LinkedIn...on Reddit...on Twitter...on Facebook

Acknowledgement

Cartoons by Sergey GordeevIRL from Gordeev Animation Graphics, Prague.

Join our mailing list:

Comments

  1. Jesper Nielsen says

    I enjoyed reading the Reddit discussion where you touched on transaction sizes.
    In chapter XVII(e) you were advocating matching DB Server requests to exactly one DB transaction. But how do you feel about transaction batching?
    Several game-world transactions could be pulled from the queue and executed in a single SQL transaction (or at least as long as these game-world transactions don’t involve more than a single DB Server…)

    • "No Bugs" Hare says

      As Garfield the Cat has once said – “We have a technical term for it. We name it ‘Oops'”. Fixed, thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *