OLTP Database Optimization 102 – DIY Heterogeneous Replication (Part II)

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

[rabbit_ddmog vol=”7″ chap=”Chapter 27(g) from “beta” Volume VII”]

In my previous post, we discussed basic algorithms related to DIY heterogeneous replication. A brief re-cap:

We have a Master OLTP DB which is handled via single-writing-DB-connection
We have all the DB updates in Master OLTP DB expressed in terms of updates with PRIMARY KEY known in advance.
Online Updates: for each update, we send a Replication Message to the Slave DB; it will be one of the following:
- INSERT INTO (with PK in it)
- UPDATE … SET … WHERE <PK> = …
- DELETE FROM … WHERE <PK> = …
- NB: Code generating Replication Messages for Online Updates MAY (and IMO SHOULD) be generated by “SQL Bindings Compiler” automatically, though this is not a strict requirement. Even as such a product isn’t publicly available and you’ll have to write it yourself – as soon as you have even several dozens of SQL statements, benefits of having code generated for you will outweigh the costs of writing SQL Bindings Compiler, and for a successful MOG you can be sure to have hundreds of them.
Initial Population works as follows:
- We start to write Replication Messages and set them aside
- We make a kinda-snapshot of the DB (“kinda-“ means that it can be a per-row snapshot rather than a completely consistent one). Techniques for obtaining kinda-snapshot include:
  - Online backup+restore. Usually the best thing if replication is homogeneous (=”to the same RDBMS”), though see below on re-population of Super-Replicas in case of Master DB crash.
  - Reading all the rows from Master DB via the same single-writing-DB-connection or a separate read-only connection-with-at-least-Read-Committed-transaction-isolation-level, and writing these rows to the Slave DB. Code for reading (and writing) those rows MAY and IMO SHOULD be generated by “SQL Bindings Compiler”.
  - Same as the previous item, but reading is via Uncommitted-Read connection. This seems to be possible, but for complicates things, as to avoid dirty rows in Slave DB, we have to issue special Rollback Replication Messages (only during Initial Population).
- Then, we apply all the Replication Messages (which had been started before we started to make our kinda-snapshot) to the Slave-DB (which is already restored from kinda-snapshot by that point). It is crucial to start applying Replication Messages only after kinda-snapshot is completely restored in Slave DB.
- As this process completes (i.e. all Replication Messages which were issued until kinda-snapshot was completed) – we have a more-or-less consistent¹ Slave DB, which is usually perfectly fine for reporting / analytical purposes.
- We continue to apply Replication Messages to keep Slave DB more-or-less consistent pretty much forever.

Now, we can proceed with considering certain more subtle implementation details, which happen to be very important for use in real-world mission-critical systems.

¹ If we’re doing everything right, Slave DB will become really consistent as soon as we stop Master DB processing for a few minutes, but at each moment it is not necessarily fully consistent.

Super-Replicas

“The point of Super-Replica is that it is a replica Slave DB which is larger than original Master DBOne important feature of DIY replicas discussed above, is that they allow for what I name “Super-Replicas”. The point of Super-Replica is that it is a replica Slave DB which is larger than original Master DB. Super-Replicas tend to appear in serious OLTP systems when we realize that (a) we have lots of audit/historical data in our OLTP DB, and (b) we don’t really need most of this data for OLTP operation itself (though this data is important for reporting and other types of analysis).

At this point, we can start truncating those historical tables from Master OLTP DB, leaving them only in Slave DBs. For real-world systems, this allows to reduce the size of OLTP DB by orders of magnitude, which has very good implications for performance and maintenance of Master OLTP DB. In particular:

Risks of issuing a disruptive overly-long statement over OLTP are significantly reduced
Online Backups of Master OLTP DB take much less time (and for sure we don’t want to run them during peak times).
Last but certainly not least – we can fit our OLTP DB 100% in memory. Whether we switch to an in-memory RDBMS or just rely on caching of our regular RDBMS, is another story, but even in the latter case having data 100% cached certainly qualifies as an Extremely Good Thing™ performance-wise (even with SSDs, we’re speaking about 100x performance advantage of an in-memory cache page read over SSD read).

To implement truncation, I strongly prefer to rely on SQL Bindings Compiler (already mentioned in several dozens of places along the course of this book); more on implementing truncation in Chapter 28.

With this in mind, our SQL Bindings Compiler has either to (a) support the concept of “non-replicated statements”, or (b) support the concept of “historical tables” (which implies that DELETEs on these tables are not replicated). Out of these two choices, I strongly prefer the latter (as non-replicated statements are way too easy to mess up/misuse).

As a result, I am arguing for the following approach:

Implement a clause for your SQL Bindings Compiler, saying “this table is a HISTORICAL table”.
For HISTORICAL tables, only INSERT INTO statements, and a special TRUNCATE_HISTORY operations are supported (and all the app-level INSERT/DELETE statements are flagged as errors by SQL Bindings Compiler).
For TRUNCATE_HISTORY operation, SQL Bindings Compiler generates DELETE statements which are not replicated to slaves.

Super-Replicas and Data Consistency

While Super-Replicas work nice for heavily-loaded OLTP environments, they have certain implications related to data consistency.

First, we have to realize that with Super-Replicas, it is NOT sufficient to backup only our Master OLTP DB anymore; in other words – at least one of Super-Replicas has to be backed up on a regular basis too (while it is not that big deal, it is a thing to keep in mind).

The second implication is a more subtle one. In case of Super-Replicas and Master DB failure (which includes hardware failure, OS failure, RDBMS failure, or Master DB Reactor failure) – Initial Population of the Super-Replicas becomes more complicated than a simple procedure described above. More on it in [[TODO]] section below.

Super-Replicas without writing to Master OLTP DB first

“if we DON’T need information from HISTORICAL tables in Master OLTP DB, why do we need to store it there in the first place?One may ask – if we DON’T need information from HISTORICAL tables in Master OLTP DB, why do we need to store it there in the first place? It is a valid question (and there are cases when we can avoid writing certain HISTORICAL data to Master OLTP DB).

Before we start, there are a few considerations:

Whatever we do, we DON’T want to have a two-phase commit between our Master DB and Slave DB. Doing this would introduce a dependency of Master processing on Slave operation, which is a Big No-No™ if we’re speaking about real-world mission-critical systems (the whole point of making our Slaves asynchronous was to remove this dependency. Re-introducing it back will bring us all kinds of trouble in case if the Slave is slowed down/crashes/etc.).
- Given that our HISTORICAL write to Slave DB has no possible legitimate reasons to fail, it means that in theory we can do a DIY kinda-two-phase-commit – first, committing to Master DB with a special “temporary” flag (but not releasing any messages-resulting-from-the-event-which-caused-the-transaction yet), and then, after receiving a confirmation from SlaveDB – removing “temporary” flag and releasing those delayed messages. This allows to avoid blocking (while providing full consistency guarantees, including Durability), but still, as of now I don’t see practical uses of this approach; most importantly, doing two transactions over OLTP DB instead of one is expensive and is likely to kill most of the the gains (if not all) from not writing HISTORICAL data <sad-face />.
If Master OLTP DB does need HISTORICAL information for a limited time, we MUST write this HISTORICAL information to Master DB first (but MAY truncate it later). Examples of such cases include all kinds of business cases such as “as chargebacks can go as far as 6 months back, we do want to keep all the payments for a year so we can process the chargebacks automatically, but going much further back is not necessary for automated processing”, “users are allowed to get their playing history for last month right from the Client, but to go further back they have to send request to support”, and so on. Certainly, not all the HISTORICAL tables will be like this, but from what I seen, such “we have to go back in OLTP DB only for N months” patterns are fairly common.
If the data is absolutely-critical, we SHOULD write it to Master OLTP DB too. As we’ll see below, there are cases (however fringe ones) when some portions of HISTORICAL data can be lost if we don’t write them into OLTP DB in the first place. Such cases will be extremely rare (as in “once in a year, a dozen of HISTORICAL records has a potential to be lost”) – but for some kind of data even such low probabilities of historical data loss are not acceptable.

Keep all these restrictions in mind, for that data which is NOT absolutely necessary and is NOT needed by OLTP itself – we MAY skip writing it to Master OLTP DB, only sending a Replication Message to the Super-Replica(s).

Risk Analysis

Now, let’s see what are the risks of a HISTORICAL record not making it to the Super-Replica (while the transaction itself did succeed within the Master DB). Of course, we’ll assume that we did implement all the kinds of keeping and re-sending the Replication Messages if connection to Super-Replica is temporarily lost, etc.

The worst case, which involves an irrecoverable loss of HISTORICAL data, happens only if Master DB fails (this includes hardware failure, or kernel panic, or RDBMS crash, or a crash of our MasterDBReactor); otherwise, MasterDBReactor still has this HISTORICAL data and will be able to ensure correctness of the Slave DBs. This loss of HISTORICAL data because of Master DB failures, in turn, can be further classified into:

HISTORICAL data loss which happens while Slave DBs are working fine, but records are still lost because corresponding messages didn’t make it to Slave DB. Here, we’re speaking about a dozen or so of Replication Messages being lost.
HISTORICAL data loss which happens while Slave DBs are failing too. These can cause much more records to be lost, but as this scenario requires a double-failure (first, Slave DB should fail, and then within a few hours, Master DB should fail too), it is much less likely. Moreover, if we have two separate SlaveDB replicas as suggested below, we’ll need three catastrophic failures to have this kind of problem.²

Keeping in mind this analysis – which means that in case of catastrophic Master DB failure all we’ll get will usually be a loss of a little bit of HISTORICAL data (while the result of the transaction itself still being valid, albeit non-audited) – in certain use cases it might be acceptable.

“this “don’t write some HISTORICAL data to Master DB” approach can easily allow to reduce the load on the Master OLTP DB by a factor of 2x-3xOn the positive side – this “don’t write some HISTORICAL data to Master DB” approach can easily allow to reduce the load on the Master OLTP DB by a factor of 2x-3x (historical data about games played tends to be rather large and unwieldy, so writing it down takes significant time). Still, I’d say this option should be left as a “last resort”, to be used in case if you’re running into big performance problems with your OLTP DB (and don’t have time to properly scale it as discussed in Vol. VI’s chapter on Databases); in such scenarios – it might save your bacon until the proper scalability is implemented.

² As long as SlaveDB failures are not correlated, which does usually stand at least for hardware failures, OS failures, and RDBMS failures, though MAY NOT be the case for SlaveDBReactor failures.

Replicas and Fault Tolerance

Continuing discussion on the Fault Tolerance of our Slave DB replicas, a few real-world observations:

“Even for reporting-only purposes, we’ll usually need to have at least two independent Slave DB Replicas.Even for reporting-only purposes, we’ll usually need to have at least two independent Slave DB Replicas.³ There are several reasons for it:
- Whatever we do, it remains possible to issue an ultra-heavy report (with some strange parameters filled in by CSR) against Slave DB, slowing it down to a crawl for half an hour (and even if your DBAs kill the offending request, it will take time to re-populate poisoned caches⁴). In such a case switching of all the reporting to an alternative Slave DB is an obvious – and very-well-working – solution.
- It has been observed that RDBMS-running-replica tends to crash MUCH more often than Master OLTP DB. This tends to happen because of (a) MUCH simpler processing within Master DB (hey, there is no potential for races within RDBMS whatsoever), and (b) because we’ll want to run newer versions of RDBMS on our Slave Replicas (to see when they become ready for Master DB).⁵
- If we’re speaking about Super-Replicas without writing some of the HISTORICAL data to the Master DB first – it becomes even more important (to reduce the chances of catastrophic failure causing loss of HISTORICAL data).
Master DB has to keep all the Replication Messages until they’re acknowledged by Slave DB (or until we declare our Slave DB dead).
- With load on Master DB being of the order of 10’000 transactions/second, and each Replication Message being 500 bytes or so, it means that we’ll have to store about 5 megabytes per second; assuming that Slave DB failure can last for up to 24 hours (if we cannot recover by that point – most likely, it will need to be re-populated from scratch anyway) – we’ll need as much as 400 Gigabytes of RAM to store all the Replication Messages on the Master DB side(!) – and that’s per replica.
  - Sure, if your numbers are much lower – you don’t need to bother, but at least do your own math before ruling this problem out.
  - One way to bypass this problem (while keeping all the guarantees intact) which I’ve seen to be used in the wild – was a system with an intermediate “proxy slave” server box which just receives Replication Messages and stores them persistently (in files or DB), acknowledging receipt of the Replication Message to the Master DB right away after this storing-by-proxy-slave is committed (more strictly, after fsync() is performed either directly or by proxy-slave DB).
    - One note – while these “proxy slave” boxes are usually not demanding hardware-wise, to improve Fault Tolerance it is desirable to keep them separate for each of the Slave DB Replicas. OTOH, if we’re not speaking about the volumes that high as used in calculations above -– it might be ok to run “proxy slaves” on the same server boxes which run Slave DB replicas (while keeping their data on separate drives from Slave DB).
- In case of regular shutdown – MasterDBReactor has to store all the outstanding Replication Messages in some kind of persistent storage (usually – DB itself).

³ We may need to have three in case of planned migrations etc. to reduce risks.

⁴ Unlike for OLTP DB, for Super-Replicas we can’t hope to fit everything in memory.

⁵ and whatever-your-RDBMS-vendor will tell you – the newer major versions (those with new features) tend to crash MUCH more often than the stable-ones-with-lots-of-bugs-already-fixed. I’ve seen previous-major-RDBMS-version to crash once per 5 years, while current-version-before-certain-fix crashing under heavy load like every month or so <ouch! />.

Recovery from Master DB faults

As discussed above – if we don’t write HISTORICAL data to Master OLTP DB, in case of Master DB failure we’re at risk of losing this HISTORICAL data. Now, let’s see how we will recover from Master DB failure in case when we DO write all the data to Master OLTP DB first.

After we had a catastrophic failure of our Master DB, and managed to restart it (that is, without restoring from the backup, which is a completely different story), there is a chance that some of our Replication Messages had been lost. It means that while all the information we need, resides somewhere in the system, answering a question “what exactly we have to do to get consistency restored” is not easy <sad-face />. Still, there are at least two different ways of handling it:

Full re-population from scratch. In this case, we’ll have two stages:
- Stage 1. Re-population of the consistent Slave DB but without “old history” stuff. If we managed to keep the size of our OLTP DB down to 100G or so (which is perfectly possible with Super-Replicas), and did everything else right, then making a new snapshot via online backup/restore is going to be done within an hour. Very rough calculation: if we reserve 50Mbyte/sec read bandwidth for our online backups – and this is perfectly possible if multiple SSDs are involved, we’re speaking about half an hour for backup/restore of the whole 100G Master DB (which effectively means that Slave DB – except for data which has already been removed from the Master DB – can be re-populated pretty quickly). After this Stage 1, we’ll have two Slave DBs:

- - one (“new”) Slave DB will be up-to-date but without old HISTORICAL info (old info being populated – ideally from newer to the older one).
  - another (“old”) Slave DB will be either not up-to-date (if we don’t send new Replication Messages there), or will have potential glitches (if we do send new Replication Messages there)
  - in theory, it might be possible to make a federated DB over these two Slave DBs to make things transparent – but this is a very RDBMS-specific question better to be left to your DBAs (and will likely require lots of jumping through the hoops, such as “drop all non-HISTORICAL data from ‘old’ Slave DB”).
- Stage 2. Population of “new” Slave DB with the “old” HISTORICAL data. This is going to take a loooong while (depending on the size of your history, it can take many days). There are several reasons for this process being much slower than online backup/restore process discussed above, in particular because we’re going to issue INSERT INTO statements against our Slave DB, and INSERT INTO is much slower than simple online backup/restore (writes are not linear, indexes need to be built,⁶ etc. etc.).
“Fortunately, it is possible to avoid full re-population after the Master DB crash.Fortunately, it is possible to avoid full re-population after the Master DB crash. In this case, we’re just taking existing Slave DB, going along the lines of the basic Initial Population process (“start writing Replication Messages” – “take kinda-snapshot of Master DB using simple row reads” – “apply those rows to Slave DB” – “apply Replication Messages to Slave DB”) as discussed above, but we’re going over the existing Slave DB instead of going over an empty Slave DB. This way, we won’t drop our “old” HISTORICAL data from the Slave DB, while effectively re-populating it with the up-to-date stuff from up-to-date Master DB.
- This process is going to be significantly slower than Stage 1 from the full re-population process above, and even for a rather small 100G Master DB can easily take as much as 4-8 hours or so.
- OTOH, if you have LOTS of “old” HISTORICAL data in your Super-Replicas – it is going to be faster than Stage1+Stage2 of the full re-population discussed above (just because the amount of the data we’re touching, is MUCH smaller).

⁶ BTW, if you have LOTS of “old” HISTORICAL data, and you don’t need to run reports while DB is being restored – it is usually significantly faster to drop indexes before re-populating Slave DB, and to rebuilt them after population is completed.

Replica Writing Performance

With an OLTP being time-critical, an obvious question is “hey, how we can be sure that Slave DBs can cope with the load?” Fortunately, it is not a problem⁷ because of one single observation:

Unlike for Master OLTP DB (where I am strongly in favor of single-writing-DB-connection), for Slave DBs, we can use multiple writing connections.⁸

The thing here is that I am not really against multiple DB connections (neither I am against multiple threads of execution); what I am against – is the need to think about interactions between different connections/threads (at least at app-level).

“Slave DB writes with very-well defined Replication Messages along the lines above, we can easily ensure that there are no interactions between different writing connections. However, for Slave DB writes with very-well defined Replication Messages along the lines above, we can easily ensure that there are no interactions between different writing connections. In particular:

As all our Replication Messages are explicitly touching only one table – it means that Replication Messages aiming different DB tables can be run via different DB connections.
Moreover, as all our Replication Messages are explicitly touching only one specific row within the table (identified by PK), it means that even within one single DB table we can make “kinda-partitioning”, separating all the PKs into several writing connections based (say) on a hash of the PK. Again, this will rule out any interactions between different DB connections.⁹
Bingo! We can easily scale populating and keeping updates of our Slave DBs using as many CPU cores as feasible.
NB: within each of connections, all relevant Replication Messages still MUST be applied in their correct order (!).

⁷ At least, I never seen it to be a problem

⁸ Under some conditions outlined below.

⁹ well, there are some kinds of implicit locks (such as “lock on adjacent record within the index”) which MAY come into play, but even in this case, they usually don’t cause too much trouble (though in theory you MIGHT need to react to deadlocks in Slave DBs, and re-issue rolled-back statements, I never seen this protection to work in practice).

[[TODO: replica snapshot]]

[[TODO: schema transformations]]

[[To Be Continued…

This concludes beta Chapter 27(g) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”.

Stay tuned for Chapter 28, where we’ll briefly discuss certain issues related to running your system 24/7]]

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

Comments

Ferdi says

May 8, 2018 at 11:27 pm

Not directly related, but you should check out the paper Mojim: A Reliable and Highly-Available
Non-Volatile Memory System. Kinda neat system using Nvm storage for distributed storage with high availability, consistency and reliability.

- "No Bugs" Hare says
  
  May 9, 2018 at 11:22 am
  
  Thanks for the link. I took a look, and it would be really ground-breaking IF they could avoid app-level msync (which BTW should be doable via techniques similar to VMWare/Xen’s “fast checkpoints”).