OLTP Optimization 102: DYI Heterogeneous Replication. Part I (Basics)

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

[rabbit_ddmog vol=”7″ chap=”Chapter 27(f) from “beta” Volume VII”]

After we discussed (and hopefully you implemented <wink />) app-level caches and Group Commits, the next step in the quest for OLTP DB performance, is usually replication. As it was discussed in Vol. VI’s chapter on Databases, the point behind replication is to move vast majority of the read-only requests to special replica databases, and these replica databases, being read-only, can be trivially scaled (and replica fault tolerance can be trivially achieved too).

“moving read-heavy requests to a separate DB tends to help performance of your Master OLTP DB very significantlyOn the performance side – moving read-heavy requests to a separate DB tends to help performance of your heavily-loaded Master OLTP DB very significantly. The very first thing you’ll see after moving reporting to the replica, will be a performance improvement due to reduction in cache poisoning caused by heavy requests (such as a table scan of a 100M-row table which is next-to-impossible to prevent from happening once in a while on your reporting/analytical DBs, but is perfectly avoidable for OLTP). This will lead to a much more stable and more predictable performance of your OLTP processing, with most of unexplained delays (as in “hey, why this OLTP request, which usually takes 300us, started to take 300ms between 7:00 and 7:10?”) magically disappearing. And a bit later (when you move to Super-Replicas as discussed in [[TODO]] section below) – you’ll be able to limit the size of your OLTP DB to the size-small-enough-to-fit-into-caches, which means that your by-far-most-performance-critical-OLTP-DB is 100% cached; this, in turn, will allow to have the very-best-possible performance (comparable to that of the in-memory DBs¹).

Speaking of “which requests can be moved to replicas” – the answer is two-fold:

Usually, at least 99.9% of the CSR reporting doesn’t need to run from OLTP DB, and can be moved to read-only asynchronous replicas.
- A quick recap – CSR reporting is all about those reports which your support folks are using to answer player requests. When you have 100 people to answer player e-mails round the clock – this does cause lots of load on your DB. Most of CSR reports are user-centric (as in “give me all the {play|CSR remarks|payment|…} history of this particular player”), but dual-player stuff is not unheard of either (example: “tell me how often these two guys played together?”, which is common for collusion analysis).
- For CSR reporting replica – most of the time it is sufficient to have the same DB structure as in OLTP. And as long as synchronization delays between OLTP DB and replica are kept within 1-2 minutes – this is usually enough for CSRs to work efficiently.
100% of the analytics.
- This includes all kinds of analytics – with most analytical reports intended for management and/or marketing.
- For analytics DB – most of the time, data representation will differ from that of in the OLTP DB. On the positive side – delays are pretty much a non-issue.

¹ And if you happen to have tons of money, you’ll be able to migrate your OLTP to in-memory RDBMS too.

DIY Replication

BASE Eventually consistent services are often classified as providing BASE (Basically Available, Soft state, Eventual consistency) semantics, in contrast to traditional ACID (Atomicity, Consistency, Isolation, Durability) guarantees.— Wikipedia —First, let’s define what we want to achieve. We want to obtain a replica Slave DB, which doesn’t slow down Master DB.² This, in turn, means that our replica should be asynchronous. In other words – we’ll be speaking about BASE consistency guarantees between our ACID-compliant Master DB and Slave DB.

As a vast majority of the reporting and analytics is history-oriented (and for a properly designed DB, there is no such thing as “change of history”), BASE is perfectly fine for them. In other words, if it is noon right now, and we have a replica which is at most 2 minutes behind, all the historical requests limited to “11:57 and before” are perfectly correct by design.

² Ok, “to the extent possible”

Why DIY??

Almost universally, whenever I start speaking about DIY DB replication, I face a question: “hey, why doing it DIY when generous DB developers have already done it for us?” Well, while I do know about RDBMS-level replication, there are still several Big Fat Reasons™ to use DIY replicas:

Not all RDBMS-level replicas are created equal. I’ve seen more than one replica-by-Big-RDBMS-vendor which occasionally failed miserably (to make things worse – sometimes there was no real way to recover from the failure). In particular, I urge to stay away from trigger-based replicas (while DB-log-based replicas are usually ok).
- In contrast, DIY replica discussed below does work regardless of underlying RDBMS.
Our DIY replica allows for certain goodies which do NOT work with RDBMS built-in replicas. These goodies (which we’ll discuss in detail later) include such things as:
- Super-replicas which are larger than original DB.
  - The idea here is to move all-the-historical-data-not-really-necessary-for-OLTP (but still necessary for reporting and analytics) from OLTP DB to the super-replicas; more on it in [[TODO]] section below.
  - “Super-replicas can (and IMO SHOULD) be used to keep size of production OLTP DB in check.Such super-replicas can (and IMO SHOULD) be used to keep size of production OLTP DB in check. In particular, most of the audit and historical data can be truncated from OLTP DB after a certain period of time – which, in turn, allows for 100% caching of OLTP DB, speeding it up significantly.
  - NB/TODO: it seems that it is possible to do achieve a similar result at least for MS SQL Server (by not replicating DELETE statements on historical/audit tables).
- Heterogeneous replicas (i.e. replication from one RDBMS to another RDBMS, or even replication from RDBMS to NoSQL DB).
  - These allow using more optimal DBMS for each of the tasks. One example – while DB/2 tends to beat Oracle for pure OLTP load (dominated by writes), reporting tends to run better from Oracle.³
  - Or (as it was discussed in Vol. VI) you might want to run your small OLTP DB on DB/2 or Oracle, but run your replicas off the free MariaDB. Even more importantly – while OLTP works better with classical RDBMS, analytical DBs are known to benefit from NoSQL.

- Replicas with different data structure
  - The point here is that to ensure an efficient operation of the requests over the replica, we often have to have different data structure in the replica than in original OLTP.
  - These are necessary for analytics (which tend to require such stuff as aggregates, and/or OLAP cubes, and/or NoSQL-style Big Data).

³ Whether having two different DBs is worth the trouble – depends on your specifics, but it certainly might.

Prerequisites

The following are the prerequisites for our app-level DIY replication:

As before, we’re speaking about single-writing-DB-connection architecture. In other words – for each of our OLTP DBs,⁴ we have exactly one connection which can modify our OLTP DB.
- As it was discussed at length in Vol. VI’s chapter on Databases– I am a huge fan of such single-writing-DB-connection architectures, and they’re seen to work in practice very well.
ALL the tables-to-be-replicated have a PRIMARY KEY
- This has to be done regardless of replication, and TBH, you should already have all your PRIMARY KEYs in place.
“for all the modifying statements we must update only those rows which are specified by respective PRIMARY KEY in the WHERE clauseALL the modifying statements are working via PRIMARY KEY. In other words, for all the modifying statements we must updateonly those rows which are specified by respective PRIMARY KEY in the WHERE clause.
- Complying to this requirement is likely to require some effort (though usually is perfectly doable). In practice, when implementing this kind of thing for a serious real-world RDBMS – our team did run in about 1-3% of the statements which didn’t comply with this requirement. On the other hand:
  - For OLTP DB, the vast majority of modifying statements already works this way. In particular, all the stuff such as “UPDATE USERS SET GOLD=? WHERE USERID=?” are already compliant.
  - Even for those update statements which are not written via PRIMARY KEY, within single-writing-DB-connection it is trivial to rewrite them into two statements: (a) SELECT necessary PRIMARY KEYS, and (b) UPDATE using those PRIMARY KEYS. As massive updates in OLTP are extremely rare⁵ – this approach won’t hurt performance too much (in practice – in the real-world example mentioned above, we didn’t see performance to be hurt in any meaningful way).
- As SELECT statements are non-modifying – they are exempt from this requirement.
In addition, ALL the updates have to provide FINAL values of the affected fields; in other words – UPDATE statements are NOT allowed to rely on anything within the database. In practice, it leads to two additional kinda-requirements. While these kinda-requirements SHOULD stand at DB level, most of the time they CAN be satisfied at SQL Binding Compiler level, without app changes, that’s why they are “kinda“ prefix:
- At DB level, we MUST NOT have statements using AUTOINCREMENT in any form.
  - Fortunately, as long as we’re speaking about single-writing-DB-connections – keeping last_primary_key for each of tables with autoincrement-ed PRIMARY KEY in-memory is not a problem, and then generating the next value for INSERT becomes trivial. Moreover, most of the time it can be done behind the scenes by the code generated by SQL Bindings Compiler, so that at app-level we’re issuing a statement with AUTOINCREMENT, but then it is replaced by reading-from-last_primary_key-with-an-increment, and an SQL statement with a specific value for the PRIMARY KEY.
- At DB level, we MUST NOT use statements which use existing values of the fields; an example of such statement is “UPDATE USERS SET GOLD=GOLD+?”.
  - BTW, I am all for this kind of statements at app-level – they are significantly less error-prone than absolute writes. Fortunately, there is a way to use such incremental statements at app-level while converting them into suitable-for-replication statements at DB level. If we’re using (guess what? <wink />) SQL Bindings Compiler with a cache – it can automagically replace “UPDATE USERS SET GOLD=GOLD+?” into reading-GOLD-from-USERS-cache (which happens to be pretty much free), followed by “UPDATE USERS SET GOLD=?” statement to DB; all of this can be done completely without any changes to robust “SET GOLD=GOLD+?” SQL statement at app-level.

⁴ More precisely – “for each of non-interacting data sets”. In other words – we CAN have multiple writing connections to the same DB, as long as each of the connections operates only over its own set of tables.

⁵ With an exception of historical data truncation, but as discussed in [[TODO]] section below, this is a very different story anyway.

Implementation

To implement this kind of app-level replication, we need to support two distinct processes: (a) Initial Population, and (b) Ongoing Updates. Chronologically, first we have to perform Initial Population of the Slave DB (obtaining a snapshot of the Master DB), and then we have to keep it in sync by Ongoing Updates.

Ongoing Updates

“Ongoing Updates can be implemented entirely within the code-generated-by-SQL-Bindings-CompilerLet’s consider implementing Ongoing Updates first. In fact, as soon as we have our SQL Bindings Compiler (and our DB is compliant with the rules above), Ongoing Updates can be implemented entirely within the code-generated-by-SQL-Bindings-Compiler:

Whenever a function-generated-by-SQL-Bindings-Compiler is called – in addition to issuing a corresponding SQL statement, it will also generate a Replication Message which essentially describes an appropriate SQL statement for the Slave DB. This Replication Message will be stored in the same PendingRepliesQueue as it was discussed above in the context of Group Commits.
- Let’s note that, given the restrictions above, ALL the Replication Messages will be one of the following:⁶
  - INSERT INTO <table> VALUES (<list of values, including pk>)
  - UPDATE <table> SET <list of modified fields with specific final values> WHERE <primary_key_field_for_table> = <pk>
  - DELETE FROM <table> WHERE <primary_key_field_for_table> = <pk>
On successful commit() – we’re sending all the Replication Messages to the Slave DB. In case of rollback() for whatever reason – we’re dropping all the Replication Messages corresponding to current DB transaction, altogether.
On the Slave DB side – on receiving a Replication Message, we’re simply applying corresponding SQL statement to Slave DB, that’s it.

That’s pretty much it for Ongoing Updates.

⁶ If, by any chance, the list of the restrictions above is not strict enough, and we still have to have Replication Messages which are not listed here, it is the list of the restrictions which has to be revised.

Initial Population

The second component of our DIY replication, is Initial Population, aiming to produce initial snapshot of the Master DB within Slave DB. This is an extremely important part of the replication, especially as we’re speaking about populating at least hundreds of gigabytes of data.

Initial Population – Interaction with Ongoing Updates

There are several ways to implement Initial Population; however, before we get to specifics, let’s note that

With the Replication Messages described above, it is ok to apply Replication Messages to Slave DB which is newer than exactly necessary.

This all-important property stands as long as we the following kinda-errors will be handled in the following manner:

If INSERT INTO would lead to a duplicate ID (and would fail as a result) – old row should be removed, and new one inserted.⁷
If UPDATE is applied to a non-existing row – we can ignore it (we can be sure that there will be DELETE FROM statement soon – more specifically, even before our Replication Messages reach the current state of Slave DB).⁸
If DELETE FROM is applied to the row which has already been deleted – we can ignore it too.

Bold⁹ observation above leads us to a very important property of our DIY replication. It is apparently ok to:

First, start recording Replication Messages for our Master DB.
Second, make a snapshot of our Master DB.
- Actually, it doesn’t even need to be a consistent snapshot of the database; instead, it is ok to have this kinda-snapshot for any row independently. In other words, it is sufficient to run a process which records current values of each and every row in the DB; this process MAY run concurrently with normal DB operation (which will generate Replication Messages).
Third, when snapshot is ready (or process recording current values of each and every row is completed) – we apply it to the Slave DB.
Fourth, we start applying our Replication Messages to the Slave DB. We MIGHT (and actually WILL) run into situations when Replication Messages lead to kinda-errors described above – but as discussed above, it is ok.
- We have to keep in mind that ALL the Replication Messages which can affect the same data, MUST be applied strictly in order. On the other hand, order of Replication Messages for non-overlapping data (such as “different tables”, or even “rows with different PRIMARY KEYs in the same table”) doesn’t matter.
Fifth. We reach the point when all the Replication Messages issued before the Third step above, are applied. At this point, DB is consistent (and actually kinda-errors described above, shouldn’t happen anymore).

“Formal proof of the correctness of the method described above, is beyond the scope of this book, but here is the proof sketchFormal proof of the correctness of the method described above, is beyond the scope of this book, but here is the proof sketch:

We have a moment T1, when we started writing our Replication Messages, T2a – when we started creating the snapshot, and T2b – when we finished creating the snapshot. We want to prove that by the end of applying those-Replication-Messages-issued-between-T1-and-T2b, our Slave DB will correspond to the Master DB as of T2b.
if any specific row wasn’t modified between T1 and T2b – it will stay unmodified (and therefore correct).
If, between T1 and T2b, the row was modified – it can be only one of the following scenarios:
- The row did exist at T1 and was UPDATEd. In this case, all the fields which were UPDATEd between T1 and T2b, will be UPDATEd once again during the processing of Replication-Messages-issued-between-T1-and-T2b, and therefore will be correct by the end of applying these Replication Messages (and regardless of the exact state caught in the snapshot).
- The row didn’t exist at T1 but was INSERTed (and then maybe UPDATEd) between T1 and T2b. In this case, regardless of whether it made it into the snapshot, it will be INSERTed via Replication Messages (and all subsequent UPDATEs will be processed too).
- The row did exist at T1 but was DELETEd between T1 and T2b (with potential UPDATEs prior to DELETE). In this case, all the UPDATEs are irrelevant (and some may be ignored by Slave DB, as mentioned above) – and DELETE will be performed when processing a Replication Message if the row being present in the snapshot.
- The row (as always – identified by PRIMARY KEY) did exist at T1, and was DELETEd and then a different row with the same PRIMARY KEY was INSERTed. In this case, first the row will be DELETEd (along the lines of the row which was simply DELETEd), and then it will be re-INSERTed (along the lines of the row which was simply INSERTed).

⁷ Maybe we could ignore such duplicate INSERTs, just as all other kinda-errors are ignored – but I am less sure about ignoring them, so if you want it – make sure to prove its correctness yourself <wink />.

⁸ Alternatively, it is formally correct to re-create the row, but for partial UPDATEs it is not feasible, so it is better to simply ignore such UPDATEs.

⁹ Pun intended

Implementing Initial Population

Now, back to the mechanics of the Initial Population. At least, the following ways can be used to achieve that (kinda-)snapshot which we need to get our replica running:

Online backup of Master + online restore on Slave. Actually, for homogeneous replication (i.e. replication to the same RDBMS) – it is usually by far the fastest way of performing Initial Population.
- IF any of the methods below happen to be too slow, it is always possible to run them using an intermediate snapshot-obtained-via-online-backup-plus-online-restore instead of Master DB.
Reading all the rows of the Master DB via the same single-writing-DB connection, and sending them to SlaveDB. Can be done using some kind of “pseudo-batch processor” which will be discussed in Chapter 28. While this is known to work, quite obviously, it tends to eat quite a bit of resources.
- Note that even while during this method of taking snapshot, “reading rows” and Replication Messages interleave, still ALL the “reading rows” messages MUST be applied to Slave DB before any of the Replication Messages.
Reading all the rows of the Master DB via a different read-only DB connection, and sending them to SlaveDB. This option is not as bulletproof as the previous ones (it does involve transaction isolation, which is a well-known can of worms, and is dependent-on-specific-RDBMS at that too), but it seems to work ok, as long as we have transaction isolation levels in both write-connection and read-only-connection¹⁰-to-read-rows-for-Initial-Population, as Read Committed. Using lower-than-Read-Committed transaction isolation level will allow for “dirty reads” which are not acceptable, and going higher-than-Read-Committed – is very likely to kill performance (while isolation benefits provided by these higher isolation levels, such as removing transaction isolation artifacts,¹¹ are pretty much pointless for our app-level replication purposes, as they will be fixed by the described-above-logic-of-Replication-Message-handling anyway).
- “it seems to be possible to use Uncommitted Read for reading rows, as long as we’re issuing special Replication Messages for all the rollbacks in Master DBWhile I never done it myself, it seems to be possible to use Uncommitted Read for reading rows, as long as we’re issuing special Replication Messages for all the rollbacks in Master DB. These “rollback row” Replication Messages should contain full DB row (and processing them on Slave DB side should result to the whole row being rewritten¹²), as the row was observed in Master DB after the rollback has happened.¹³ This way, even if Uncommitted Read while “reading rows” has lead to a “dirty read” included into our (kinda-)snapshot, it will be fixed by this “rollback row” Replication Message.
  - With Uncommitted Reads, in case of DBReactor/RDBMS/DB Box crash, it still possible for “dirty” records to sneak into the replica, making it incorrect <sad-face />. On the other hand – it is only Initial Population which is vulnerable to this behaviour, so in case of crash we can simply restart it from scratch.
- In any case, to ensure consistency, all the “reading rows” MUST be applied to Slave DB before any of Replication Messages.

After the (kinda-)snapshot is obtained – we can start applying all the Replication Messages obtained from the some-point-before-point we started generating our snapshot; when this process is completed – we can be sure that our replica is 100% consistent with the original (well, with the potential delay of those few messages which happen to be in transit).

¹⁰ Or multiple read-only connections

¹¹ Such as non-repeatable reads or phantom reads

¹² or DELETEd if we’re rolling back INSERT

¹³ alternatively, for partial UPDATEs it might be possible to write only-those-fields-which-were-affected-by-the-UPDATE – but this looks unlikely to improve things, and I also feel safer with whole rows (different RDBMSs tend to have very different understanding of isolation levels, and whole-row is the least common denominator, which is likely to work pretty much everywhere).

[[To Be Continued…

This concludes beta Chapter 27(f) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”.

Stay tuned for further parts of Chapter 27, where we’ll continue our discussion on DYI heterogeneous DB replication]]

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

Comments

Cory Long says

November 1, 2017 at 2:45 pm

I’m probably overlooking something but

“Fortunately, there is a way to use such incremental statements at app-level while converting them into suitable-for-replication statements at DB level. If we’re using (guess what? ) SQL Bindings Compiler with a cache – it can automagically replace “UPDATE USERS SET GOLD=GOLD+?” into reading-GOLD-from-USERS-cache (which happens to be pretty much free), followed by “UPDATE USERS SET GOLD=?” statement to DB; all of this can be done completely without any changes to robust “SET GOLD=GOLD+?” SQL statement at app-level.”

If you’re using the replica database to help set values in the primary, if the value in the replicate database is out of date, the update statement will be incorrect, no? Or am I misunderstanding?

- "No Bugs" Hare says
  
  November 9, 2017 at 4:14 pm
  
  > if the value in the replicate database is out of date, the update statement will be incorrect
  
  That’s exactly why “GOLD=GOLD+?” is bad, but “GOLD=?” is good for replication. My point in the paragraph you’re citing, was that IF we have app-level cache for table USERS within our single-writing-DB-App, it is trivial for SQL Bindings Compiler to replace “GOLD=GOLD+?” with “GOLD=?”; of course, to have things correct – we should use value of GOLD _taken_on_the_Master’s_side_ (and not on the Slave side).
  
  - Cory Long says
    
    November 10, 2017 at 3:21 pm
    
    Ah, the _taken_on_the_Master’s_side_ was the part that was not clear. Sounded like it was suggesting using a cache that may or may not equal Master at any exact moment.
    
    - "No Bugs" Hare says
      
      November 11, 2017 at 7:01 am
      
      It implicitly refers to http://ithare.com/oltp-db-optimizations-102-100-coherent-app-level-cache-for-single-writing-db-connection/ – where such a 100%-coherent cache is described in detail (for posts, it indeed wasn’t too obvious, but for the book it should be better I hope).
      
Mikhail says

November 21, 2017 at 4:06 am

“While these kinda-requirements SHOULD stand and DB level, most of the time they CAN be satisfied at SQL Binding Compiler level”

Typo: “stand _and_ DB level”

- "No Bugs" Hare says
  
  November 21, 2017 at 11:37 am
  
  Fixed, thanks! 🙂