Databases and RAIDs

Author:  Follow: TwitterFacebook
Job Title:Sarcastic Architect
Hobbies:Thinking Aloud, Arguing with Managers, Annoying HRs,
Calling a Spade a Spade, Keeping Tongue in Cheek
Server, RAID-1, and SSDs
#DDMoG, Vol. VII

[[This is Chapter 23(e) from “beta” Volume VII of the upcoming book "Development&Deployment of Multiplayer Online Games", which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the "release" book to those who help with improving; for further details see "Book Beta Testing". All the content published during Beta Testing, is subject to change before the book is published.

To navigate through the "1st beta" of the book, you may want to use Development&Deployment of MOG: Table of Contents.]]

After we discussed configuring our Game Servers, we can proceed to the question of configuring our Database Servers (and other storage servers).

For DB Servers and any other storage servers – not surprisingly, one of the most important features of them, is storage – more specifically, we’ll be speaking about HDDs, SSDs, NVMe, and RAIDs.

RAID 101

Before we start speaking about RAIDs, let’s define what they’re about. Very very briefly – the idea of RAID is to provide redundancy (BTW, RAID stands for “Redundant Array of Inexpensive Disks”), so if one of the drives fails – the system is still working (and the data is still intact). After you replace the failed drive – a so-called rebuilding of the RAID occurs, and after the rebuilding is completed – we again have a perfectly redundant RAID array.

RAID Levels

I don’t want to go into well-known details, but here is a very short summary of usable RAID levels:1

  • RAID-0 (“stripe”). Provides no redundancy – but improves overall throughput (but not latency). As a rule of thumb – pretty useless for operational databases (unless combined with some other RAID level, see below).
    • On the other hand – RAID-0 can be combined with pretty much any other RAID level, forming RAID-10, RAID-50 etc. – some of which are discussed below.
  • RAID-1 (“mirror”). Has 2x redundancy (which can be pretty expensive if your DB is large enough) – but performs really well both for writes and reads. As we’ll see, RAID-1 (and/or RAID-10) is a very popular choice for performance-critical drives.
    • The logic behind RAID-1 is very simple: everything is written to both drives in parallel, and can be read from either of them. As a result – under normal conditions, write performance (both latency and throughput) of RAID-1 is almost the same as the performance of a single drive;2 on the other hand, read throughput (though not latency) of RAID-1 can be up to 2x higher than for a single drive. Also – for RAID-1 performance-while-rebuilding is usually pretty good (so your system can still work even after one of the drives has failed and while RAID is being rebuilt).
    • Usually, RAID-1 consists of exactly two drives; however, at least in theory (and often also in practice) – it is possible to have 3-way RAID-1 (consisting of three drives instead of two).
  • RAID-5. Consists of at least of 3 drives, and survives failure of any single one.RAID-5. Consists of at least of 3 drives, and survives failure of any single one. It also means that for RAID-5 out of N drives – we have redundancy of N/(N-1) – which is much less than 2x for RAID-1. Performance-wise, RAID-5 has worse write performance (both latency-wise and throughput-wise) than both single drive and RAID-1, while having significantly better read throughput than single drive (roughly by factor of (N-1)x) and somewhat-better read throughput than RAID-1/RAID-10 with the same number of drives. In addition, for RAID-5 performance-while-rebuilding tends to suffer significantly (RAID-5 needs to do lots of drive accesses while rebuilding).
  • RAID-6. Consists of at least of 4 drives, and survives simultaneous failure of any two drives. Can be considered as an “elder brother” of RAID-5.
    • The point of RAID-6 is that if we have a RAID consisting of two dozen drives – then there is a chance that one drive fails, and then the second drive fails while the first one is still not replaced (or while it is still being rebuilt).
    • In practice, such double-failures on smaller RAIDs are extremely rare (and we need to have backup for catastrophic failure anyway), so merits of RAID-6 are usually not that large. OTOH, if the RAID is large (i.e. has lots of drives) – dedicating another drive to convert from RAID-5 to RAID-6 isn’t exactly expensive either.
  • RAID-10. A combination of RAID-1 and RAID-0.3 Arguably the fastest practical RAID when it comes to write performance; RAID-10 out of N drives has write latency which is roughly equivalent to single drive, and throughput roughly equivalent to N*throughput_of_single_drive. This performance comes at the cost of 2x redundancy (just as with RAID-1).
    • Strictly speaking, RAID-10 guarantees survival only if one drive fails. If two drives fail – RAID-10 may survive (or may not, if it is 2 drives within the same mirror which fail); contrast it with RAID-6, which is guaranteed to survive failure of any two drives. Still, as RAID-10 tends to outperform RAID-6 by a wide margin at least for writing, it still remains a primary choice for most of write-intensive DBs such as OLTP one.
  • RAID-50. A combination of RAID-5 and RAID-0. Think of it as of performance-improved RAID-5 for a huuuuge number of drives.

1 In theory, there are other levels such as RAID-2/3/4 – but they’re rarely practical.
2 More strictly – write performance of RAID-1 is minimum of write performance of 2 drives; unfortunately – in case of partial drives failures it has been observed to cause trouble <sad-face />; on the other hand – such occurrences are extremely rare.
3 For our purposes, we can ignore the question of “which is better – RAID-10 or RAID-01 (i.e. should be build a mirror over stripes, or stripes over mirror)”).


[[TODO: problems of most of SW RAIDs – “Write Hole”, inability to distinguish between healthy and broken drive on restart; usually not a problem for serious BBWC HW RAIDs; ZFS and RAID-Z claiming to fix “Write Hole” for SW RAID]]


Wtf hare:for quite a few applications out there, the most important feature of the RAID is not even redundancy, but greatly improved write latency. Moreover – this is the thing which we cannot possibly get from software RAID (!).To be honest, for quite a few applications out there, the most important feature of the RAID is not even redundancy, but greatly improved write latency. Moreover – this is the thing which we cannot possibly get from software RAID (!).

The idea behind BBWC feature of the RAID (BTW, BBWC stands for “Battery-Backed Write Cache”) is that on our RAID card, we can have not only a usual read cache – but also a write cache. Usually, any kind of write cache would be a Big No-No™: for example, if we’re stupid enough to enable a write cache on a usual drive, then if we have a power failure, we will lose information which is already supposed to be persistent (formally – we’ll have big problems with Durability requirement for our storage).

However, with BBWC RAID card, situation is completely different – as we have that battery which provides power to our write cache even while the main power is not available – it means that while the power is off, the battery will keep all the data in the cache intact,4 and after the power is restored, all the data will be written from the cache to the drive. In other words, a system battery-backed write cache is functionally indistinguishable from the non-write-cached system even in case of power failures (while still providing an extremely important write latency improvement).

On a computer with a BBWC RAID card, write request coming from the app running on CPU, needs to get over PCIe to the card, and then the card should write the data in the request to the (battery-backed) on-card DRAM – and that’s it: at this point the card is able to reply back-to-app that “the write has been successfully made and is guaranteed to be durable”; all the other lengthy work (such as writing to HDD/SSD) is done asynchronously, without any additional latencies for the app.

Moreover, as we discussed in Vol. VI’s chapter on Databases – for quite a few operational databases (and especially for the most-performance-critical OLTP database), it is write latency of the database log files which really matters (in short – it is the only blocking operation which prevents transaction from being completed; all the other writes can be done in a lazy manner not blocking anything). That’s exactly why I am almost-universally insisting on using hardware BBWC RAID (or NVMe) at least for OLTP database(s) – and usually also for “reporting replicas” as discussed in Vol. VI.

To give an idea about typical latencies of BBWC RAID– most of the time, we’re speaking about a few hundred of microseconds (that’s for end-to-end fsync() latency, coming from app to BBWC RAID and back).

4 Usually, at least for 72 hours


On SSD and NVMe

In recent years, I heard more than once “hey, why bother with this strange BBWC stuff when we have SSDs which solve everything! (and my desktop SSD has a latency of 30 microseconds, I read it in the manual5)”.

When speaking about SSDs and latencies, there are several common issues:

  • Swearing hare:SSD performance (especially write performance) tends to degrade with time (ouch!).SSD performance (especially write performance) tends to degrade with time (ouch!). In contrast, BBWC RAID, as the one using DRAM, doesn’t really degrade (not on our lifetime anyway).
  • Even if SSD itself has good latency, it is perfectly useless if you’re forced to access it via high-latency interface. In an extreme case, a cloud provider may sell you an “SSD-based storage”; however, if this storage is in effect a NAS-storage-which-uses-SSD – you can easily get delays which are not caused by SSD itself, but by limitations in shared NAS (and which can occasionally get up to tens of milliseconds!).
  • Speaking of NVMe, the most important part is that as far as I know, as of now, there is no such thing as hardware RAID with NVMe. This means that we’re facing a choice between (a) traditional BBWC RAID (and then for DB logs it doesn’t really matter what drives sit behind it), and (b) software RAID with NVMe. Out of these two, I will pick traditional BBWC RAID any time of the day, because:
    • software RAIDs have a long history of bad issues (which tend to be conveniently ignored by most of those-gamers-who-use-RAID-only-on-desktops). The most ugly issue with software RAID is so-called “Write Hole” (which occurs if we suddenly lose power in the middle of writing). In general, “Write Hole” can lead to inconsistent RAID states, which can easily cause behaviour which is worse than non-RAID-ed drives (how about your data changing on the fly several months after the power failure?). As for existing method for avoiding “write holes”, I know of two of them:
      • first, it is possible to close the “write hole” by full-disk-scan-compare-and-fix on unclean restart (such as “md resync”). If doing it – we MUST take into account that additional time which is necessary to preform such  resync. IF we’re speaking about doing it for the whole OLTP DB (which should keep within several hundred gigabytes anyway) – then resync will take of the order of several hours, which is rarely acceptable. On the other hand – if we’re speaking only about DB logs (which can be kept to a 10G RAID) – then resync times will be in the ballpark of 10 minutes, which is usually not that much of a problem if we have to restart the whole system anyway.
      • the second approach to deal with “write holes” is to use special file systems and RAID levels (such as “ZFS with RAID-Z”); however, we have to keep in mind that:
        • performance of such configurations needs to be tested, and
        • at the moment I am not 100% convinced that all the potential issues with RAID-Z and write holes are ironed out. In short – IMO, RAID-Z is still not mature enough for serious production (=”production which can have serious financial implications”).


Overall, as of 2017, my personal take on it goes as follows:

  • Personally, I still prefer good old BBWC RAID (from one of Big Three server vendors6) for both DB log drives and for data drives (if you can afford it – use SSDs behind your BBWC RAID, and use SSD-aware RAID cards, i.e. the ones supporting TRIM command; also make sure to keep an eye on SSD write performance to replace SSD when performance starts to degrade). However, depending on your specifics, the following configuration MIGHT be viable too:
    • for DB log drives only: use Linux’s software RAID-1 over two separate(!) NVMe cards, with automated resync in case of unclean shutdown, and limiting the size of this drive to 10G at most. This approach will provide somewhat better latency (with a potential to save 100-200 microseconds per commit), at the cost of the 10-minutes-or-so during the restart after unclean shutdown.
      • LOTS of testing and staging (=”using exact configuration-you’re-going-to-use in not-so-critical environments”) is required.
      • TBH, I still prefer BBWC RAID, aided with group commits if necessary (more on group commits, including DIY group commits, in Vol. VII’s chapter on “DB Optimizations Take 1”).

5 rant about accuracy of claims in consumer manuals: for many years, we read that our monitors have their latencies at 8-4–2ms; however, in practice, those 2ms was just “pixel latency”, which cannot be observed by user; practically observed latencies (which include latencies due to serialization, buffering etc.) were 10x-20x larger than advertised ones.
6 ={HP|IBM|Dell}; this is one case when paying for brand is well-worth it


Hot Spare Drives

In any serious RAID controller, it is possible (and highly desirable) to add a so-called “hot spare” drive. The idea is simple – if one of drives in your RAID fails, controller is able to use “hot spare” to rebuild your RAID and to restore redundancy even before you replace failed drive.

Formally, “hot spare” drive reduces a “vulnerability window” which exists while the RAID is in non-redundant state (between drive failure and restoring redundancy).

Database Servers

After we completed defining very basics of RAID – we can proceed with describing configuration issues for our database servers.

First, let’s not that most (though not all) considerations which apply to Game Servers (and which we discussed above), also apply to Database Servers, so I won’t repeat them here. On the other hand – there are a few things which differ between Database Servers and Game Servers (but are common for all DB Servers):

  • Choice of Linux distribution is different for DB Servers. While for Game Servers, it is usually possible to use pretty much any distro – for DB Servers situation is different. To avoid problems (and even more – fingerpointing between vendors) I strongly prefer to use a distro (and version!) which is officially supported by both your hardware manufacturer (including their RAID array, of course), and by your database vendor.
    • Hare with omg face:DB Servers is the case when you may need that support (mostly – to force vendors to fix their bugs and to provide you with temporary workarounds)It may happen that your system will work perfectly even if this doesn’t stand; however – unless everything is officially supported, you will stay without support, and DB Servers is the case when you may need that support (mostly – to force vendors to fix their bugs and to provide you with temporary workarounds7).
    • This, in turn, limits our choice of Linux distros for DB Servers greatly. The most popular commercially-supported distros are the following: RedHat8, SUSE, and Ubuntu9 (you may found some other distros also supported, but – they’re not supported across the board, so you may find it difficult to get support both for the hardware and for the DB). Among these (and after checking that both your DB and your hardware-including-RAID are officially supported for these distros) – feel free to choose whatever-tickles-your-fancy.
  • When configuring your OS, make sure to follow recommendations by your DB vendor. In particular:
    • Make sure that your database data is cached exactly once. Some serious DBs can cache data themselves (and, as a rule of thumb, are doing it better than OS as they have more information) – but then you need to configure them to disable OS-level caching for their files. Caching data twice, while possible, is usually a waste of RAM (and as a result – can significantly hurt performance).
      • If such an option is available, “locking” DB cache pages in RAM may eliminate swapping; as an alternative – earlier-discussed approach of eliminating swap system-wide can be used too.
    • When choosing your file system for database disks -– follow recommendations from your DB vendor.
  • When configuring your hardware RAID (if you’re using one), make sure to enable BBWC = “Battery-Backed Write Cache”.

The issues above are common for all the DB Servers; as discussed in Vol. VI’s chapter on Databases – while we could start with a single OLTP DB, in the long run our DB can easily include “reporting replica DBs”, “analytical DBs”, and “archive DBs”. On the other hand – within current “Preparing for Launch” chapter, we’ll concentrate only on OLTP DB (it is the only thing which you need to launch – and we’ll discuss the specifics of the other DB Servers later in Vol. IX).

7 You will be surprised, but for some of Big Three hardware vendors and some of Big Three DB vendors I’ve even seen such things as “temporary patch just for you so you can survive until official fixed version”. However, to get it (or even a simple configuration workaround), it is necessary to have everything officially supported; otherwise you will be turned away by 1st-line support saying “of course, the problem is with <this unsupported Linux distro>, and NOT with our obviously-perfect software/drivers/etc.”. Oh, and beware: to get the support with workarounds, patches etc. – you’ll need to jump through the hoops, provide very clean cases proving that it is their problem (and not yours), request elevation etc. (at least to support level 3 or a reasonable facsimile), and spend lots of time in general; still – with some persistence it can be done (though as always, YMMV).
8 NB: you MAY try running CentOS instead of RedHat, and it will work, but again – you will be without support, and on DB Servers you’re usually much better with support than without
9 Out of major vendors, as of 2017, Oracle DB doesn’t seem to support Ubuntu.



Hare thumb up:As a rule of thumb – I am trying to keep the size of OLTP DB small enough to fit into RAM of the OLTP DB Server.A short recap from Vol. VI’s chapter on Databases: OLTP Database Server is a DB server which handles all the operational requests; it is the most critical DB Server (both in terms of downtime, and in terms of performance). On the other hand – usually we can keep the size of OLTP DB reasonably small. As a rule of thumb – I am trying to keep the size of OLTP DB small enough to fit into RAM of the OLTP DB Server. These days, we can easily get a server with several terabytes of RAM (as of 2017, upper limit of RAM for 4S/4U servers by major vendors was between 6T and 12T – if you can afford it, of course), so we can easily survive for quite a while. And most likely, long before when we’re out of these terabytes – we’ll need those “reporting replica DBs” discussed in Vol. VI’s chapter on Databases. And as soon as we have “replica DBs” – we can start truncating older historical data from the OLTP DB to make sure it is as lean and mean as possible.10

As a result, for OLTP DB Server:

  • Make sure that you have enough RAM to fit your whole OLTP DB for a while (and that you have enough space for expanding RAM too; unlike expanding drives – expanding RAM is fairly easy and requires only limited downtime11).
  • As most of the processing goes in RAM, it means that the only I/O thing which really matters – is write performance (especially write latency) of database logs.
    • Make sure that you have 100% dedicated logical drive (=”RAID-1 made out of 2 physical drives”) for database logs. We do not want any kind of contention over this logical drive.
      • Make sure that you’re using BBWC-RAID (or some-kind-of-RAID over NVMe) for this logical drive.
        • Ideally – you should try having a dedicated BBWC-RAID card (or dedicated pair of NVMe’s with software RAID over them) just for database logs. However, using the same BBWC-RAID card to handle all the drives in the system is usually ok for a while (as long as drives for the database logs are completely separate).
      • Make sure that you have 100% dedicated disk partition for database logs (and make sure to re-format it anew just in case). We do not want any kind of fragmentation going on database logs partition.
    • Some kind of RAID for both database logs and database data is absolutely necessary. On the other hand – for database data, latencies are not critical, so “BBWC-RAID or NVMe” requirement doesn’t apply.
      • Given the typical sizes mentioned above – we don’t need anything more than simple RAID-1s. A typical configuration consists of 6(+1) physical drives:
        • 2 drives combined into RAID-1 for system etc. (no database logs or data there, though you MAY want to store temporary tablespaces here).
        • 2 drives combined into RAID-1 for database logs ONLY.
          • Even if your DB/RAID vendor says “hey, you can use your log drives for something-else such as temp tablespace” – do NOT trust this claim. For OLTP DB, log drives are by far the most important thing – and MUST be kept 100% dedicated.
          • If you’re using log files to enable “Hot Spare” DB Server (as described below) – size of log files should be selected to ensure that delay between primary OLTP Server and “hot spare” is reasonable.
        • 2+ drives combined into RAID-1 for database data ONLY. On the other hand, as OLTP sizes tend to be within single-digit terabytes at most – it is rarely a problem to keep it to 2 drives.
          • If you need more than 2 drives here – a question will arise about “what we want to use here”. More on it in the “On Handling DB Data Growth” section below.
        • In addition, hot spare drive is very desirable
        • On the other hand, for initial deployments, even 4(+1)-drive configurations (2 drives for system+database data, 2 drives for database logs, plus an optional hot spare drive) were observed to work pretty well.
  • If an option to pre-allocate data files is supported – pre-allocate your files in advance.  It both reduces fragmentation and avoids spending time on file expansion during high load.
    • If there is an option to run on “raw” partition instead of files – TBH, I still prefer (pre-allocated!) files.

10 As discussed in Vol. VI’s chapter on Databases – such a history truncation means that “replica DBs” are not pure replicas anymore, but the whole thing still works like a charm as soon as we acknowledge this and take measures to ensure that data is carefully preserved.
11 Though make sure to check with your ISP whether you’re able to schedule exact time for the RAM upgrade.


“Hot Spare” OLTP DB Server

Hare pointing out:it is usually a good idea to keep a whole DB Server (with hardware being identical to the primary one) as a “hot spare” in case if your primary OLTP DB Server fails.OLTP DB Server is one thing which is absolutely critical for operation of your system. As a result – it is usually a good idea to keep a whole DB Server (with hardware being identical to the primary one) as a “hot spare” in case if your primary OLTP DB Server fails.

Note that achieving full-scale fault tolerance is usually too tricky and/or too performance-affecting and/or too expensive – so at least for the time being, we’ll concentrate on an ability to restore our DB in case of a catastrophic failure of our primary OLTP DB Server; however, this restore (a) will require a system restart, and (b) (that’s much worse) – there is a risk that we do lose some limited amount of data in case of catastrophic failure of the primary OLTP DB Server (ouch!). Fortunately – such failures are extremely rare (like once per 10 years of operation – that is, if you’re using a reasonably good hardware and good RDBMS), but IMNSHO, you still should account for such a thing happening, from the very beginning (or at least “as soon as you can afford it”).

DIY Hot-Spare (online backup + ongoing rollforward)

One way of achieving such a “hot spare DB Server” in a generic manner (i.e. without using “some magical DB-specific software which will handle it for us”) – is to use online backup with ongoing rollforward.

Let me explain. Most of anywhere-serious DBs provide us with so-called “online backup” mechanism, which allows us to make a backup without stopping processing. (Almost-?)universally, online backup works as follows:

  • We need to make a DB backup specifying it is an “online backup”.
  • Regardless of the backup in progress – database generates database log files while transactions are coming through.
  • After restoring from an “online backup” – we do not get a consistent DB, but we get a DB in a “rollforward pending” state; to bring DB into a consistent state – we need to “rollforward”.
  • “rollforward” is a process of applying log files to a DB; to apply log files – DB MUST be in a “rollforward-pending” state.
  • To make DB consistent after restoring from online backup – we need to rollforward at least those database log files which were generated while “online backup” was in progress.

Good! Now let’s see how these procedures can be (ab)used to organize a “hot spare” DB Server:

  • We make an “online backup” from our primary OLTP DB Server.
  • We restore it on the “hot spare” DB Server box.
  • We take all database log files from operational DB Server as they come – and apply them to the “hot spare” DB Server; after we get all the log files which were made during online backup – we are able to stop rolling forward, but we still keep rolling forward just because we feel like it.
  • We keep retrieving database log files as they’re completed on primary DB Server – and applying them to the “hot spare” one. This process can go for many months easily.
    • Note that while applying log files, for some RDBMS (notably Oracle) the DB is even accessible (as a read-only one) – so we might be able to use it for reporting. On the other hand, most of DBs don’t allow access while in “being rolled forward” state
  • Formally, our “hot spare” DB Server is an “eventually-consistent” copy/replica of the primary one. In practice – the lag between the two DBs depends on the size of the log file, and can be brought down to 10-20 seconds.
    • BTW, it is perfectly possible to run one of “hot spare” DB Servers in another datacenter (or even on a different continent, though in the latter case the lag can grow by a few seconds occasionally)
  • If something-catastrophic happens to the primary OLTP DB Server – we’re stopping the rollforward on the “hot spare”, and can get our system up and running “as of” last_moment_recorded_by_primary_DB – current_lag_between_DBs.

Arguing hare:It is MUCH better than being able to restore after 10 days necessary to procure new server, and with losing a whole day of gameplay because we were only doing daily backups.In other words – most of the time, in case of catastrophic failure of the primary OLTP DB Server, we’ll be able to restore the system within an hour or so, losing about 10-20 seconds of the game play. While not exactly perfect – it is MUCH MUCH better than not being able to restore anything at all, and MUCH better than being able to restore after 10 days necessary to procure new server, and with losing a whole day of gameplay because we were only doing daily backups.

Replica-Based Hot Spare

If your DBMS supports asynchronous (and most importantly – working) replication – you are in luck; it means that your DBMS, while most likely still doing pretty much the same stuff as discussed above – will do it itself, without the need to jump through the hoops.

Still – it will be subject to the same restrictions as the method discussed above; specifically – in case of failure you will lose at least a few seconds of your game data. While (exactly as mentioned above) it is much better than having offline backup – it is still much worse than full-scale fault tolerance; still – for games-other-than-stock-exchanges – it may be a reasonable compromise at least for the first stages of your deployment.

On Handling DB Data Growth

When dealing with large data sizes (which is not too likely for OLTP DB, but is very likely for “reporting replicas” in the long run) – it is important to realize how exactly you’re going to expand your DB when historical data grows out of your current drive size.

In general, there are two conceptually similar, but practically very different approaches for organizing your database data into multiple disks:

  • Option 1. We can say that the whole database data drive is one single logical drive provided by whatever-RAID-we’re-using. In this case, for performance-critical database sets such as those by OLTP DB and probably “reporting replicas”, we should usually use RAID-10 (=”stripe over multiple mirrors”).
    • In this case, if we need to add a new pair of drives to our RAID-10 – it is usually perfectly possible, but there is a huuuuge potential caveat: rebalancing
      • Rebalancing happens when you’re increasing the size of your RAID array – and then your RAID decides to make sure that everything is perfectly balanced (formally – rebalancing is necessary to maximize our reading throughput, which we don’t care about at least for OLTP, but RAID of course always knows better than we do).
      • In theory, rebalancing is expected to be “seamless” for apps; in practice – it can easily bring your system to a crawl performance-wise. From what I’ve seen, rebalancing effects were significantly worse than the rebuilding
        • For a multi-terabyte logical drive being rebalanced under heavy load – rebalancing can easily take several days.
        • Hare with hopeless face:Even when database data is fully cached – rebalancing tends to have such effect that it still may kill your OLTP DB (and if it happens – we’re speaking about several days of downtime(!)).Even when database data is fully cached – rebalancing tends to have such effect that it still may kill your OLTP DB (and if it happens – we’re speaking about several days of downtime(!)).
      • There is no way to predict how bad the effect of the rebalancing is without real-world testing.
      • In theory – there might be a setting for a RAID controller to disable rebalancing; however – I didn’t see such a thing yet. What I did see – is a setting which sets relative priority of rebuild operations (and which might also affect rebalancing). Feel free to play with it – but don’t take it for granted; from what I’ve seen – “Low priority” setting, while it was a bit less intrusive than the “High priority” one, was still pretty bad <sad-face />.
    • Option 2. We can say that we organize our-physical-drives-intended-for-database-data as a bunch of RAID-1 logical drives – and let database itself to do “striping” over them. To the best of my knowledge – this option is currently available only for DB/2, Oracle, and MS SQL. In DB/2-speak, each tablespace is implemented over “containers” (the “containers” can be files placed on different logical drives), and we can organize the “containers” into several “stripe-sets” within the same tablespace. If we create a new “stripe-set” within the same tablespace – we’re instructing DB/2 to avoid the rebalancing entirely.12 For MS SQL, it seems that similar effect can be achieved by creating multiple files in a file groups, and spreading those files across our RAID-1 volumes; it seems that MS SQL doesn’t perform re-balancing at all (but as discussed above – this is exactly the behaviour we want for OLTP DB). For Oracle – the same effect is achievable via Automatic Storage Management (ASM).

Now to the ways how we can organize growth of our database:

  • If we’re lucky to run DB/2, Oracle, or MS SQL – IMO the best way is to:
    • At RAID level – organize our drives-intended-for-database-data as a bunch of separate RAID-1’s
      • Then, For DB/2:
        • Use DB/2 containers over these separate RAID-1’s, and combine them into tablespaces as required. BTW, as a side note: I tend to prefer file containers to drive-level containers.
        • When we need to add new pair of drives for our database data – we just add a new RAID-1 and add container-over-this-RAID-1 to a new “stripe set” within existing tablespace. Bingo! No rebalancing will occur13
      • To achieve the same thing for MS SQL (I hope):
        • Use multiple files in a file group, and spread these files over these separate RAID-1’s.
        • When we need to add new pair of drives for our database data – we just add a new RAID-1 and add new file to a file group. Make sure to test this procedure in advance to make sure that rebalancing does NOT occur!
      • For Oracle, recommended (by both Oracle and me <wink />) way is to have a bunch of hardware RAID-1’s and let Oracle ASM handle only striping (i.e. we’re still using hardware RAID-1’s, and specify EXTERNAL REDUNDANCY to ASM). It seems that by default rebalancing doesn’t occur (make sure to test it in advance!)
  • In other cases – we’re bound to have a single RAID logical drive for our database data (note that we MUST keep database logs completely separate in any case, and that we can and often should split database data logically into different tablespaces residing on different RAIDs – but it still won’t help to handle growth within one single tablespace). Then, when we need to expand a tablespace – we have two options:
    • Try expanding RAID logical drive where the tablespace resides, right on the RAID controller.
      • Beware of rebalancing! Make sure to test everything in advance!
    • As, most likely, as discussed above, we need a “hot spare” of the whole OLTP DB Server anyway – we can do one of the following:
      • Add drives to the “hot spare” DB Server (with rebalancing); even if “hot spare” Server will fall behind because of rebalancing – we will be able to catch up with “rollforwarding” after rebalancing is completed (for more information on “hot spare Servers” and “rollforwarding” – see [[TODO]] section above).
        • Optionally, we can even stop “rollforwarding” while rebalancing is performed (aiming to catch up after rebalancing is completed) – it may happen to be faster to do things sequentially rather than concurrently.
      • On the “hot spare” – we can drop all the data, add drive(s), create new larger RAID-10s, create new tablespaces, restore “hot spare” from online backup, bring it online via rollforwarding, switch production to the “hot spare”; then previously-production DB Server becomes a “hot spare” and we can do the same for it.
        • Note that this option is more risky (there is a “vulnerability window” while the data is not present on the “hot spare”), and should be generally avoided if possible at least for OLTP DB.

12 If we’re adding containers to the same stripe-set – rebalancing will occur, but as discussed above – this is not what we normally want for a highly-loaded operational DBs
13 still, don’t forget to test the whole procedure to make sure that you did provide all the right parameters to DB/2 commands


[[To Be Continued…

Tired hare:This concludes beta Chapter 23(e) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”.

Stay tuned for beta Chapter 25, where we’ll proceed to discussing DevOps for MOGs]]

Don't like this post? Comment↯ below. You do?! Please share: ...on LinkedIn...on Reddit...on Twitter...on Facebook


Cartoons by Sergey GordeevIRL from Gordeev Animation Graphics, Prague.

Join our mailing list:


  1. JohnK says

    It may be helpful to note that RAID6 can survive any two drives failing while with RAID10 if the wrong two drives fail (a mirrored pair) then you’re dead.

  2. Tim Cuthbertson says

    You know your stuff. I was a database admin for most of my 34 year career, with at least my last 10 years specializing in storage configuration for maximum performance. I was pretty much floored when you mentioned turning off OS caching for dbms’s that do their own cache management. I only learned that in maybe my final 6 or 7 years. Props!

  3. Inari says

    I am extremely surprised you’re leaving out Microsoft’s DB stack. Given the recent changes in licensing, even the free versions support OLTP and everything necessary to run a game server. Not to mention the costs are significantly lower than that of Oracle in the non free versions.

    And while yes, this is focusing more on RAID, there are a LOT of details about MSSQL deployment and configuration that needs to be taken in consideration if RAID is used. It would be nice to see an expansion of the chapter to include information on alternative SQL server products and how they interact with RAID.

    • "No Bugs" Hare says

      This post is not really DB-specific; also – everything said in this post, applies to pretty much all the DBs (at least to all RDBMS) out there. For comparison of different DBMS for OLTP environments – see – with MS SQL discussed in detail.

      > And while yes, this is focusing more on RAID, there are a LOT of details about MSSQL deployment and configuration that needs to be taken in consideration if RAID is used.

      Being not an MS SQL person – I am not really aware of anything specific for RAID+MSSQL which doesn’t apply across the board (NB: exact commands/switches etc. are out of scope by design; it is concepts which are important).

      Care to elaborate about MSSQL+RAID specifics?

Leave a Reply

Your email address will not be published. Required fields are marked *