Scalability for MOGs - IT Hare on Soft.ware

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

[rabbit_ddmog vol=”3″ chap=”Chapter 8(a) from “beta” Volume III”]

In Vol. II, we finished our first round of discussions about Client-Side (with more to follow in Vol. V), so now we can start considering Server-Side.

On the Server-Side, one of the very first things we need to take into account, is Scalability.

In general, poor Scalability isn’t usually observed until post-deployment; on the other hand –

we DO need to take Scalability into account while architecting and developing our system.

Otherwise – it can easily happen that at our finest hour – say, when we have a million players willing to play – we won’t be able to handle even half of them (or our Servers will be slowed down to the point of being completely unplayable). In this case, instead of being the finest hour – it can easily become a disaster, with our game beginning a downward spiral towards oblivion.¹

Let’s note that Scalability is a big topic, and that this Chapter is not the whole discussion on Scalability within this book. In particular, we were already discussing the mechanics of scaling for seamless Game World Servers in Vol. I’s chapter on Communications, and we’ll also discuss DB Scalability in detail in Vol. VI’s chapter on Databases; in addition, there will be another round of discussion on Scalability (as it applies to mature highly loaded systems), in Vol. IX.

In this Chapter, we’ll merely try to define a few terms – and to describe some very high-level approaches to Scalability (which, as we’ll see a bit later, will affect our architecture greatly).

¹ no relation to Elder Scrolls

Performance != Scalability, but… Performance Still Matters

What to Scale?

“In the context of multiplayer games, there are traditionally two main points which need to be scaled. The first one – is scaling Game Worlds; the second such point is scaling our Database.In the context of multiplayer games, there are traditionally two main points which need to be scaled. The first one – is scaling Game Worlds; the second such point is scaling our Database.²

In addition, at some point you may need to scale your Matchmaking (and if you ever need to scale your Cashier and Payments – you’re certainly in luck <wink />) – but from our current abstract perspective, they will be scaled in a manner similar to scaling your Game Worlds.

As a result, at least for the time being, we will concentrate on scaling of (a) Game Worlds, and (b) Database.

² more generally – persistent storage, but for the purposes of this book we’ll name it Database anyway

Scaling Up – Doesn’t Help Much

To start discussing Scalability – we need to define some terms. In theory, there are two types of Scalability: “scaling up” and “scaling out”. “Scaling up” refers to just buying a better hardware Server; it is a purely hardware solution which doesn’t require anything from software side.

Scaling up sounds as a really nice idea. Hey, we don’t need to think about scaling of our software – just tell admins to buy a new Server, and we’re done!

It all sounds very nice and easy for us developers. Unfortunately, there is one teensy problem with scaling up; the problem is that

as of 2017, scaling up doesn’t help much.

The reason for it is two-fold. First, we need to mention that

for scaling up, it is per-core performance which matters.

As we don’t want/need to make our software scalable when scaling up – then if our program run on N cores, buying new Server won’t give it any performance boost other than N*performance-of-new-cores/N*performance-of-old-cores; this pretty much translates into the statement above.

The second observation leading to scaling up not working, is the following:

These days, the best per-core performance we can hope for – is an around-4GHz CPU core.³

Moreover, this didn’t change much over last 15 years or so.⁴

Based on these two observations, we can see that if you’re already running a 3GHz CPU (either server one or a desktop one) – possibilities of scaling it up are extremely limited; all we’re speaking about – is 1.5x difference GHz-wise; in addition – it might be possible to get an additional 2x per-core advantage or so due to better caches etc. etc. – but well, at least for scalability of Game Worlds even 1.5x*2x=3x is not really much.

³ over the past 10 years, IBM Power was the CPU having the highest clock rate – and the fastest Power goes around 4.7-5GHz. Recently, AMD has reached 4.7GHz too, and Intel was rumored to release 5.1GHz CPU soon – though as of early 2017 it seems that these expectations didn’t materialize.

⁴ granted, per-MHz performance did increase over the last 15 years, but gains of the order of 3-4x or so do not help much with scalability

Improving Performance to Avoid the Need to Scale

In addition to Scaling Up – at least in theory, there is one more thing which we might be able to do to deal with the load; it is improving performance of our app, so that we can run our load on the same hardware.⁵

“Just like Scaling Up, improving performance doesn’t provide infinite scalability. However, it happens that it is all about numbers. Let’s take a closer look at them.Just like Scaling Up, improving performance doesn’t provide infinite scalability. However, it happens that it is all about numbers. Let’s take a closer look at them.

Let’s consider a game which needs to handle 100’000 of simultaneous players. As noted above – we have two major points to scale (Game Worlds and Database).

When scaling Game Worlds, and assuming a typical industry number of being able to support 100 players/core – we’ll need around 1000 cores to support desired 100’000 players. Honestly – improving this performance by 1000x so we can run it on a single core, is not really realistic. It means that improving the performance of Game World Servers (while still being important to reduce Server costs(!)) won’t save us from the task of ensuring Scalability for Game Worlds.

On the other hand, if trying to scale our Database for the same game with 100’000 players – we’ll usually be speaking about the numbers of 0.1-1 DB transaction/player/minute (see also the TODO section below). It translates into 10’000-100’000 DB transactions/minute ~= 150-1500 DB transactions/second. And these numbers, as we’ll see in Vol. VI, can be quite achievable even over one single DB connection (that is, if we do a really good job optimizing our performance).

In practice, it means that for scaling Databases, at least in certain cases, we can try to avoid dealing with Scalability until our game becomes really large (and usually, unless we’re an AAA company, it doesn’t happen overnight).

⁵ I don’t want to engage in discussion whether “improving performance” is actually one of the forms of “Scaling Up” or not; for the purposes of this book – let’s use “Scaling Up” in a narrow sense of “purely hardware upgrade”, so “improving performance” is a separate concept.

Scaling Out

Even for Databases, and even when trying to avoid dealing with Scalability by improving performance really hard, we’re likely to hit a wall somewhere between 10’000 and 1’000’000 simultaneous players. And as discussed above – for Game Worlds neither Scaling Up nor improving performance will allow to achieve the required performance.

To deal with it –

we’ll need to Scale Out.

Unlike Scaling Up which relies on performance of one single core/Server-box – Scaling Out is all about spreading the load across different CPU cores (Server boxes etc.). As noted above, this is the only feasible way to scale your programs these days (when you DO need to scale, that is). As a result – whenever I’m speaking about “Scalability” or “scaling” within this book – I will mean “Scaling Out”.

On In-Memory States and Multi-Player Games

The show must go on

The show must go on, yeah yeah

Ooh, inside my heart is breaking

My make-up may be flaking

But my smile still stays on

— Queen —

After discussing the very basics of Scalability as such – let’s come back to our topic of multiplayer games.

Let’s observe that most multi-player games out there can be seen as a sequence of “game events”. Let’s define (very loosely) a multiplayer Game Event as “some dynamic interaction which involves more than one player, is limited in time and has an obviously observable outcome”. Examples of Game Events include such seemingly different things as:

Arena match
Poker hand
RPG fight (or talk)

“all the multi-player games I know look to the player as a sequence of Game EventsActually, all the multi-player games I know look to the player as a sequence of Game Events – with, maybe, some interspersed interactions which involve only one single player (for example, interactions with game environment and NPCs but not with other PCs).

Now, as we realize that from multi-player point of view, our game is just a sequence of Game Events, we can make a few further observations.

Observation 8.1. If Game Event is interrupted for more than a few dozen of seconds,⁶ it is next-to-impossible to get all the players who participated in the Game Event, back to it.

For example, if you are running a bingo game with a hundred of players, and you disrupt it for 10 minutes for technical reasons, you won’t be able to continue it in a manner which is fair to all the players, at the very least because you won’t be able to get all that 100 players back into playing at the same time. The problem is all about numbers: for a two-player Game Event getting these two players back might work, but for 10+ – succeeding in all the players returning back to play at the same time is very unlikely.⁷

Of course, if your game is a final of a big tournament with a big cash prize, you’ll probably be able to reschedule it for the next day or something, but gathering the same people back after 15 minutes or so of a game being irresponsive, won’t be possible for the vast majority of Game Events out there.

Observation 8.2. If Game Event is interrupted for more than a few dozens of seconds, then even if we are able to reconstruct the same Game World State technically, it won’t be the same from the player’s point of view. Moreover, any substantial interrupt of the Game Event can easily provide an unfair advantage to some of the players.

“Being interrupted in the middle of a sword fight and being asked to resume just from the middle of it – is not likely to be satisfying for the players.Being interrupted in the middle of a sword fight and being asked to resume just from the middle of it (which was “who-knows-how-many-milliseconds-before-you-need-to-press-the-button”) – is not likely to be satisfying for the players. In addition, if the interrupt is rather long – then from the players’ perspective they will stay in a nervous state of “what exactly is my position within this fight” (which is quite unusual and therefore rather uncomfortable compared to usual “I am preparing for this fight”); this is usually worse than knowing that the whole thing will be rolled back and you can start anew (of course, if one of them was winning – it will be unfair, but well – there is no ideally fair solution here).

As for unfair advantages – quite often, it is possible for a player to obtain some important game-changing information during the game event being interrupted. This information can be pretty much anything – from noticing the start of opponent’s move and preparing to counter it during the pause, to being able to run a sophisticated analysis tool in the middle of interrupted chess blitz match.

These effects are known in the game industry – though way too often they’re taken into account only during deployment as an after-thought, and this can easily lead to ugly solutions and even uglier resulting problems. In “In-Memory Game World States: a Natural Fit for ‘No Bugs’ Rule of Thumb” section below, we’ll see an example of rather crazy crash recovery logic of a large multi-million-dollar game: after the crash they first restored a perfectly correct current Game World State (with this restore itself causing lots of trouble) – just to follow it with rolling back this perfectly-correct-current-Game-World-State back to the start of Game Event – exactly because they weren’t able to resume the game due to lack of players.

⁶ unless we’re speaking about big tournaments or large prizes, I’d put more or less typical time at about 1-2 minutes (with all the necessary disclaimers about it depending on your game etc. etc.).

⁷ As always, there are some exceptions here and there, but they’re few and far between

‘No Bugs’ Rule of Thumb for Multiplayer Games

The weight of evidence for an extraordinary claim

must be proportioned to its strangeness

— Pierre-Simon Laplace —

Now, armed with these two observations, we can try to figure out what needs to be done if our Server app crashes in the middle of the Game Event (which inevitably causes a large interrupt in game play – that is, unless we’re going for full-scale fault tolerance for all our Servers, and the crash was a hardware one⁸). Personally, I prefer to state it as the following

“’No Bugs’ Rule of Thumb for Multiplayer Games”:

Whenever Game Event is interrupted for significant time, as a rule of thumb it is better to roll back the interrupted Game Event rather than trying to restore the exact Game World State in the middle of the Game Event.

This statement is very bold – and as such, requires quite a bit of explanation. Let’s consider two options: the first one is to restore the exact Game World State at the moment of crash, and the second one is to roll back our interrupted Game Event (i.e. we restore the exact Game World State as of the beginning of current Game Event); moreover, let’s assume that the both options are feasible to implement (which is not often the case for the Option 1, but for our current analysis we can afford a bit of daydreaming).

In both cases, the player’s experience will be hurt. Of course, roll back in Option 2 obviously changes the game landscape. However, as it follows from Observation 8.2, restoring exact Game World State in Option 1 is also far from the ideal. I’d say that from the point of view of “providing the least possible disruption to the players”, for quite a few games out there both approaches are roughly equal (=”having bad impact on the game of the same order of magnitude”). In other words – whatever we’re doing after the crash, gameplay will be hurt, and players will be inevitably unhappy (and some players will be inevitably more unhappy than the others).

With this in mind, we should take into account considerations which have led us to Observation 8.1 – those about getting players back. If going the route of Option 1 (taking “restoring Game Event in the middle” route), we’re basically saying that “everybody who participated in that Game Event, needs to stay online for some unspecified time, just polling to see when we’re able to relaunch our Game Servers; anybody who doesn’t do it – will be punished by losing the Game Event by default”. Essentially, we’ll be punishing players for our own problems – which is certainly not good (and players tend to hate it too). On the other hand, if dealing with the crash via Option 2 – we can say “sorry, Bad Thing™ did happen – but at least we rolled back all the current Game Events, so you can come whenever-you-want and continue playing at your convenience”.

Overall, from what I’ve seen (both as a developer and as a player) – I like the second option (the “roll back to start of Game Event” one) much much better than the first one, at the very least – for most of the games out there. Which is exactly what is summarized in the “’No Bugs’ Rule of Thumb” boldly stated in bold⁹ above.

“if a fire breaks out in a brick-and-mortar poker room while the hand is being played, then the whole hand gets cancelled, and all the chips are returned to their owners “as of the beginning of the hand”BTW, it is exactly the same strategy which is traditionally applied for at least one brick-and-mortar game for centuries. As one poker pro has explained it: if a fire breaks out in a brick-and-mortar poker room while the hand is being played, then the whole hand gets cancelled, and all the chips are returned to their owners “as of the beginning of the hand” – regardless of the cards they were dealt and regardless of the state of the hand in general. Then, when the fire is extinguished, players can start a new hand – or some of them can leave; it is their choice, and certainly not the choice of the casino owner.

⁸ a software crash will be faithfully repeated on a reserve node (just as it happened with Ariane 5 rocket, see [Wikipedia.Cluster.LaunchFailure] for details), so no kind of Fault Tolerance will help against it.

⁹ pun intended

Good for Us (as developers)

Now, let’s note that all this analysis above stands even before we take into account the complexities of implementing perfectly-durable-intra-Game-Event-Game-World-States. In practice – these complexities are so big (essentially leading to pushing each-and-every-player-action into some-kind-of-durable-database – which in turn leads to increasing DB load anywhere from 10x to 1000x, and scaling DBs with that much transactions is going to be next-to-impossible) – that we’d likely to choose Option 2 just to avoid these complexities. However, my point here is different; what I am trying to say is that

In case of crash, rolling back Game Event is usually a Good Thing™ from the player’s point of view

And the fact it also simplifies development – well, it just means that we as developers got lucky: if the simplest-for-developers solution (the one with in-memory Game World States – see “In-Memory Game World States: a Natural Fit for ‘No Bugs’ Rule of Thumb” section below) happens to be the best one for players too, it is certainly a Good Thing™ for everybody involved.

Exception: Stock Exchanges

A word of caution for stock exchanges. If your game is a stock exchange, you generally do need to save every-player’s-action persistently (to ensure strict correctness even in case of Game Server loss), so rolling back is not usually an option. Of course, technically we can say that with stock exchanges each single bid constitutes a Game Event, but well – it won’t really simplify our jobs down the road.

That being said, it should be noted that even for stock exchanges at least Classical Game Architecture described in Chapter 9, has been observed to work very well despite DB transaction numbers being rather large. At least in part, it can be attributed to two further observations: first, that for stock exchanges number of user interactions are usually not that high as for MMORPG,¹⁰ and second, that price of the hardware is generally much less of a problem for stock exchanges than for other types of games.

¹⁰ if we exclude post-2007-or-so NASDAQ with lots of bots playing

NOT applicable to Single-Player Games

It should also be noted that the logic above (and especially Observation 8.1) does not apply to single-player games (this includes over-the-Internet single-player games such as Internet slot machines etc.). For a single-player game (whether Internet-based or not), the whole thing tends to work exactly the other way around: there is only one player, and she expects to resume the game exactly at the point when the whole thing was interrupted; moreover, the interrupted gameplay is usually already supported for single-player games, so handling it differently for the crash of our Servers will feel pretty bad. Even worse, with single-player games where the player is playing against the game (such as casinos), rolling the game back for any reason will have pretty bad implications and will raise pretty bad suspicions too.

“single-player games and multi-player games are two extremely different beasts in this regard, so observations about multi-player games SHOULD NOT be blindly extended to the single-player ones, and vice versa.In short – single-player games and multi-player games are two extremely different beasts in this regard, so observations about multi-player games SHOULD NOT be blindly extended to the single-player ones, and vice versa.

In-Memory Game World States: a Natural Fit for ‘No Bugs’ Rule of Thumb

From the discussion above it follows that if we had a Server crash with a subsequent reboot, then (as long as crash-reboot cycle took more than 2 minutes or so) – we’ll need to roll back the interrupted Game Event, even if we have perfect data as of the exact moment of crash.

Now comes an all-important

Observation 8.3. Hey, but if we keep current Game Event in-memory only, we’ll get exactly the behavior we need without any “rollback” efforts (and will lower the DB load by orders of magnitude too)

It means that for most of the multi-player games out there, we can use the following paradigm:

we divide the game into Game Events, which need to be rolled back in case of Server crash or something
while Game Event is in progress, this progress is maintained as a part of in-memory Game World State
Game World States¹¹ SHOULD be written to DB only at the end of each Game Event, and not while the Game Event is in progress.
- As a side benefit – this allows for the result of the Game Event to be written to DB atomically, so if there was one artifact for two players before they fight – we can be 100% sure that in DB there will be exactly one artifact after the fight regardless of whatever-has-happened.

Bingo! We can have our cake and eat it too! We’ve just got a very high-performance system (in-memory states without syncing to DB are about as fast as they go) – and it also provides very good player experience (well, as good as possible after something went horribly wrong).

BTW, if you choose to ignore this observation – you still can create a workable system, but the things can easily get rather ugly. Once, I’ve seen an architecture which wrote all the user actions to in-memory DB right away – effectively storing perfectly current Game World State in that in-memory DB. It took them quite an effort to implement this DB, but it did work. However – whenever their Server crashed – they needed to roll-forward the whole thing, which in turn, in quite a few cases has led to the need to fix a bug-which-caused-the-crash “right on the fly” before roll-forward can be completed(!); as a result – the roll-forward implementation of the in-memory DB has been observed to cause quite a few long downtimes.

To add insult to the injury – in fact, all these efforts and complexities of roll-forward were completely pointless – because, whenever their Server crashed, the recovery procedure went as follows:

first, they roll-forwarded all the DB logs to get a consistent DB state with all the user actions accounted for (including those actions within unfinished Game Events);¹²
and right after the roll-forward was completed – they ran an application-level rollback to remove all those unfinished Game Events from DB; the latter was necessary exactly because of the problems with getting the players back to the same Game Event (see Observation 8.1).

“they made a complicated custom DB-level rollforward, only to follow it up with a complicated custom application-level rollback.In short – they made a complicated custom DB-level rollforward, only to follow it up with a complicated custom application-level rollback.

A competing system (couldn’t help myself from bragging it was mine<wink />), simply didn’t write all those unfinished Game Events into the DB while Game Events were in progress (and wrote the whole Game Event after it is completed, instead). In case of crash (BTW, crashes were extremely rare) – it simply started from DB state (which, given the logic above, corresponded to the end of last-Game-Event), without any additional rollbacks.¹³ The whole thing was much simpler, scaled much better, and was observed to be much more reliable than competing in-memory-DB-based one described above.

¹¹ or a part of the Game World State, which corresponds to the specific Game Event

¹² and, as noted above, the completing roll-forward could take fixing a bug, ouch!

¹³ In case of DB crash (though not in case of app crash), a DB-level roll-forward to get consistent DB state was still necessary, but – as DB was a standard log-based RDBMS (and RDBMSs are doing log rollforwards for 50+ years now), it worked like a charm

On Data Consistency

One thing which comes to mind when considering such in-memory state-based processing models, is a question about data consistency: “hey, how losing information and data consistency can possibly be a good thing?”. Here I need to mention that I am all for consistency; there is still a question, however, how to define this consistency.

As follows from the discussion above, from the player’s point of view (GDD, business requirements, etc.) it is necessary to include “interrupts” into our definition of consistency; and as soon as we’ve done it along the lines above – we will need to say something along the lines of “if the game was interrupted for significant time in the middle of Game Event, then the consistent state is defined as the state at the beginning of the interrupted Game Event”.

And as soon as we say it – our in-memory Game World State (synced to DB at the end of each Game Event) becomes a perfectly valid implementation (and a damn convenient one too <wink />) of the data consistency under the definition above. While another implementation discussed above – the one based on in-memory DB with a subsequent app-level rollback – is also valid under the same definition, it happens to be much less convenient in the real-world.

In-Memory State Summary

TL;DR:

“if you disrupt a Game Event for more than a few dozens of seconds – you won't be able to continue it anyway because you won't be able to get all the players-within-this-Game-Event backfor multi-player games, if you disrupt a Game Event (such as match, hand, or fight) for more than a few dozens of seconds – you won’t be able to continue it anyway because you won’t be able to get all the players-within-this-Game-Event back.
as a result, you’ll most likely need to roll your whole Game Event back.
and to implement this rolling-back-to-the-beginning-of-the-Game-Event, in-memory states (with syncing to DB at the end of each Game Event) are very natural and convenient.

As a result, the following processing model tends to work very well for multi-player games, so you SHOULD consider it very seriously:

Your gameflow needs to be separated into Game Events.
- These Game Events SHOULD be more-or-less natural from player’s point of view
You store intra-Game-Event Game World State in-memory only.
You synchronize your in-memory Game World State with DB around the end of each of Game Events.
If your Server crashes in the middle of the Game Event – you lose your in-memory Game World State.
- On restart – your system will restore itself from the DB, which corresponds to rolling the state back to the beginning of the interrupted Game Event.
  - It is a Good Thing™, as this is exactly what is required in vast majority of cases.¹⁴

¹⁴ YMMV, void where prohibited

[[To Be Continued…

This concludes beta Chapter 8(a) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”.

Stay tuned for beta Chapter 8(b), where we’ll discuss different scalability schemas]]

[+]References

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.