Avoiding Ugly Afterthoughts. Part a. From Writing for Cross-Platform, to Writing for Debugging and Production Post-Mortem, with Error Handling in between

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

[[This is Chapter 12(a), starting “beta” Volume IV of the upcoming book “Development&Deployment of Massively Multiplayer Online Games”, which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the “release” book to those who help with improving; for further details see “Book Beta Testing“. All the content published during Beta Testing, is subject to change before the book is published.

To navigate through the book, you may want to use Development&Deployment of MMOG: Table of Contents.]]

We’ve finished Volume 1 (“Architecture and Pre-Development”) with a list of “tools of the trade” necessary for pretty much any kind of serious development, ranging from source control to static code analysers. Now it is time to deal with another boring-but-necessary thing – a list of items to be kept in mind while coding.

On the first glance, remembering about these things is a pain in the neck (and admittedly it is). But on the other hand, if you ignore these potential issues, addressing them later on as an afterthought will become way too time-consuming (and the end-result will look really ugly more often than not). I’ve seen many cases, when the code base became so cluttered with these problems, that getting rid of them at a later stage was outright impossible, and needed a complete rewrite 🙁 . In short – it is MUCH better to keep these things in mind from the very beginning rather than to try fixing them as an afterthought.

Writing for cross-platform

One of the biggest mistakes you can make when developing a cross-platform project, is an approach of “we’ll develop it just for this platform, using everything we need for it, and deal with porting it to the other platforms later”.

With C/C++, it is an almost-sure way to disaster. But even with supposedly cross-platform languages, there are many ways to bind them to a specific platform (JNI-to-call-COM-object – directly or via JACOB – is not unheard of, and with C# similar things happen all the time).

WINE Wine (recursive acronym for Wine Is Not an Emulator) is a free and open source compatibility layer software application that aims to allow applications designed for Microsoft Windows to run on Unix-like operating systems— Wikipedia —If you go this “one-platform-now-other-platforms-later” route, then when your game is almost-ready and you’re about to port your game to a different platform – then “Whammo!”, you suddenly realize that your code has sooo many dependencies on your development platform (usually Windows) that porting your game would amount to rewriting the whole thing. In one company I’ve even seen guys starting to implement windows.h and all the relevant APIs on Linux (effectively starting homegrown WINE project¹) – no need to say, that this effort didn’t go anywhere.

If your project is aimed to be a cross-platform, at the very least (and regardless of using C++ or a cross-platform language), you should have a very firm policy on “we do not use platform-specific stuff, period”.² And if you see a violation of this policy, you should drop an offending piece of code immediately, even if it delays your other work. Otherwise you’ll be starting to go along a slippery road of adding more and more platform dependencies, eventually becoming so intertwined with the platform, that any attempt to disentangle will lead to almost-complete rewrite 🙁 .

However, as we all know, declaring intentions is one thing, and following them is a completely different matter. In other words, all the good intentions are pointless unless they’re enforced (see, for example, [GDC2015.Butcher]). To make sure that your whole team (yourself included) follows your own guidelines –

it is strongly recommended to have your build server to compile your game for at least two sufficiently-different platforms from the very beginning

“As soon as your game client compiles (and then runs) on, say, for Windows and Mac OS (or your server compiles-and-runs on Windows and Linux), chances that you have fatal dependencies on any of these platforms, are lowered by many orders of magnitude.(ok, let’s make it “within a month from when your start development”). As soon as your game client compiles (and then runs) on, say, for Windows and Mac OS (or your server compiles-and-runs on Windows and Linux), chances that you have fatal dependencies on any of these platforms, are lowered by many orders of magnitude.

Note though, that the platforms do need to be sufficiently different. Compiling your game under Windows 8 and Windows 10 is clearly not enough (that is, if you’re going to aim for Mac or console), as well as compiling for Windows and Windows Phone, Windows and Xbox, and Mac and iOS – all these pairs are way too interrelated, so make sure that your code is not too platform-dependent.

Note that if you’re developing in C++, then the task of staying cross-platform becomes even more complicated; with C++, you certainly need two-platforms approach above, plus a few other things in place to make sure your game stays a cross-platform one. More on it in Chapter [[TODO]].

¹ and WINE took 15 years to get to version 1.0

² of course, if it is only your client which is intended to be cross-platform, this policy should apply only to the client

Error Handling

Error handling is one of the most annoying things in software development, and it is forgotten about way too often. In short – it is one of those things which needs to be done right. If it is not, you will be spending lots of time on debugging error conditions which weren’t handled, and left your program in an invalid state.

“I am still sure that for development of non-life-and-death-stuff exceptions are MUCH better than error codesIn spite of all the criticism of “exceptions leading to difficult-to-find errors” (started at least 20 years ago by [Cargill], and recently resurrected in spades by Google in [Go.FAQ] and [Google.C++]), I am still sure that for development of games and other non-life-or-death stuff ³ the alternative of error codes (with or without multiple-return-values) is even worse (IMNSHO, MUCH worse). Yes, exception (if not handled correctly) MIGHT leave your object in an inconsistent state (honestly, I yet to see it in practice, but it indeed MIGHT – there is a lot of plausible examples out there, it is just that I didn’t run into them causing trouble in real world⁴). With error codes, however, forgetting to check the error code is much more likely (and I have seen a damn lot of missing-error-code-check bugs in reality).

When looking at those exception-related examples which cause trouble, we can see them applicable mostly if we need to consider that each and every function call can throw an exception. For our non-life-and-death gaming applications, this is generally not true. Let’s see why it is the case. Most of the trouble in achieving perfect exception-safe programs is related to a dreaded question of “what will happen if this new operator will cause an exception?” While it is a perfectly valid question in theory, in practice for business-like apps the answer is usually

we don’t really care. If allocation of 50 bytes causes an “out of memory” error, we’re probably already long dead because of unacceptable swapping. And even if we disabled swap file – chances that we will recover from this condition, are infinitesimally small

While this logic doesn’t fly at all for kernels or nuclear warheads (or any library which might be used in these contexts) – it has been seen to work perfectly even for stock exchanges, not to mention other game genres. We’ll discuss more of memory allocation error handling a few paragraphs below.

To avoid dealing with a perception of “each and every line of code MAY throw”, the best approach would be to have compile-time-enforced “catch or declare” rule (such as the one in Java⁵). In C++ it seems that there is no good alternative at the moment 🙁 .⁶

Also a few hints regarding exceptions in the context of games (for other applications, especially life-or-death ones, YMMV may vary greatly):

“Instead of manually releasing your resources in finally blocks, use RAII (in C++) or try-with-resources/using-statement/with-statement in languages such as Java, C#, or PythonInstead of manually releasing your resources in finally blocks, use RAII (in C++) or try-with-resources/using-statement/with-statement in languages such as Java, C#, or Python. The reason is simple: finally blocks are just way too error-prone (and are poorly readable too). With respect to C++, see also discussion on RAII in Chapter [[TODO]].
DON’T use Java’s finalize(), C#’s Finalize(), or Python’s __del__()⁷ to do any meaningful job (asserting that your object has already been destructed correctly, is usually ok). While there are certain use cases when finalizers might be more useful than harmful, they’re so rare and far between, that it is better to prohibit using finalizers completely rather than to allow them and deal with resulting problems. The biggest problem with finalizers for practical purposes is that they’re called when (and only when) garbage collector decides to call them. It means that your program becomes non-deterministic, and therefore not really testable (i.e. what was working on one run, may fail on the other run just because GC worked a bit differently this time; been there, seen that, and it wasn’t a picnic to put it mildly).
Whether to consider lack of memory an exception: well, when you’re writing a game if you’re not doing outright stupid things (such as “allocating buffer of the 64-bit size x, where x has came over the network”), it won’t really matter too much in practice. Sure, there are lots of zealots out there saying that all your code should have strong exception safety guarantees, but correct handling of each and every allocation is usually way too cumbersome for our non-life-and-death projects. Whether your client or your server has run out of memory – chances that you’re able to recover from failed-alloc-of-50-bytes, are minimal anyway, so you’re already in lots of trouble (and will most likely terminate really soon regardless of the formal exception safety guarantees).
- “While we’re at it – no, it is not sufficient to catch memory exception after you’ve tried to allocate that buffer of 64-bit size xWhile we’re at it – no, it is not sufficient to catch memory exception after you’ve tried to allocate that buffer of 64-bit size x; instead, you should check that the x is sane before trying to allocate. The reason for it should be rather obvious: on most modern systems, allocating the buffer of several Gigabytes in size will be successful, but it might easily cause active swap, slowing down your game to the point of being outright unusable (that’s even if your code is 100% formally exception-safe, and despite that no exception has happened).
- Also keep in mind that for games such as stock exchanges it is often safer simply to terminate on lack of memory rather than to risk remaining in a incorrect state. Such termination-on-lack-of-memory also brings your app closer to the holy grail of formally strong exception safety guarantees (more on it below).
Try to keep your constructors exception-free (under the mentioned-above assumption that memory allocations of a sane size do not cause exceptions).
For your event processing object, follow VALIDATE-CALCULATE-MODIFY pattern described in Chapter V. Among other advantages, it makes exception handling within VALIDATE and CALCULATE phases trivial (as the state didn’t change, there is nothing to roll back).
Overall, strong exception safety (a.k.a. commit-or-rollback semantics) is a Good Thing(tm) to have for your game; that is, if you’re excluding considering each-and-every-allocation from potential sources of the exceptions. Under this assumption, achieving strong exception safety is not too time-consuming in practice. The whole thing will require rather little effort, simply because most of the lines of your code, especially in an event-driven program described in Chapter V, won’t throw.

³ when speaking about kernel (like Linus in [Linus]) or nuclear warheads (like Hoare in [Hoare], the exception-vs-error-code decision indeed becomes much less obvious

⁴ given that I’m relying on exceptions since 1996 or so, it is quite telling per se

⁵ if you happen to be overwhelmed with number of exceptions to catch, just use throws Exception instead

⁶ while I like the idea behind C++11’s noexcept, it seems to lack ability to log the error properly at least under MSVC, so I cannot recommend it at this point

⁷ despite being called “destructor” in Python-speak, __del__() is similar to Java’s finalize(), and not to C++ destructor

Writing for Debugging

“Smart folks developing those-silicon-chips-we're-relying-on (and which fail MUCH less frequently than our programs) have noticed a long time ago that they save a lot of time when they design their chips for testing from the very beginning.Smart folks developing those-silicon-chips-we’re-relying-on (and which fail MUCH less frequently than our programs) have noticed a long time ago that they save a lot of time when they design their chips for testing from the very beginning. In an IC-development world, this is known as Design for Testing (DfT). The very same idea can (and SHOULD) be applied to software development as “writing for debugging” (see also [[TODO- tentatively “ongoing testing”]] section below).

As we’re spending much more time debugging code than writing it, building debugging tools right into your code makes perfect sense.

Assert, and then assert some more…

One really important thing you need to do when you programming, is inserting lots of assertions into your code. Syntax of assertion differs from one programming language to another one,⁸ but the idea is more or less the same across the board: make sure that whenever you reach this point in code, a certain expression is true. If not – your program is either terminated, or a special exception is thrown.

I won’t argue here why assertions are a Good Thing™ – not only my experience shows that they help to speed things up greatly, and there is a mathematical model which explains why it is the case ([NoBugs14]), but there is also a more-or-less consensus in the industry about it.⁹ Moreover, assertions were considered so important that they are supported as special statements (!) in programming languages such as Java and Python.¹⁰

Now, let’s note that from my perspective, assertions have two Big Advantages. More obviously, in case when you have a bug they cause your program to fail earlier (instead of the problem being there, but not manifesting itself for hours), that allows to debug programs MUCH faster. Less obviously (but IMHO even more importantly), assertions help to improve code readability.

Whenever I see an assertion in the code – I know that this condition does stand. From this perspective, assertion acts as a kinda-comment, but with one very significant advantage over the usual comments –

Unlike comments, asserts never become obsolete¹¹

Or, from a bit different perspective, you can see the very same assertions as implementing “programming by contract” concept. Whenever you have an assertion at the beginning of your function, it is an enforced pre-condition on what-you-expect-from-your-caller. Whenever you have an assertion at the end of your function – it is an enforced post-condition of what-you-guarantee-for-your-caller.

Ok, I hope that at this point you agree that you do need to use assertions. But what exactly needs to be asserted? The answer is that “pretty much everything”. At the very least – assert every non-trivial pre-condition and every non-trivial post-condition of all of your functions. As a rule of thumb:

If you ever wonder “whether such and such condition is always true here?” and you find that it does – put an assert into your code, it will save you time from figuring it out once again 3 months down the road

⁸ for the way I prefer to handle assertions in C++, see Chapter [[TODO]]

⁹ sure, there are people out there who deny it, but they’re rarely coming from the industry, and are few and far between

¹⁰ in C, C++, and C# assertions are implemented either as a macro or function

¹¹ ok, they may become obsolete and invalid, but it doesn’t last long

Validation code: assertValidity() functions

One thing which is closely related to assertions, is a special validation code. For many objects with complicated and non-trivial states, you SHOULD have routines which make sure that the state of your program is valid; whatever “state is valid” means – depends on your object, but I really hope that you are able to describe it for any object in your system.

Such assertValidity() functions (asserting that the object is in valid state) are useful for two reasons. First, if you find a bug, you can start inserting these assertValidity() calls pretty much everywhere, often finding that the bug has occurred MUCH earlier then you were able to see it otherwise. This technique works even better if you can “replay” your event-driven program (see Chapter V), but is often useful even without replay being available.

The second advantage of such assertValidity() functions is that they serve as a documentation on “what is considered valid for this object”. And just like asserts (and unlike comments, which can and do become obsolete), assertValidity() stuff doesn’t become outdated as you’re making changes to other parts of your code (that is assuming that they’re used at least from time to time; and if they’re NOT called – it is better to remove them as misleading).

Event-driven programs: Asserts in production

Normally, assertions are activated only in “debug” mode (however it is defined for your programming language), and are disabled for “release” mode. This makes sense for two reasons: first, because in release checking everything can eat too many CPU cycles, and second – because for many programs out there there is no good answer to the question “what to do in case of assert() failing?”

“On the other hand, as long as we’re staying within our event-driven programming model, and especially as long as our event-driven objects are written in VALIDATE-CALCULATE-MODIFY style – production-time assertions start to make sense.On the other hand, as long as we’re staying within our event-driven programming model (see Chapter V for a detailed discussion of event-driven programs), and especially as long as our event-driven objects are written in VALIDATE-CALCULATE-MODIFY style (as discussed in [[TODO]] section of Chapter V) – production-time assertions start to make sense. To make them work, we just should:

If assertion fails – instead of aborting the program, we throw an exception
Then, an assertion becomes a “yet another, though not exactly expected, exception”
catch all the unexpected exceptions right outside of our state machine, doing nothing about them (except for logging/reporting). Effectively, we want to ignore incoming-event-which-has-caused-assertion (and to prepare to process the next one).

In case when assertion happens within VALIDATE or CALCULATE processing stages as described in Chapter V – we can be sure that the state of our state machine didn’t change, which means that incoming event is ignored cleanly and safely. While ignoring offending messages is not guaranteed to make your program to work longer reasonably – my experience shows that most of the time it helps rather than hurts.¹² Assertions within MODIFY processing stage are less obvious, but they’re rare enough to ignore them for most games. For those games (such as stock exchanges, or any game where you want to keep strong exception guarantees) when you want to handle properly production-time assertions (and other exceptions) within MODIFY portion of your code – you may either aim for a formal strong exception safety, or may want to employ different tactics (as one example, copy-on-write-based memory transactions will do the trick).

As you might have noticed, I tend to see assertions-in-production as useful for quite a few games out there. A question “how to implement them” (which includes a question “how to make them CPU-friendly”) heavily depends on the programming language you’re using. For C/C++ I am usually using multi-level ASSERT-like macros (more on it in Chapter [[TODO]]), for Java – something like Spring Framework’s Assert.isTrue() – which throws IllegalArgumentException from within if condition is false; for other programming languages – it is usually possible to design something along the same lines.

¹² just make sure not to try this approach for a control program of a real-world nuclear station

Post-mortem analysis

“The most important thing when your program crashes is to identify the bug as soon as possible, using some post-mortem data from the system where the crash has happened.For any game (and for any non-life-and-death real-world system in general), you will be facing cases when your server (I hope that it happens relatively rarely), or one of your millions-of-clients (this will happen a lot) crashes. It is pretty much inevitable, so you need to prepare for it (and to make sure that it doesn’t happen next time). The most important thing in this case is to identify the bug as soon as possible, using some post-mortem data from the system where our program has crashed.

Traditionally, for post-mortem analysis, core dumps and text logs are used. However, core dumps contain only information at the very last moment of the program life, and by that time things are usually already pretty bad, so core dumps are usually of a limited use for post-mortem purposes.

From what I’ve seen, text logs are much more useful than core dumps (with really good logging allowing to identify of up to 80% of the bugs from a single post-mortem log).

Plus, in addition to traditional core dumps and text logs I also suggest to use much less traditional event-driven replay (more on it below).

Text Logging

Text-based logging is a very important part of any production-oriented code. As a rule of thumb, you should have a lot of logging both on your server and on your client. And while we’re at it – note that you should have a “Send Logs” menu item (or at least “Save logs to…” menu item) within your client (instead of providing a 3-page-length instruction on “where to find our logs and how to ZIP them”).

About “what to log” question. As a yet another rule of thumb, you should log all non-trivial input events (excluding those events which happen too often).

Ideal text log is the one which allows you to identify 100% of bugs from a single post-mortem log

“I’ve seen systems for which 80% of the bugs were identifiable from a text logOf course, 100% is not achievable, but as noted above, I’ve seen systems for which 80% of the bugs were identifiable from a text log. If you have “replay” facilities for your event-driven system enabled (see below about event-based replay) – importance of text-based logging is somewhat reduced, but they still remain a very useful tool for debugging of event-driven objects a.k.a. ad-hoc FSMs.

Which library to use for text logging – is up to you. Pretty much every language has a library for this purpose, so just pick one. For C++, we’ll discuss logging in Chapter [[TODO]] (hint: at the moment I prefer [cppformat] library over both stdio and iostream formatting).

Event-driven recording/replay

Another very important tool for debugging is event-driven recording/replay feature as was described in Chapter V.

In short, the idea is two-fold:

first, you make your event-driven program perfectly deterministic (this does require some effort, but usually is not too bad)
second, have a (usually binary) circular log of all the events happening within your event-driven object, and to have at least one snapshot of the object state within the same circular log too.

“As a result, if your game client crashes – you can easily run these logs in comfort of your development machine, and to crash exactly at the same place where production-client has crashedAs soon as you have both these things – then in case of post-mortem you have last NN seconds of your event-driven object perfectly recorded, which (given “perfectly deterministic” property above) makes it perfectly replayable.

As a result, if your game client crashes (and player is willing to send client logs to you, including this circular log) – then you can easily run these logs in comfort of your development machine, and to crash exactly at the same place where production-client has crashed. Bingo! The bug is identified, and now it is only a matter of time to fix it.

Of course, in practice not all event-driven objects can be made deterministic, and not all the bugs can be found using last NN seconds of replay, so replay cannot be seen as a Silver Bullet of event-driven debugging. That being said, it can be made an extremely useful tool for production post-mortem.

[[To Be Continued…

This concludes beta Chapter 12(a) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”. Stay tuned for beta Chapter 12(b).]]

[+]References

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.