Production Crashes. Post-factum Debugging. Logging. Replayable Deterministic Re(Actors)

posted April 4, 2017 by "No Bugs" Hare, translated by Sergey Ignatchenko

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

[rabbit_ddmog vol=”6″ chap=”Chapter 22(b) from “beta” Volume VI”]

Of course, testing is a very important part of software development, but in practice it is never perfect, so occasionally we’re bound to be faced with

the ultimate developer’s nightmare: crash in production.

Of course, Clients crash much more frequently than Servers – but when you have a million players, even 1% of Client crashes per day means 10’000 disgruntled players every day – and this is quite a pain in the neck¹ for your Support team. Moreover, if your management is doing a good job – they will also make developers share this pain;² as much as I hate dealing with such things – I do realize that keeping players happy is the most important responsibility for the developer (and I certainly don’t want the game to fail because of players being unhappy with my software).

“I’ve seen game companies with hundreds of thousands of dollars lost per hour of unplanned server downtime.With crashing Servers – the effects can be much more drastic than that; I’ve seen game companies with hundreds of thousands of dollars lost per hour of unplanned server downtime.

This leads us to a (literally) multi-million-dollar question:

What do we do with a production crash?

The answer is not that easy; however – it is perfectly clear that we should be able to do something better that merely restarting our Server (or telling our player to re-install the Client).

In particular, IMNSHO,

a good system allows to identify 80% of the production bugs from one single crash.

Moreover – it is not only desirable, but is also achievable. In other words – I’ve seen such systems myself <wink />.

¹ Yes, ‘neck’ will do here too…

² Once upon a time, I worked for a tens-of-thousands-of-simultaneous-players game, where modus operandi was centered around the following quote coming from the management: “We have already got TWO complaints from our players about this particular issue. We MUST do something about it”. And while it probably was a bit too harsh – the game was well-known for its reliability, being user-friendly, and so on and so forth – and had no problems with growing to hundreds of thousand players (and IMO – this policy was a contributing factor for the growth).

Implementing post-factum debugging

One thing to keep in mind is that to achieve those 80% of the bugs being identified from one single crash – simple core dumps won’t do. This comes from two all-important observations:

When core dump happens, more often than not, memory of the system is already damaged beyond repair. As a result – well, there isn’t that much we can see from the core dump. And while production asserts usually manifest themselves a bit earlier – memory dump take at that point is still rarely sufficient to identify what has happened (most of the time – it will show some strange state, without a hint where this strange state has come from).
By far not all the bugs/issues are manifested as a core dump (or as a production assert).

Option 1: Text Logging

“One thing which can bring us pretty close to that holy grail of “80% of production bugs identified from one single crash” – is good old plain text logging.One thing which can bring us pretty close to that holy grail of “80% of production bugs identified from one single crash” – is good old plain text logging. The idea is simple: during normal operation, -you’re just logging everything worth attention to a plain text file – and on crash you just get the log and analyze last several seconds (or minutes) before the crash.

This approach happens to be immensely more powerful than simple core dumps; most of the additional power comes from the fact that with logging, we not only know what exactly was the observable problem, but (assuming sufficient logging) we can deduce why the problem has happened.

If you’re logging all the messages (and all the player actions) which come into your Client – chances are that you will be able to determine the cause of the crash/problem just by looking at the logs, and by analyzing what was so unusual about a few last seconds of the events (which supposedly have caused the problem).

For slow-paced games without app-level multithreading – I’ve seen text logging to be sufficient to get to that desirable 80%-of-crashes-fixed-from-one-single-log number. Of course, it wasn’t 100% sufficient – but in most cases (except for the most egregious 3^rd-party bugs, such as bugs in malloc()), if the problem wasn’t identifiable from one single crash – adding more logging (and waiting for the problem to happen for the second time) did help; while not really ideal – it still was a rather straightforward way of dealing with production crashes.

Two things to keep in mind about text logging:

Most importantly – text logging is pretty much powerless to deal with inter-thread races. At the very moment when we have mutexes etc. at our app-level – the value of the logging goes down tremendously. Except for a few very narrow cases, we won’t be able to log all the entries and exits to/from mutex locks; this, in turn, means that the-most-elusive-inter-thread-races won’t be traceable from the logs.
- More formally – explicit threading is inherently non-deterministic, which means that our log becomes pretty much useless to answer the “why the problem has happened” question
- Moreover, if we add more logging trying to identify the inter-thread race – very often the problem will hide under the carpet for a while (to jump out when we have completely forgot about it – and at the very worst possible time). This can be seen as a pretty bad case of an experiment where adding a new sensor changes the results; with not much other sensors available – the whole thing can easily become impossible to debug.³
- On the other hand – if we’re sticking to a “bunch of Share-Nothing single-threaded systems” (such as (Re)Actors from Vol. II’s chapter on (Re)Actors) – this problem doesn’t exist, so the whole system is still very much reproducible and debuggable. In other words – it is not threads which are bad; it is poorly controlled interactions between threads, which are causing the trouble in general (and next-to-impossible-to-debug races in particular).
When logging to a file – it will work only for rather slow-paced games. For faster-paced ones (think FPS shooter) logging all-the-necessary stuff 20 times per second to a file – is likely to cause too much load both for Clients and Servers.
- “for fast-paced games, this problem can usually be alleviated by writing the same text log into a circular RAM-based buffer – and dumping the log into a file in case of crash.On the other hand, for fast-paced games, this problem can usually be alleviated by writing the same text log into a circular RAM-based buffer – and dumping the log into a file in case of crash. These days, allocating even 50M of RAM for this purpose is not likely to kill the Client (and even less – the Server) – and 50M of history should be enough for quite a few games out there.

From what I have seen – text logging might be “just good enough” to deal with the production crashes in a reasonable manner (and get most of the bugs fixed after the first crash). I tend to say that thorough text logging (with new logging added when new bugs are found) should be the bare minimum for any serious game.

³ once upon a time, we’ve built a fiber-based system, simulating context switches at all relevant points within code, just to deal with races in a heavily multithreaded environment. While it was an interesting exercise – I firmly oppose this kind of debugging for application-level code

Things NOT to log

With all the benefits of the logging, there are a few things which should not be logged into file-based text logs. Two examples of such things include:

Credit card numbers. These are usually prohibited from being logged in unencrypted/unmasked form, by PCI DSS. Most of the time it is not difficult to exclude them from text logs – but it may require to be rather careful.
Very sensitive up-to-the-second information. This is related to potential attacks on your servers (or potential for your admin to watch the log and gain advantage – even if all your admins are perfectly trustworthy, convincing players about it will be very difficult). This usually can be dealt with just by logging such sensitive events twice: (a) when the event has happened – just write the type of the event (withholding really sensitive information from the log for the time being). (b) when information is no longer that sensitive – it can be logged as “you know, back then the event actually had such and such sensitive attributes”.
- A classical example of such extremely sensitive information is pocket cards in poker. They’re soooo important that pushing them into the log file right away would be imprudent. However, as soon as the hand is over – it is not a problem to log them anymore (and this is exactly what we should do – delay logging of the actual pocket cards until the end of the hand).
- Actions of players under RTS fog-of-war in certain cases can be at least as sensitive as pocket cards – and in such critical cases can/should be delayed too.

Option 2: Deterministic (Re)Actors with Post-factum Debugging

With all the apparent real-world efficiency of the text logging – it is not the best we can do to deal with production crashes.

One way to do production post-factum debugging better, is related to my favorite (Re)Actors. The mechanics of using (Re)Actors for post-factum debugging was discussed in Vol. II’s chapter on (Re)Actors, so I will provide only a very brief summary here:

We make our (Re)Actors deterministic; in other words – we’re providing guarantees that with the same inputs, our (Re)Actors produce the same outputs
- NB: we do not care about “cross-platform determinism”; what we need for the purposes of post-factum debugging – is merely “same-executable” determinism (which is much easier to achieve).
- This “same-executable” determinism is achievable in practice; see Vol. II for a rather detailed discussion of related techniques.
We keep circular log of all the events coming to our (Re)Actor.
- We guarantee that the log always has a serialized state of our (Re)Actor, followed by all the events following the serialization. For more details – see Vol. II.
Whenever there is a problem – we’re taking the log, and delivering it to developers.
- See [[TODO]] section below on “How to do it for Clients”
In development lab, using exactly the same executable – developers are able to debug the problem, exactly as it has happened in production(!!) – and without being affected by our debugging too. In particular:
- The problem is 100% reproducible
  - In particular, we can trace it from the observable effects back to those-inputs-which-has-really-caused-it
- We do not need to care about performance of the debugging box, or our debugging being too slow (or too fast).

“deterministic debugging is the best thing I have seen for production debuggingSuch deterministic debugging is the best thing I have seen for production debugging. Also I would say that with such a system in place – we as developers have very few excuses not to identify the bug within a few hours.⁴

On the other hand, of course, even such a powerful system is not a “silver bullet”. In particular, three classes of bugs remain uncovered:

Inter-thread races (as noted above – inter-thread is non-deterministic, so determinism won’t fly)
- Note, however, that “unusual sequences of messages/events” – which is apparently a very wide class of bugs – are perfectly debuggable this way.
- It means that for (Re)Actors – there are no non-debuggable races <phew… />
Bugs which happened before the available part of log has started (but which manifested only much later).
Some of memory corruption bugs are not guaranteed to be caught this way (especially if ASLR etc. is used).

This all sounds almost too good to be true – but it has been seen to work not in theory, but also in practice. For a very good example of determinism being used for an AAA game post-factum debugging – I strongly suggest to watch a GDC presentation [Aldridge] about Halo:Reach by Bungie – at least around 01:08. Very shortly:

Bungie developers have made their Halo:Reach Client deterministic.
“they added a controller button so player can complain that “game is lagging”.During internal playtesting, they added a controller button so player can complain that “game is lagging”.
Whenever “lagging” button was pressed – they took Client log (the one of input events coming to deterministic Client), and sent the log to their servers (they named them TicketTrack servers).
- BTW, TicketTrack servers were able to handle more that just logs resulting from “it is lagging” complaints. In particular, after crashes and detected abnormalities, logs were sent to TicketTrack servers too.
From TicketTrack servers, the logs went to their network developers, who were able to see the whole thing with both the picture-seen-by-player, plus all the relevant network-related information. They were able to go back and forth to see how the problems were developing during the real-world sessions – and observe how the system reacted to these problems.
This was Bungie’s main tool for debugging of their network subsystem for Halo:Reach – and helped to identify and fix several very important issues before the game was released.

I rest my case (the one for deterministic systems which allow post-factum debugging of the code-already-in-production).

⁴ fixing the bug is a different story, but with at least half-sane architecture and design – bugs shouldn’t require rewriting of 50% of the code to fix

Getting Logs from the Client

After all that really impressive stuff above, the next question is going to be pretty minor and mundane – but we still need to touch it. It is a question of “how to get logs from the Client?”

As a rule of thumb, I tend to argue for two things in this regard:

Do NOT send logs without an explicit request of the player to do so
As soon as the player wants it – DO make sending logs as easy as possible (hint: saying “please go to some folder – which depends on the version of Windows in use – zip such-and-such files, and send them to us” does not qualify as “easy”).

Besides approach from [Aldridge] discussed above, I’ve seen a pretty simple system which worked pretty well:

On the Client, there was an option “Send logs to support”. It just took all the log files, zipped them, and sent them to support (IIRC, it was sent over e-mail, but could be done in other ways too). The mail (or were it logs themselves?) also contained player’s ID.
When a player started to complain about problems with the Client – the support simply asked her to go to the menu, and to use “Send logs to support” option.

From what I have seen – it worked like a charm.

No Backdoors Whatsoever

“With all the desire to handle all the bugs – there is one thing which we as developers should never use: it is any kind of backdoorsWith all the desire to handle all the bugs – there is one thing which we as developers should never use: it is any kind of backdoors (including, but not limited to: superusers, God accounts, etc.) While it is extremely tempting to feel myself as a kind of an omnipotent “god” within the game (which is also a well-known practice for single-player games) – it is a big no-no for MOGs.

The problem with god accounts is that

God accounts can be abused.

I don’t want to say that it will be developers who will abuse this ability – but there is always a risk that somebody will do it, sooner or later.

The most well-known example of such an abuse goes to two scandals with (IIRC) the same poker software: to find information about these scandals – Google for “UltimateBet scandal” and/or for “Absolute poker scandal”. Long story short:

Developers created a special “superuser” account – able to see all the cards at the table.
- Supposedly, it was intended only as a debugging feature
When the software was already in production – this “feature” was left there.
Then, somebody related to company management (and with access to the system), realized that it can be used to cheat the money out of the other players.
This feature was abused by two separate teams of cheaters over two sites, for more than a year, and cheating millions of dollars from the players.
The backlash against the sites was very severe – and arguably has started a downturn spiral which caused collapse of both sites in mere three years.

While it was the most publicized case of this kind of backdoor-related cheating – it wasn’t the only one. And BTW, if you think that your game has nothing of value and is therefore immune to cheating – make sure to re-read Vol. I’s chapter on Cheating (in particular, “If You’re Popular Enough, They Will Find Reasons to Cheat” section).

Bottom line –

Never ever write any “backdoors” into your game – even if you think that “it will be removed before the game goes live⁵”

⁵ It won’t.

[[To Be Continued…

This concludes beta Chapter 22(b) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”.

Stay tuned for beta Chapter 23, the one which will start discussion on the preparing for deployment]]

[+]References

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

ITHare (8)

Comments

Cory Long says

April 5, 2017 at 8:22 pm

“they will also make developers to share this pain” should probably be ” they will also make developers share this pain” or ” they will also make developers have to share this pain”, no?

Reply
- "No Bugs" Hare says
  
  April 7, 2017 at 3:47 am
  
  Of course, fixed now, thanks!
  
  Reply
SerzaNT says

June 16, 2018 at 5:24 pm

I really like this idea and I would very like to achieve “Replayable Determinism” in my project. But when I started to experiment around it I discovered few problems. I wasn’t been able to find clear explanation/examples(sometimes stated as implementation details in the book) on how to go around them and so I would like to ask here(im not sure if this is the right place, but couldn’t find better one) for some help/additional explanation.
Specifically:

1 – should log contain only external events(player input, connection events,…) or include internal (inter-ReActor) messaging(which should go through same queue & react(ev), as I understand it) ?

2 – How to handle frequent events? There are two groups that I have in mind. First, ticking inputs e.g. mouse movement, analog stick input(gamepad). Second, ticking logic e.g. interpolation to new transform.

To explain the last, aim using UE4 with variable timestamp (DeltaTime each frame is passed). So, if I only log the “trigger event” that start some ticking logic then times when that logic is performed may vary(that doesnt soud very determinictis to me). On the other hand, if I log event every frame with its DeltaTime that will spam the log and potentially shorten the logged times.

I would be grateful for any advice.

Reply
- "No Bugs" Hare says
  
  June 18, 2018 at 5:34 am
  
  Very briefly – EVERYTHING which has a potential to change the state of the system, MUST be recorded, or we’re out of luck 🙁 .
  
  Therefore:
  1. Inter-(Re)Actor messaging has to be recorded (BTW, terminology-wise, we’re isolating each and every (Re)Actor, so everything but (Re)Actor itself should be seen as “external” for our determinism).
  2.First – stuff such as mouse movement has to be recorded
  2.Second In theory, you MIGHT be able to get away without recording timer-events-sent-to-yourself most of the time, but in practice, it is still MUCH simpler to record them – see also below.
  
  OTOH, if speaking of logged/replay times, all this stuff is usually minuscule. On modern box, there is usually no problem to dedicate, say, 32M of RAM for logging, and with more-or-less sane marshalling, we’ll be speaking about ~64 bytes per mouse-movement log frame (which has to have some coordinates) and about ~32 bytes for a ticking log frame (which usually has nothing but timestamp). It means that even with 60 ticks/second, it will take like 17’000 seconds (~=5 hours(!)) to fill the log (of course, there will be other stuff, but my point is that this kind of data volumes is pretty much nothing by the standards of modern x64 computers, where we can easily use not 32M, but if necessary, 128M or even more).
  
  Reply
  - SerzaNT says
    
    June 19, 2018 at 6:22 pm
    
    Thank you. I’m a little bit confused though. For clarification, let’s say we have our game logic reactor/thread, as shown in https://bit.ly/2I1tuYo, then there are sub-ReActors which communicate through the same interface ie. they send messages to the queue. So, if we have situation illustrated as follows https://bit.ly/2llGgrP, then “Event#2” will be duplicated during the replay. This way we have to pass the (Mode == REPLAY_MODE) or make it global const to ensure that logic can deal with this kind of duplicate events. But I didn’t saw any mention of similar logic in the book, on the contrary, there is suggestion that this kind of logic is dealt with on “Infrastructure” level of code. That is why I think there’s something wrong with this concept, but im not sure what.
    
    Reply
    - "No Bugs" Hare says
      
      June 20, 2018 at 5:34 am
      
      Thanks for asking, it is indeed not _that_ simple :-).
      
      If we add sub-reactors into the picture, the things indeed change. We still have to record EVERYTHING, but ONLY at the external boundary of the outermost (Re)Actor (i.e. _excluding_ sub(Re)Actors).
      
      In general, if going into non-trivial analysis, I prefer to think of deterministm+replay in terms of “isolation perimeters”: we define “isolation perimeter” which isolates our deterministic state from the rest of the world, AND record EVERYTHING which goes inside this “perimeter” (this includes BOTH events AND return values of system calls – for the latter see, for example, http://ithare.com/deterministic-components-for-interactive-distributed-systems-with-transcript/2/ ). As soon as we record EVERYTHING which goes into a certain perimeter, the whole thing inside this perimeter becomes deterministic and replayable.
      
      Now, back to the case in hand: in the diagram on https://bit.ly/2llGgrP, we want to have our “isolation perimeter” around THE WHOLE Game Logic Reactor, but we don’t need it around sub-reactors… In practice, I strongly prefer to have recording/replay implemented as a part of Infrastructure Code, so I can say “hey, let’s make this Reactor replayable by a simple change to its base class or something” without changing the Reactor logic itself 🙂 . BTW, we recently started work on an open-source implementation for replayable Reactors 🙂 – but no ETA is known for it yet 🙁 .
      
      Reply
      - SerzaNT says
        
        June 20, 2018 at 7:03 pm
        
        I see. So, events will be logged before pushing to the queue, right? Im asking to confirm my assumption that inter-ReActor communication goes through the same queue.
        
        I’m really excited to hear about the open-source project, it will be a great help.
      - "No Bugs" Hare says
        
        June 21, 2018 at 5:48 am
        
        > events will be logged before pushing to the queue, right?
        
        Formally, it would work, but in practice, it is usually MUCH better to log them “right after Infrastructure Code reads an event from the queue” (if logging before pushing to the queue, we’ll have to deal with thread sync, and it is going to be Quite Ugly for no reason 🙁 ).
        
        > assumption that inter-ReActor communication goes through the same queue.
        
        Yes it does.