My Personal Take on Testing (including Unit Testing and (A)TDD/BDD)

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

[rabbit_ddmog vol=”6″ chap=”Chapter 22(a) from “beta” Volume VI”]

First of all, let’s note that whatever I will discuss in this section – is just my personal feelings on the subject of testing. There are very few universally-acceptable truths in testing, and every company has its own testing process, which is quite different from the others.

As a result (and unlike many other places in this book), I am not going to say “hey, this is THE way to do testing!”. Instead – I will outline my personal experiences and concerns about existing testing techniques (and will try to cross-reference them with existing works on the subject).

Automated Regression Testing is a Good Thing™

While just two paragraphs above I admitted that there are very few (near-)absolute truths when it comes to testing, at least one such truth clearly exists. It is that

For any sizeable agile project,¹ you DO need some kind of automated regression testing.

“At this point, I do NOT try to specify how to implement this automated regression testing; instead – I am just saying that we DO need it, that’s itAt this point, I do NOT try to specify how to implement this automated regression testing; instead – I am just saying that we DO need it, that’s it. The rationale for this rule is very obvious – as soon as we have a project with frequently changing requirements (and games pretty much inevitably fall under this definition) – making a change and just hoping that each and every change didn’t break anything else – is way way too optimistic. With number of changes within a monthly build going into hundreds – you can be pretty sure that at least one of them will break the existing behavior. As a result – having automated tests for regressions certainly qualifies as a Good Thing™ (and moreover, this stands both in theory and in practice).

¹ Here “agile project” is to be understood as “project where requirements change frequently”, regardless of any buzzwords being applicable

On “Common Wisdoms” a.k.a. Popular Misperceptions

So far so good – and I haven’t even deviate from “common wisdoms” yet. Well, I am just about to do so.

On Unit Testing

When people are speaking about “automated testing”, they very often imply “automated unit testing”. However, at the very least for distributed systems,

Unit testing is not enough – by far.

It means that if you think that by unit-testing your game, you’re going to have it error-free – you’re in for a Big Fat Surprise™ (and very unpleasant one at that). From my experience, for distributed systems (and games in particular), percentage of regressions which can be detected by unit testing, is well below 20%.² The remaining 80% tend to represent all the bugs related to generalized notion of “races”. NB: for our purposes here, we’ll consider both classical inter-thread/inter-DB-connection races, and “unusual sequence of incoming messages” as “generalized races”. While classical ones are not possible at application level as long as your architecture is (Re)Actor-based and single-writing-DB-connection-based (as discussed in Vol I-III) – the“unusual sequence” ones are inherent to all interactive distributed systems, including (Re)Actor-based ones.

As a result, I have to insist that

unit testing should not take more than a very small portion (such as “20%”) of your overall testing efforts.

Unfortunately, way too often when speaking about testing, there is a horrible misperception of “hey, we’re 100% unit-tested, so we should be fine” – which tends to fail very badly when facing real-world games (including stock exchanges, banks, etc.).

“BTW, I don’t want to say that unit testing is pointlessBTW, I don’t want to say that unit testing is pointless. If your code is 100% unit-tested – this is not bad, but – you still need to do like 80% of your testing work. OTOH, if you’re not 100% unit-tested but are using different testing techniques (in particular, those techniques discussed below) – these non-unit-testing testing efforts can still keep your game in a very good shape. In other words, I am arguing that:

You should do MUCH more than mere unit testing.
You should take a look which amount of unit testing represents a “sweet spot” in terms of bringing-the-most-improvement for your testing efforts. In general, each effort is subject to the “law of diminishing returns” – so if trying to spend more and more efforts on unit testing, at some point your unit testing efforts will most likely start yielding results which are worse than you’d have spending the same effort on some non-unit testing.

Of course, in theory (and for military/nuclear stations/…) it is necessary to perform all the types of testing (and have both 100% unit-test coverage and use all other testing methods which we’ll discuss below). However, for quite a few games we’re operating with a limited budget of time for testing – and should allocate it to those techniques which provide the best bang for your hour spent.

BTW, apparently I am not alone here; for example, [Coplien], to the best of my understanding, seems³ to go even further and claim that “Most Unit Testing is Waste”.

² That is, if we can assume that the developer has already run the changed code in debugger at least once

³ I am trying to be very careful when trying to convey my feelings about James Coplien’s writings. Recently, he complained that I misinterpreted him, so I have to be very clear that it is just my interpretation, and not necessarily his understanding of his articles. The same disclaimer applies to all the references to [Coplien] in this book.

On Code Coverage

One popular metric with testing is “code coverage” of the test suites. It is easy to obtain, and is even useful. However, as any other metric, it is often being misused. In particular, it falls victim to the Goodhart’s law, which says:

When a measure becomes a target, it ceases to be a good measure.

“as soon as you tell your teams that you’re looking at the code coverage – they will find a way to abuse this metric, so it will become pretty much uselessWhen applying it to the code coverage, it means that as soon as you tell your teams that you’re looking at the code coverage – they will find a way to abuse this metric, so it will become pretty much useless. BTW, it doesn’t mean that your developers are dishonest or trying to game the system – as Goodhart-law-related abuses can easily happen at subconscious level.

On the other hand – as long as your developers don’t know that you’re using this metric (and you just come up with suggestions “let’s write a test for this use case”, without mentioning where you got it) – you’re ok to use code coverage as a way to find under-covered pieces of code.

BTW, [Coplien] seems to provide a very good real-world example of developers abusing “code coverage” when it became a CMM target (and in my interpretation – it is a very obvious application of Goodhart’s law).

On TDD, ATDD, and BDD

The idea behind TDD is very neat – “let’s write the test first, and the code second”. This way we can be sure that we have 100% of the functionality covered by our testing.

For quite a while, Test-Driven Development (TDD) was seen as The Way to make programs reliable. Apparently, at the very least the reality was not that bright. In particular (to re-iterate – these are just my personal takes about it, not to be seen as any kind of “universally acceptable truth”):

“I am arguing that changing design to enable testing should be avoided.I am arguing that changing design to enable testing should be avoided. In particular, I am very averse to any design changes which will affect the code readability (and “any design changes” includes “design changes to enable testing”).
- A side note: in IC design, Design-for-Testing (DfT) is an (almost-)universally- acceptable practice. However, even there DfT is usually implemented as-non-intrusively-as-possible with relation to “normal” operation. I am arguing for doing the same in program design, optimizing for normal operation – and adding tests to the existing design (in general, well-designed code should be also testable, but forgetting about normal operation and caring only about testability leads to Really Ugly Results™ way too often).
- For some examples of the stuff which I am trying to avoid, please refer, for example, to [Hannson].
I am arguing that changing code to enable testing should be avoided (on the same readability grounds as above). It means that I am pretty much ruling out extra levels of indirection just to enable testing.
OTOH, I am not against mocking and stubbing – as long as they can be done without affecting the “normal” design and code.
And – I am not against TDD as such (as long as the above rules stand).

If you think about it – you’ll probably see that following the way outlined above, will lead us to

Having larger pieces to test.

Sure, with these restrictions we still can test all our code base – but we’ll be doing it using larger modules-under-test – and essentially on a higher level of abstraction.

And – IMNSHO this is a Really Good Thing™ for several reasons:

The higher we are in the food chain abstraction level diagram – the more we’re moving from how we’re doing things to what we’re doing. This helps to avoid “Tautological TDD” (for discussion on TTDD, see, for example, [Pereira]) and other related anti-patterns.
With larger pieces – we can test much more than when testing the smaller ones. In other words – we can test the whole larger piece, including interactions between smaller pieces comprising the larger one. As noted above – low-level unit testing has only limited use, so going from low-level unit testing towards functional testing and/or acceptance testing is a Good Thing™.
We’re not cluttering the code (which is, as we know, is read much more times than written) with not-really-relevant details. As a developer, I am saying it is a Good Thing™ too.

In a sense – IMO the approach outlined above, is well-aligned with Acceptance-Test-Driven Development (ATDD) and Behaviour-Driven Development (BDD) – which also tend to imply larger pieces for testing (with ATDD and BDD, at least as I understand them, you shouldn’t try to unit-test each and every function – but should test much larger units instead, keeping implementation details off the testing table). Also – while I hate taking yet another risk to misinterpret [Coplien] – it seems that his “Test at a coarser level of granularity” is pretty close too.

Automated testing for (Re)Actors

If (as I really hope) you have decided to implement your system as a bunch of (Re)Actors (as it was discussed at length in Vol. II’s chapter on (Re)Actors) – we can use (Re)Actors to introduce other two layers of automated testing.

Script-Driven Testing

As it was discussed in Vol. II, (Re)Actors provide a very-well defined interface to the outside world; very shortly – for (Re)Actors, pretty much everything is an input event (and most of the reactions generated by (Re)Actors in response to input events, are outgoing messages).

“it is possible to write a test script, which takes (Re)Actor as a whole, throws messages at it – and observes messages generated by (Re)Actor, checking these generated messages against the rules specified in the scriptThis, in turn, means that it is possible to write a test script, which takes (Re)Actor as a whole, throws messages at it – and observes messages generated by (Re)Actor, checking these generated messages against the rules specified in the script.

With such a (Re)Actor-level script-based testing – we’re essentially treating our (Re)Actor as a “black box” – which is very well aligned with “we should test what the object does, not how it works” paradigm.

That’s it – but this simple testing has been observed to be very efficient (especially when compared to traditional unit testing). The reason for it is quite simple – (Re)Actor-level script-driven testing usually operates at a level which is much higher than usual unit-testing – which in turn:

Tests things which are different from those things usually tested by unit testing
Tends to be more resilient to implementation changes (in a sense – because it is closer to BDD/ATDD paradigm)

In general, script-driven testing at (Re)Actor level is a good fit for Functional Testing – and for Regression Testing.

[[TODO: databases/mocking]]

Replay-based Regression Testing

The idea of such testing goes along the following lines:

The whole point of regression testing is to ensure that there are no unexpected changes introduced from version V into version V+1
With deterministic (Re)Actors, we can easily record all the inputs and outputs of selected (Re)Actors during real game play from a production system – while it is running version V.
Then – we can separate all the changes intended for upcoming-but-not-yet-released version V+1, into two categories: (a) those which are supposed to change the existing behavior, and (b) those which are not supposed to change it. NB: most of “added functionality” changes tend to fall into (b) category (i.e. most of the time, existing behavior shouldn’t change unless enabled – and it certainly wasn’t enabled when version V was run).
- This separation can be done quite easily – as long as you have a practice of attributing of your commits to the issues within your issue tracking. And for issues – separating non-modifying-existing-behavior issues and modifying ones is rarely a problem.
  - Moreover, in practice you’ll see that like 80+% of all the commits are supposedly non-modifying-ones (the vast majority of commits tend to be about new features, and they’re usually not changing existing behavior unless enabled).⁴
- Then – we can build version V+0.5, based on version V, plus only those-commits-which-are-not-supposed-to-change-existing-behavior
- “As soon as we have this version V+0.5 – we can (and should) replay the logs which were recorded during normal operation of version V, against version V+0.5 – and see if there are any discrepancies.As soon as we have this version V+0.5 – we can (and should) replay the logs which were recorded during normal operation of version V, against version V+0.5 – and see if there are any discrepancies.
  - Each such discrepancy should be seen as a bug unless proven otherwise (and the burden of proof should be on the developer who has made the offending commit). If it happens to be caused by a commit which was misattributed to a different issue – I won’t argue whether it is a bug or not, but what is clear is that it should be re-attributed – and then version V+0.5 should be re-built, and re-tested.
    - This process can be quite unpleasant at first, but pretty soon your developers will learn to be more careful with attributing their commits (which is a Good Thing™ for any sizeable development anyway)
  - BTW, if any non-trivial bugs are identified during replay testing – test cases for them may (and should) be introduced into script-based regression testing.

Bingo! We have our replay-based regression testing – which tends to provide much more thorough testing than simple unit-testing, and is also more thorough than script-based testing. I tend to attribute it to the following phenomenon:

Nobody can possibly predict all the scenarios which your players will throw at your system.

I’ve spent quite a bit of time looking at the players and cases they create – and I have to admit that predicting all the scenarios is well beyond not only my humble capabilities to generate test cases – but also well beyond capabilities of all the developers and QAs I’ve ever met. BTW, some of the QAs I had the pleasure to work with, were exceptionally good and were able to find like 3x-5x more significantly different test cases than developers; however, capabilities of 100K players pressing the buttons at the same time – go at least 5x-10x further than the very best of QAs.

⁴ As always, YMMV; here – YMMV even more than usual

Simulation Testing

One more thing which tends to help when testing games – is simulation testing. We just create 10’000 simulated players – and run them against the QA instance of our Server, keeping an eye on it to see whether everything makes sense (there are no assertions or crashes, database invariants still stand after we finish simulation, and so on).

Just like replay-based testing, simulation-based testing tends to find those scenarios which are impossible to predict in advance. However, it is neither a subset nor a superset of the replay-based testing, because:

Bugs in new functionality cannot be tested by replay-based testing by definition – but there is a chance to find some of them in simulation
While simulators are good to find randomly occurring interactions (including those “generalized races” mentioned above), they’re usually not able to find some non-random patterns which can be caused by players who’re trying to angle the system, or players who just behave in a way simulation writer wasn’t able to think of.

In addition, simulation testing is a reasonably good way to do basic Performance Testing, and it can be seen as a kind of Integration Testing too.

Summary: My Personal Recommendations for Testing

[[TODO: manual testing (see arguments in the comments below)]]

[[TODO: testing frameworks (“don’t matter much” for unit tests, your own one for (Re)Actor-level testing)]]

To summarize my own personal recommendations for testing of the games (and any other distributed system):

“Unit tests (at least as they’re usually understood) are usually not enough to provide sufficient test coverageUnit tests (at least as they’re usually understood) are usually not enough to provide sufficient test coverage
- This can (and IMO should) be alleviated by moving tests up the abstraction level – and testing larger pieces of code⁵
- As a Big Fat Rule of Thumb™, you should not spend more than 20% of your overall testing efforts, on unit testing.
Modifying design and/or code to make your code testable is generally a pretty bad thing (especially from readability point of view)
Whether to use TDD – is up to you, as long as:
- You’re testing larger pieces of code with higher-level tests (think ATDD or BDD)
- You modify neither design nor code to run your tests⁶
  - BTW, mock-ups and stubs are fine as long as the above restrictions stand
- Most importantly – DO use automated testing. Moreover, DO use as many different automated testing techniques as you can
  - Unit testing might be useful.⁷ What is important though is not to rely solely on unit testing. As a rule of thumb, for games (and any other interactive distributed system) unit testing should take at most 20% of overall testing efforts
  - DO use (Re)Actors – and you’ll have two other testing techniques to use, which are known to work very well:
    - Script-driven regression testing at (Re)Actor level (just throwing messages at (Re)Actor and seeing its reactions to them)
    - Replay-based regression testing. This one requires (Re)Actors to be deterministic – but tends to test things which no test writer can possibly think of.
  - Simulation-based testing provides a yet another significantly different way to test your system (which can find bugs not foundable by any other testing methods).

⁵ Whether to name these larger-than-usual pieces of code “units” so keep “unit test” name – is a purely terminological – and therefore completely pointless – argument

⁶ as always – don’t take it as a gospel; what’s important here is to make sure that code readability doesn’t suffer too much

⁷ though I’ve seen highly successful and unusually reliable games which didn’t have unit testing at all

[[To Be Continued…

This concludes beta Chapter 22(a) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”.

Stay tuned for beta Chapter 22(b), about logging and production post-factum debugging]]

[+]References

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

Comments

Jesper Nielsen says

March 28, 2017 at 8:50 pm

>Unit tests (at least as they’re usually understood) are usually not enough to provide sufficient test coverage

This can (and IMO should) be alleviated by moving tests up the abstraction level – and testing larger pieces of code5

Lemma to this – don’t be afraid to use tools built for unit-testing to perform higher level tests. I’ve had very good experience using Visual Studio’s Unit test framework in my dayjob to perform automated integration testing. At one point we had to perform a large series of mock business flows against a server run by an “authoritative entity” for certification. The tests later proved very valuable to ourselves for regression testing.

Don’t forget manual testing also – it’s always good to have QA screwing around trying to break things because they’re very good at that – actually most non-programmers, even if they aren’t trained for QA, are.(Nooo – why did you push THAT button???)
But you need to complement this with test plans – structured, documented manual tests to be performed before a release.

I don’t have much good to say about Testpartner, Silk or similar automated UI testing though, unless you’re ready to throw a lot of ressources at making those tests with limited value in return. 99% of failed tests will be because someone moved a UI element a few pixels…

- "No Bugs" Hare says
  
  March 29, 2017 at 4:27 am
  
  > This can (and IMO should) be alleviated by moving tests up the abstraction level – and testing larger pieces of code
  
  Exactly – and this is argued for in OP too :-). BTW, (Re)Actors (FSMs/…) happen to be a very good unit for meaningful testing (and doesn’t lead to messy code inside).
  
  > But you need to complement this with test plans – structured, documented manual tests to be performed before a release.
  
  This I did see a lot – and never seen to be efficient (in fact, it was a horrible waste of time _in_all_cases_I_have_seen_it_). If mainstream scenarios are auto-regression-tested – there is practically no room for the-same-thing-done-in-manual, to fail. Sure, there can be occasional slips (something not really auto-tested) – but unless we’re speaking about military etc. – usually we have only a limited QA budget (can be “money”/”time”/”people” – doesn’t matter for us now), and within this budget there are MUCH better things your QA can do. Moreover, having manual tests where QA’s role is reduced to a role of the robot pressing the buttons – tends to alienate good QAs to the point of leaving the company. So while I agree that manual testing is good (yes, good QAs are very good at finding corner cases) – I will strongly oppose to making QA to go through the list of buttons to be pressed and reactions to be seen; this is the kind of mundane work computers are doing much better than people – and people tend to _hate_ doing it.
  
  > 99% of failed tests will be because someone moved a UI element a few pixels…
  
  Even if the test framework is _that_ bad (which I didn’t see in practice; reasonably good test framework should go by IDs, not by pixels) – it is still better to have QAs to look at those failed tests, and fix them (at least they won’t hate you _that_ much as when they need to press all the buttons themselves). Also – let’s keep in mind that as a rule of thumb, client-side logic shouldn’t be regression-tested via UI (the boundary between UI and logic should be well-defined, and automated testing should be applied at this boundary instead; then – we’ll see that at least for games, graphics/UI will change fairly rarely compared to the logic). If speaking about web testing – it is a different beast – but frameworks such as Selenium are not _that_ bad (and IMO clearly preferable to mundane manual button-pressing).
  
  - Jesper Nielsen says
    
    March 29, 2017 at 10:24 am
    
    ——
    > This can (and IMO should) be alleviated by moving tests up the abstraction level – and testing larger pieces of code
    
    Exactly – and this is argued for in OP too :-). BTW, (Re)Actors (FSMs/…) happen to be a very good unit for meaningful testing (and doesn’t lead to messy code inside).
    ——-
    Yes you were:) I was actually quoting your text there, I guess it was hard to see.
    
    And yeah – subsystems communicating only via queues definitely makes testing easier.
    
    I will still argue that especially in a graphical game there’s really no way to test end-to-end functionality than to actually testplay the game, and a list of features to check is not a bad thing. I’m not saying that all corner cases should be covered by manual testing, or that the test-plan should be rigid, because a realtime game is not end-to-end deterministic, no matter how deterministic the parts are, simply because of timing issues of input, networking etc.
    
    PS – I don’t think we’ll see a testing framework with powerful enough computer vision and other AI to properly test the user experience of a graphical realtime game before shortly after the singularity.
    
    - "No Bugs" Hare says
      
      March 30, 2017 at 4:18 am
      
      > I was actually quoting your text there, I guess it was hard to see.
      
      Oops :-). This is what one of my colleagues tends to call a “violent agreement” :-).
      
      > I will still argue that especially in a graphical game there’s really no way to test end-to-end functionality than to actually testplay the game
      
      For the first time – sure; however, regression testing is a very different beast – and it can be done automatically. Also – for manual testing it is very important not to force testers (which are ideally beta testers) to reproduce the bug (they should just say “here is the bug” by pressing the controller button) – and this is achievable too (see, for example, http://www.gdcvault.com/play/1014345/I-Shot-You-First-Networking for a very visual example of such post-factum debugging).
      
      > I don’t think we’ll see a testing framework with powerful enough computer vision and other AI to properly test the user experience of a graphical realtime game before shortly after the singularity.
      
      Yes – but see above about regression testing :-).