C++ for Games: Performance. Allocations and Data Locality

 
Author:  Follow: TwitterFacebook
Job Title:Sarcastic Architect
Hobbies:Thinking Aloud, Arguing with Managers, Annoying HRs,
Calling a Spade a Spade, Keeping Tongue in Cheek
 
 


Pages: 1 2

Lack of Locality. How Bad Can it be In Practice?

Hare asking question:how bad the impact of bad data locality can be in practice?You may feel that the above is more like a horror story and/or FUD; after all, everybody uses allocations and dynamic data structures (and all non-C/C++ programming languages are doing it at even larger scale). So, a reasonable question arises: how bad the impact of bad data locality can be in practice? I’ve also had such a question for quite a while, until I’ve got a very enlightening experience in this regard.

Once upon a time, there was a real-world program which performed certain calculations on a large in-memory ad-hoc-tree-like data structure: there were structures and collections of other structures within these structures, and so on, and so on. And it worked, but at certain point it started to use a bit too much memory (several Gigabytes), and we decided to “flatten” the structure – not to improve speed, but to save some RAM (by eliminating not-really-necessary extra pointers, and there were LOTS of them). So, I’ve rewritten the data structure, “flattening” it (more specifically, some of the structures were “flattened” into their immutable versions right after they were fully created; see more discussion on flattening in “Mitigation: Flattening” section below).

And yes, as a result of the rewriting we’ve got that about-2x reduction in memory size, which we were planning for. What we didn’t plan for, was that

The program as a whole has started to work 5x faster!

I should note that program was an almost-ideal case for speed-up due to flattening (it had small tree-like structures which flatten very nicely and into small-size blocks too), and 5x result should be seen as the one close to the maximum real-world speed improvement due to flattening. On the other hand, it does illustrate that gains due to flattening MIGHT be VERY compelling not only on paper, but in real world too.

Almost-Natural Locality: Reactors

Judging hare:Each of Reactors has its own state which is not accessed by anybody elseAs we’ve discussed in Chapters V, VI, and VII, I suggest to implement as much as possible of your game (both on Server-Side and Client-Side) as Reactors (a.k.a. event-processing objects). We’ve already discussed many Very Significant Benefits of Reactors in those Chapters (including, but not limited to, Replay Testing and Production Post-Mortem), but from our current point of view, there is one additional benefit of Reactors which matters: it is that Reactors tend to have relatively good locality themselves.

Indeed, each of Reactors has its own state which is not accessed by anybody else. And as the size of state is generally limited – yes, it tends to help with locality a bit (though the larger the state of one single Reactor is – the more you may need to resort to other techniques to improve locality).

However, there is one thing which you need to do to exploit this property of Reactors; it is that

You need to use a separate allocator for each of your Reactors

Assertive hare:When speed is an issue, I prefer to have explicitly separated allocators for separate Reactors (or at least to have an ability to do it this way)Otherwise, if you’re running dozens of Reactors within the same process, all of them using the same (usually default) allocator – then all the data from your different Reactors will be spread over one large default allocator, which will pretty much kill better-Reactor-locality. This negative effect is mitigated to certain extent by allocators exhibiting some thread affinity themselves (such as tcmalloc and ptmalloc2), but when speed is an issue, I still prefer to have explicitly separated allocators for separate Reactors (or at least to have an ability to do it this way). As a side benefit, such per-Reactor allocators can usually be made thread-agnostic (avoiding all the locks completely) – at the cost of using a different (global and thread-safe) allocator for message composing/parsing.

To make sure that you’re using “correct” allocators, you will probably need to wrap all the allocating data structures (such as std::basic_string<> and std::vector<>) into your own classes using your own allocators, and use these “your own” classes instead of usual std::string, std::vector etc. IMHO, having such wrappers qualifies as a Good Idea even if you end up using default std::allocator for your data later (at least you will be sure that you’ve tried everything and work in the best possible mode), but most of the time you’ll be able to see some gains in this regard.

Mitigation: Reducing number of Allocations

The most obvious way to reduce the number of allocations and improve spatial locality is, well, avoiding allocations wherever possible. Three most common sources of allocations in most of the programs are (a) strings, (b) collections, and (c) “owning” pointers (which can be implemented either via smart pointers, or via manual allocations).

With strings, it is often possible to:

  • Hare pointing out:reduce number of strings, or at least reduce amount of strings actively used during time-critical processingreduce number of strings, or at least reduce amount of strings actively used during time-critical processing. One example is to use integer IDs instead of player names for all internal purposes (especially for search keys); you still MAY have that std::string member within your Player structure, as long as you do NOT use it most of the time. You still may need to use it for rendering on the Client-Side (which is not a problem performance-wise), and for state sync on the Server-Side (in this case, “Flattening as an Intermediate Stage” may help, see below).
  • rely on short string optimisation (a.k.a. SSO) if your std:: library does it. See [StackOverflow.StdString] for details.
  • use eastl::fixed_string (for more on eastl, see also [[TODO]] section below)
  • use fixed-size char s[SZ]; instead of std::string (DON’T forget to encapsulate char array1 and enforce all the appropriate bound checks to avoid nasty buffer-overrun bugs). If you know that your strings are up to 8 bytes in size, there is usually little sense to use std::string (if your std::string does support SSO, then you’ll end up with 24+ bytes instead of 8, otherwise – you’ll end up with allocation, which is even worse); even for 20-30 byte strings, using some extra memory to avoid allocation is often worth the trouble.
    • On the other hand, making a decision “whether to use char[] or std::string in this particular case” is not always easy, so I suggest to have your own class MyString<max_size> for the use within the long-term-saved structs, with MyString deciding itself whether it uses char[] or std::string depending on max_size. This way, you will be able to adjust the balance between memory usage and locality (i.e. speed) without any changes to your app-level code, via changing MyString. Alternatively, MyString MAY use fixed_part_size approach described below for vectors. In a sense, this MyString is very similar to short string optimisation mentioned above, but is more flexible (also we can see MyString as a DIY version of eastl::fixed_string).

With collections, one very common case which can be optimized, is vectors which (usually) happen to be very small (like up to 5-10, though YMMV). With such vectors, it is often possible to replace them with fixed-size arrays. On the other hand, if outright restriction on array size is too harsh (or is too large), I often use my own class MyVector<T,fixed_part_size>; this class has T[fixed_part_size] as an array, and also some other fields related to potential allocation. Then, if the vector size is smaller than fixed_part_size – MyVector<> stores everything within the array, and avoids allocation completely; if the vector grows larger than fixed_part_size – it goes into allocation (for example, into a usual std::vector). MyVector<> can even have an interface which is identical to std::vector<> (though dealing with iterators may be not-so-trivial). BTW, similar thing is available in EASTL as eastl::fixed_vector<> (and they have eastl::fixed_string too; more on EASTL in [[TODO]] section below).

Hare wondering if you are crazy:To avoid allocation in this case, it is often possible to 'guess' the maximum size (and maximum required alignment) of all your derivative classes...For owning pointers, you CAN try to avoid them too. One common example for owning pointers is an owning pointer to a polymorphic base class, with actual object pointed to, being an object of derivative class (and then you can benefit of all the virtualisation goodies). To avoid allocation in this case, it is often possible to “guess” the maximum size (and maximum required alignment) of all your derivative classes, and then to have something like alignas(GUESS_MAX_ALIGN) uint8_t buf[GUESS_MAX_SZ]; instead of your “owning pointer”, using “placement new” to construct your objects within bufNB: use of asserts (or even better – static_asserts) to check that ALL your subclasses fit into this space (and are properly aligned too), is ABSOLUTELY REQUIRED! NB2: yes, it does expose your code to the knowledge of all the subclasses; whether performance is worth this complication – is up to you; as a rule of thumb – do NOT use this trick until you’re sure that you need it. NB3: this technique MAY be used alone, or with the offsets as discussed in “Mitigation: Flattening” section right below.


1 probably within a template class

 

Mitigation: Flattening

As I’ve mentioned above, flattening can provide very substantial performance benefits for complicated data structures. Let’s take a closer look on what flattening looks like.

Let’s say that we have an on-heap struct which has a string, vector of ints, and some other fields. Normally, such struct uses three allocated blocks: first one for the struct itself, second one for a string, and the third one for vector. To “flatten” it, we want to put all this data into one single allocation block.

To do it, we can:

  • allocate one block of sufficient size
  • place struct in the beginning of the block
  • then, place string-data
  • then, place vector-data

Pointers to string and vector-data, MAY stay as pointers (pointing to appropriate places within the same allocation block), or you MAY replace them with offsets (or, for string-data – avoid offset altogether, as it is constant by design).

Hare thumb up:Bingo! We’ve got out flattened data structure, and can traverse it too.Bingo! We’ve got out flattened data structure, and can traverse it too. As discussed above, you can expect data traversing over such a flattened structure to be significantly faster than traversing over original data structure, due to significantly better data locality. However, this comes at a price: modifying such “flattened” structures, while possible, is more complicated and expensive 🙁 . In other words: “flattened” structures are optimized for reading (and non-size-changing modifications), while usual C++ structures-with-ad-hoc-allocations are more like all-around jacks-of-all-trades (not too bad for modifications, not too good for reading). If you keep this observation in mind, it MIGHT be possible to achieve significant optimizations without rewriting too much of your program 😉 .

Now let’s consider several flattening scenarios which are common for games, and see how they work with regards to this observation.

Flattening Modifiable Reactor State (such as Game World State)

One not-so-common-but-perfectly-feasible approach is to flatten the whole state of your Reactor (as one example, you may want to flatten your Game World State).

Thinking hare:you generally SHOULD NOT try to completely flatten your whole state into one single allocation block (otherwise most of modifications will become hopelessly long).In this case, you’ll need an ability to modify your flattened state. As a result, you generally SHOULD NOT try to completely flatten your whole state into one single allocation block (otherwise most of modifications will become hopelessly long).

Instead, it is better to reduce the number of allocated blocks, while keeping the size of each allocated block limited. This way, you’ll be able to modify flattened-structure-within-allocated-block at a limited cost (even if block modification requires reallocation-and-complete rewrite of the block, the cost stays bound as block size is limited).

If flattening of Game World State (the version which you normally work with) is your intention, then usually it is better to split your Game World State into reasonable-size objects (approx. 64-1024 bytes in size each), allocating each of them separately. This way, you will still keep an ability to modify your flattened structures at not-so-high cost, but will still save a lot on locality issues. You will still need to do some jumps in RAM to traverse your Game World State, but you’ll need MUCH less of them compared to traditional unflattened C++ allocation-based data structures.

Flattening as an Intermediate Stage: Client-Side

In games, there are at least two specific scenarios when flattening can be used without the need to modify flattened data structures at all.

The first such scenario occurs on the Client-Side, when we need to render our Client-Side game world state for the next (Client-Side) tick/frame. Then, it MIGHT be a good idea to create a flattened version of our game world state; after creating this flattened version – two Big Client-Side Tasks we have, can work perfectly in parallel.

The first Big Client-Side Task we have is rendering of the current-world-state; this is done over essentially-constant flattened structure, so all the traversings are very fast. The second Big Client-Side Task is modifying of the current game world state to figure out the world state for the next tick; this is done over usual C++ tree-like structure, which is easily modifiable.

Flattening as an Intermediate Stage: Server-Side

[[TODO: a case for Server-Side flattening suggested by Glenn Fiedler]]

Flattening using FlatBuffers

Hare pointing out:Indeed, FlatBuffers serialized data looks very much like the “flattened” structure which I’ve described aboveOne interesting trick with regards to flattening, is to use [FlatBuffers] (even when you do NOT want to send anything across the network). Indeed, FlatBuffers serialized data looks very much like the “flattened” structure which I’ve described above – and they already have all the automated code generation stuff to make it happen without much efforts from your side.

So, if you take your data structure, and serialize it using FlatBuffers – you will be able to use this FlatBuffers-serialized-version as a “better faster” version of your data structures. Using serialized data as a faster version to access the data might sound crazy – but with FlatBuffers most of the time it will indeed work exactly this way.

The only significant drawback in this regard is that last time I’ve checked, the only way to modify FlatBuffer (known as mutation in FlatBuffer-speak), didn’t support size-changing mutations at all. Other than that – while I didn’t try it myself, I love the idea of using FlatBuffers for flattening of the data structures stored as part of state of your Reactor (for example, a state of your Game World).

Oh, and keep in mind that if you want to use FlatBuffers to represent Game World State in a “Flattening Modifiable Reactor State” scenario – recommendation to keep size of allocated blocks reasonably small (usually of the order of 64-1024 bytes) still stands. In case of modification – you can either use FlatBuffer mutation, or (when Flatbuffer-provided mutation doesn’t fly) – build new FlatBuffer from the existing one, adding or removing fields as necessary.

On Data Locality and Hyperthreading

As you most likely know, most of x86/x64 CPUs have a feature known as Hyperthreading (HT).2 Actually, one of the Biggest Ideas behind HT is to have CPU’s ALUs etc. something to do while it waits for data from RAM ;-). Among other things it means that if locality of your data is Really Good, your program MAY perform better without HT.3

While on the Client we do NOT normally control HT, on the Server I suggest to try running your program both with and without HT, and see which one performs better.


2 actually, HT is a reincarnation of well-known-in-non-x86-world SMT as “Simultaneous MultiThreading”
3 and yes, it has been confirmed by some practical observations too, though YMMV

 

On Prefetch and _mm_prefetch()

One further thing which has significant impact on performance in the context of data locality, is so-called “prefetch”. In short – if you read memory in sequence, modern CPUs recognize the pattern, and get you the next cache line “just in case” if you might need it. I don’t want to go into lengthy discussions on “how prefetch is implemented” etc., but will concentrate on “what we can/should do about prefetch from the developer’s point of view” instead.

Wtf hare:Most of the time, prefetch just silently works behind the scenes, and I didn’t see cases when messing with prefetch at application-level would be worth the trouble.Most of the time, prefetch just silently works behind the scenes, and I didn’t see cases when messing with prefetch at application-level would be worth the trouble. At least in theory, however, such cases do exist.

If you still want to play with manual prefetch (also known as “software prefetch”), on x86/x64 you MAY want to take a look at the _mm_prefetch() intrinsic function (available both with GCC and MSVC). The idea of this function (actually an asm instruction from x86/x64 instruction set) is to inform CPU that you’re about to need certain memory location, so it can start getting corresponding cache line into cache (that is, if the line is not in cache yet). The idea is that when you’re done with some other stuff which happened between _mm_prefetch() and actual memory access, you will get your data from L1 cache. BTW, it implies that _mm_prefetch() is useless for writing; writing pretty-much-never-blocks anyway (and doesn’t require prefetching).4

Once again – do NOT expect miracles from _mm_prefetch(); moreover (unlike some cases of flattening), it is very unlikely to be worth the trouble (except, maybe, for very specific very tight algorithms already identified as Big Fat Bottlenecks). If you still feel you do need manual prefetch – make sure to read [LeeEtAl]. BTW, on ARM CPUs there is a somewhat similar PLD instruction (and at least on GCC, you have __builtin_prefetch() intrinsic).5


4 for modern CPUs, all the writes are buffered at CPU level one way or another, so you don’t need to wait until it is really written to RAM
5 technically, PLD is available only for ARMv7 and better, but this includes “pretty much any smartphone/tablet CPU out there”

 

On Locality within the Cloud

You might think that “hey, who cares about these locality things in the cloud age!”. However, it is a bit more complicated than that.

First of all, let’s note that locality DOES apply to cloud-hosted apps. While for cloud-hosted apps locality effects will indeed be somewhat diluted by “noisy neighbours” and you will indeed get less control, better-locality apps will still perform better than worse-locality ones (in particular, because characteristic times for locality effects to exhibit themselves are several orders of magnitude shorter than characteristic times of VM/hypervisor).

Now, let’s note that whether it is worth to spend time on Server-Side locality issues (or it is better to pay extra for hosting) – is an open question. Ideally, I’d try to have a set of APIs which allow to write without caring about locality (allowing to release your game earlier a.k.a. ensuring better time-to-market) – but will allow to work on better locality later, saving you on server costs when they start to matter.

On writing Your Own Allocator

The fact that allocations are expensive, is rather well-known among gamedevs. As a result, it is fairly common to write their own allocators. I won’t say whether it is a good idea or bad idea (it Really Depends), but will note a few important observations instead:

  • Arguing hare:unless your allocator is Very Specific to your case, it is Very Difficult to beat good modern general-purpose allocator algorithm-wiseunless your allocator is Very Specific to your case (like in “all my allocations have exactly the same size”), it is Very Difficult to beat good modern general-purpose allocator algorithm-wise
    • At the very least, make sure to read [ptmalloc2] to see what ptmalloc2 is doing, and realize which of their stuff you want and which you don’t
    • If developing your own allocation algorithm, DO take locality into account. For example, LIFO allocators tend to work significantly better than FIFO ones due to better temporal locality.
  • That being said, not all the platforms have built-in allocators which are that good
    • Therefore, there MIGHT be merit in using the same library-based allocator across all your Client platforms. In this case, ptmalloc2 and/or tcmalloc are two rather usual candidates for porting
  • While it is difficult to beat existing allocators algorithm-wise in a general case, it IS possible to beat them if you can restrict allocator functionality and create specific allocators (such as “no-lock thread-agnostic allocator”, or “LIFO-only allocator”)
  • Usually, I am arguing for having several different allocators for different purposes. At least, having one-instance-of-allocator-per-Reactor (as mentioned above in “Almost-Natural Locality: Reactors” section) is usually a Good Idea
    • In particular, such per-Reactor allocator can be made thread-agnostic (avoiding all the nasty locks). NB: if you’re using this technique, you usually need to make sure that a different thread-safe allocator is used for the inter-Reactor messages.

[[TODO: optimizing data structures depending on the way they’re going to be processed, example vector-of-structs vs struct-of-vectors]]

[[To Be Continued…

Tired hare:This concludes beta Chapter XIII(a) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”. Stay tuned for beta Chapter XIII(b), continuing discussion on performance of C++ game code beyond allocations and data locality.]]

Don't like this post? Comment↯ below. You do?! Please share: ...on LinkedIn...on Reddit...on Twitter...on Facebook

[+]References

Acknowledgement

Cartoons by Sergey GordeevIRL from Gordeev Animation Graphics, Prague.

Join our mailing list:

Comments

  1. says

    > use fixed-size char s[SZ]; instead of std::string.

    Bare arrays are bug magnets, don’t use them lightly. Don’t skip the step of wrapping them in a template. Eliminating arrays from my C++ code and switching to a string class and a vector class for arrays (both of which do bounds checks in DEBUG builds) dramatically reduced the initial number of bugs in my code.

    Also, nice timing on this post – I started load testing my server last week and server performance optimization is very much on my mind. I have a couple more tricky bugs left before I’ll reach my initial goal of 50 clients connected.

    • "No Bugs" Hare says

      > Don’t skip the step of wrapping them in a template.

      Of course, I’ve added a sentence about encapsulating it, THANKS!

    • Dmitry says

      Another reason to _not_ replace std::string in most of the cases is short string optimizations in modern C++ standard library implementations. E.g. libc++ does not allocate any memory for strings with length up to 20 chars on x86_64.

      • "No Bugs" Hare says

        …and has size of 24 bytes instead of 8 for char[8] (that’s 3x overhead). I am not saying that SSO is bad (the answer is, as always, “it depends”, and for most of usage scenarios it is a good thing), but std:: is by design “one size fits all”, and as a result cannot possibly compete with solutions optimised for one single use case.

  2. luc says

    Question 1

    On page 1. Curiosity.

    “Once upon a time, I’ve seen two games.
    One has had 400K simultaneous players, and another one – around 100K, and from the player’s perspective the game
    was 99%-the-same.” (…)

    Could you share with us more information about these two games? For example: game title, tech stack, programming language(s), engines, architecture and so on.

    Maybe you cannot say much about these two games since it seems some kind of privileged information.
    I understand these details are not that important for the point of the text. Also, accurate conclusions cannot be taken since we do not know the whole scenario.
    But, you know! As developers, we love to know these kind of stuff! 🙂

    Question 2

    On page 2. Regarding Reactors (FSM).

    In the following text:

    “Otherwise, if you’re running dozens of Reactors within the same process,
    all of them using the same (usually default) allocator – then all the data
    from your different Reactors will be spread over one large default allocator,
    which will pretty much kill better-Reactor-locality.”

    What if we had one Reactor (FSM) per process. Could it solve the allocator issue? Would we add another issue?

    • "No Bugs" Hare says

      > Could you share with us more information about these two games?

      I would certainly like to brag ;-), but I am always trying to stay on a Very Safe Side from the point of view of professional ethics. In other words – if I tell names, some of potential clients can think – “hey, he might also tell about us too”, and cease to be potential clients :-(. Overall, I’m trying to treat employers and clients the same way doctors treat their patients – while telling “how this disease is usually cured” is usually ok, telling patient’s names is a Big No-No.

      > What if we had one Reactor (FSM) per process. Could it solve the allocator issue?

      Yes, it will :-).

      > Would we add another issue?

      Yes, it will :-(. (a) it limits you to single-Reactor-per-thread (and this is not always optimal, especially as having multiple thousands of threads per box is usually a Bad Idea). (b) cost of running a process are higher compared to costs of running a thread (there is a dozen of subtle things which are better for threads, including inter-thread vs inter-process communications).

      Bottom line about Reactor-per-process: there ARE cases when doing it will help, but – it is not universal :-(.

  3. int19h says

    >> If you know that your strings are up to 8 bytes in size, there is usually VERY little sense to use std::string

    Don’t most STL implementations use small-string optimization these days (which is basically shoving as much of the data into the string object itself as they can to avoid allocation)?

    MSVC does that for sure, and avoids allocations for strings up to 15 chars (on x64), so it would seem that the proper advice here when it is the target is the reverse – if you know that your strings are up to 15 bytes in size, _do_ use std::string, since there’s no allocation/non-locality overhead, and you do get a much more convenient interface than a char[].

    • "No Bugs" Hare says

      THANKS for mentioning SSO, I’ve added it. However, for <=8-byte string there is still little sense to use std::string, as SSO-ed std::string itself is 24+ bytes (32 for MSVC, ouch) - instead of 8 for char[8] :-).

      • int19h says

        Yep, if you know that you will never ever cross that boundary, it is overkill (albeit a convenient one). On the other hand, you get the convenience of having short strings be blazing fast, while still allowing to store the occasional long string that you might need, and paying the penalty only for access to that occasional string.

        Obviously this is still jack-of-all-trades & master of none approach. I just think that it’s “good enough” (that you wouldn’t have to bother with micro-optimizations) in way more scenarios than a non-SSO optimized std::string.

        BTW, even if you do have to stick to arrays, I think there might be some advantage to using std::array instead, to get normal pass-by-value semantics without pointer decay etc. You could even overload some operators for array to make it behave like STRING*N in BASIC dialects of old.

  4. Maxim says

    Didn’t you forget about data-oriented design? And as a most notorious example something like “struct of vectors instead of vector of structs”?

    • "No Bugs" Hare says

      You mean “Data-oriented design” as it was presented by Mike Acton a few years ago, right? I have to say that I tend to agree with most of he’s saying (except for a few things, such as his formulation of Lie #2, IMO it is overgeneralized – and his example with 99.9% waste without mentioning that even with the waste, assuming 60 frames/second 2Mbytes over 10K frames takes just ~5e-7 of the total RAM bandwidth – hardly worth to spend time to optimize). On the other hand – I don’t feel that “Data-oriented design” understood as a very generic “the way how we should look at the code” (from code POV or from data POV) is within the scope at least for this Chapter.

      As for the significantly narrower concept of “optimizing the data depending on how it will be processed” (and this includes ‘structs of vectors vs vector of structs’) – I’ve added a note to discuss it in 2nd beta, THANKS!

      • Maxim says

        Yes, that’s from his talk. I don’t agree with a lot of things from it (perhaps, cuz I’m a DBMS internals engineer, not a game backend 🙂 and we have a little bit different restrictions ), but rewriting data structures for better usage was a good point.

        • "No Bugs" Hare says

          Oh, the beautiful world of DBMS internals… From whatever little I know in this regard, within DBMS you should be very much RAM-bandwidth-bound (and this is quite unusual situation for the rest of us), am I right?

Leave a Reply

Your email address will not be published. Required fields are marked *