C++ for Games: Performance. Allocations and Data Locality

posted May 23, 2016 by "No Bugs" Hare, translated by Sergey Ignatchenko

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

[[This is Chapter 16(a) from “beta” Volume V of the upcoming book “Development&Deployment of Multiplayer Online Games”, which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the “release” book to those who help with improving; for further details see “Book Beta Testing“. All the content published during Beta Testing, is subject to change before the book is published.

To navigate through the book, you may want to use Development&Deployment of MOG: Table of Contents.]]

[[TODO: cross-platform C++, with a real-world example going across PC/Mac/NDK]]

Using C++ in the context of games implies quite a few considerations, many of them being performance-related. As usual with performance, there are many things which are subjective and/or depend on the specifics, so it is important to take everything you hear about it (this book included) with a good pinch of salt. Still, I’ll try to provide a few pointers and my personal feelings in this regard.

On Performance in General

One thing which is very important when we’re speaking about performance, is so-called 90/10 law:

90% of the time is spent in 10% of the code, and only 10% of the time in the remaining 90% of the code

This law tends to stand for games too (ok, for some games the ratio might be as small as 70-30, but it is still quite uneven, as there are corner cases and corner cases of corner cases, and so on).

This observation means that

Not all the code needs to be absolutely optimized

In particular, the main question we’re facing during the development, is the one “whether this code is bad enough performance-wise to be rewritten right away, or it can wait until we run into performance problems and we can use profiler to identify the problem instead of guessing?”

My approach in this regard usually goes along the following lines:

Using the code which is orders-of-magnitude inefficient (and is executed more frequently than once-in-a-blue-moon) usually qualifies as a Pretty Bad Idea even for not-so-frequently-executed code. It becomes an even worse idea if we’re speaking about inefficiencies comparable to O(N), where N is total number of your players (or even number of players-per-server).
- “if your interfaces are forcing your code to be Really Inefficient – they generally should be changed at the earliest opportunity (as later rewrite can easily amount to rewriting The Whole Damn Thing).It becomes even more important to make sure that all the things which dictate the code to be that inefficient, and are used in many places across the code, are fixed as soon as possible. In other words, if your interfaces are forcing your code to be inefficient – they generally should be changed at the earliest opportunity (as later rewrite can easily amount to rewriting The Whole Damn Thing).
- On the other hand, if the inefficiency in question is very local and can be fixed without affecting anything else, it can be allowed to live longer.
It is usually good to use some practices which allow you to optimize “for free”. In other words, if all it takes is to use “++it” instead of “it++” (see also [[TODO]] section below for more examples) – it is usually a Good Idea to do it.
It is Really Important to have Very Clean separation between different parts of your code, at the very least between your infrastructure-level code and game-logic-level code (but different parts of game-logic-level code usually need to be separated too – for example, as separate Reactors, see Chapter V for details). Such separation has numerous benefits, but from optimization point of view – such Very Clean separation opens the door for optimizing your code later, when it becomes necessary (and when it becomes apparent which pieces of code become bottlenecks in real-world scenarios)

As you can see, despite providing some (hopefully useful) hints, the approach outlined above still leaves LOTS of room for interpretation. Which, in turn, brings us to the most important statement of all:

Finding “what to optimize and what to leave as is for now” doesn’t not really qualify as science. It is an art.

On “noisy neighbor” code and “Resource Hogs”

One further thing to keep in mind with regards to 90-10 (or 70-30) rule is that even if only performance of 10% of the code matters,

the rest of the code can still affect performance of critical 10% in a Pretty Bad Way 🙁

“Moreover, this effect will be observed even if we dedicate 100% of one of our cores to our critical codeIt is related to the fact that LOTS of CPU (and more generally – computer) resources are de-facto shared between different pieces of code. Moreover, this effect will be observed even if we dedicate 100% of one of our cores to our critical code¹ :-(. Such sharing happens because at least some of the CPU caches are shared between different pieces of code (and if we don’t dedicate 100% of a given core to a single task, even per-core caches will be effectively shared due to context switching), memory is shared too, and even some blocks of CPU are often shared between threads concurrently-run-on-the-same-code due to hyper threading a.k.a. simultaneous multithreading a.k.a. SMT.

And as soon as we have a resource which is shared between non-critical piece of code and critical piece of code – well, it can happen that non-critical piece of code will overuse this resource to the point that it will start affecting the critical code 🙁 . For example, if our non-critical code reads LOTS of RAM – it can easily cause CPU to start kicking data-which-belongs-to-critical-code from L3 cache (an effect known as cache pollution), which in turn will affect performance of our critical code.

These observations are usually not that bad in practice, but they lead us to one important consequence:

Even being non-critical doesn’t give a right to be a Resource Hog

¹ for example, via affinity masks

On Server Performance

In certain circles of gamedevs (especially some of those developers coming from client-side) there is a belief that the performance only matters on the client. In extreme cases, it may result in statements along the lines of “[performance is] not an issue on the backend, where scaling up a service can mean nothing more than dragging a slider to the right” [GlazerMadhav].

Well, there is only one objection against this statement, and it is that the statement is deadly wrong. First of all, linear scalability should not be taken for granted (hint: scaling DB in linear manner is very difficult at least, see also Chapter [[TODO]]). Second, even when you do have linear scalability,

“dragging a slider to the right” is going to cost your company.

Let’s consider a hypothetical example first. Current industry standard for multi-player games revolves around 1000 players/2S-workhorse-server (or 100 players/core, which is more or less the same thing). Now let’s consider a scenario when you didn’t care about server performance at all, and wrote your game in a way so it runs only 10 players/server. And then your game got Really Successful and grew to a million of simultaneous players, which means that you need to run 100’000 of the servers. Not only finding 100’000 of servers to rent will be quite a task by itself, but also even at $250/server/month you’ll need $25M/month just with server costs (effectively requiring you each of your players to pay at least $25/month just for servers, and that’s without any profit or even money to pay your salary). Things become even worse when we’re speaking about costs for Free-to-Play (a.k.a. F2P) games, where most of the players are not paying; for F2P even 100 players/server MAY easily put your game at risk (and also each and every bit of extra cost will put you in a competitive disadvantage compared to your better-performing competition).

“performance differences between different implementations CAN be REALLY HUGE.And depending on many factors, performance differences between different implementations CAN be REALLY HUGE. Coming from hypothetical clouds back to real-world, let’s consider another example. Once upon a time, I’ve seen two games. One has had 400K simultaneous players, and another one – around 100K, and from the player’s perspective the game was 99%-the-same. The former company, however, was running its game from mere 50 servers, while the latter one took 400 servers to run. It means that the first one has managed to get 8’000 players/server, while the second one was at 250, a whopping 32x difference in number of players per server (and therefore in server costs per player). And the difference was pretty much mirrored by the size of admin team, too, ouch. Now you’re in a good position to make a guess – which of these two companies has had more money for advertising?²

One last note in this regard – I do NOT mean that you should aim to absolutely-optimize performance of your Servers³ (and I do NOT rule out other-languages-than-C++ from server side too). All I want to say is that

Server-Side Performance Does Matter Too

And when going beyond this statement – see above about optimization being an art rather than science; which level of inefficiency to consider as “unacceptable from the very beginning”, is one of those choices which you’ll need to make for yourself.

For the rest of this Chapter, whenever speaking about best practices performance-wise, I’ll mean that they apply BOTH to Client-Side and to Server-Side (that is, unless it is explicitly stated otherwise).

² Yep, you guessed it right

³ Neither you should aim to absolutely-optimize performance of your Clients. While this may sound as an heresy to some of the Client-Side gamedevs out there, I stand for this statement

Allocations are Evil. Even More So Because of Locality

In the context of performance, first of all, let’s discuss one thing which IMHO tends to affect performance of C++ programs the-third-most⁴ – it is allocations.

The first important thing to remember for all allocations (whether it is C malloc() or C++ new) is that

Allocations are Expensive

Sure, there were quite a few improvements with allocation during last 10 years (including so-called thread-caching in tcmalloc and per-thread arenas in ptmalloc2), but well – it is still a rather slow operation (give or take, we’re speaking about the order of magnitude of 200 CPU clocks per allocation/deallocation pair, and more if you’re not so lucky with your allocator).

BTW, when it comes to discussions of “smart pointers are evil” – actually it is not smart pointers per se, it is allocations-inevitably-arising-from-smart-pointers, which affect performance of your code; on the other hand, if you replace smart pointer with a naked-pointer-allocated-via-malloc – it is very unlikely that you’ll see any performance gain. More on it in [[TODO]] section below.

Allocations and Data Locality

In addition to allocations being relatively slow, even worse is the fact that

Allocations tend to reduce data locality

In many (most?) cases, poor data locality caused by allocation, incurs slowdown which is worse than just the cost of the allocation itself :-(. However, the concept of data locality is not that obvious (nor it can be easily separated from other effects), so let’s see what is it all about.

“During the times of Ancient Gamers (also known as times of Atari, ZX Spectrum, and Commodore64), CPUs were running at clock speed of a single-MHz (and one operation usually took several clocks). At the same time, memory run at about the speed comparable to that of CPUs.During the times of Ancient Gamers (also known as times of Atari, ZX Spectrum, and Commodore64), CPUs were running at clock speed of a single-MHz (and one operation usually took several clocks). At the same time, memory run at about the speed comparable to that of CPUs. For example, on Z80 CPU⁵ one register-register op took 4 CPU clocks, and the same operation but register-memory, took 6 clocks.

Since that time, however, things have changed considerably. A register-register operation currently routinely takes less than 1 CPU clock (don’t ask me how it is possible 😉⁶), but reading from (main) memory takes 100+ CPU clocks. In other words, while memory latencies have been reduced since ancient times by a factor of 10x,⁷ over the same time frame CPU’s pure number crunching speed has improved by a factor of 1000x+. This discrepancy was addressed by all kinds of caches (Z80 didn’t have any cache at all, and in recent years, it is common to have at least three levels of CPU caches, known as L1-L3, with an occasional L4). Not really surprisingly, these caches and their mechanics tend to have profound impact on the performance.

I don’t want to go into too much discussion of caches etc. (for this, you can read [Drepper07], though beware of some outdated concepts such as Northbridge and Front-Side-Bus which are currently replaced with modern NUMA-related stuff, which is discussed, for example, in [Lameter13] and [GaudEtAl15][[TODO: if you know an up-to-date all-around reference covering BOTH usual RAM stuff and NUMA, please LMK]]). However, I will make a few observations which are applicable to caches and programming:⁸

At physical level, all the work with memory is usually made via so-called “cache lines”. On x86/x64, cache line is 64 bytes for many years now
- “It means that as soon as you’ve accessed any single byte in a cache line, all the other 63 bytes are already in L1 cacheIt means that as soon as you’ve accessed any single byte in a cache line, all the other 63 bytes are already in L1 cache
- In addition, if your program exhibits a pattern of sequential reading, CPU usually does so-called “prefetch”, getting you the next-in-pattern cache line just in case you MIGHT need it.
Register-register operation is around 1 CPU clock
L1 read is like 4 CPU clocks these days (typical L1 cache data size is around 32K). Usually L1 is per-core
L2 read is around 10-12 clocks (typical L2 cache size is around 256K). L2 is usually per-core these days
L3 read is 30-50 clocks (typical L3 cache size is around 4-12M). Unlike L1-L2, L3 is usually shared between cores and is per-socket(!)
Main RAM access is around 100-150 CPU clocks
NUMA Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors).— Wikipedia —NUMA accesses. NUMA applies only to multi-socket configurations, i.e. to servers and (almost-)only to servers:
- With NUMA, each CPU has its own RAM connected to it (and only to it); there is no more SMP/FSB-style “RAM directly connected to all the CPUs”
  - Whenever CPU A needs to access RAM connected to CPU B, CPU A issues a request to CPU B which goes over ultra-fast CPU-to-CPU connection (for example, Hypertransport for AMD, or Quick Path Interconnect a.k.a. QPI for Intel)
- NUMA access to another socket, with data happening to be in other-socket L3, costs 100-200 CPU clocks
- NUMA access to another socket, with data happening to be in RAM, costs 200-300 CPU clocks

Armed with this information, let’s take a look at a classical in-memory binary tree structure (the one used by STL’s map<> etc.); let’s assume that our tree/map has a million elements. Now, let’s see what happens if we want to look for something within the tree: even if our tree is reasonably well-balanced (which it is with STL), we’ll need to go through 20 nodes of the tree (as 2^20 ~= 1e6). And if each of our elements is allocated in heap, it means that we’ll need to make about 20 jumps in memory. In practice, some of these tree nodes will happen to be cached, but some won’t (causing so-called cache miss when we’re trying to access them), so from a certain point in our lookup we’ll start incurring that 100+-clock penalty for each of these jumps(!).

Which leads us to a very important conclusion:

Data Locality DOES matter.⁹

In this regards, let’s first note that some of the allocators¹⁰ use some (though usually limited) trickery to improve data locality. This is a Good Idea, and in quite a few scenarios having such trickery MIGHT more than compensate for the allocator formally exhibiting a bit less performance than competition. Moral of the story: do NOT trust 3^rd-party measurements regarding performance of allocators, and compare such performance exactly within the context of your game.

“For such ad-hoc tree-like structures, solution is not THAT easy, though if you run into problems, at least some mitigating approaches DO exist.Let’s further note that while the data locality problem with STL-like trees can usually be addressed by switching to hash tables¹¹ a.k.a. std::unordered_*<> containers (though they do have their drawbacks in the worst-case scenarios, so you need to be careful), most of our programs have their own ad-hoc trees. One example is a list-of-players consisting of Player objects, each of Players containing player-name, and a list of Friends, and so on, and so forth; even more obvious example is a classical Scene Graph routinely used for 3D rendering (see, for example, Chapter 16 from [McShaffryGraham]). For such ad-hoc tree-like structures, solution is not THAT easy :-(, though if you run into problems, at least some mitigating approaches DO exist 🙂 .

⁴ after (a) using-O(N^2)-algo instead of O(N) one, and probably after (b) using blocking code instead of non-blocking one

⁵ or did it stand only for Z80’s predecessor i8080, and was improved for Z80 a bit? Doesn’t matter much for our purposes anyway

⁶ Ok, if you’re Really Interested – you may want to take a look at [Wikipedia.MicroprocessorDesign] as a starting point

⁷ Note that here we’re speaking about “RAM access as it is seen from CPU”, not about “CAS latency”, and these two, while being of the same order of magnitude, happen to be quite different

⁸ the numbers below are for modern-as-of-2016 x64

⁹ actually, both spatial locality and temporal locality matter, but we’ll speak mostly about spatial locality here

¹⁰ at least [ptmalloc2], though make sure to check your favourite allocator – it might doing it too

¹¹ with a hash table and no collisions, you have at most 2 jumps over RAM, but with 1e6-element tree you can get up to 20, and this can make LOTS of difference

Keep reading for discussion on ways to reduce allocations and improving locality

Pages: 1 2

ITHare (14)
Reddit (12 upvotes, 2 comments)
Reddit (34 upvotes, 7 comments)

Comments

Robert Basler says

May 23, 2016 at 11:15 pm

> use fixed-size char s[SZ]; instead of std::string.

Bare arrays are bug magnets, don’t use them lightly. Don’t skip the step of wrapping them in a template. Eliminating arrays from my C++ code and switching to a string class and a vector class for arrays (both of which do bounds checks in DEBUG builds) dramatically reduced the initial number of bugs in my code.

Also, nice timing on this post – I started load testing my server last week and server performance optimization is very much on my mind. I have a couple more tricky bugs left before I’ll reach my initial goal of 50 clients connected.

Reply
- "No Bugs" Hare says
  
  May 24, 2016 at 5:19 am
  
  > Don’t skip the step of wrapping them in a template.
  
  Of course, I’ve added a sentence about encapsulating it, THANKS!
  
  Reply
- Dmitry says
  
  May 30, 2016 at 6:52 am
  
  Another reason to _not_ replace std::string in most of the cases is short string optimizations in modern C++ standard library implementations. E.g. libc++ does not allocate any memory for strings with length up to 20 chars on x86_64.
  
  Reply
  - "No Bugs" Hare says
    
    May 30, 2016 at 8:42 am
    
    …and has size of 24 bytes instead of 8 for char[8] (that’s 3x overhead). I am not saying that SSO is bad (the answer is, as always, “it depends”, and for most of usage scenarios it is a good thing), but std:: is by design “one size fits all”, and as a result cannot possibly compete with solutions optimised for one single use case.
    
    Reply
luc says

May 26, 2016 at 6:52 pm

Question 1

On page 1. Curiosity.

“Once upon a time, I’ve seen two games.
One has had 400K simultaneous players, and another one – around 100K, and from the player’s perspective the game
was 99%-the-same.” (…)

Could you share with us more information about these two games? For example: game title, tech stack, programming language(s), engines, architecture and so on.

Maybe you cannot say much about these two games since it seems some kind of privileged information.
I understand these details are not that important for the point of the text. Also, accurate conclusions cannot be taken since we do not know the whole scenario.
But, you know! As developers, we love to know these kind of stuff! 🙂

Question 2

On page 2. Regarding Reactors (FSM).

In the following text:

“Otherwise, if you’re running dozens of Reactors within the same process,
all of them using the same (usually default) allocator – then all the data
from your different Reactors will be spread over one large default allocator,
which will pretty much kill better-Reactor-locality.”

What if we had one Reactor (FSM) per process. Could it solve the allocator issue? Would we add another issue?

Reply
- "No Bugs" Hare says
  
  May 27, 2016 at 5:15 am
  
  > Could you share with us more information about these two games?
  
  I would certainly like to brag ;-), but I am always trying to stay on a Very Safe Side from the point of view of professional ethics. In other words – if I tell names, some of potential clients can think – “hey, he might also tell about us too”, and cease to be potential clients :-(. Overall, I’m trying to treat employers and clients the same way doctors treat their patients – while telling “how this disease is usually cured” is usually ok, telling patient’s names is a Big No-No.
  
  > What if we had one Reactor (FSM) per process. Could it solve the allocator issue?
  
  Yes, it will :-).
  
  > Would we add another issue?
  
  Yes, it will :-(. (a) it limits you to single-Reactor-per-thread (and this is not always optimal, especially as having multiple thousands of threads per box is usually a Bad Idea). (b) cost of running a process are higher compared to costs of running a thread (there is a dozen of subtle things which are better for threads, including inter-thread vs inter-process communications).
  
  Bottom line about Reactor-per-process: there ARE cases when doing it will help, but – it is not universal :-(.
  
  Reply
int19h says

May 30, 2016 at 6:45 am

>> If you know that your strings are up to 8 bytes in size, there is usually VERY little sense to use std::string

Don’t most STL implementations use small-string optimization these days (which is basically shoving as much of the data into the string object itself as they can to avoid allocation)?

MSVC does that for sure, and avoids allocations for strings up to 15 chars (on x64), so it would seem that the proper advice here when it is the target is the reverse – if you know that your strings are up to 15 bytes in size, _do_ use std::string, since there’s no allocation/non-locality overhead, and you do get a much more convenient interface than a char[].

Reply
- "No Bugs" Hare says
  
  May 30, 2016 at 8:39 am
  
  THANKS for mentioning SSO, I’ve added it. However, for <=8-byte string there is still little sense to use std::string, as SSO-ed std::string itself is 24+ bytes (32 for MSVC, ouch) - instead of 8 for char[8] :-).
  
  Reply
  - int19h says
    
    May 30, 2016 at 9:19 am
    
    Yep, if you know that you will never ever cross that boundary, it is overkill (albeit a convenient one). On the other hand, you get the convenience of having short strings be blazing fast, while still allowing to store the occasional long string that you might need, and paying the penalty only for access to that occasional string.
    
    Obviously this is still jack-of-all-trades & master of none approach. I just think that it’s “good enough” (that you wouldn’t have to bother with micro-optimizations) in way more scenarios than a non-SSO optimized std::string.
    
    BTW, even if you do have to stick to arrays, I think there might be some advantage to using std::array instead, to get normal pass-by-value semantics without pointer decay etc. You could even overload some operators for array to make it behave like STRING*N in BASIC dialects of old.
    
    Reply
    - "No Bugs" Hare says
      
      May 30, 2016 at 9:55 am
      
      > I just think that it’s “good enough” (that you wouldn’t have to bother with micro-optimizations) in way more scenarios than a non-SSO optimized std::string.
      
      Well, TBH, I prefer eastl::fixed_string as it provides more flexibility with the same amount of headache, see my new post here 🙂 :
      http://ithare.com/c-performance-common-wisdoms-and-common-wisdoms/ . Will add a sentence about it here too…
      
      Reply
Maxim says

December 1, 2016 at 6:17 pm

Didn’t you forget about data-oriented design? And as a most notorious example something like “struct of vectors instead of vector of structs”?

Reply
- "No Bugs" Hare says
  
  December 2, 2016 at 10:29 am
  
  You mean “Data-oriented design” as it was presented by Mike Acton a few years ago, right? I have to say that I tend to agree with most of he’s saying (except for a few things, such as his formulation of Lie #2, IMO it is overgeneralized – and his example with 99.9% waste without mentioning that even with the waste, assuming 60 frames/second 2Mbytes over 10K frames takes just ~5e-7 of the total RAM bandwidth – hardly worth to spend time to optimize). On the other hand – I don’t feel that “Data-oriented design” understood as a very generic “the way how we should look at the code” (from code POV or from data POV) is within the scope at least for this Chapter.
  
  As for the significantly narrower concept of “optimizing the data depending on how it will be processed” (and this includes ‘structs of vectors vs vector of structs’) – I’ve added a note to discuss it in 2nd beta, THANKS!
  
  Reply
  - Maxim says
    
    December 2, 2016 at 11:19 am
    
    Yes, that’s from his talk. I don’t agree with a lot of things from it (perhaps, cuz I’m a DBMS internals engineer, not a game backend 🙂 and we have a little bit different restrictions ), but rewriting data structures for better usage was a good point.
    
    Reply
    - "No Bugs" Hare says
      
      December 3, 2016 at 8:00 am
      
      Oh, the beautiful world of DBMS internals… From whatever little I know in this regard, within DBMS you should be very much RAM-bandwidth-bound (and this is quite unusual situation for the rest of us), am I right?
      
      Reply