C++ Performance: Common Wisdoms and Common “Wisdoms”

posted May 30, 2016 by "No Bugs" Hare, translated by Sergey Ignatchenko

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

Premature Optimisation - KISS - Premature Pessimization

[[This is Chapter 16(b) from “beta” Volume V of the upcoming book “Development&Deployment of Multiplayer Online Games”, which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the “release” book to those who help with improving; for further details see “Book Beta Testing“. All the content published during Beta Testing, is subject to change before the book is published.

To navigate through the book, you may want to use Development&Deployment of MOG: Table of Contents.]]

The opposite of a fact is falsehood,

but the opposite of one profound truth may very well be another profound truth.

— Niels Bohr —

There are quite a few common wisdoms when it comes to C++ and games. As it always the case when facing a bunch of common wisdoms, some of them have their merits, some are obsolete-beyond-belief, and some are just taken from a very different context and are not really applicable. Let’s take a look at the most popular ones.

“No C++, just plain C” – not for app-level programming

There is a school which teaches that C++ is no good, for performance and reliability reasons.

Many proponents of this point of view quote [Linus] who has said that “C++ is a horrible language”. However, it should be noted, that while Linus MAY have a point against using C++ for operating systems,¹ his rant should not be easily generalized to such a different area as games.

“over(ab)using C++ features is a different story, we’ll discuss these features below on case-by-case basisIn short – code in OS’s is changed some orders of magnitude less frequently than for games (or any other business app), so the problem of code maintenance becomes of much more importance in a game world. On the other hand, things such as “we need to handle even the most extreme corner cases, such as being unable to allocate 20 bytes of memory” are of much less importance for a usual game app (whether client or server). As a result, I am firmly going against the camp which says that “C++ is no good for game development” (over(ab)using C++ features is a different story, we’ll discuss these features below on case-by-case basis).

¹ Note that this point is arguable, as quite a few OS’s are currently heavily relying on C++, including such a serious thing as OS/400. Well, fortunately we don’t need to answer the question “whether we want to write an OS in C or in C++”

On “C++ as C-with-Classes”

A bit less extreme point of view is that C++ is good, but only as long as we’re using it only as “C with classes”. Well, often I myself am guilty of using C++ along these lines (with a few notable exceptions, pun intended) ;-).

Interpretation of the term “C-with-classes” varies between different developers,² so when discussing it, I strongly prefer to separate those-C++-features-going-beyond-plain-C, and discuss them one-by-one, to see whether these features are worth using in the context of game development.

In any case, there is more or less a consensus that “classes” in “C with Classes” is a Good Thing™ (in particular, as it encourages encapsulation and decoupling). Now let’s discuss more controversial features of C++.

² I’ve seen people who were saying that anything with C-style array in it is not real C++, so it is “C-with-classes”

C++ Exceptions – (Almost-)Zero-Cost until Thrown

There is a “common wisdom” that C++ exceptions are Bad, because they’re damn expensive. Another school of thought says that modern table-driven C++ exceptions are zero cost until they’re thrown. Well, while I am not 100% convinced that exception handling is indeed exactly “zero-cost” until the exception is thrown, but it is certainly very very close. The most common exception-related thing which may affect performance of your app, is an additional cost of calling non-inlined function (and in ancient times there was an additional cost to it due to exceptions, in the range of single-digit CPU clocks). However:

“It seems that this per-function-call penalty is gone now (though I’m not 100% convinced whether it is 100% gone, and that it is gone on 100% of the widely-used compilers)It seems that this per-function-call penalty is gone now (though I’m not 100% convinced whether it is 100% gone, and that it is gone on 100% of the widely-used compilers). On how it is achieved – see, for example, [TR18015].
Moreover, there are claims that C++ exceptions are less expensive (until exception is thrown) compared to C-style error-code checking after the function returns. While I didn’t run my own tests to validate this claim, I can tell that the performance difference (if any) between error checking and C++ exception in non-error case is certainly negligible³ ⁴
to be honest, if you have a function which is that performance-critical, you should force-inline it anyway (more on inlines and force-inlines below), and then even for older compilers the problem will go away

On the other hand, the cost of exception when it is thrown, is pretty large; by the very definition of exception handling, it has to perform RTTI or RTTI-like things (and usually it is direct RTTI, see, for example, [Brailovsky]). In one experiment, the cost of exception thrown-and-caught has been measured to be of the order of 1000-2000 CPU clocks [Ongaro].

Practical conclusions with regards to exceptions:

DO use exceptions; error codes are usually much more error-prone in the context of games.⁵
“Cost of exception handling until the exception is thrown, is negligibleCost of exception handling until the exception is thrown, is negligible
Cost of exception when it is thrown, is significant. Therefore:
- DON’T use exceptions to return from the inner loop
- However, using exceptions to perform not-so-normal return from a top-level event handler, is generally ok (as top-level event handlers usually take at least hundreds of thousands of clocks anyway).

³ except, maybe, a very few extreme corner cases

⁴ and this “negligible difference” is that small, that I have no idea whether it is positive or negative

⁵ for nuclear reactors and operating systems the balance of pros and cons can be different, but for games and business apps the choice in favor of exceptions is very clear

Smart Pointers and STL: Allocations are Still Evil

“In other words, if you replace an std::unique_ptr<> with a simple manually-handled new – well, you will get nothing except for an additional headache.Smart pointers are often considered a Big No-No for games. However, I contend that it is not smart pointers per se, but rather related allocations which will hurt your performance (see more on allocations and related lack of spatial locality in “Allocations are Evil. Even More So Because of Locality” section above). In other words, if you replace an std::unique_ptr<> with a simple manually-handled new (or, Stroustrup forbid, with (YourClass*)malloc(sizeof(YourClass))) – well, you will get nothing except for an additional headache. Note though that std::shared_ptr<> is different performance-wise (and usually you CAN beat it as long as you’re sharing only within the same thread).

When it comes to using STL in the context of games, I should admit that it is quite controversial. Overall, I love STL, but it does have its shortcomings when it comes to high-performance stuff, games included. My main concern about STL is that it uses way too many allocations for my taste (and more importantly, for poor CPU cache too), and as it was discussed in “Allocations are Evil. Even More So Because of Locality” section above, allocations are one of the most nasty things performance-wise out there 🙁 .

As a result, I tend to advise along the following lines (YMMV, batteries not included):

DO use STL containers sparingly.
DON’T use STL containers just for the sake of using them.
- DO use arrays (including std::array<>) when size is known or very limited (that is, properly wrapped with bound checks) where applicable. Even better, DO use eastl::fixed_*<> containers (see more on EASTL below).
- The same applies to char arrays in lieu of std::string (though for std::string it is usually mitigated these days by small string optimisation (SSO), see [[TODO]] section above). Once again, proper wrapping (such as in eastl::fixed_string) is necessary.
“As a rule of thumb, DO favour std::vector<> over std::list<> (as it has MUCH better locality). Use cases for std::list<> do exist, but they’re rather few and far between.As a rule of thumb, DO favour std::vector<> over std::list<> (as it has MUCH better locality). Use cases for std::list<> do exist, but they’re rather few and far between.
- BTW, one thing which makes our life better under C++11, is that std::list<>::size() is now required to be constant-time a.k.a. O(1) (pre-C++11 it was allowed – and often was – O(N) which was counterintuitive and caused significant problems 🙁 ). If this is critical for you I’d still suggest to double-check that it stands for your compiler/library.
As a rule of thumb (and unless you need ordering), DO favour std::unordered_map<> over std::map<> (the same applies to all tree-based containers, namely to multimap<>, set<>, and multiset<>). std::unordered_*<> containers tend to cause MUCH less cache misses when performing searches, and as a result tend to be MUCH more efficient. Another way of reaching the same conclusion is to say that unless you need ordering – you should tell the world about it by using std::unordered_*<>, making the code more readable.
- Caveat: hash table collisions can be a Quite a Bad Thing for std::unordered_*<> containers, including potentially opening a door for DoS attacks (if your keys come from Clients) 🙁 . Be careful to make sure your hash function is good for your keys, have run-time reporting on number of collisions for critical collections, and see Vol.3 for discussion on DoS.
- If you do need ordering, DO consider sorted arrays/vectors instead of std::map<> (or even better, eastl::vector_map<> etc., more on EASTL below). While not always possible (due to MUCH higher addition/deletion costs), quite often it can make wonders.
DO use optimized STL versions such as [EASTL]
- DO use eastl::fixed_*<> containers (they have “overflow feature” too).
- DO use eastl::intrusive_*<> containers where applicable
- DO use eastl::vector_map<> etc.
“Most importantly, DO understand underlying implementation of STL containersMost importantly, DO understand underlying implementation of STL containers. In particular:
- std::vector<> is a single allocated block with some reserve (reallocated when you go over reserved portion). Which can be translated to “very good data locality” 🙂 🙂
- std::list<> is a double-linked list. Which means “locality is poor” 🙁
- std::map<>, std::multimap<>, std::set<>, and std::multiset<> are all binary trees. Means “lots of allocations, very poor data locality” 🙁 🙁
- std::unordered_*<> containers are hash maps. Can be translated into “risk of collisions, and quite a few allocations but pretty good data locality for your usual find()” 🙂
DO consider [short_alloc] where applicable – that is, after you understand its implications (it is an on-stack allocator). Whenever it works (and this is usually limited to temporaries only) – it saves quite a bit of CPU clocks while allowing you to use your usual std:: containers.
In addition to STL containers, DO use STL algorithms too.
- Most common candidates are those things which are not that trivial to write yourself, including binary search stuff (binary_search()/lower_bound()/upper_bound()), random_shuffle() (yes, it is very easy to make a mistake in shuffle), all kinds of sort, and all kinds of operations on sorted ranges.
  - However, DON’T think that STL will do everything for you (doing so will likely to cause you think that everything out there is a nail). In some cases, you DO need to write your own algos, it is just that whenever the-algo-you-need is already written in STL – well, it is usually better to use it :-).
- Whether to use trivial algos such as find_if() etc. – is up to you.

Using boost:: – Case by Case and Only If You Do Need It

You may ask: “ok, you’ve said about STL, but what about boost::?” My position on boost:: usually goes as follows:

almost-all the MUST-HAVE stuff from boost:: is already in std::, phew (and THANKS A LOT to boost:: guys to make it happen)
- one thing which you MAY still want is boost::Container; note though that it has significant overlaps with eastl (for example, boost::container::flat_*<> containers are very similar to eastl::vector_{set|map}<>, and boost::container::static_vector<> and boost:container::small_vector<> are very similar to eastl::fixed_vector<>). Overall, from the point of view of containers, as of beginning of 2016 I still tend to prefer EASTL.
boost:: as a whole is a Damn Large set of libraries
as a result, while there CAN be things in boost:: which you MIGHT need, I do NOT recommend to say “we’re using all the boost::”. Instead, I suggest an approach of:
- waiting until a need for a specific boost:: library arises
- discussing it (and implications, including performance ones) by the team
- and deciding that “yes, this specific boost:: library is a good fit for our purposes”

Polymorphism – Two-Fold Impact

“in practice, there are usually two different costs associated with polymorphismYet another performance-related “common wisdom” says that polymorphism is a Bad Thing performance-wise and should be avoided. Well, I need to say that there in practice, there are usually two different costs associated with polymorphism. The first (explicit) cost is the cost of taking a VTABLE pointer from the body of your polymorphic class, and going into VTABLE (which MAY cause a cache miss, though it is unlikely to cause any real issues as VTABLE is going to be cached if it is accessed anywhere close to often). In addition, there is usually an additional cost similar in nature to that of branch misprediction (see below).

However, there is a second (implicit) cost of polymorphism as-it-is-traditionally-used. Usually, to exploit polymorphism, we’re creating our polymorphed objects via new. And, as I noted above quite a few times, new is a Bad Thing performance-wise (in particular, because in practice it causes that dreaded cache miss much more frequently than VTABLE access). On the positive side, this second part of polymorphism-related costs CAN be avoided (see discussion on it in “Mitigation: Reducing number of Allocations” section above).

Overall, unless you’re using polymorphism in some Really Crazy manner, I didn’t see it to cause anywhere significant performance degradation (that is, beyond the costs of allocation+poor locality).

RTTI – Can be Easily Avoided Most of the Time

RTTI (a.k.a. Run-Time Type Information) is quite expensive when it’s used. Honestly, I haven’t seen any real-world uses for RTTI beyond dynamic_cast<> (while such uses MIGHT exist, I certainly missed out on them).

And when it comes to dynamic_cast<>, my position goes as follows:

“most of the time, it is trivial to replace dynamic_cast<> (together with related code) with your own virtual functionmost of the time, it is trivial to replace dynamic_cast<> (together with related code) with your own virtual function. If your dynamic_cast<> happens within performance-critical code – just DO it.
If you CANNOT replace dynamic_cast<> with your own virtual function (and the only case I know for it, is when you’re not allowed to modify base class) – then it MAY be a legitimate use for dynamic_cast<>.

And that’s about it on RTTI.

On Branch (mis)-Predictions and Pipeline Stalls

One further topic which is routinely raised in the context of performance, is branch predictions and pipeline stalls. I won’t go into a lengthy discussion about it (see, for example, [Wikipedia.BranchPredictor] for Branch Predictions 101), but what really matters from our perspective, is an observation that:

If there is an “if” statement in the program,⁶ and execution goes along the “wrong” path⁷ – it is going to cost us

While the cost of misprediction is not that high (10-20 CPU clocks), in some of the Very Inner Loops it might start playing its role.

Dealing with mispredictions can be done in one of the two following ways.

The first way is to remove conditional processing completely (NB: do NOT do this until you are sure that your bottleneck is within this piece of code). For example,

a += (b ? 0 : c);//all operands including b are uint32_t
                 //  b is guaranteed to be either 0 or 1

can be rewritten into:

a += (b-1)&c;//no conditional processing here
assert( ((b-1)&c) == (b ? 0 : c) );
//NB: if your b is not guaranteed to be 0 or 1, 
//    you can just use (!!b-1)&c - most of the time it will compile 
//        without any branching too

Here (b-1) can take values of 0 or (uint32_t)(-1) (the latter equal to 0xFFFFFFFF). Then, bitwise-anding it with c, we’ll get either 0 or c, which is exactly what we need. Note that operand types are Really Important here (and DON’T try to use such tricks until (a) you’re sure that your bottleneck is here, (b) you know EXACTLY what you’re doing).

In theory, compiler should be able to optimize these things for us. In practice, however, more often than not it doesn’t 🙁 (things MAY improve in the future though). So, for Really Performance-Critical pieces of code we MAY try doing it ourselves (NB: after making such a change, we need to measure performance, and if the trick didn’t help – rollback the change to keep the code more readable).

[[TODO: add ref to “Bit Hacks for Games” from “Game Engine Gems 2”]

As with pretty much any other optimisation trick – make sure NOT to use such things unless you have already identified your bottleneck, otherwise you’re risking to end up with lots of convoluted and barely-readable code, and from my experience, a convoluted unreadable code tends to inhibit further (often MUCH more significant) optimizations.

“The second way of dealing with branch misprediction, is to tell the compiler which of the branches is the “right” (more likely) way of the code execution goingThe second way of dealing with branch misprediction, is to tell the compiler which of the branches is the “right” (more likely) way of the code execution going. If your code gets into each branch with 50% probability, you can’t really define “right” and “wrong” (and you cannot do much about branch misprediction besides replacing it with arithmetics), but if probability is not that even, you DO have a chance to deal with it.

To tell GCC/Clang which branch is “right”, you can use __builtin_expect() intrinsic pseudo-function (better still, use likely()/unlikely() macros as described here: [Kerrisk]; also if you’re about to use these things, make sure to read the whole post of his, there is a lot of useful stuff there). For MSVC, there are no such intrinsics, but there is an observation (which stands in most, but not in all cases [TODO]) that the “if” branch of the “if” statement is usually treated by MSVC compiler as “likely” (and “else” branch as “unlikely”).

However, the most important thing about using likely() is the following:

DON’T use likely() etc. until you’re sure that It Is Your Bottleneck! And keep in mind that making a wrong guess can easily kill performance 🙁 .

⁶ or equivalent; in particular, ?: operator is often guilty of being compiled into equivalent of if

⁷ we don’t define what is a “right” path and what’s a “wrong” path – yet.

On inline asm and SSE

There are cases where you may want to write inline asm. However, they should be REALLY few and far between. Contrary to one popular belief, inline asm still CAN improve your program. And contrary to another popular belief, it is Very Rarely worth the trouble. If you decide to go against my advice and go inline asm for app-level code, keep in mind that:

“Correct operation of inline asm MAY easily depend on the context where you use itCorrect operation of inline asm MAY easily depend on the context where you use it.
- In particular, if you have your inline asm as a part of inline function, and make a mistake with a clobbered register list for GCC – all kinds of things can happen (the code may work for one particular invocation of this inline, and when you call it from a different place – a dragon can eat your code, or even worse – an “undefined behavior” can happen).
Inline asm is Very Different for GCC and MSVC (and on MSVC it doesn’t even exist for x64 and ARM)

At some point in time, x86 SSE was an excuse to use inline asm; these days, fortunately enough, SSE intrinsics were added, so you can program SSE code without the hassle of inline asm (and in mostly compatible-between-GCC-and-MSVC-manner, though obviously still restricted to x86/x64). In short – for SSE, DO use intrinsics rather than inline asm (well, as long as it is possible).

On inlines and force-inlines

One thing which is often underestimated in performance department, is the cost of function calls. [Efficient C++] gives the cost of function call at 25-250 clocks (depending on number of arguments), and my personal experience for your-usual-function-with-one-or-three-arguments is about 20-30 clocks, so we’re pretty much on the same page here.

If it happens in the innermost loop of your algorithm, such a cost can be very considerable. And worse than that,

From my experience, as of 2016 compilers are still notoriously bad with deciding whether to ignore you inline specification or not 🙁 .⁸

“Compiler will use all its Next-to-Divine Wisdom to show you that it is smarter than you are, and to ignore most of those inline specifications you carefully wrote.Yes, you read it correctly: if you designated your function as an inline, it doesn’t mean that it will be inlined. Compiler will use all its Next-to-Divine Wisdom to show you that it is smarter than you are, and to ignore most of those inline specifications you carefully wrote. In such cases, forcing inline usually helps; it can be done with __forceinline in MSVC, and __attribute__((always_inline)) in GCC (and BTW, make sure to make a cross-platfrom #define FORCEINLINE that works everywhere).

However, as with other performance tricks, I need to re-empasize:

DON’T overdo it. Use force-inline ONLY after you have identified the bottleneck. Indiscriminate use of force-inline can easily hurt performance.

One further inline-related trick applies if you’re too lazy to put all your inlinable functions into headers (which you normally SHOULD); in this case, -flto option of GCC/Clang has been observed to allow for usual (non-forced) inlining across your .obj files (it also allows for a whole bunch of other optimisations, so it is worth trying anyway).

⁸ I’ve seen code which has performance improved over 1.5x after placing two or three of force-inlines – BOTH on MSVC and GCC 🙁

On templates

One thing which was a long-running scare of early-days C++, is “template bloat”. It refers to the compiler creating numerous instances of pretty-much-the-same code when it instantiates templates. In practice, I didn’t see it as a real problem (that is, unless you’re doing Really Crazy Things), and these days C++ templates are even used on MCUs which have very limited amount of program memory. As we’re not on MCUs (phew), we should be generally be fine with templates (that is, unless we’re into deeply recursive templates – these DO have a potential to get ugly 🙁 ).

If code bloat is still a concern, a technique similar to the one in [Druy] can be used, though I’ve never seen it to be Desperately Needed in practice.

On compiler flags

“One of the first things novice developers are trying to do when optimizing for performance, is trying to play with compiler flags.One of the first things novice developers are trying to do when optimizing for performance, is trying to play with compiler flags. Well, usually it helps; OTOH, it helps only a little bit compared to what a reasonable code optimization can achieve (and even less compared to what removing of the outright-crazy-stuff can achieve). The main problem with compiler flags is that usually they’re applied indiscriminately to the whole project, and most of the optimizations are not that universally useful as we’d like. Better results can often be achieved by applying flags on per-subproject (or even per-file) basis.

For GCC, the most popular performance-affecting options are:

-O2 or -O3 (note that O3 often produces worse performing code than O2, so you need to try both(!))
-march (well, at least for Server-Side you know your architecture exactly, and for Client-Side finding something less than -march=corei7, will be quite difficult these days). Also for Server-Side you MAY want to consider compiling on exactly the target server with -march=native.
-fomit-frame-pointer (not much gain, but why not?)
-funroll-loops (results vary, but certainly worth trying for critical code)
-flto (make sure to read GCC/Clang recommendations on using it)
-fno-math-errno
-ffast-math (includes -fno-math-errno and a whole bunch of other stuff) and -ftree-vectorize. NB 1: these options MAY cause not-exactly-IEEE-compliant behaviour for floating-point math. NB 2: I’ve heard about a significant synergy between these two options, so you SHOULD try them together (in addition to trying them one by one).
and don’t forget about -fprofile-generate; it is going to take (quite) a bit of effort, but is often worth it (that is, if you’re concerned about performance)

Note that I’m not saying that these options will necessarily help; what I’m saying is that you MAY want to try them, run your compiled app, and see whether there is any difference.

On Manual Loop Unrolling

Using compiler option such as -funroll-loops, can make your app run faster – or slower, if you’re not that lucky. However, it doesn’t mean that loop unrolling is useless – it just means that indiscriminate loop unrolling can have their problems.

As a result, I’ve seen manual unrolling of the innermost most-critical loop to help quite a bit (like “up to 2x improvement”). It is a Big Headache though, so once again –

NEVER EVER DO IT until you’re sure that it is your bottleneck.

[[TODO: x64-specific: on float-to-int switching (also on SSE)]]

Putting Things into Perspective: Rough Costs in Clocks

We should forget about small efficiencies, say about 97% of the time:

premature optimization is the root of all evil.

Yet we should not pass up our opportunities in that critical 3%

— Donald Knuth —

The following table tries to summarize some of the CPU clock costs which typically occur in the program:⁹

Operation	Cost in CPU clocks
“Simple” register-register op (ADD,OR,etc.)	<1
Memory write	~1
“right” branch of “if”	~1
Integer Multiplication	3-6
L1 read	4
L2 read	10-12
“wrong” branch of “if” (branch misprediction)	10-20
L3 read	30-50
Function call	25-250
Additional cost of the function call being polymorphic (NOT accounting for cache misses due to poor locality if allocation is used)	20-30¹⁰
Main RAM read	100-150
NUMA: different-socket L3 read	100-200
NUMA: different-socket main RAM read	200-300
Allocation+deallocation pair	200+¹¹
User-to-kernel switch and back (i.e. minimal cost of kernel function)	300-500
Exception thrown+caught	1000-2000
Context switch (direct costs)	2000
Context switch (total costs, including cache invalidation)	10K-1M

And a practical suggestion based on the numbers above:

Three things Really Matter for performance. The first one is Algorithm, the second one is your code being Non-Blocking, and the third one is Data Locality. The rest of the tricks, while potentially useful, should be used only on case-by-case basis, when a bottleneck is identified. Otherwise it will be a Premature Optimisation.

⁹ for modern x64 CPU, YMMV, batteries not included

¹⁰ amortized

¹¹ less in certain special cases

“For-free” Optimizations a.k.a. Avoiding Premature Pessimization

One exception to the “DON’T optimize too early” rule above is “optimizations” which do NOT cause any trouble when you’re developing (in particular, do NOT reduce readability and compatibility). In some sources (see, for example, [Sutter]), not doing such “for-free” optimizations is even referred to as “premature pessimization”.

“One example of such a 'for-free' optimization is making sure that all your loops have standalone increments written as ++it instead of it++.One example of such a “for-free” optimization is making sure that all your loops have standalone increments written as “++it” instead of “it++”. The point here is that any kind of postfix operation on a non-integer type (such as iterator), strictly speaking, requires a temporary copy. While compiler MAY be able to eliminate this temporary as a part of optimization, well – why not save it the trouble (and not to avoid any chance that it will be suboptimal on some-other-compiler) –

Especially as it costs you pretty much nothing?

Other tricks of the similar “cost-nothing” nature include:

passing structs and classes by const reference (rather than by value). If you are not doing it yet – chances are that you’re doing something wrong.
- Note that since C++11, due to copy elision (which is not mandated but is a de-facto standard by modern compilers), it is usually better performance-wise to pass classes/structs/collections by value – but ONLY if you’re going to create a copy of the relevant parameter within your function anyway (see discussion in [StackOverflow.PassByValue]).
- Note that returning values via non-const reference parameters (essentially output parameters), while still popular, gets out of favour and with move constructors is rarely really necessary
using initialization over assignment
- a special case: using “: str(s)” instead of “str=s;” in constructors to initialize string and other allocated data members
specifying “move constructors” and “move assignment operators” where applicable
- specifying them as noexcept wherever applicable

[[To Be Continued…

This concludes beta Chapter 16(b) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”. Stay tuned for beta Chapter 16(c), dedicated to some of C++ “best practices” (well, at least as I understand them ;-)).]]

[+]References

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

ITHare (32)
Reddit (107 upvotes, 50 comments)
Reddit (73 upvotes, 45 comments)
Reddit (81 upvotes, 20 comments)

Comments

Martins says

May 30, 2016 at 10:44 am

Starting with C++11 std::list::size is O(1) guaranteed.

Reply
- "No Bugs" Hare says
  
  May 30, 2016 at 1:43 pm
  
  THANKS, I’ve changed it…
  
  Reply
- Paavo says
  
  June 2, 2016 at 5:37 am
  
  And requiring O(1) std::list::size has seriously hurt one of the very few usage cases of std::list – splicing.
  
  In my experience I have not yet encountered a use case for std::list. Either std::vector or std::deque has always been a better match for the task.
  
  Reply
  - "No Bugs" Hare says
    
    June 3, 2016 at 8:24 am
    
    > And requiring O(1) std::list::size has seriously hurt one of the very few usage cases of std::list – splicing.
    
    Just curious – what is the mechanism of splicing being seriously hurt? Intuitively, I’d expect that updating count member within list header as a result of splicing should be an O(1) (and a Very Small O(1) too as all the headers are already cached).
    
    > In my experience I have not yet encountered a use case for std::list.
    
    One example: prohibition on moving elements (because ptr/iterator is stored externally) will lead either to list, or to vector>, and under these circumstances, list<> MAY become a winner, having less allocations (and often a specialised allocator too).
    
    Reply
    - Malcolm Parsons says
      
      June 17, 2016 at 2:36 pm
      
      > what is the mechanism of splicing being seriously hurt?
      
      Splicing a range from one list to another using
      void splice( const_iterator pos, list& other,
      const_iterator first, const_iterator last);
      used to be O(1), but it now has to calculate std::distance(first, last) to update the size of both lists.
      
      Reply
      - "No Bugs" Hare says
        
        June 20, 2016 at 12:56 pm
        
        Ah, I see – it is not splicing as such which is hurt, but performance of post-splicing (but still-within-splice()-function) processing is hurt. Well, not that I really care about splicing anyway… ;-). OTOH, making list::size() to have O(1) complexity has opened quite a few other use cases (O(N) size is an abomination which prevents unwrapped use in a pretty much any case except for splicing).
Jean-Michaël Celerier says

May 30, 2016 at 10:47 am

> -O2 or -O3 (note that O3 often produces worse performing code than O2(!))

Most recent benchmarks (c.f. phoronix) tend to show that -Ofast > -O3 > -O2 > -O1 > -O0;

> passing structs and classes by const reference (rather than by value).

As per https://web.archive.org/web/20140205194657/http://cpp-next.com/archive/2009/08/want-speed-pass-by-value/
if you have a function that will copy its arguments at some point, it will be better to pass them by value and move the value.

e.g.

struct Bar {
// Yup
void set(std::vector vec) {
m_vec = std::move(vec);
std::sort(m_vec.begin(), m_vec.end());
}

// Nope
void set(const std::vector& vec) {
m_vec = vec;
std::sort(m_vec.begin(), m_vec.end());
}

std::vector m_vec;
};

Of course, if the argument is just read and not copied nothing changes and it should always be passed by const reference.

Reply
- Vaughn Cato says
  
  May 30, 2016 at 2:05 pm
  
  Pass-by-value can often be detrimental when used with assignment. See http://stackoverflow.com/q/26261007/951890
  
  Reply
"No Bugs" Hare says

May 30, 2016 at 2:01 pm

> Most recent benchmarks (c.f. phoronix) tend to show that -Ofast > -O3 > -O2 > -O1 > -O0;

Well, it is not possible to run benchmarks on ALL the code, and the last code I was profiling myself, was worse under -O3 than under -O2 (and I can assure you it is not the only one case). Therefore, I stand by that it is still necessary to try both.

> As per https://web.archive.org/web/20140205194657/http://cpp-next.com/archive/2009/08/want-speed-pass-by-value/

You’re right, I’ve added this thing, THANKS!

Reply
- David says
  
  May 31, 2016 at 4:38 am
  
  Speaking of optimization flags, another one worth considering is -Os (at least for gcc and clang) which is equivalent to -O2 with a few optimizations disabled to try to make code smaller. I’ve had a case where this caused something to fit better into cache and run a good bit better.
  
  Reply
  - "No Bugs" Hare says
    
    May 31, 2016 at 6:38 am
    
    Yes, I’ve seen it too, but from my experience it is very rare. Pretty much any optimisation flag can cause some improvement in some case (and I don’t want to list ALL the flags, but only the-most-likely-to-help-ones).
    
    Reply
David says

May 31, 2016 at 4:35 am

Would it be worth mentioning the potential benefit of link-time optimization? I’ve worked on projects where this accomplished a lot of cross-module inlining, giving me a nearly 4x speedup (I had a *really* hot function that needed to be inlined).

Additionally, http://blog.man7.org/2012/10/how-much-do-builtinexpect-likely-and.html (from 2012, to be fair) has some more advice on when to use compiler branch hints, and it supports my experience that a wrong branch prediction hint can absolutely murder performance. On the flip side, getting a hint right can make things scream (for a ~2 element LRU cache, the vast majority of the time had the 0th element matching in my case)

Reply
"No Bugs" Hare says

May 31, 2016 at 6:46 am

> Additionally, http://blog.man7.org/2012/10/how-much-do-builtinexpect-likely-and.html (from 2012, to be fair) has some more advice on when to use compiler branch hints, and it supports my experience that a wrong branch prediction hint can absolutely murder performance.

I’ve already referred to this very post :-), but yes – I will add a note that wrongly placed likely() can kill performance (it is my experience too, and developers are notoriously bad with placing likely()). THANKS!

> Would it be worth mentioning the potential benefit of link-time optimization? I’ve worked on projects where this accomplished a lot of cross-module inlining, giving me a nearly 4x speedup (I had a *really* hot function that needed to be inlined).

Do you mean -flto or something different? (I myself is an old-style guy, and prefer to keep my inlinable functions to headers ;-)).

Reply
- David says
  
  May 31, 2016 at 7:50 pm
  
  Yes, I’m referring to -flto on clang (I don’t know flags on other compilers). It’s a similar idea to doing unity builds where every function and variable is available to the optimizer so that cross-file inlining and other nice things can happen. A lot of the benefit can be attained by putting the appropriate functions in headers, but I’m willing to trade compile time for not needing to worry about that.
  
  Reply
  - "No Bugs" Hare says
    
    June 1, 2016 at 5:56 am
    
    Yep, -flto is one interesting option (and not only for inlining, but just overall). Added it, THANKS!
    
    Reply
Daniel Pfeifer says

May 31, 2016 at 7:49 am

Please have a look at the Boost naming conventions: http://www.boost.org/development/requirements.html#Naming_consistency

When you refer to the Boost organization, you should write “Boost”, not “boost::”. When you refer to a single Boost library, you should write “Boost.Container”, not “boost::Container”. When you refer to contents of the “boost::” namespace, “boost::” is appropriate (You already follow the same convention with regards to the STL: When you refer to the library, you use STL, when you refer to the namespace, you use “std::”).

Reply
- "No Bugs" Hare says
  
  May 31, 2016 at 9:27 am
  
  From the same document: “This page describes requirements and guidelines for the content of a library submitted to Boost.” As this is certainly NOT “a library submitted to Boost” (neither it is documentation for such a library), I fail to see how it can possibly apply to whatever I am writing.
  
  Reply
Daniel Pfeifer says

May 31, 2016 at 8:42 am

In the section “On Templates” I expected to find a discussion on the “template hoisting” technique presented in “Designing and Coding Reusable C++”, by Carroll and Ellis, Addison-Wesley, 1995, p. 80. See also here: http://cpptips.com/hoisting

Given the age of the book and the fact that it says this technique could become obsolete with a smart compiler, it would be interesting whether it is still relevant or not.

Reply
- "No Bugs" Hare says
  
  May 31, 2016 at 9:30 am
  
  As I understand, it is essentially the same as the one in referenced [Druy], and I prefer Druy’s explanation as (a) readily available online and (b) more clear than the one on cpptips.
  
  Reply
Matthew Fioravante says

May 31, 2016 at 11:32 pm

For optimization flags, I would also mention -fno-math-errno. Some standard C math. h functions check for errors and set the global errno variable. The most common one being sqrt(x) checks if x < 0 and sets errno if it is. In all of my career so far I've never yet had a use case where I needed to check errno after calling sqrt().

You can easily see the differences from using this flag using an online compiler like gcc.godbolt.org. Without -fno-math-errno sqrt() will either be an opaque function call or you'll see the assembly for checking the input and setting errno. With -fno-math-errno, sqrt() gets inlined and optimized to a single sqrt hardware instruction on x86.

Reply
- "No Bugs" Hare says
  
  June 1, 2016 at 4:18 am
  
  THANKS, added as a standalone option
  
  Reply
Vinny says

June 1, 2016 at 11:42 pm

No love for ostringstream?

Reply
- "No Bugs" Hare says
  
  June 3, 2016 at 8:15 am
  
  It is ok in some cases, but it is allocator-based too, which means that it needs to be used with care 🙁 on case by case basis. Do you mean any specific scenario for using it?
  
  Reply
Malcolm Parsons says

June 17, 2016 at 2:10 pm

“DO use vector::map etc.”

Did you mean “eastl::vector_map”?

Reply
- "No Bugs" Hare says
  
  June 20, 2016 at 12:58 pm
  
  Oops, fixed, THANKS!
  
  Reply
Malcolm Parsons says

June 17, 2016 at 2:16 pm

“std::unordered_* containers are hash maps. Can be translated into “risk of collisions, but few allocations and pretty good data locality” ”

The std::unordered_* containers use chaining, so they allocate on every insertion.

Reply
- "No Bugs" Hare says
  
  June 20, 2016 at 12:46 pm
  
  > The std::unordered_* containers use chaining, so they allocate on every insertion.
  
  Yes, what I’ve meant (indeed rather clumsily) is that there are few indirections involved when performing usual searches. I’ve changed the wording into “Can be translated into “risk of collisions, and quite a few allocations but pretty good data locality for your usual find()””, THANKS!
  
  Reply
Alex Guteniev says

February 4, 2017 at 9:36 am

On it++ vs ++it.

Another similar advice to avoid premature pessimization in loops.
Consider following:

for (std::vector::size_type i = 0 ; i != MyVector.size() ; ++i ) // premature pessimization

for (std::vector::size_type i = 0, m = MyVector.size() ; i != m ; ++i ) // fixed

Why MyVector.size() may be expensive:
1. Vector range may be implemented as pointer to begin and pointer to end, no size is stored. At least in MSVC it is.
Pointer subtraction is pointer value subtraction and division result by sizeof(MyType). If sizeof(MyType) is power of two, you’re lucky. If not, the division is likely to be implemented with imul instruction.
2. For the first example compiler is likely to access memory on each iteration (like it can be changed by another thread).
Applies not only to size() ranged loop, to end() iterator too.

If don’t need index or iterator, it is safer to do loop with range-based for.

Reply
- "No Bugs" Hare says
  
  February 6, 2017 at 4:51 am
  
  I see your point – and indeed it _might_ be a good thing to do, but I’m not 100% sure that 2nd version is _always_ beneficial. Losses due to compiler having to reserve one extra register for ‘m’, can easily outweigh benefits (especially if sizeof(MyType) is indeed 2^N); this is especially true for older x86 with its very limited number of registers, but – for certain programs _might_ apply to x64 too.
  
  > If don’t need index or iterator, it is safer to do loop with range-based for.
  
  This! :-).
  
  Reply
dédé says

April 6, 2018 at 9:32 pm

I think that noinline attribute is better (more useful) than force_inline, because it lets you tell the compiler when a function is likeley to be not often called (like the grow() function of a vector implementation). Then the compiler can know where to break the inline greed. This information is more useful to the compiler that the fact that you *really want* this function to be inlined. The compiler is already trying to INLINE ALL THE THINGS!!!

Reply
- "No Bugs" Hare says
  
  April 7, 2018 at 5:52 am
  
  > The compiler is already trying to INLINE ALL THE THINGS!!!
  
  That’s what they’re saying, but they.do.not.do.it (at least it stands for GCC and MSVC, and _probably_ also for CLang, but I didn’t check Clang asm for a while, so things might have changed). Just one example: with intensive use of force-inline, I got monolithic functions as large as 60K in size, which never happens without force-inline.
  
  Actually, from what I’m still seeing, it is even worse than that: even in 2018, it is perfectly possible to get serious performance degradation by adding another layer of doing-nothing wrapper function (_at_least_ it was observed with very-recent GCC and MSVC, though Clang wasn’t caught red-handed yet); and FWIW, this is not even related to force-inlines, but merely to some limitations when they’re trying to optimize things across different functions :-(. Overall, pretty much everything compiler writers want us to believe performance-wise, is, arguable at best, and misleading at worst :-(( (hey, they told us that inline functions are zero-cost abstractions, IIRC about 20 years ago – and it is STILL not the case).
  
  That being said, there ARE indeed cases for noinline attribute, but I wouldn’t say it is because the compiler is “greedy” with inlines; rather, it is because the compiler doesn’t have enough information at compile-time 🙁 .
  
  Reply
Ivica says

August 6, 2020 at 11:30 am

About the avoiding branches. I measured branches avoidance and the bottom line is: first fix improve cache hit rate, only when you do this you will see improvement in branch avoidance.

Reply