#CPPCON2017. Day 2. Why Local Allocators are a Good Thing(tm) Performance-Wise, and Why I am Very Cautious about C++17 STL parallelized algos

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

At CPPCON2017 Day 2, two talks were of special interest to me. One was a 2-hour talk about Local Allocators – and another about C++17 STL parallelised algorithms.

Local Allocators

The talk on Local Allocator by John Lakos was brilliant both in terms of content and in terms of presentation (I wish I’d be able to speak as convincingly as him some day <sigh />). I highly recommend all the performance-conscious C++ developers to watch his talk as soon as it is becomes available on the Youtube.

Very very shortly:

using local allocators can easily improve performance by a factor of 10x
- It doesn’t matter much which of the global allocators you’re using (default one, tsmalloc, whatever-else) – compared to the local allocators all of them lose hopelessly.
- It is NOT something new – however, it MIGHT easily come as a revelation to LOTS of developers out there

pretty often (IMO – most of the time), the most important thing is not about the time of allocation/deallocation – but rather a spatial locality of the data allocated. It may come as a revelation to (and can cause friction with) LOTS of developers trying to optimize their allocators for pure alloc/dealloc times – but here I am 100% on the side of John.
- diffusion (randomising memory patterns which naturally occurs over the program life time) IS a performance-killer. Local allocators can help against diffusion too – but keep in mind it is all about your typical access patterns, so your allocators should be optimized for the access patterns you’re mostly facing.
penalty of C++17-style virtualized allocators is low (and often, though not always, is non-existing); the reason is two-fold:
- because for the virtualized allocators C++17-style, most of the time, modern compilers can inline virtualized calls (while inlining virtualized calls is known since time immemorial, the fact that it can eliminate most of virtualized calls in such usage scenarios, is new to me, but well – I am ready to trust John’s experience and experiments in this regard, though IMO there will be restrictions on the programming patterns which allow this, so it most-performance-critical-scenarios we should be aware of them).
- [not really articulated by John as a reason, by IMNSHO still standing] – most of the time, direct alloc/dealloc costs are minor compared to the benefits of improved locality.

Sure, going for local allocators if you don’t care about performance still qualifies as a premature optimization and has to be avoided – but IF you’re after performance – they’re one of the very first things to look at (IMO – the second one after getting rid of those unnecessary thread context switches, they can be even more devastating).

Local Allocators for (Re)Actors

Last but not least. While it wasn’t discussed in John’s talk, but I hope he would agree <wink />: (Re)Actors (a.k.a. event-driven programs, ad-hoc finite state machines, etc. etc., discussed in many places of this blog) tend to play really nicely with local allocators; at the very least, per-(Re)Actor allocator can help A LOT; if necessary – it can be subdivided further into even-more-local-allocators, but it is still a very-good-to-have first step. For a discussion of (Re)Actor allocators – see also my own recent series of articles in Overload: Allocator for (Re)Actors with Optional Kinda-Safety and Relocation, A Usable C++ Dialect that is Safe Against Memory Corruption, and the third one coming in a few days in an October issue of Overload, and titled “Allocator for (Re)Actors. Part III – “Speedy Gonzales” Serializing (Re)Actors via Allocators”; while the issues raised by me and in John’s talk, are pretty much orthogonal – we can easily rip both performance-benefits-discussed-by-John and other-benefits-discussed-by-me from using per-(Re)Actor allocators.

STL parallel algorithms

Another talk of significant interest to me was the talk by Anthony Williams on concurrency, parallelism, and coroutines. I have to admit that I even before the talk, I already was cautious about newer STL parallel algorithms, but was wondering whether I am missing something.

Anthony Williams. Concurrency, Parallelism and Coroutines

And Antony’s presentation (which BTW was brilliant, and answered all my questions without me asking) had a probably-undesired-by-him effect – it has convinced me further that STL parallel algorithms are way way too easy to misuse, and as a result – I am strongly against recommending them to an average developer.

The idea behind C++17 parallel algos looks nice at the first glance – “hey, you can just add this “policy” parameter, and your algo will magically become parallel!”. However, it is exactly this perceived simplicity which gives me creeps. I see the following major problems on this way (to be very very clear: it is NOT what Antony has said, it is my feelings about this subject; to make your own judgements based on his talk – make sure to watch his talk when it becomes available on Youtube):

adding parallelism only looks simple, while it is not.
- Most importantly, thread sync is still a responsibility of developer <ouch! />. As getting thread sync correctly is Extremely Difficult (just one example – even WG21 members are guilty of providing and proposing code-with-grave-MT-bugs, more on it in my Friday talk), it means that anything which requires thread sync makes the program inherently fragile. And while thread sync in parallel algos might indeed be necessary – still, providing a very syntaxically simple way of making your program extremely fragile is hardly a good thing.
  - In other words – I am of a very strong opinion that not only the standard should allow doing simple things in a simple manner, but also that the standard should make doing dangerous things more complicated. In other words – IMNSHO, we should make an effort to make shooting ourselves in the foot more difficult. Of course, it is a philosophical question and as such can be debated at length – but well, personally I am very sure about it.
- Another major problem on this way is that in practice, one of the most important things to make the algo not only formally parallel, but also to perform better – is granularity of the stuff offloaded between different threads. This absolutely-important issue is completely missing from the STL parallel algos. And while in theory, it might be fixed solely by the policies – I am extremely sceptical that policies can do it magically efficient without the significant help from the developer’s side who has intimate knowledge of the algo involved (and having per-algo policies, while technically possible, effectively defeats the idea of the separation between algos and policies)
  - Sure, these concerns might be mitigated by providing some benchmarks – but these are conspicuously missing from all-the-talks-I’ve-seen-on-the-STL-parallel-containers (!). This certainly doesn’t improve my confidence in the current practical value of them.
  - And without paying attention to making our algos coarse-grain, at least on non-GPU architectures, we can easily run into a seen-many-times-myself situation when developer happily says “hey, I was able to utilize all 8 cores”, while at the second glance it is observed that his-program-utilizing-all-8-cores actually takes more wall-clock time to execute (!!).
level of control provided by default policies, is extremely limited. And it is another thing which is extremely important in practice.

Given this, I am standing extremely cautious about the merits of the STL parallelised algorithms for real-world use on non-GPU processors; in particular, for Intel CPUs I still strongly prefer TBB. On the other hand, for GPGPUs, STL paralellised algos mentioned by Anthony might happen to be much more interesting for two big reasons: (a) writing GPGPU code is a major headache to start with; (b) for GPGPUs, the issue of coarse granularity, while still somewhat present, is not that important, so automated balancing has much better chances to work.

Overall: even if STL parallelised algos can be made to perform well after spending significant amounts of work (which is yet to be observed, most likely – even if it happens eventually, there is still a loooong road to achieve this), and (given right tools) can easily provide significant value for GPGPUs, what is clear that

by pretending that parallelisation is simple – it has an enormous potential for unsuspecting developer trying to use it – and getting the whole project badly burned

This can very easily lead to an improper thread sync, which will in turn lead to bugs-exhibited-only-once-a-month-on-the-client-site (and increasing an already-unacceptably-high number of programs which fail occasionally – is hardly a good idea). In addition – as noted above, it is very easy to write a parallel algo which performs worse even wallclock-wise that non-parallel one,¹

¹ BTW, consumed-power-wise parallel algos pretty much inevitably perform worse than serial ones, especially if the thread sync is involved(!) – and this also should be articulated to avoid developers trying to use parallel stuff just because it is “cool”

It was another long day – and there is another one coming. Tomorrow for me it will be SG14 (“Game Development and Low Latency”) series of meetings.

Local Allocators

Local Allocators for (Re)Actors

STL parallel algorithms

Related posts

Leave a Reply Cancel reply