#CPPCON2017. Day 4. Async Rulezzz! - IT Hare on Soft.ware

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

During the Day 4 of CPPCON, I had to prepare to my own talk tomorrow; still – I was able to attend all the talks I was interested in.

But before I can start – let’s discuss a thing which I just understood during CPPCON17; it is related to…

Big Fallacies of Parallel Programming for Joe Average Developer

Twice today, I heard pretty much the same thing, related to the concept of “utilization” of the available CPU cores. And the most funny about it was that while once, a reference to “core utilization” felt as a fallacy, on another occurrence it made perfect sense. After thinking about it – I realized that the whole difference was about whether the person realized what it really means.

The first time, I heard about CPU core utilization from a vendor trying to sell their profiler, and saying something along the lines of “hey, if your core is idle – we’ll show it to you in red, so you can see your cores are idling, which is a Bad Thing(tm)”.¹ Unfortunately, if interpreting core utilization this way – it is extremely easy to fall into the trap of assuming that utilization is The Only Thing Which Matters(tm). And with such an assumption – the concept of “core utilization” becomes a Really Big Problem. Unfortunately, it is perfectly possible to have a parallelized program, which utilizes all the available cores – while working slower (by wall-clock(!)) than original single-threaded one. Moreover, even if we’re using exactly the same algorithms in single-threaded implementation and in a parallel one – single-threaded one can still win wall-clock-wise. The point here is related to a so-called granularity: very shortly, if we’re doing our job in chunks which are too small – we’ll be incurring LOTS of thread context switches, and with each context switch taking between 2’000 and 1’000’000 CPU clock cycles² – the overhead can easily make the whole thing perform really badly.

“while in this case we can easily utilize all the 24 cores of our box - the wall-clock time of the calculation can easily be LARGER than the single-threaded one, that's while consuming 24x more power, and causing 24x more of CO₂ footprint too.This, in turn, means that if we’re going to run a program which simply calculates a sum of an array in parallel, with each sub-task doing only one addition before doing a thread context switch back to the main one, while we can easily utilize all the 24 cores of our box – the wall-clock time of the calculation can easily be LARGER than the single-threaded one, that’s while consuming 24x more power, and causing 24x more of CO₂ footprint too.

As a result, an interpretation of “core utilization” being The Only Thing Which Matters – is not just wrong, but Really Badly Wrong.

The second time, I heard about CPU core utilization within the talk “The Asynchronous C++ Parallel Programming Model” by Hartmut Kaiser. However, in this talk – it was in the context of HPC, and more importantly – it was accompanied by a discussion on overheads and granularity. In this context – sure, utilizing all the cores, as a rule of thumb, is indeed a Good Thing(tm).

Observing this difference in potential interpretation of the “core utilization” has lead me to a feeling that there are several different layers of understanding of the parallelizing our calculations, and that the understanding of a Joe Average developer (such as myself) can be very different from the understanding of an HPC pro. This, in turn, means that

taking HPC pro’s comments out of context and without proper understanding, can easily be devastating for our (=”Joe Average”) programming efforts

Here is the list of concepts, which are, while being true in the proper context (and which are so obvious to HPC pros that they’re using them without specifying the context) – can be very dangerous when being pulled into the context of what-most-of-us-are-doing-for-living:

the concept of “core utilization”, briefly described above.
- For HPC pros, it is indeed one of the most important metrics (mostly because they already handled everything else, including choosing proper granularity)
- For us, “Joe Average” developers – it is not that obvious, and pretty often having wrong granularity can kill our efforts much easier than we say “Joe Average” – while keeping 100% utilization.
“Let’s parallelize as many things as humanly possible”.
- While this recommendation comes right from Amdahl’s Law, strictly speaking, even for HPC it stands only if we postulate that “we want the result ASAP at all costs” (and if we introduce things such as “power costs” or “CO₂ footprint” in consideration – the things will be somewhat different). Still, for real-world HPC I am perfectly ready to defer to HPC pros on this one 🙂 .
- However, for us “Joe Average” programmers, most of the time such approach is just plain wrong.
  - First of all – more often than not, we’re speaking not about long calculations – but instead about interactive stuff. Sure, there MAY be a requirement to finish our work before-next-frame-comes (before-user-dies-from-old-age, …) – but all the other things which are not time-critical, are better to be left without parallelization. Most importantly, non-parallelized code is simpler and is much less risky. In addition – it will have less overhead, which in turn will leave more CPU power available for other applications, reduce electricity bills, and will be more environment-friendly.
  - Second – if trying to parallelize all the things in the context of the interactive programs (which tend to be rather short) – we’ll likely end up with lots of the algos with calculation chunks being too short to be efficient – which, in turn, will cause severe losses of performance.
  - “for us (='Joe Average developers working on interactive programs'), the rule of 'Let's parallelize as many things as humanly possible', most of the time, becomes its exact opposite: 'Let's NOT parallelize anything which still fits into time bounds required by interactivity'.Overall, for us (=”Joe Average developers working on interactive programs”), the rule of “Let’s parallelize as many things as humanly possible”, most of the time, becomes its exact opposite: “Let’s NOT parallelize anything which still fits into time bounds required by interactivity”.
“Oversubscribe and utilize all resources”
- For HPC, it makes perfect sense – after all, if calculation chunks are large enough, any idle core is indeed a waste of resources.
  - Indiscriminate oversubscription can be still a problem (running a million threads per core is rarely a good idea) – but HPC guys do know their stuff, so they know how much to oversubscribe.
- However, for interactive programs – having oversubscription is risky. In particular – having all the cores at use all the time, reduces responsiveness of the system. In other words – to reduce latencies, we have to keep at least one core idle more or less at all times.
  - NB: while we can improve responsiveness by using the priorities – in practice, it happens to be rather difficult (it is easy to forget to raise priority of one of the threads on the critical path, priority inversion can kick in, etc.)
  - Moreover, heavy oversubscription can kill an interactive program much easier than HPC.

¹ BTW, later the guy has admitted he’s just a sales person, so he has no idea what it all means <sad-face />

² that is, on modern desktop/server CPUs, not on GPUs or MCUs

Why HPX kicks a** of current parallel STL

Now, after all those necessary disclaimers and clarification above – I can convey my feelings about the talk “The Asynchronous C++ Parallel Programming Model” by Hartmut Kaiser. Very briefly – he was speaking about HPX, which is IMHO inherently superior to the new parallelized STL algos in C++17. The reason for it is the difference in the programming paradigm. While C++17 STL follows a traditional OpenMP-like model of “hey, we have this loop, let’s parallelize it” – HPX goes along the lines of describing “how the thing-we-need can be calculated”, and then gives it to the infrastructure code to ensure that the stuff is calculated efficiently.

'The Asynchronous C++ Parallel Programming Model' by Hartmut Kaiser

For example, with C++17’s parallel algos we have two loops – then first, we need to incur lots of context switches to start loop #1, then we have to wait until all the threads finish their calculations (incurring another bunch of expensive context switches), and only then we can start calculating the second loop. On the other hand, with HPX (while the code actually looks reasonably similar to the C++17-based one) – in practice, we’re just describing the way how certain futures can be calculated, so as soon as all the information necessary to calculate certain future becomes available – HPX can start calculating it. This allows to avoid all the unnecessary thread synchronizations – which, in turn, provides both better core utilization and reduced overall overheads.

For more details – make sure to watch the presentation when it appears on YouTube. What is clear though – is that

once again, asynchronous systems have demonstrated their advantage over the systems-relying-on-explicit-synchronization (such as mutexes)

NB: I heard that similar future-based implementation is planned for C++ standard too; when we can hope for it – is unclear though.

Naked Coroutines

Another excellent presentation (and once again, demonstrating the power of the asynchronous stuff <wink />) was a talk by Gor Nishanov titled “Naked coroutines live (with networking)”. The point was to take just bare coroutines+new-networking-stuff-from-networking-TS – and to make a non-blocking networked app out of it – live within one hour, no less <smile />. When video of the talk becomes available on YouTube – make sure to watch it with your own eyes to see how easy writing an async network app has become.

Disclaimer: for the time being, for serious network development I’d rather still use native OS APIs, as they’re more likely to provide more knobs to turn – and at least for now, this is often necessary to achieve optimal networking performance. However, nothing prevents us to use coroutines with our-own-async-calls-built-on-top-of-{select()|poll()|epoll()|kqueue()|whatever-else} – in pretty much the same manner as Gor has described in his talk.

Once Again on Metaprogramming

As you might have noticed based on my previous post – I am wildly excited about the coming-to-compiler-near-you-some-time-in-next-decade those “metaprogramming” features for C++. As a result – I didn’t really have an option to miss the talk “Language Support for Metaprogramming in C++” by Andrew Sutton, and I certainly wasn’t disappointed with in. Very shortly – it is another MUST-watch talk when it appears on YouTube.

It was another day on CPPCON2017; tomorrow is the last day, and I’ll be speaking myself – so while I still hope to write something about it, please don’t expect too much <wink />.

Big Fallacies of Parallel Programming for Joe Average Developer

Why HPX kicks a** of current parallel STL

Naked Coroutines

Once Again on Metaprogramming

Related posts

Leave a Reply Cancel reply