Using Parallel <algorithm> Without a Clue: 90x Performance Loss Instead of 8x Gain

 
Author:  Follow: TwitterFacebook
Job Title:Sarcastic Architect
Hobbies:Thinking Aloud, Arguing with Managers, Annoying HRs,
Calling a Spade a Spade, Keeping Tongue in Cheek
 
 


Parallel vs Single-Threaded

Ubi nihil vales, ibi nihil velis
Where you are worth nothing, there you should want nothing

— Arnold Geulincx, XVII century —

Disclaimer: if you’re doing parallel programming for living – please ignore this post (this stuff will be way way way too obvious for you). However, if you’re just about to believe claims that parallel programming can be made easy-as-pie – make sure to read it.

With C++17 supporting1 parallel versions of the std:: algorithms, there are quite a few people saying “hey, it became really simple to write parallel code!”.

Just as one example, [MSDN] wrote: “Only a few years ago, writing parallel code in C++ was a domain of the experts.” (implying that these days, to write parallel code, you don’t need to be an expert anymore).

Inquisitive hare:I made an experiment which demonstrates Big Fat Dangers(tm) of implying that parallelization can be made as simple as just adding a policy parameter to your std:: call.I always had my extremely strong suspicions about this position being deadly wrong, but recently I made an experiment which demonstrates Big Fat Dangers(tm) of implying that parallelization can be made as simple as just adding a policy parameter to your std:: call.


1 well, at least on paper; to the best of my knowledge, both libstd++ and libc++ do NOT support it yet (all my GCC/Clang compilers, as well as Godbolt, are choking on #include <execution>)

 

Task at Hand

Let’s consider the following very simple scenario: we have to calculate a sum of the elements in a million-element integer array. The code we’re starting from, is as simple as

template<class Iterator>
size_t seq_calc_sum(Iterator begin, Iterator end) {
  size_t x = 0;
  std::for_each(begin, end, [&](int item) {
    x += item;
  });
  return x;
}

When running this code on my pretty old Windows box (compiled with MSVC VS2017 in Release mode2) – it takes about 450us.


2 as noted above – as of now and to the best of my knowledge, MSVC is the only compiler+lib able to handle this kind of stuff; note that even in MSVC it is still “experimental”

 

Adding parallelism: Take 1 (badly broken)

First, we have to note that simple addition of std::execution::par to our std::foreach() call will NOT work. If trying to write something like

//DON'T USE: WILL CAUSE WRONG RESULT
template<class Iterator>
size_t par_calc_sum_deadly_broken(Iterator begin, Iterator end) {
  size_t x = 0;
  std::for_each(std::execution::par,begin, end, [&](int item) {
    x += item;//data race; often causes wrong calculation result(!)
  });
  return x;
}

– it will compile and will run, but we’ll easily get wrong result (in my experiments with a million-element array, the result was wrong each and every time, but YMMV, which only makes things worse <sad-face />).

Adding parallelism: Take 2 (90x performance hit)

IF we were observant enough to note this problem – and found a neat recommendation in [CppReference], we’ll realize that in addition to specifying std::execution::par, we also have to use std::mutex (or std::atomic) to make our program correct.

Ok, so our next (still very naive BTW) version would be along the following lines:

//DON'T USE: DEADLY SLOW
template<class Iterator>
size_t par_calc_sum_mutex(Iterator begin, Iterator end) {
  size_t x = 0;
  std::mutex m;
  std::for_each(std::execution::par,begin, end, [&](int item) {
    std::lock_guard<std::mutex> guard(m);
    x += item;
  });
  return x;
}

This does work correctly – and if we take a look at taskman, we’ll see that it DOES use all the cores (4 physical x 2-due-to-HT = 8 logical ones in my case). And, if we wouldn’t measure the performance compared to the sequential version – we could think that everything is good here. Not so fast <sad-face />.

Measurements have shown that the function above, takes about 40 milliseconds of wall-clock time, so instead of expected speedup of about 4x-8x, it is about 90x slowdown compared to the sequential one

(BTW, if you have doubts and want to run it yourself – the code is available at [demo]).
To make things even worse, the code above is written strictly along the lines recommended in [CppReference] (actually, it is almost exactly the same code).

Adding parallelism: Take 3. Atomics to the rescue? (still 50x performance hit)

Ok, as a next step we could think “hey, mutexes are bad for performance, so we should use atomics instead”. So, we rewrite it as

//DON'T USE: DEADLY SLOW
template<class Iterator>
size_t par_calc_sum_atomic(Iterator begin, Iterator end) {
  std::atomic<size_t> x = 0;
  std::for_each(std::execution::par, begin, end, [&](int item) {
    x += item;//changing it to x.fetch_add(item) doesn't change anything - neither it should
  });
  return x;
}

Well, it is still correct, AND (as expected) it is faster than our previous mutex-based version. The problem is that

It is still 50x slower than sequential one

Oh, and BTW – replacing std::execution::par with std::execution::par_unseq (with an assertion on x.is_lock_free() to prevent potential for deadlocks due to vectorization) didn’t make it better.3


3 it is unclear whether using of std::atomic when it is_lock_free() is safe for par_unseq; IMO it is, but there are voices out there that formally it isn’t; in any case, currently MSVC doesn’t really implement par_unseq (falling back to par), so as of now, it is a moot issue.

 

Results

Box non-parallelized std::execution::par with std::mutex std::execution::par with std::atomic std::execution::par_unseq with std::atomic
#1 (4 physical, 8 logical cores) 470+-4us 41200+-900us (90x slower, 600x+ less power-efficient) 23400+-140us (50x slower, 300x+ less power-efficient) 23400+-140us (50x slower, 300x+ less power-efficient)
#2 (2 physical, 4 logical cores) 900+-150us 52500+-6000us (60x slower, 200x+ less power-efficient) 25100+-4500us (30x slower, 100x+ less power-efficient) 21800+-2800us (25x slower, 100x+ less power-efficient)

As we can see, not only our naive parallel code is hopelessly inefficient and greatly increases CO2 footprint for absolutely no reason,4

it also punishes users with more powerful boxes

(more strictly, it seems that the more cores we have – the more penalty we get; this will start making sense in the second part of this mini-series).


4 desire to utilize all cores, or to get the program parallel does not qualify as a reason

 

Intermediate Conclusions and To Be Continued

As we have seen from the results above, naive attempts to make our code parallel (while having no clue about the way parallel code works) can easily cause HUGE problems (starting from wrong results and even crashes, and to having even correct programs slowing down by factors of 50x-90x).

In other words (arguing with the quote from [MSDN] cited in the very beginning of this post):

Writing parallel code in C++ is still a domain of the experts.5

Surprised hare:it is still necessary to understand what we're doingOTOH, the point of the exercise above is NOT to say that it is not possible to write efficient code with parallel std:: functions (it is). However, to do it – it is still necessary to understand what we’re doing. An explanation of what-is-wrong-with-the-code-above is planned for the next post of mine (spoiler: it is all about overhead which happens to be waaay too high in the code above).


5 It MIGHT change with the async stuff such as [HPX], but current support for parallel algos in std:: (except for std::reduce()), which forces us to mix calculations with thread sync, is not going to make writing of correct-and-high-performance programs any simpler <sigh />

 

Don't like this post? Comment↯ below. You do?! Please share: ...on LinkedIn...on Reddit...on Twitter...on Facebook

Acknowledgement

Cartoons by Sergey GordeevIRL from Gordeev Animation Graphics, Prague.

 

Join our mailing list:

Comments

    • "No Bugs" Hare says

      …and probably also some allocations (for a good measure). Anyway – this post is not really for those ppl who knows why it is so slow – but for those who don’t (and the whole point is that those-people-who-DON’T-understand-why-it-is-slow, SHOULD NOT write parallel code until they DO understand it).

  1. std::everything_is_a_nail says

    I’ve read the disclaimer but to be honest this just shows that every tool like (std::foreach) can be misused. “Choose the right tool for the job” is a guideline that does not only apply to parallel programming and I do not see how it invalidates the usability promises of parallel std algos.

    The article would be a lot better if it also included the right tool for the job: std::reduce, which does not need mutexes or atomics. What’s the performance of naively using that?

    • "No Bugs" Hare says

      The whole point of the article is to counter claims “hey, we can just make things parallel without having a clue of what we’re doing” (and IMNSHO, to replace for_each with reduce, we _have_ to have a clue). As for reduce() – sure, it would help in this particular case, but it is way too narrow for LOTS of real-world scenarios (also, rewriting not-so-trivial logic into reduce() is usually difficult; I’d rather go with HPX-style data-driven stuff which is more like most-of-existing-serial-code – AND it will outperform code-with-multiple-reduce()’s anyway as it doesn’t need to sync at the end of each op). In any case, given the designated target audience, it qualifies as a “very advanced” subject ;-).

  2. Benjamin Buch says

    std::for_each is the wrong algorithm, try std::reduce. It should massively speed up.

    template
    size_t seq_calc_sum(Iterator begin, Iterator end) {
    return std::reduce(begin, end);
    }

    for_each will speed up if you have independent operations. Forming the sum requires synchronization and must be done manually in for_each. But synchronization is always slow. However, it is not necessary to synchronize every call. reduce hides this implementation detail. I didn’t measure it, but it should be much faster. I would be happy if you could try it and write another post about the results. 😀

    • "No Bugs" Hare says

      Right (and there are some other ways to achieve the same thing), but to write it this way – you do have to have a clue about what you’re doing, which is the whole point of this post 🙂 .

    • says

      Any parallel algorithm (including std::reduce) can cause performance degradation if parallel threads are trying to access sibling memory cells (so called false sharing, depends on a hardware design that causes synchronization even if your code doesn’t need it in theory).

      • "No Bugs" Hare says

        TBH, false sharing looks quite unlikely in this particular context; after all, most of the data is constant, and any modifiable stuff is within std::reduce, where it _should_ have been taken care of by lib writers (BTW, this is achieved automatically if they’re using stack-only vars as accumulators, which I guess is quite likely).

  3. Andrew says

    What are the results of you used std:: accumulate where everything is localized (no sharing or locking necessary)? I agree that developers shouldn’t blindly using the new shiny parallel algorithms without testing. However, the method used should’ve been using accumulate from the beginning rather than for_each. It wasn’t the right algorithm to begin with.

    • "No Bugs" Hare says

      > It wasn’t the right algorithm to begin with.

      Arguable (in serial processing, there is no discernible difference between the two). And FWIW, parallel version of std::accumulate is known as std::reduce (and I’ll discuss it in the next instalment).

      • sv90 says

        The names of std::accumulate and std::reduce describe your intent better than std::for_each and hence one should prefer them always.

        • "No Bugs" Hare says

          This is a style-like subject we can argue ad infinitum (I agree that “intent” argument does fly to a certain extent, but also agree that a counterargument flies, that accumulate() is an unnecessary entity which has to die under Occam’s razor). More practically – just as one example, IF I need to get two things calculated (such as “sum” and “sum-of-squares”), writing it as TWO accumulates is (arguably) a premature pessimization (single loop with two adds will do it faster pretty much by definition due to caching and memory bandwidth issues), and writing a special two-component accumulator just to shoehorn accumulate for this purpose is unnecessarily-verbose (which reduces readability!) and overall ugly (that is, unless we’re doing it in parallel – THEN the tables are turned, more on it in next(next(this_post))). And IF I need to BOTH modify my vector AND calculate something over it – shoehorning accumulate() there is a recipe for disaster. Hence (one may argue with a straight face) accumulate is not ALWAYS the best thing to use. Q.E.D. (and overall, “always” is a waaaay to strong predicate to stand in the real world; even my favorite “mutex MUST DIE” doesn’t always stand 😉 – though it stands “as a Big Fat Rule of Thumb”).

  4. Marcello Maggioni says

    Of course it is slower. You are trying to do on multiple cores something that inherently sequential. There is no code that is parallelizable in the example you developed. All the code needs to be serialized to be correct. Typically code that is run in parallel has a bunch of code ran in parallel and some synchronization points here and there to exchange data between the cores. The ratio between synchronization points and parallelizable code need to be as low as possible (ideally 0). Your example has only one line of code and is a synchronization point, so the ratio is infinite.

    It’s not a C++ problem, this code would suck performance in any other parallel programming paradigm.

    • "No Bugs" Hare says

      > It’s not a C++ problem, this code would suck performance in any other parallel programming paradigm.

      Sure. But the point “you still have to have a clue of what you’re doing”, still stands.

    • Sothatsit says

      Am I misunderstanding your point because parallelising this code would simply require separating the data into 8 chunks, passing each to a seperate thread to be summed, and then aggregating each thread’s results?

      Whether this would be more efficient for only a million integers is a question of benchmarking, however it would definitely be more efficient as the dataset grows. Therefore claiming that this is an “inherently sequential” problem is incorrect, or am I missing your point?

  5. Smitty says

    The problem, surely, is that the statement
    x += item
    is atomic.

    This problem (like every parallelizable problem?) requires the input data *and* the operation to be partitioned in parallel.

    A naive implementation might consider loop unrolling and then recursive descent into a function which split the input array into N_Core sub arrays, recursively, until the input array had size(A) <= 2. Each descent would create N_Core threads waiting for the return value of the recursive functions.

    • "No Bugs" Hare says

      > The problem, surely, is that the statement x += item is atomic.

      Or more generally – that calculation is “too small” for the overhead incurred. I’ll try to write about it in the next post.

    • "No Bugs" Hare says

      Didn’t know about it, thanks for the link! While I am not sure whether it will fly with consumers (do we REALLY need more than 4 cores on our home/office boxes?) – it is IMO a Good Thing(tm) that more people will start to think about NUMA 🙂 .

      • Jesper Nielsen says

        The funny thing is, that for Threadripper, by default you’re not able to think about it for consumer software…
        “By default, Threadripper actually hides its NUMA architecture. AMD instead runs Threadripper in a UMA configuration”

          • Jesper says

            Well at least the inter-node latency is not supposed to be that bad, and since it has SMT the core can use the wait to perform useful work. I still haven’t had my hands on any Numa systems but it could be fun to play with, although I’m kinda restricted by being a C# coder so a little removed from the metal.

          • "No Bugs" Hare says

            > I still haven’t had my hands on any Numa systems

            Pretty much all the multi-socket servers are NUMA these days (and if you REALLY want it – it is possible to rent one for $69.99/month right now at https://www.leaseweb.com/dedicated-servers/build-your-own – no warranties of any kind, but I’d guess that 2xE5620 box is NUMA – at least it is almost-for-sure NUMA in hardware). Although TBH, NUMA-related issues on modern x64 boxes are not really noticeable – UNTIL you start looking for them.

    • "No Bugs" Hare says

      What kind of synchronization is meant, is not 100% clear (just a few lines above, when providing a rationale, it talks about _blocking_ synchronization being dangerous because of deadlocks – and atomic is non-blocking and cannot possibly cause any deadlocks). If speaking in terms of sequencing – it is still safe (and is practically safe for a pretty much any feasible implementation). Have to wait until final wording is released.

      EDIT: BTW, formally wording-you-referred-to says about “_standard_ functions”, which technically has nothing to do with my own functions ;-).

      • Billy O'Neal says

        > If it says “synchronizes with”, at all, you can’t use it in par_unseq. That means no atomics, no shared_ptr, etc.

        It’s likely that the current spec is written that way because specifying how you could use atomics in a way that wouldn’t create deadlocks is hard. 🙂

        > as you sound like a person from MSFT – can you tell whether the absence of parallel version for transform() in 15.5.5 is intentional – or it is just an accidental omission

        Yes, I implemented most of the parallel algorithms library. 15.5 doesn’t contain a complete parallel algorithms implementation. I believe the first release we are calling “complete” there is 15.7; I’m not sure if 15.6 contained transform or not.

        • "No Bugs" Hare says

          > specifying how you could use atomics in a way that wouldn’t create deadlocks is hard

          IMO, it is quite easy – as long as atomic is_lock_free() (or type is_always_lock_free()) – it should be inherently deadlock-safe (this certainly stands for x86/x64, but also for other platforms where is_lock_free atomics are based on hardware CAS instructions, and I don’t know of any platform where they aren’t). Any objections? (if not – it is probably a good idea to try pushing this clarification into the par_unseq spec…)

          > I believe the first release we are calling “complete” there is 15.7

          I see, thanks (but 15.7 is not out yet, right?)

          • Billy O'Neal says

            >as long as atomic is_lock_free() (or type is_always_lock_free()) – it should be inherently deadlock-safe

            But with atomics, users can create their own spinlocks, which would not be deadlock safe. Consider:

            atomic_flag f; // always lock free

            { // inside parallel algorithm
            while (!flag.test_and_set()) /* spin */;
            // do protected thing
            flag.clear();
            }

            If you run this in a GPU-like context, the intent of par_unseq, if the warp decides to take the “didn’t get the flag” branch here, you deadlock, in your warp all 31 “threads” will forever not be the winner of the test and set, and the hardware won’t ever try to run the lucky winner thread because a join point is never reached.

            You can build things like atomic_flag here out of all the other atomic operations (except perhaps plain load and store?), so right now the standard just says no, not ever. Even no synchronization can still be helpful in a lot of domains.

            There are probably examples of patterns that could be made safe (maybe reference counting?) but for now the standard says “if it says Synchronizes with thou shalt not call it in par_unseq”.

          • "No Bugs" Hare says

            Well, IF you really want it – you can get into trouble even with atomics 😉 (via no less than a busy loop – gimme a break…). As for the standard – my current reading of it is that it says essentially that _with par_unseq is developer’s responsibility to avoid deadlocks even if things get interleaved_ – and my implementation (with a just-in-case assert() added) is perfectly fine under this understanding (and under pretty much any sensible implementation too).

          • Billy O'Neal says

            I don’t quite get how you can read http://eel.is/c++draft/algorithms.parallel#exec-6 as “atomics are OK”

            “A standard library function is vectorization-unsafe if it is specified to synchronize with another function invocation, or another function invocation is specified to synchronize with it, and if it is not a memory allocation or deallocation function. Vectorization-unsafe standard library functions may not be invoked by user code called from execution​::​parallel_­unsequenced_­policy algorithms.”

          • "No Bugs" Hare says

            IMO it is obvious: as your quote explicitly starts with a reference to a “_standard_ library functions”, and as I am not writing a _standard_ library, this whole paragraph _explicitly_ does NOT apply to my app-level code, plain and simple. If they would want to speak about _any_ function, why they would put “_standard_” word in?

            The way it is written, this paragraph does apply to _you_ (as std library implementor), but does NOT apply to _me_ (as an app-level developer). As for why it is written this way – ask guys from WG21 ;-).

    • "No Bugs" Hare says

      BTW, as you sound like a person from MSFT – can you tell whether the absence of parallel version for transform() in 15.5.5 is intentional – or it is just an accidental omission?

  6. says

    “what-is-wrong-with-the-code-above […] spoiler: it is all about overhead” – One could argue it’s all about shared state, really. The overhead is just a consequence of handling the shared state (that’s what your naïve solutions in this article do), and a (the) survival technique is to identify such state and either refactor it (the classic “divide the loop into N chunks and give each chunk its own copy of the state, and then summarize” – basically apply map-reduce, whether via specialized algorithms or by hand) or refrain from parallelizing altogether, which of course sometimes will just be the best course of action.

    “it is all about overhead” feels misleading as the “what-is-wrong”, don’t you think? I’ll be happy to read the next article, see where you’re going.

    • "No Bugs" Hare says

      Well, with “overhead” being a direct and inevitable result of “shared state”, I don’t see much point in arguing about it (it becomes a purely-terminology-argument, so it is always possible to argue it either way ;-( )). Strictly speaking, this example exhibits all four of SLOW (Starvation-Latency-Overhead-Waiting) components in a classification-proposed-by-HPX – but to get observed 90x slowdown, you do have to rely on a huge overhead, there is no way around it 😉 .

      And FWIW, the next post is already out 5 minutes ago or so (and yes, it includes both manually-implemented reduce-like stuff, and library-provided one – which happen to be within 2-3% performance-wise 😉 ).

  7. Eno says

    Saying it has been made easy to use parallellism, does not necessarily mean parallellism is always beneficial. That seems to be an assumption you make yourself (unconsciously maybe). As with every effort to speed up your code, the only way to know if it is an improvement is to measure. That counts for parallellism, just as it count for other coding. It doesn’t matter if parallellism is done in a simple way like this, or in a difficult way. In both situations you still have to measure to know if it is usefull. So still one can safely say, that parallellism has been made easier in my opinion. And saying that the example code directly comes from cppreference, does not mean anything. They probably just show an easy to understand example, regardless of whether it is usefull to use that example code directly in your own code. That should always be judged by the programmer.

    • "No Bugs" Hare says

      > That should always be judged by the programmer.

      Which is EXACTLY the point of the OP. OTOH, I STRONGLY disagree with the parallel tutorials which mention mutexes within first few pages (and no, I do NOT agree that “should always be judged” excuses the authors of such tutorials). Tutorials are by definition written for NOVICES, and for them – mutexes (AND those situations-which-need-mutexes) should be banned outright, plain and simple.

  8. Thor Lancaster says

    In addition to the advice in this article…

    Any parallelization should be performed at the highest level possible. For example, if you wanted to sort the words alphabetically in 100 text files, it would be much better to have a single thread working on each file than to try to parallelize each sorting operation.

    If you are only doing expensive inter-thread data sharing to start large tasks that each run sequentially, the overhead will be negligible.

Leave a Reply to Jesper Nielsen Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.