Using Parallel <algorithm> Without a Clue: 90x Performance Loss Instead of 8x Gain

posted March 27, 2018 by "No Bugs" Hare, translated by Sergey Ignatchenko

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

Ubi nihil vales, ibi nihil velis
Where you are worth nothing, there you should want nothing

— Arnold Geulincx, XVII century —

Disclaimer: if you’re doing parallel programming for living – please ignore this post (this stuff will be way way way too obvious for you). However, if you’re just about to believe claims that parallel programming can be made easy-as-pie – make sure to read it.

With C++17 supporting¹ parallel versions of the std:: algorithms, there are quite a few people saying “hey, it became really simple to write parallel code!”.

Just as one example, [MSDN] wrote: “Only a few years ago, writing parallel code in C++ was a domain of the experts.” (implying that these days, to write parallel code, you don’t need to be an expert anymore).

“I made an experiment which demonstrates Big Fat Dangers(tm) of implying that parallelization can be made as simple as just adding a policy parameter to your std:: call.I always had my extremely strong suspicions about this position being deadly wrong, but recently I made an experiment which demonstrates Big Fat Dangers(tm) of implying that parallelization can be made as simple as just adding a policy parameter to your std:: call.

¹ well, at least on paper; to the best of my knowledge, both libstd++ and libc++ do NOT support it yet (all my GCC/Clang compilers, as well as Godbolt, are choking on #include <execution>)

Task at Hand

Let’s consider the following very simple scenario: we have to calculate a sum of the elements in a million-element integer array. The code we’re starting from, is as simple as

template<class Iterator>
size_t seq_calc_sum(Iterator begin, Iterator end) {
  size_t x = 0;
  std::for_each(begin, end, [&](int item) {
    x += item;
  });
  return x;
}

When running this code on my pretty old Windows box (compiled with MSVC VS2017 in Release mode²) – it takes about 450us.

² as noted above – as of now and to the best of my knowledge, MSVC is the only compiler+lib able to handle this kind of stuff; note that even in MSVC it is still “experimental”

Adding parallelism: Take 1 (badly broken)

First, we have to note that simple addition of std::execution::par to our std::foreach() call will NOT work. If trying to write something like

//DON'T USE: WILL CAUSE WRONG RESULT
template<class Iterator>
size_t par_calc_sum_deadly_broken(Iterator begin, Iterator end) {
  size_t x = 0;
  std::for_each(std::execution::par,begin, end, [&](int item) {
    x += item;//data race; often causes wrong calculation result(!)
  });
  return x;
}

– it will compile and will run, but we’ll easily get wrong result (in my experiments with a million-element array, the result was wrong each and every time, but YMMV, which only makes things worse <sad-face />).

Adding parallelism: Take 2 (90x performance hit)

IF we were observant enough to note this problem – and found a neat recommendation in [CppReference], we’ll realize that in addition to specifying std::execution::par, we also have to use std::mutex (or std::atomic) to make our program correct.

Ok, so our next (still very naive BTW) version would be along the following lines:

//DON'T USE: DEADLY SLOW
template<class Iterator>
size_t par_calc_sum_mutex(Iterator begin, Iterator end) {
  size_t x = 0;
  std::mutex m;
  std::for_each(std::execution::par,begin, end, [&](int item) {
    std::lock_guard<std::mutex> guard(m);
    x += item;
  });
  return x;
}

This does work correctly – and if we take a look at taskman, we’ll see that it DOES use all the cores (4 physical x 2-due-to-HT = 8 logical ones in my case). And, if we wouldn’t measure the performance compared to the sequential version – we could think that everything is good here. Not so fast <sad-face />.

Measurements have shown that the function above, takes about 40 milliseconds of wall-clock time, so instead of expected speedup of about 4x-8x, it is about 90x slowdown compared to the sequential one

(BTW, if you have doubts and want to run it yourself – the code is available at [demo]).
To make things even worse, the code above is written strictly along the lines recommended in [CppReference] (actually, it is almost exactly the same code).

Adding parallelism: Take 3. Atomics to the rescue? (still 50x performance hit)

Ok, as a next step we could think “hey, mutexes are bad for performance, so we should use atomics instead”. So, we rewrite it as

//DON'T USE: DEADLY SLOW
template<class Iterator>
size_t par_calc_sum_atomic(Iterator begin, Iterator end) {
  std::atomic<size_t> x = 0;
  std::for_each(std::execution::par, begin, end, [&](int item) {
    x += item;//changing it to x.fetch_add(item) doesn't change anything - neither it should
  });
  return x;
}

Well, it is still correct, AND (as expected) it is faster than our previous mutex-based version. The problem is that

It is still 50x slower than sequential one

Oh, and BTW – replacing std::execution::par with std::execution::par_unseq (with an assertion on x.is_lock_free() to prevent potential for deadlocks due to vectorization) didn’t make it better.³

³ it is unclear whether using of std::atomic when it is_lock_free() is safe for par_unseq; IMO it is, but there are voices out there that formally it isn’t; in any case, currently MSVC doesn’t really implement par_unseq (falling back to par), so as of now, it is a moot issue.

Results

Box	non-parallelized	std::execution::par with std::mutex	std::execution::par with std::atomic	std::execution::par_unseq with std::atomic
#1 (4 physical, 8 logical cores)	470+-4us	41200+-900us (90x slower, 600x+ less power-efficient)	23400+-140us (50x slower, 300x+ less power-efficient)	23400+-140us (50x slower, 300x+ less power-efficient)
#2 (2 physical, 4 logical cores)	900+-150us	52500+-6000us (60x slower, 200x+ less power-efficient)	25100+-4500us (30x slower, 100x+ less power-efficient)	21800+-2800us (25x slower, 100x+ less power-efficient)

As we can see, not only our naive parallel code is hopelessly inefficient and greatly increases CO₂ footprint for absolutely no reason,⁴

it also punishes users with more powerful boxes

(more strictly, it seems that the more cores we have – the more penalty we get; this will start making sense in the second part of this mini-series).

⁴ desire to utilize all cores, or to get the program parallel does not qualify as a reason

Intermediate Conclusions and To Be Continued

As we have seen from the results above, naive attempts to make our code parallel (while having no clue about the way parallel code works) can easily cause HUGE problems (starting from wrong results and even crashes, and to having even correct programs slowing down by factors of 50x-90x).

In other words (arguing with the quote from [MSDN] cited in the very beginning of this post):

Writing parallel code in C++ is still a domain of the experts.⁵

“it is still necessary to understand what we're doingOTOH, the point of the exercise above is NOT to say that it is not possible to write efficient code with parallel std:: functions (it is). However, to do it – it is still necessary to understand what we’re doing. An explanation of what-is-wrong-with-the-code-above is planned for the next post of mine (spoiler: it is all about overhead which happens to be waaay too high in the code above).

⁵ It MIGHT change with the async stuff such as [HPX], but current support for parallel algos in std:: (except for std::reduce()), which forces us to mix calculations with thread sync, is not going to make writing of correct-and-high-performance programs any simpler <sigh />

[+]References

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

ITHare (40)

Comments

Yury Schkatula says

March 27, 2018 at 2:55 pm

Usermode-to-kernel journeys, false-sharings, and all this staff…

Reply
- "No Bugs" Hare says
  
  March 27, 2018 at 3:06 pm
  
  …and probably also some allocations (for a good measure). Anyway – this post is not really for those ppl who knows why it is so slow – but for those who don’t (and the whole point is that those-people-who-DON’T-understand-why-it-is-slow, SHOULD NOT write parallel code until they DO understand it).
  
  Reply
std::everything_is_a_nail says

March 27, 2018 at 4:02 pm

I’ve read the disclaimer but to be honest this just shows that every tool like (std::foreach) can be misused. “Choose the right tool for the job” is a guideline that does not only apply to parallel programming and I do not see how it invalidates the usability promises of parallel std algos.

The article would be a lot better if it also included the right tool for the job: std::reduce, which does not need mutexes or atomics. What’s the performance of naively using that?

Reply
- "No Bugs" Hare says
  
  March 28, 2018 at 5:05 am
  
  The whole point of the article is to counter claims “hey, we can just make things parallel without having a clue of what we’re doing” (and IMNSHO, to replace for_each with reduce, we _have_ to have a clue). As for reduce() – sure, it would help in this particular case, but it is way too narrow for LOTS of real-world scenarios (also, rewriting not-so-trivial logic into reduce() is usually difficult; I’d rather go with HPX-style data-driven stuff which is more like most-of-existing-serial-code – AND it will outperform code-with-multiple-reduce()’s anyway as it doesn’t need to sync at the end of each op). In any case, given the designated target audience, it qualifies as a “very advanced” subject ;-).
  
  Reply
Phlosioneer says

March 27, 2018 at 5:18 pm

Tiny typo in Line 4 of the example in Take 1: missing a colon: “std::execution:par”

Reply
- "No Bugs" Hare says
  
  March 28, 2018 at 5:06 am
  
  fixed, thanks!
  
  Reply
Benjamin Buch says

March 27, 2018 at 6:53 pm

std::for_each is the wrong algorithm, try std::reduce. It should massively speed up.

template
size_t seq_calc_sum(Iterator begin, Iterator end) {
return std::reduce(begin, end);
}

for_each will speed up if you have independent operations. Forming the sum requires synchronization and must be done manually in for_each. But synchronization is always slow. However, it is not necessary to synchronize every call. reduce hides this implementation detail. I didn’t measure it, but it should be much faster. I would be happy if you could try it and write another post about the results. 😀

Reply
- "No Bugs" Hare says
  
  March 28, 2018 at 5:07 am
  
  Right (and there are some other ways to achieve the same thing), but to write it this way – you do have to have a clue about what you’re doing, which is the whole point of this post 🙂 .
  
  Reply
- Yury Schkatula says
  
  March 28, 2018 at 11:34 am
  
  Any parallel algorithm (including std::reduce) can cause performance degradation if parallel threads are trying to access sibling memory cells (so called false sharing, depends on a hardware design that causes synchronization even if your code doesn’t need it in theory).
  
  Reply
  - "No Bugs" Hare says
    
    March 28, 2018 at 2:52 pm
    
    TBH, false sharing looks quite unlikely in this particular context; after all, most of the data is constant, and any modifiable stuff is within std::reduce, where it _should_ have been taken care of by lib writers (BTW, this is achieved automatically if they’re using stack-only vars as accumulators, which I guess is quite likely).
    
    Reply
Karl Miller says

March 27, 2018 at 7:56 pm

One way to experiment with parallel C++ is via SYCL (https://en.wikipedia.org/wiki/SYCL). The ParallelSTL — including exec policies — is written as a SYCL project here (https://github.com/triSYCL/triSYCL) .

Under the hood that project uses Boost.Computer to write SIMD intrinsics (CPU not GPU). There is an experimental SYCL branch of clang but not very advanced yet.

Reply
Andrew says

March 27, 2018 at 11:36 pm

What are the results of you used std:: accumulate where everything is localized (no sharing or locking necessary)? I agree that developers shouldn’t blindly using the new shiny parallel algorithms without testing. However, the method used should’ve been using accumulate from the beginning rather than for_each. It wasn’t the right algorithm to begin with.

Reply
- "No Bugs" Hare says
  
  March 28, 2018 at 5:09 am
  
  > It wasn’t the right algorithm to begin with.
  
  Arguable (in serial processing, there is no discernible difference between the two). And FWIW, parallel version of std::accumulate is known as std::reduce (and I’ll discuss it in the next instalment).
  
  Reply
  - sv90 says
    
    April 4, 2018 at 6:04 pm
    
    The names of std::accumulate and std::reduce describe your intent better than std::for_each and hence one should prefer them always.
    
    Reply
    - "No Bugs" Hare says
      
      April 5, 2018 at 3:34 am
      
      This is a style-like subject we can argue ad infinitum (I agree that “intent” argument does fly to a certain extent, but also agree that a counterargument flies, that accumulate() is an unnecessary entity which has to die under Occam’s razor). More practically – just as one example, IF I need to get two things calculated (such as “sum” and “sum-of-squares”), writing it as TWO accumulates is (arguably) a premature pessimization (single loop with two adds will do it faster pretty much by definition due to caching and memory bandwidth issues), and writing a special two-component accumulator just to shoehorn accumulate for this purpose is unnecessarily-verbose (which reduces readability!) and overall ugly (that is, unless we’re doing it in parallel – THEN the tables are turned, more on it in next(next(this_post))). And IF I need to BOTH modify my vector AND calculate something over it – shoehorning accumulate() there is a recipe for disaster. Hence (one may argue with a straight face) accumulate is not ALWAYS the best thing to use. Q.E.D. (and overall, “always” is a waaaay to strong predicate to stand in the real world; even my favorite “mutex MUST DIE” doesn’t always stand 😉 – though it stands “as a Big Fat Rule of Thumb”).
      
      Reply
Marcello Maggioni says

March 28, 2018 at 1:01 am

Of course it is slower. You are trying to do on multiple cores something that inherently sequential. There is no code that is parallelizable in the example you developed. All the code needs to be serialized to be correct. Typically code that is run in parallel has a bunch of code ran in parallel and some synchronization points here and there to exchange data between the cores. The ratio between synchronization points and parallelizable code need to be as low as possible (ideally 0). Your example has only one line of code and is a synchronization point, so the ratio is infinite.

It’s not a C++ problem, this code would suck performance in any other parallel programming paradigm.

Reply
- "No Bugs" Hare says
  
  March 28, 2018 at 5:10 am
  
  > It’s not a C++ problem, this code would suck performance in any other parallel programming paradigm.
  
  Sure. But the point “you still have to have a clue of what you’re doing”, still stands.
  
  Reply
- Sothatsit says
  
  March 28, 2018 at 11:13 am
  
  Am I misunderstanding your point because parallelising this code would simply require separating the data into 8 chunks, passing each to a seperate thread to be summed, and then aggregating each thread’s results?
  
  Whether this would be more efficient for only a million integers is a question of benchmarking, however it would definitely be more efficient as the dataset grows. Therefore claiming that this is an “inherently sequential” problem is incorrect, or am I missing your point?
  
  Reply
Smitty says

March 28, 2018 at 7:12 am

The problem, surely, is that the statement
x += item
is atomic.

This problem (like every parallelizable problem?) requires the input data *and* the operation to be partitioned in parallel.

A naive implementation might consider loop unrolling and then recursive descent into a function which split the input array into N_Core sub arrays, recursively, until the input array had size(A) <= 2. Each descent would create N_Core threads waiting for the return value of the recursive functions.

Reply
- "No Bugs" Hare says
  
  March 28, 2018 at 10:25 am
  
  > The problem, surely, is that the statement x += item is atomic.
  
  Or more generally – that calculation is “too small” for the overhead incurred. I’ll try to write about it in the next post.
  
  Reply
Jesper Nielsen says

March 28, 2018 at 7:32 am

This is becoming even more true today as AMD CPUs are beginning to ship with more than 1 Numa node per socket – who knows when Intel will follow suit – if they haven’t done so already?
https://www.anandtech.com/show/11697/the-amd-ryzen-threadripper-1950x-and-1920x-review/3

Reply
- "No Bugs" Hare says
  
  March 28, 2018 at 10:27 am
  
  Didn’t know about it, thanks for the link! While I am not sure whether it will fly with consumers (do we REALLY need more than 4 cores on our home/office boxes?) – it is IMO a Good Thing(tm) that more people will start to think about NUMA 🙂 .
  
  Reply
  - Jesper Nielsen says
    
    March 28, 2018 at 10:37 am
    
    The funny thing is, that for Threadripper, by default you’re not able to think about it for consumer software…
    “By default, Threadripper actually hides its NUMA architecture. AMD instead runs Threadripper in a UMA configuration”
    
    Reply
    - "No Bugs" Hare says
      
      March 28, 2018 at 2:55 pm
      
      Drats and double-drats! ;-(
      
      Reply
      - Jesper says
        
        March 29, 2018 at 9:10 pm
        
        Well at least the inter-node latency is not supposed to be that bad, and since it has SMT the core can use the wait to perform useful work. I still haven’t had my hands on any Numa systems but it could be fun to play with, although I’m kinda restricted by being a C# coder so a little removed from the metal.
      - "No Bugs" Hare says
        
        March 30, 2018 at 12:47 pm
        
        > I still haven’t had my hands on any Numa systems
        
        Pretty much all the multi-socket servers are NUMA these days (and if you REALLY want it – it is possible to rent one for $69.99/month right now at https://www.leaseweb.com/dedicated-servers/build-your-own – no warranties of any kind, but I’d guess that 2xE5620 box is NUMA – at least it is almost-for-sure NUMA in hardware). Although TBH, NUMA-related issues on modern x64 boxes are not really noticeable – UNTIL you start looking for them.
Billy O'Neal says

March 29, 2018 at 7:45 pm

std::execution::par_unseq with std::atomic is undefined behavior, see
http://eel.is/c++draft/algorithms.parallel#exec-6

[…] A standard library function is vectorization-unsafe if it is specified to synchronize with another function invocation […]

Note that in the MSVC++ implementation we implement only par, not par_unseq.

Reply
- "No Bugs" Hare says
  
  March 30, 2018 at 12:32 pm
  
  What kind of synchronization is meant, is not 100% clear (just a few lines above, when providing a rationale, it talks about _blocking_ synchronization being dangerous because of deadlocks – and atomic is non-blocking and cannot possibly cause any deadlocks). If speaking in terms of sequencing – it is still safe (and is practically safe for a pretty much any feasible implementation). Have to wait until final wording is released.
  
  EDIT: BTW, formally wording-you-referred-to says about “_standard_ functions”, which technically has nothing to do with my own functions ;-).
  
  Reply
  - Billy O'Neal says
    
    April 3, 2018 at 8:53 pm
    
    > If it says “synchronizes with”, at all, you can’t use it in par_unseq. That means no atomics, no shared_ptr, etc.
    
    It’s likely that the current spec is written that way because specifying how you could use atomics in a way that wouldn’t create deadlocks is hard. 🙂
    
    > as you sound like a person from MSFT – can you tell whether the absence of parallel version for transform() in 15.5.5 is intentional – or it is just an accidental omission
    
    Yes, I implemented most of the parallel algorithms library. 15.5 doesn’t contain a complete parallel algorithms implementation. I believe the first release we are calling “complete” there is 15.7; I’m not sure if 15.6 contained transform or not.
    
    Reply
    - "No Bugs" Hare says
      
      April 4, 2018 at 3:56 am
      
      > specifying how you could use atomics in a way that wouldn’t create deadlocks is hard
      
      IMO, it is quite easy – as long as atomic is_lock_free() (or type is_always_lock_free()) – it should be inherently deadlock-safe (this certainly stands for x86/x64, but also for other platforms where is_lock_free atomics are based on hardware CAS instructions, and I don’t know of any platform where they aren’t). Any objections? (if not – it is probably a good idea to try pushing this clarification into the par_unseq spec…)
      
      > I believe the first release we are calling “complete” there is 15.7
      
      I see, thanks (but 15.7 is not out yet, right?)
      
      Reply
      - Billy O'Neal says
        
        April 5, 2018 at 4:16 am
        
        >as long as atomic is_lock_free() (or type is_always_lock_free()) – it should be inherently deadlock-safe
        
        But with atomics, users can create their own spinlocks, which would not be deadlock safe. Consider:
        
        atomic_flag f; // always lock free
        
        { // inside parallel algorithm
        while (!flag.test_and_set()) /* spin */;
        // do protected thing
        flag.clear();
        }
        
        If you run this in a GPU-like context, the intent of par_unseq, if the warp decides to take the “didn’t get the flag” branch here, you deadlock, in your warp all 31 “threads” will forever not be the winner of the test and set, and the hardware won’t ever try to run the lucky winner thread because a join point is never reached.
        
        You can build things like atomic_flag here out of all the other atomic operations (except perhaps plain load and store?), so right now the standard just says no, not ever. Even no synchronization can still be helpful in a lot of domains.
        
        There are probably examples of patterns that could be made safe (maybe reference counting?) but for now the standard says “if it says Synchronizes with thou shalt not call it in par_unseq”.
      - "No Bugs" Hare says
        
        April 5, 2018 at 3:30 pm
        
        Well, IF you really want it – you can get into trouble even with atomics 😉 (via no less than a busy loop – gimme a break…). As for the standard – my current reading of it is that it says essentially that _with par_unseq is developer’s responsibility to avoid deadlocks even if things get interleaved_ – and my implementation (with a just-in-case assert() added) is perfectly fine under this understanding (and under pretty much any sensible implementation too).
      - Billy O'Neal says
        
        April 11, 2018 at 7:25 pm
        
        I don’t quite get how you can read http://eel.is/c++draft/algorithms.parallel#exec-6 as “atomics are OK”
        
        “A standard library function is vectorization-unsafe if it is specified to synchronize with another function invocation, or another function invocation is specified to synchronize with it, and if it is not a memory allocation or deallocation function. Vectorization-unsafe standard library functions may not be invoked by user code called from execution::parallel_unsequenced_policy algorithms.”
      - "No Bugs" Hare says
        
        April 12, 2018 at 6:37 am
        
        IMO it is obvious: as your quote explicitly starts with a reference to a “_standard_ library functions”, and as I am not writing a _standard_ library, this whole paragraph _explicitly_ does NOT apply to my app-level code, plain and simple. If they would want to speak about _any_ function, why they would put “_standard_” word in?
        
        The way it is written, this paragraph does apply to _you_ (as std library implementor), but does NOT apply to _me_ (as an app-level developer). As for why it is written this way – ask guys from WG21 ;-).
- "No Bugs" Hare says
  
  March 30, 2018 at 12:51 pm
  
  BTW, as you sound like a person from MSFT – can you tell whether the absence of parallel version for transform() in 15.5.5 is intentional – or it is just an accidental omission?
  
  Reply
JCAB says

April 4, 2018 at 8:14 am

“what-is-wrong-with-the-code-above […] spoiler: it is all about overhead” – One could argue it’s all about shared state, really. The overhead is just a consequence of handling the shared state (that’s what your naïve solutions in this article do), and a (the) survival technique is to identify such state and either refactor it (the classic “divide the loop into N chunks and give each chunk its own copy of the state, and then summarize” – basically apply map-reduce, whether via specialized algorithms or by hand) or refrain from parallelizing altogether, which of course sometimes will just be the best course of action.

“it is all about overhead” feels misleading as the “what-is-wrong”, don’t you think? I’ll be happy to read the next article, see where you’re going.

Reply
- "No Bugs" Hare says
  
  April 4, 2018 at 3:38 pm
  
  Well, with “overhead” being a direct and inevitable result of “shared state”, I don’t see much point in arguing about it (it becomes a purely-terminology-argument, so it is always possible to argue it either way ;-( )). Strictly speaking, this example exhibits all four of SLOW (Starvation-Latency-Overhead-Waiting) components in a classification-proposed-by-HPX – but to get observed 90x slowdown, you do have to rely on a huge overhead, there is no way around it 😉 .
  
  And FWIW, the next post is already out 5 minutes ago or so (and yes, it includes both manually-implemented reduce-like stuff, and library-provided one – which happen to be within 2-3% performance-wise 😉 ).
  
  Reply
Eno says

April 4, 2018 at 11:16 am

Saying it has been made easy to use parallellism, does not necessarily mean parallellism is always beneficial. That seems to be an assumption you make yourself (unconsciously maybe). As with every effort to speed up your code, the only way to know if it is an improvement is to measure. That counts for parallellism, just as it count for other coding. It doesn’t matter if parallellism is done in a simple way like this, or in a difficult way. In both situations you still have to measure to know if it is usefull. So still one can safely say, that parallellism has been made easier in my opinion. And saying that the example code directly comes from cppreference, does not mean anything. They probably just show an easy to understand example, regardless of whether it is usefull to use that example code directly in your own code. That should always be judged by the programmer.

Reply
- "No Bugs" Hare says
  
  April 4, 2018 at 3:31 pm
  
  > That should always be judged by the programmer.
  
  Which is EXACTLY the point of the OP. OTOH, I STRONGLY disagree with the parallel tutorials which mention mutexes within first few pages (and no, I do NOT agree that “should always be judged” excuses the authors of such tutorials). Tutorials are by definition written for NOVICES, and for them – mutexes (AND those situations-which-need-mutexes) should be banned outright, plain and simple.
  
  Reply
Thor Lancaster says

July 28, 2020 at 8:25 pm

In addition to the advice in this article…

Any parallelization should be performed at the highest level possible. For example, if you wanted to sort the words alphabetically in 100 text files, it would be much better to have a single thread working on each file than to try to parallelize each sorting operation.

If you are only doing expensive inter-thread data sharing to start large tasks that each run sequentially, the overhead will be negligible.

Reply