Parallel Coding: From 90x Performance Loss To 2x Improvement

 
Author:  Follow: TwitterFacebook
Job Title:Sarcastic Architect
Hobbies:Thinking Aloud, Arguing with Managers, Annoying HRs,
Calling a Spade a Spade, Keeping Tongue in Cheek
 
 


Parallel vs Single-Threaded: Speed vs Power Consumption

Disclaimer: if you’re doing parallel programming for living – please ignore this post (this stuff will be way way way too obvious for you). However, if you’re just about to believe claims that parallel programs can be achieved just by adding a few things here and there to your usual single-threaded code – make sure to read it.

In my previous post, we have observed pretty bad results for calculations as we tried to use mutexes and even atomics to do things parallel. OTOH, it was promised to show how parallel <algorithm> CAN be used both correctly and efficiently (that is, IF you need it, which is a separate question); this is what we’ll start discussing within this post.

Parallel Code Which Made It 90x Slower on 8-Core Box

As it was mentioned in the previous post, the code below – while being correct – was observed to be REALLY REALLY slow (90x slower than original, that’s while using all 8 cores instead of one):

//DON'T USE: DEADLY SLOW
template<class Iterator>
size_t par_calc_sum_mutex(Iterator begin, Iterator end) {
  size_t x = 0;
  std::mutex m;
  std::for_each(std::execution::par, begin, end, [&](int item) {
    std::lock_guard<std::mutex> guard(m);
    x += item;
  });
  return x;
}

And BTW, while replacing mutex with atomic did improve things, it still was 50x slower than original serialized code.

The Big Fat Overhead Problem

The problem we’re facing here, is actually a problem of the overhead. It is just that the cost of thread context switch (which is likely to happen pretty often as we’re trying to obtain lock_guard) is sooo huge,1 that addition (which costs about 1 CPU cycle2) happens to be about zero-cost compared to the costs of synchronization. That’s exactly what happens here – and in this extreme case, we have this generic problem amplified sooooo badly that it happens that ALL the calculations are actually serialized by this mutex (so effectively, there is no parallelism at all3).


1 in general, it can be within 2000-1’000’000 CPU cycles [NoBugs2016], though here it is on the lower side of things, as cache invalidation doesn’t really happen
2 NB: this is a wrong place to go into a long discussion whether it is 1 CPU cycle or we should speak about it statistically being 3/4th of CPU cycle
3 however, if not for the overhead, 100% serialization would mean only that we’re not gaining anything from going parallel; it is the overhead which is responsible for 90x slowdown

 

Tackling The Overhead Manually

Let’s see what we could do to speed things up; note that it is NOT intended to be a code-for-real-world-use – but rather a code-to-provide-some-“feeling”-of-what-is-going-on. While performance-wise it is a huge improvement over previous code, but compared to alternatives-which-we’ll-see-later – it is certainly NOT the best option.

  //DON'T USE: UGLY, ERROR-PRONE, AND A-BIT-SLOWER-THAN-OPTIMAL
  size_t TestFuncManualParSum::run(RandomAccessIterator begin, RandomAccessIterator end) {
    std::atomic<size_t> x = 0;
    constexpr int NCHUNKS = 128;
    assert( (end-begin) % NCHUNKS ==0 );//too lazy to handle it for sample code
    RandomAccessIterator starts[NCHUNKS];
    size_t sz = (end - begin) / NCHUNKS;
    for (int i = 0; i < NCHUNKS; ++i) {
      starts[i] = begin + sz * i;
      assert(starts[i]<end);
      }
    std::for_each(std::execution::par, starts, starts + NCHUNKS, [&](RandomAccessIterator start) {
      size_t y = 0;
      for (auto it = start; it < start + sz; ++it) {
        y += *it;//NO synchronization here
      }
      x += y;//synchronization
    });
    return x;
  }

Associative Property Within an expression containing two or more occurrences in a row of the same associative operator, the order in which the operations are performed does not matter as long as the sequence of the operands is not changed.— Wikipedia —The basic idea here is to avoid synchronization on each and every element; instead – we’re splitting our array into 128 chunks – and running for_each over these large chunks (and within each chunk – we’re calculating partial sum in a local variable y, avoiding any synchronization until the very end of the chunk).4 Let’s note that in the code above, we’re actually relying on the sum to be associative to achieve the speed improvement (we DO change the order of additions compared to sequential code).

If we run this code – we’ll see that it is ACTUALLY FASTER than sequential one (in my measurements – see also a table below – it was by a factor of almost-2x on an 8-core box, while using all the 8 cores).

This is a HUGE improvement over our previous 90x-slower-than-sequential code – AND is the first time we’ve got it faster than sequential.


4 As noted above – it is NOT a production-level code, so I didn’t bother with creating an iterator-which-jumps-over-each-sz-elements, and created an ugly temporary array instead; however, it should be good enough to illustrate the point

 

Enter std::reduce()

Well, while we did manage to improve speed by using multiple cores in our previous code <phew />, it is NOT what we really want to write in our app-level code (that’s to put it mildly). Fortunately, there is a function which will do pretty much the same thing (actually, even a bit better) for us: it is std::reduce().

The point of std::reduce() is that it – just like our code above – exploits the fact that operation-which-is-used-for-reduce (default is ‘+’), is associative.5 Then, the whole thing we wrote above, can be written as follows (it is MUCH less ugly, MUCH less error-prone, and a tiny bit more efficient):

  //DO USE
  size_t TestFuncReduceParSum::run(Iterator begin, Iterator end) {
    return std::reduce(std::execution::par, begin, end, (size_t)0);
       //NB: (size_t)0 specifies not only an initial value, but also the TYPE of the accumulator
       //    which ensures that accumulation is done in terms of size_t, and NOT in terms of *begin
       //    This, in turn, is necessary to keep our calculation EXACTLY the same as our previous ones
  }

Interestingly enough, when I run the code with std::reduce(), I observed that it is merely 2-3% faster6 than our manual parallelization which we used above. While it is certainly not a proof, but it is an indication that the methods used within std::reduce(), are conceptually similar to the stuff-we-wrote-ourselves-above.7 This approximation, in turn, can be useful to “feel” what we can realistically expect from std::reduce(). After all, there is NO magic involved, and everything-what-happens-in-the-world can (and IMO SHOULD) be explained at least at the very high level of abstraction (and BTW, if you have a better explanation – without going into “we’re experts, just trust us to do things right” – please LMK).


5 or at least almost-associative, which happens with floats, more on it in the next post
6 actually, given the standard deviation being 3%, it qualifies as “barely noticeable”
7 except for “work stealing” which is most likely used by std::reduce(), but I don’t think we’re ready to discuss work stealing at this point

 

Per-Element Mutations

In terms of std::reduce() we can express LOTS of parallelizable things (a bit more on related trickery in the next post), but all of them are inherently limited to our array-being-processed, being unmodified in the process (in other words – std::reduce() can be seen as a “pure” function). However, as in C++, we’re NOT confined to the world of “pure” functions such as std::reduce(), we MAY consider certain mutation operations over our large collection. However, learned from experience with mutexes, we WON’T try to modify more-than-one-single-element of our collection in our parallel algorithm. And as long as all our mutations are limited to only one element of our array – we are good to go without mutexes / atomics / etc. <phew />:

void TestFuncForEachParModify::run(Iterator begin, Iterator end) {
  std::for_each(std::execution::par,begin, end, [](int& item) {
    item += 2;
});

In general, it is The Right Way(tm) of doing things parallel; however – in this particular case of ultra-simple ops (which happen to cause our app to consume all the memory bandwidth at least on my box) – parallel version still happens to be slower than sequential one(!).8 This is NOT to say that parallel mutations are always slower than serial ones – but rather that while with some experience it MIGHT possible to predict which parallelization WON’T work (one such example is our mutex-based calculation above) – it is next-to-impossible to predict which parallelization WILL work for sure; hence – whatever-we-think about our parallelization, it is necessary to test its performance (and under conditions which are as close as possible to real-world ones).

Results

Cores Used Wall-Clock Time (us) CPU Time (us) Walk-Clock Compared to Sequential Power Consumption Compared to Sequential (guesstimate)
“Pure” (mutation-free) Calculation
sequential (for) 1 2960+-20 2960 1x 1x
sequential (for_each) 1 4480+-509 4480 1.5x slower 1.5x higher
std::accumulate()10 1 2940+-30 2940 1x 1x
std::reduce(seq) 1 2960+-60 2960 1x 1x
naive std::mutex11 8 201’000+-4000 1’600’000 70x slower 300x higher
naive std::atomic11 8 68’600+-20 550’000 25x slower 100x higher
manual chunking 8 1620+-50 13000 1.82x faster 2.5x higher
std::reduce(par) 8 1580+-50 12600 1.87x faster 2.5x higher
Mutation
sequential (for) 1 3310+-20 3310 1x 1x
sequential (for_each) 1 3300+-4 3300 1x 1x
sequential (in-place transform) 1 4510+-30 4510 1.36x slower12 1.3x higher
manual chunking 8 3540+-90 28300 1.07x slower 4x higher
for_each(par) 8 3520+-100 28100 1.06x slower 4x higher

8 once again, manually-parallelized version performs pretty much the same as the best-library-provided one, see [demo2] for source
9 To the my best understanding, it SHOULD NOT be the case, hopefully is merely a flaw in MSVC handling of lambdas
10 see [demo2]
11 from previous post, see also [demo2]
12 probably – but not 100% sure – this is another under-optimization by MSVC

 

 

Hey fellas, don’t be jealous
When they made him they broke the mould
So charismatic with an automatic
Never prematurely shooting his load

— 'Man for All Seasons' song from Johnny English film —

My observations from the table above:

  • All sequential methods are equivalent (all the differences are within measurement errors).
    • The only exceptions are related to sequential for_each(), and in-place transform() – but hopefully they’re merely a flaw in optimizing lambdas out by MSVC; OTOH, for_each() case does highlight risks of using lambdas in performance-critical code even in 2018(!).
  • Out of the parallel methods – those which serialize too often (“naive” versions above) fail BADLY performance-wise.
  • Judging hare:out of all the premature optimizations, premature parallelization is the most premature oneThose parallel methods which serialize “rarely enough” MAY (or may not) outperform serialized version latency-wise, but are always losing energy-consumption-wise; in fact, it is a UNIVERSAL observation that parallel versions are (almost)-always losing to the serial ones in terms of power consumed, and CO2 footprint. Which in turn means that – as long as serial code provides good-enough response times from end-user perspective – DON’T PARALLELIZE; you’ll do a favour BOTH to your user (she will be able to use the cores for something else, AND will pay less in electricity bills), AND to the planet. Or from a bit different perspective – out of all the premature optimizations, premature parallelization is the most premature one.
    • mutexes are evil;13 even atomics are to be avoided as long as possible
    • Even IF we’re mutex-free, parallelization MAY slow things down. So after you found the bottleneck and parallelized – make sure to re-test performance – and under close-to-real-world conditions too.
    • FWIW, effects of our manual parallelization look very close to those of std::reduce(); it is NOT to argue to do things manually – but to get a very-simplified idea of how std::reduce() does it magic internally.

13 especially so as you’re reading this post (see disclaimer at the very beginning)

 

Intermediate Conclusions and To Be Continued

In the previous post, we took a look at “how NOT to write parallel programs” (which apparently where LOTS of parallel programming tutorials will lead you). In this one, we have taken a look at “how efficient parallel programs CAN be written”. In the next (and hopefully last-in-this mini-series) post we’ll discuss some guidelines-and-hints for-NOVICE-parallel programmers, which MAY allow speeding up their programs here and there, without the risk to jeopardize program correctness – and without the risk of slowing things down by 50x too. Sure, it won’t be the-very-best optimization – but it MAY bring SOME benefits to the table (and it is rather low-risk too, so as a whole it MAY fly).

Don't like this post? Comment↯ below. You do?! Please share: ...on LinkedIn...on Reddit...on Twitter...on Facebook

[+]References

Acknowledgement

Cartoons by Sergey GordeevIRL from Gordeev Animation Graphics, Prague.

Join our mailing list:

Comments

  1. Jesper Nielsen says

    When I made some tests before with a substantial workload/sync ratio I would get very close to 200% performance from 2 threads on separate cores, and around 160% performance on the same core (hyperthreading) – scaling at least to 4 cores (8 logical).
    One thing I didn’t do though was test with large memory usage so cache invalidation wasn’t much of an issue.
    I’m getting inspired to perform a more realistic experiment for my purposes: A “large” number of quadratic maps each with a “large” number of dummy monster objects moving around and trying to kill each other, respawning when they die.
    Each map is a work item, and a game-loop consists of updating all maps – possibly in parallel.
    In particular I want to see the effect of:
    Always processing maps on the thread they ran on last gameloop
    Always processing them on a different thread
    Preferring to process on same thread but allow work stealing
    Object pools vs new / Garbage collector
    Shared object pool
    Per thread object pool (perhaps allowing pool stealing similar to work stealing – or letting thread local object pool “spill over” into a shared object pool)
    Per map object pool

    • "No Bugs" Hare says

      > I would get very close to 200% performance from 2 threads on separate cores,

      Sure, that can happen easily (if there is no contention and you’re NOT memory-bound). Still, it will be “a bit less” than 200% :-).

      > and around 160% performance on the same core (hyperthreading)

      Obviously possible too, though specific number HUGELY depends on the nature of load (interplays between threads in SMT/hyperthreading are next-to-impossible to predict).

      > One thing I didn’t do though was test with large memory usage so cache invalidation wasn’t much of an issue.

      From my experience, UNLESS you have high-contention mutexes – this is NOT likely to be a problem (no warranties of any kind, batteries not included). OTOH, synchronizing even 1000 times per video frame can easily be catastrophic 🙁 (practically, I’ve seen thread-context-switch-costs-including-cache-invalidation to be very roughly within 10K-50K CPU cycles, and wasting 50K*1000*60fps = 3000M cycles per second means wasting about 100% of the available CPU power 🙁 ).

      > I’m getting inspired to perform a more realistic experiment for my purposes…
      > In particular I want to see the effect of:…

      This would be REALLY interesting (and BTW, if by any chance you’ll ever want to write about the results – I would be REALLY happy to post this kind of research here). My wild guess is that two most important things will be (a) eliminating contention on mutexes etc., and (b) improving memory locality – but the farther we go into analysis, the more app-specific we get, so my guess can be FAR off in this case.

      • Jesper Nielsen says

        I’ll get back to you with results:)
        I’m making this test in C#/.Net so some things – like assigning processor affinities to specific threads – are a bit out of my reach unless I do P/Invoke or rely on deprecated APIs.
        I’ve seriously considered moving my server codebase to C++ – especially after discovering that many of the things we “just do” in C# because the garbage collector allows us to be lazy when making regular business software won’t cut it – and also because of newer things like smart pointers and boost.asio that weren’t around back when I dabbled in C++. But for now I feel it would be too large of a setback to start from scratch again and having to deal with that learning curve while also keeping a dayjob to make it feasible.

        Locks and other syncronization methods in C# aren’t “that” bad as long as the contention is low because they’re implemented in a light-weight manner, only utilizing system-mutexes when wait times become “large”.
        This means using locks for securing queues is quite feasible as long as you do serious work with whatever you dequeue 🙂 It can even compete with a ConcurrentQueue for dequeueing items in batches (not supported by ConcurrentQueue unfortunately)

        Locality of reference is somewhat out of my hands, although objects allocated at the same time are typically near each other in memory. I think that would be hard to get right even for C++ though when dealing with hordes of monsters moving around all over the map the chance of them being on the same page as their neighbor becomes slim over time. There is bound to be a lot of cache misses with such a dynamic model no matter how you slice it I believe, unless you go for objects stored in arrays with move semantics etc. instead of pointers/references. I don’t really want to go there too much because it makes polymorphism difficult.

        I think it’s important – even though going for a very simplistic model of a game – not to make it too simple because you risk making a microbenchmark that doesn’t really tell you anything. Therefore I’ll also be using a simple grid-based spatial index for entities similar to the one used in the actual server code.

        After all the goal of this is to take what I learn from this experiment and then apply it to the actual server code.
        So far my expectations based on single threaded performance is to be able to handle ~500 CCU/core @ 15-20 hz, not taking into account resources needed for networking – I hope those will be significantly less than resources needed for simulation but I really have no idea…

        • "No Bugs" Hare says

          > like smart pointers and boost.asio that weren’t around back when I dabbled in C++.

          Take a look at co_await – it will work with any asynchronous I/O (and TBH, I wouldn’t use boost:: and would get directly to OS calls instead – or at least to libuv on the Client-Side).

          > like smart pointers and boost.asio that weren’t around back when I dabbled in C++.

          > only utilizing system-mutexes when wait times become “large”. .

          And spinning like crazy while it is “small”; TBH, whenever you have to resort to spinlocks – chances are you’re doing something wrong 🙁 .

          > when dealing with hordes of monsters moving around all over the map the chance of them being on the same page as their neighbor becomes slim over time.

          The question here is: which are the memory patterns to access these monsters? If monsters are large and are mostly rendered/simulated – locality is not going to be too bad (as long as each monster itself is compact – i.e. allocated-as-one-single-chunk). Yes, you will get cache misses when iterating through the monster list, but one-single-cache-miss is nothing compared to rendering. And if you need to find-some-monster-quickly (which is rather likely for the server-side) – a special very-local structure can be devised for it too.

          • Jesper Nielsen says

            I’m talking purely serverside here, so no rendering 🙂
            I don’t think I can find the time for rewriting the server in C++ (in addition to (re)learning C++ 🙂 ) but perhaps if I do a few POC simulations like this demonstrating massive performance differences I would change my mind:)

            Preliminary results are in.
            Hardware: Intel I7 6700, 4 cores (8 logical), L1: 4×32 KB L2: 4×256 KB L3: 8 MB

            Benchmark:
            1000 zones, each a 1000×1000 square.
            1000 mobs/zone, all vs all scenario, no pathfinding or blocking – movement is just directly towards target.
            Zone has a simple grid of fixed-size cells for spatial lookups (find nearest enemy within range)
            Each zone is implemented as a Reactor with an Update() method that is called each game tick.
            This method updates all mobs in the zone. Absolutely no shared data is touched inside the Update() method.

            Sequential:

            private void UpdateSequential()
            {
            foreach (var zone in Zones)
            {
            zone.Update();
            }
            }

            Parallel – yes it’s really that easy when you don’t access shared data:)
            Not accessing shared data is the tricky part in real life though…

            private void UpdateParallelFor()
            {
            Parallel.ForEach(Zones, zone => { zone.Update(); });
            }

            Results:
            Memory usage ~200-600MB depending on grid cell sizes between 16 and 64 with smaller cell size equaling higher memory usage. (GC.GetTotalMemory())
            350%-380% parallel performance vs sequential – best for cell size around 64 it seems.
            Varying parameters ( number of zones, sizes, number of mobs) I’ve been able to get to around 465% parallel performance – but with much less memory usage.
            I just reran another less memory-intensive parallel test I had made before (~15MB usage) and got around 600% vs sequential.
            Theoretic maximum is somewhere between 400% and 800% because of hyperthreading.
            Sync overhead is less than 1%, tested by only using 1 mob/zone.

            Conclusions (so far):
            Even with no sharing of data the cache limit seems to be an important bottleneck (my guess is L3 contention is the biggest sinner since it’s shared by all cores but I have no idea how to test this assumption, because L2 and L1 contention only affects hyperthreading and there are many other factors at play here because logical threads on the same core also share the ALU)

            There might also be effects from objects belonging to different threads sharing memory pages. In theory I can’t do anything about that because the CLR is free to move stuff around but in practice I would like to see if it makes a difference if the initial allocation of a zone and everything in it is done on the same thread that updates it.
            This gives me another reason to want to play with (semi-)permanently assigned threads for reactors.

            If it makes a difference at all then I would like to see the effect of work-stealing too, and perhaps also delve into methods to avoid oscillation if it’s observed as a problem (implement hysteresis?).

            Effect of object layout is something I really would like to test. Examples:
            Collection types.
            DIY dynamic arrays with reactor bound array pool for resizing
            Intrusive linked lists (prev/next references)
            Intrusive arrays/lists for faster removal (Mob knows it’s location in the list – on removal plug the hole with the last Mob on the list instead of moving the entire tail)
            “hole plugging” list removal without intrusion on the object – still O(N) reads but perhaps it doesn’t matter at all?

          • "No Bugs" Hare says

            As it is 100% server-side, and you already split it into zones, a question – do you REALLY need to go parallel? Usually, it is better to split it into smaller zones, and that’s it :-). Going into parallel stuff on the Server-Side can easily lead to tons of strange difficult-to-test problems (for example, if you’re going parallel with NTHREADS=NCORES, you SHOULD ensure that there is EXACTLY ONE process which goes parallel like this – which might be doable, but simple inter-(Re)Actor scaling without going into intra-(Re)Actor parallelism is usually simpler and performs better – YMMV, batteries not included).

          • Jesper says

            I’m not sure if I understand 100% what you mean? Are you asking why the zones share a heartbeat because that’s really the only intra-reactor thing going on.

          • Jesper says

            Wait the heartbeat is inter-reactor. I don’t have any intra-reactor parallelism

          • Jesper Nielsen says

            I had to reread what you wrote, and I think you just misunderstood the code I posted because of the names UpdateSequential() and UpdateParallelFor()? These are just methods of the World class, and as far as I can see I am actually doing simple inter-(Re)actor scaling for this benchmark, since my zones are the reactors – deterministic even (except for a single Random object per zone that could have it’s seed recorded if need be). Zone.Update() doesn’t spin up any more threads or queue any worker items.
            (You could argue that the World class is itself a (super)reactor with intra-reactor parallelism I guess?)

            I don’t know what the optimal number of threads is. In my experiment – pure simulation – it’s equal to the total number of logical cores, but an actual MMO world server would also need resources for networking.

            Anyway – more results:
            It doesn’t matter in my environment if it’s always the same thread calling Update() on a zone. I guess it shouldn’t surprise me that this is the case – after all both L3 and RAM is equally accessible to all cores, and L2 and L1 is very unlikely to hold any data specific to one zone after updating another. My guess is they’re only able to hold very fresh short-term data (and perhaps also very long term data like often used method tables)
            With Numa I think this picture would change – possibly big time. Perhaps I should give it a try some time – I can rent a baremetal cloud AMD EPYC 7401P (24 cores, 4 Numa nodes) for $1/hour so it’s not going to break the bank:)

            All in all it’s been a fun little experiment that’s given me a lot of ideas for going multithreaded. Unfortunately it has also made me a little more pessimistic about the practical scalability gains of multiple cores than I was before.

            And I’ve got a nice little testbed now to try things out and get results fast without polluting my “real” code.
            Some things I found were surprising. Among the strangest things I found was how incredibly hard it is to choose between array based dynamic lists and “intrusive” linked lists (prev and next references for mob objects) both per zone and per cell. Things like cell size, mob density, mob behaviour etc. all makes huge differences because array based lists generally are superior for iteration while linked lists are vastly superior for addition and especially removal.

            Not so surprising was that object pools beat allocations. But I was still surprised by how much.

            I guess a million mobs running around and fighting, living long enough to get promoted by the GC a couple of times before dying leaves a bloody mess on the battlefield that noone wants to clean up…

          • "No Bugs" Hare says

            > misunderstood the code I posted because of the names UpdateSequential() and UpdateParallelFor()

            …rather because of Parallel.ForEach() within. But you’re right – as long as it is only about (Re)Actors – AND you DO know how exactly the threads are created (to avoid creating N-threads-per-each-incoming-request-from-user, which is obviously catastrophic), it should be ok. However, such a system implies one single _shared_ pool of zones – which in turn prevents you from scaling beyond one single box, am I right?

            > It doesn’t matter in my environment if it’s always the same thread calling Update() on a zone.

            Performance-wise, it does (and significantly too). Cache invalidation (when your state jumps cores) is an ugly beast, PLUS NUMA memory-to-core affinity is also a factor (though I cannot say how important the latter is for your app).

            > after all both L3 and RAM is equally accessible to all cores,

            Whether we like it or not, this does NOT stand for the server-side (servers are overwhelmingly multi-socket these days; “workhorse” server is 1U/2S for last 20 years, and is NUMA for last 15 years, and TBH I don’t see it changing; ultra-cheap Atom- or ARM-based stuff which is indeed single-socket doesn’t really fly for serious server-side).

            > array based lists generally are superior for iteration while linked lists are vastly superior for addition and especially removal.

            Just thinking aloud – did you consider something like deque (IIRC, it is implemented as a list of fixed-size arrays, so you MIGHT get a good balance between iteration and addition/removal)?

            > Not so surprising was that object pools beat allocations. But I was still surprised by how much.

            You should watch https://cppcon2017.sched.com/event/Bgsc/local-arena-memory-allocators-part-1-of-2 (plus part 2). It is consistent with my experience with it : my take on it is that “whatever penalty we can think of because of poor memory locality, we will be underestimating” ;-(

        • Jesper says

          What I meant was – there was absolutely no measurable difference in my test environment. And I think it’s because after going through 125 zones there isn’t a trace left of zone 1 in L1 or L2 – or even L3 – so not much is won by running it again on the same thread. Numa I expect to be totally different of course, because the state is (permanently?) bound to Ram local to a node.

          Nothing stops me from scaling out to other boxes. A global heartbeat is not a requirement – the only requirement is that each zone can’t be moved from one heartbeat to another.

          So far my guess is that work-stealing will work fine as long as a zone stays on the same Numa node – and each Numa node should probably be given its own heartbeat to avoid rotten apple issues.

          Parallel.Foreach was just the fastest way I could write it 🙂 I’ve since moved to designated threads and a producer-consumer pattern with blocking queues. Not much performance difference at this point but it’s the way I must move to ensure reactor-thread affinity

  2. Billy O'Neal says

    > except for “work stealing” which is most likely used by std::reduce()

    The MSVC++ implementation only uses work stealing for std::sort; all other ops just chop the work in pieces and throw it at the system thread pool. Work stealing has some overhead and really only benefits you when your work items vary in size a lot, which is unusual for typical functors passed to reduce(). We use it for sort due to quicksort’s behavior where the left and right partition may have wildly varying sizes, and the algorithm is n lg n making overhead less noticable.

    Note that reduce enables commutative and associative math on our implementation which can allow floats to be vectorized, even for no execution policy or for std::execution::seq.

    > All sequential methods are equivalent (all the differences are within measurement errors).

    On MSVC the execution::seq overloads are implemented by calling the serial versions, except the parallel versions with seq run the algorithm inside a noexcept function to implement http://eel.is/c++draft/execpol.seq#2

    template inline
    _FwdIt _Remove_if_unchecked(_Sequenced_policy_tag, const _FwdIt _First, const _FwdIt _Last, _Pr _Pred)
    { // remove each satisfying _Pred, serially
    return (_STD remove_if(_First, _Last, _Pred));
    }

    template inline
    _FwdIt _Remove_if_unchecked(_Parallel_policy_tag, const _FwdIt _First, const _FwdIt _Last, _Pr _Pred)
    { // remove each satisfying _Pred, in parallel
    /* parallel implementation omitted */
    }

    // _NOEXCEPT on the parallel signature enforces termination
    template<class _ExPo,
    class _FwdIt,
    class _Pr,
    _Enable_if_execution_policy_t/* = 0 */>
    _NODISCARD inline _FwdIt remove_if(_ExPo&& _Exec, _FwdIt _First, const _FwdIt _Last, _Pr _Pred) _NOEXCEPT
    { // remove each satisfying _Pred
    _DEBUG_RANGE(_First, _Last);
    return (_Rechecked(_First,
    _Remove_if_unchecked(_STD forward(_Exec), _Unchecked(_First), _Unchecked(_Last), _Pass_fn(_Pred))));
    }

    > The only exceptions are related to sequential for_each(), and in-place transform() – but hopefully they’re merely a flaw in optimizing lambdas out by MSVC; OTOH, for_each() case does highlight risks of using lambdas in performance-critical code even in 2018(!).

    Sounds like a bug in which the optimizer team would be interested if you’d care to send our way.

    • "No Bugs" Hare says

      > Work stealing has some overhead and really only benefits you when your work items vary in size a lot,

      Sure.

      > which is unusual for typical functors passed to reduce().

      For an absolutely-generic lib such as std:: this is arguable – but it is your design choice (and you had to make your guesses), so the only thing I can do is to keep it in mind :-). It also explains why my manual version is pretty much within margin of error of std::reduce().

      > Sounds like a bug in which the optimizer team would be interested if you’d care to send our way.

      I recently had 3 codegen bugs opened against certain Eric 😉 ; while 2 of them are reportedly fixed – I’d still probably wait with under-optimizations until they fix the last codegen one (I positively _hate_ codegen bugs, in my books (pun intended) they’re ranking WAAAAAAAY above any other class of bugs).

  3. Andrea says

    Really interesting, could you please elaborate on why “[…] here it is on the lower side of things, as cache invalidation doesn’t really happen” as you say in note 1 at the very beginning?
    Isn’t each thread actually just stomping on the same variable “x” during the parallel for_each execution? Or am I missing something?

    • "No Bugs" Hare says

      > each thread actually just stomping on the same variable “x” during the parallel for_each execution?

      Yes, it does, but it is synchronization cost, not cache invalidation cost. Cache invalidation occurs when we just loaded a megabyte into L2 cache (or 10M into L3) – and suddenly run into mutex, which causes context switch – and another thread starts to load its data from L3/RAM (because 1st thread has ate all the L2/L3). This is one of the worst possible patterns performance-wise – but fortunately, it DOES NOT happen here.

      • Jesper Nielsen says

        It looks to me x’s page will get invalidated in L1 and possibly also L2 every time it’s updated? Depending on architecture that page must be loaded from either L3 or RAM.

        • "No Bugs" Hare says

          > x’s page will get invalidated in L1 and possibly also L2 every time it’s updated?

          It depends on terminology we’re using, on MESI/MESIF/MOESI, and on who-knows-what. Still, I tend to see it as a case of synchronous processing (with invalidation being a side effect) rather than of pretty-much-asynchronous cache invalidation.

Leave a Reply

Your email address will not be published. Required fields are marked *