‘Speedy Gonzales’ Serializing (Re)Actors via Allocators

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

As we briefly discussed in Part I of this mini-series [NoBugs17a], message-passing technologies such as (Re)Actors (a.k.a. Actors, Reactors, ad hoc FSMs, and event-driven programs) have numerous advantages – ranging from being debuggable (including post-factum production debugging), to providing better overall performance.

In [NoBugs17a] and [NoBugs17b], we discussed an approach to handling allocations for (Re)Actors, and then were able to achieve a safe dialect of C++ (that is, as long as we’re following a set of well-defined local rules). Now, let’s take a look at another task which can be facilitated by per-(Re)Actor allocators – specifically, at the task of serializing (Re)Actors that are later going to be de-serialized by the same executable. While one solution for this task was provided in [Ignatchenko-Ivanchykhin16], the proposed ‘ultra-fast’ serialization is rather cumbersome to maintain, and in most cases it can be beaten performance-wise by serializing at the allocator level.

#define (Re)Actors

To make this article self-contained and make sure that we’re all on the same page with terminology, let’s repeat the definition of our (Re)Actors from [NoBugs17a].

Let’s name a common denominator for all our (Re)Actors a GenericReactor. GenericReactor is just an abstract class – and has a pure virtual function, react():

class GenericReactor {
  virtual void react(const Event& ev) = 0;
  virtual ~GenericReactor() {}
};

Let’s name the piece of code which calls GenericReactor’s react() Infrastructure Code. Quite often this call will be within a so-called ‘event loop’ (see Listing 1).

//Listing 1
std::unique_ptr<GenericReactor> r = reactorFactory.createReactor(...);
while(true) { //event loop
  Event ev = get_event();
    //from select(), libuv, ...
  r->react(ev);
}

Let’s note that the get_event() function can obtain events from wherever we want; from select() (which is quite common for servers) to libraries such as libuv (which is common for clients).

Also let’s note that an event loop such as the one above is by far not the only way to call react(): I’ve seen implementations of Infrastructure Code ranging from the one running multiple (Re)Actors within the same thread, to another one which deserialized the (Re)Actor from a database (DB), then called react(), and then serialized the (Re)Actor back to the DB. What’s important, though, is that even if react() can be called from different threads, it must be called as if it is one single thread. In other words, if necessary, all thread sync should be done outside of our (Re)Actor, so react() doesn’t need to bother about thread sync regardless of the Infrastructure Code in use.

Finally, let’s name any specific derivative from Generic Reactor (which actually implements our react() function) a SpecificReactor:

class SpecificReactor : public GenericReactor {
  void react(const Event& ev) override;
};

In addition, let’s observe that whenever the (Re)Actor needs to communicate with another (Re)Actor then – adhering to the ‘Do not communicate by sharing memory; instead, share memory by communicating’ principle – it merely sends a message, and it is only this message which will be shared between (Re)Actors. In turn, this means that we can (and should) use single-threaded allocation for all (Re)Actor purposes – except for allocation of those messages intended for inter-(Re)Actor communications.

Task: Serialization for the same executable

Now, let’s define what we’re going to do with our (Re)Actor in this article (and why). Basically, as discussed in [Ignatchenko-Ivanchykhin16], when dealing with (Re)Actors, we often need to serialize the current state of our (Re)Actor so that we can deserialize it in exactly the same executable (though this executable can run on a completely different computer etc.). Applications of such ‘serialization for exactly the same executable’ are numerous; in particular, it is useful for (see, for example, [NoBugs17c]):

Migrating our (Re)Actors (for example, to load-balance them)
Low-Latency Fault Tolerance for (Re)Actors
Production post-mortem debugging (serializing the state, plus all inputs after that state, of the (Re)Actor, and then transferring them to developer’s box in case of a crash)

‘Speedy Gonzales’ (Re)actor-level serialization

Now, as we have both our (Re)Actors and our task at hand more-or-less defined, let’s see how well we can do with our per-(Re)Actor allocator.

Actually, it is fairly simple:

We re-define global allocator (for example, malloc()/free(), though depending on compiler specifics and options YMMV) so that it acts as a bunch of per-(Re)Actor allocators (or at least per-thread allocators) – more on it below
This means that within our (Re)Actor, we do not need to specify allocators for each call to ‘new’ and for each collection <phew! />
Within our per-(Re)-Actor allocator, we allocate/deallocate OS pages ourselves (via calls such as VirtualAllocEx()or mmap()); of course, we also keep a list of all the OS pages we’re using.
Whenever we need to serialize our (Re)Actor, we simply dump all the pages used by the allocator of this (Re)Actor (with an address of each page serialized) – that’s it!
When we need to deserialize our (Re)Actor, we try to allocate OS pages at exactly the same (virtual) addresses as they were originally allocated. If such allocation succeeds for all our serialized pages (which is common – though strictly speaking, not guaranteed – when we’re deserializing a (Re)Actor into a dedicated-for-this-(Re)Actor process, which in turn is common for debugging), we just need to copy pages back from the serialized form into allocator (that’s it). If allocation at the same address isn’t possible for whatever reason, we have to use a process which is essentially similar to relocation discussed in [NoBugs17a]. Very briefly:
- We have a relocation map, which gives the mapping between ‘old’ page addresses and ‘new’ page addresses.
- At OS level, we make sure the pages with ‘old’ addresses are unreadable.
- We run traversing (as discussed in [NoBugs17a]) over the state of our (Re)Actor. During this traversing process, we merely try to access all the elements of the state of our (Re)Actor. Whenever any pointer within our (Re)Actor state happens to point to the ‘old’ page, such access will fail (causing an access violation exception). We catch each such an exception, and update the pointer which caused the exception to the ‘new’ value within the corresponding ‘new’ page (calculating the ‘new’ value using the relocation map).
  Note that while the traversing of our own collections can easily be handled along the same lines as above, traversing and fixing standard collections can be outright impossible without adding a few lines to them ☹. How to handle this in practice? It depends, but one way to do it is to take a cross-platform STL implementation (such as EASTL), and to add the few lines implementing traversing for each collection you require (it is NOT rocket science for any specific STL).

“Bingo! After such traversing of the whole state of our (Re)Actor is completed, we can be sure that all the pointers to the heap within our (Re)Actor are updated with the new values.Bingo! After such traversing of the whole state of our (Re)Actor is completed, we can be sure that all the pointers to the heap within our (Re)Actor are updated with the new values. In turn, it means that all the pointers to the state are already updated, so all the relocations due to serialization are already handled, and we can proceed with normal (Re)Actor processing.

One very important property of such serialization is that it is extremely difficult to beat this kind of serialization performance-wise. Deserialization is a slightly different story due to potential relocation, but it is still pretty fast. Also, for several of the use cases mentioned in the ‘Task: Serialization for the same Executable’ section above, it is only the performance of serialization which really matters. Indeed, all we need to do is memcpy() for large chunks of RAM, and with memcpy() speeds being of the order of 5-10 gigabytes/second at least for x64 (see, for example, [B14]), this means that even to serialize a (Re)Actor which has 100MB state, we’re talking about times of the order of 10-20ms. Serializing the same thing using conventional serialization methods (even really fast ones, such as the one discussed in [Ignatchenko-Ivanchykhin16]) is going to be significantly slower. The exact numbers depend on the specifics of the organization of the data, but if we have a randomly filled 100MB std::map, just iterating over it without any serializing is going to take the order of 100ms, almost an order of magnitude longer (!).

Parallel serialization aided by (kinda-)copy-on-write

For those apps where even 10-20ms of additional latency per 100MB of state is too much, it might be possible to reduce it further.

One of the implementations would work along the following lines (which are ideologically similar, though not identical to, classic Copy-on-Write):

When we want to serialize, we (in our ‘main’ (Re)Actor processing thread):

create a list of pages to be serialized
pre-allocate space in some other area of RAM where we want to serialize
for all the pages to be serialized, set an ‘already serialized’ parameter to false
mark all the pages to be serialized as ‘no-write’ (using VirtualAllocEx() or mprotect()).
start another ‘serialization’ thread (use an always-existing dedicated serialization thread, take it from thread pool, etc.)
continue processing the (Re)Actor messages in the main thread

The ‘serialization’ thread just takes pages from the list to be serialized one by one, and for each such page:

checks if it is already serialized: if yes, skips; and if not, marks it as ‘already serialized’ (which should be done using an atomic CAS-like operation to prevent potential races with the main thread)
if it wasn’t already serialized:
- serializes the page into pre-allocated space
- removes the ‘no-write’ protection from the page

Whenever we have write access to one of the ‘no-write’ pages, we catch the appropriate CPU-level exception, and within the exception handler:

Check if the page being accessed is already being serialized (this can happen due to races with the ‘serialization’ thread); this should be done in an atomic manner similar to the ‘serialization’ thread as described above

If the page isn’t serialized yet:

serialize it into pre-allocated space
remove the ‘no-write’ protection from the page, so future writes no longer cause any trouble.

That’s it. While with such a processing we do have to copy some of the pages within the main thread (causing some latency), for typical access patterns, this will happen relatively rarely, significantly reducing overall serialization latency observed within the ‘main’ (Re)Actor thread. For example, if out of our 100MB (~=25K pages) (Re)Actor state, only 1000 pages are modified during our 20ms serialization – then the latency cost of the serialization will drop approximately by a factor of 25x, bringing observable latency to around 1ms (which is acceptable for the vast majority of the systems out there, even for first-person-shooter games).

Per-(re)actor allocators in a usable manner

Up to now, we are merely assuming that our allocators can be made per-(Re)Actor; one obvious way of doing this is to have our (Re)Actor code specify our own allocator for each and every allocation within our (Re)Actor (we’ll need to cover both explicit calls to new, and all implicit allocations such as collections).

While such a naïve approach would work in theory, it is way too inconvenient to be used in practice. Fortunately, changing an allocator to a per-(Re)Actor one happens to be possible without any changes to the (Re)Actor code. In particular, it can be done along the following lines.

First, we replace malloc()/free() (Important: make sure that your global ::operator new/::operator delete, and your default std::allocator also use the replaced functions (!). The latter might be rather tricky unless your std library already uses ::operator new()/::operator delete(), but usually it can be take care of; in particular, for GCC, see [GCC]) and the –enable-libstdcxx-allocator option for ./configure of libstdc++.)

To implement our own malloc(), we’re going along the lines of Listing 2. (Of course, free() should go along the same lines.)

//Listing 2
thread_local OurAllocator* current_allocator = nullptr;
void* malloc(size_t sz) {
  if( current_allocator )
    return current_allocator->malloc(sz);
  else
    return non_reactor_malloc(sz);
}

The point here is that our Infrastructure Code (the one which calls our (Re)Actor) sets the current_allocator pointer before every call to GenericReactor::react() (see Listing 3).

//Listing 3
current_allocator = create_new_allocator();
std::unique_ptr<GenericReactor> r = reactorFactory.createReactor(current_allocator,...);
current_allocator = nullptr;
while(true) { //event loop
  Event ev = get_event();
    //from select(), libuv, ...
  current_allocator = r->allocator;
  r->react(ev);
  current_allocator = nullptr;
}
current_allocator = r->allocator;
r.reset();
current_allocator = nullptr;

Of course, this is a kind of trick – but it will work. Very briefly: first, we confine our current_allocator variable to the current thread by using thread_local, and then within this single thread, we can easily control which allocator is currently used by simple assignments within our Infrastructure Code. One thing to remember when using this way is to make sure that we set current_allocator before each and every method call of our (Re)Actor (including its constructor and destructor(!)).

That’s it: we’ve made our (Re)Actor use a per-(Re)Actor allocator – and without changing a single line within our (Re)Actor’s code too ☺.

Summary

To summarize this part III of the mini-series on ‘Allocators for (Re)Actors’:

“Allocator-based serialization for (Re)Actors is extremely fast (for x64 – around tens of ms per 100MB of state)Allocator-based serialization for (Re)Actors is both
- Easy to implement in a very generic manner, and
- Extremely fast (for x64 – around tens of ms per 100MB of state)

If necessary, parallel serialization may further reduce latencies (in some cases, down to – very roughly – a single digit ms of latency per 100MB of the state).

The Allocators can be replaced to make them per-(Re)Actor, without any changes to the (Re)Actor code(!)

And as this part III concludes this mini-series, let’s summarize all our findings (from all three parts).

Part I

(Re)Actor-based allocators allow for very efficient allocation, with three potential modes:
- ‘Fast’ mode (no protection, but faster than regular malloc())
- ‘kinda-Safe’ mode (with protection from some of the memory corruptions)
  Here, we introduced a potentially novel method of implementing ‘dangling’ pointer detection in runtime – the one based on ID comparison. Compared to traditional ‘tombstones’ it has better locality, and will usually outperform it.
- ‘kinda-Safe with Relocation’ mode, with added ability to relocate heap objects (this, in turn, allows to avoid dreaded external fragmentation, which tends to eat lots of RAM in long-running programs).
Simple ‘traversing’ interface is sufficient to ensure that all the pointers in the (Re)Actor state are updated

Part II

By adding a few more of easily understood guidelines, we can extend our ‘kinda-Safe’ mode from Part I into ‘really safe’ C++ dialect.

All the guidelines/rules we need to follow are local, which enables reasonable tool-aided enforcement and allows to keep code maintainable.

Part III

Custom (Re)Actor-based allocator can be used for the all-important for (Re)Actors serialization for the same executable. It is (a) very easy to maintain for (Re)Actor code, and (b) extremely fast.
Per-(Re)Actor allocators can be implemented without any changes within (Re)Actor itself (i.e. all the necessary changes can be confined to Infrastructure Code).

Phew. It was rather long mini-series, but I hope I have made my points about the significant advantages of allocators specialized for (Re)Actor purposes reasonably clear.

[+]References

[+]Disclaimer

Acknowledgements

This article has been originally published in Overload Journal #141 in October 2017 and is also available separately on ACCU web site. Re-posted here with a kind permission of Overload. The article has been re-formatted to fit your screen.

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

Comments

Jesper says

May 2, 2018 at 6:28 pm

Have you considered reactor-local DIY “pointer” types? Raw performance might suffer but perhaps being able to use 32 bit addressing instead of 64 might make things more cache friendly? Obviously it’s not easy to do this transparently, putting a bit of a burden on the developers. I’m expanding my experiment from before (still C#) to use structs inside arrays and use array indexing instead of object references. Only for map topology for now – dynamic and polymorphic objects like mobs are a bit trickier – in particular players because they can move between zones. There are solutions of course like composition over inheritance and copying by value instead of passing by reference but it’s a bit beyond what I have the time for atm.

- "No Bugs" Hare says
  
  May 3, 2018 at 7:16 am
  
  > Have you considered reactor-local DIY “pointer” types?
  
  Did you see http://ithare.com/a-usable-c-dialect-that-is-safe-against-memory-corruption/ ? In this model (which we’re currently trying to put to the open-source together with allocator based on this post) – both “owning pointers” and “weak pointers” are customized and MAY be implemented as 32-bits (TBH, we didn’t think about it, but it is certainly an option as they’re perfectly encapsulated). As for “naked” pointers – under the model above, they have to stay 64-bit for 64-bit systems, but as they’re inherently temporary (and on-stack only), I have doubts whether replacing them is necessary / beneficial…
  
  What do you think?
  
  > I’m expanding my experiment from before (still C#) to use structs inside arrays and use array indexing instead of object references.
  
  If you didn’t watch Lakos’ talk on CPPCON2018 – you should. He discusses very similar things there. Very briefly – it is all about locality, and I’ve seen “flattened” read-only representations (replacing on-heap vectors with “in-place” non-growable analogues) outperform usual STL stuff by as much as 5x (!) (the idea was to reduce memory consumption, but improving performance by 5x was a really nice side effect ;-)).
  
  - Jesper Nielsen says
    
    May 3, 2018 at 9:57 am
    
    Are you talking about his “Local (Arena) memory allocators” talk from 2017? I did watch it, and it has in fact contributed to the idea of doing the experiment i mentioned, in addition to thinking more about locality of reference and caching because of several of your articles here.
    
    It was also that talk that made me dream of having reactors – or in general subsystems within a process – with their own address spaces and relative “pointers”.
    
    The “actual” address of the subsystem root can’t be stored in the pointer type for obvious reasons – a scope guard in combination with a thread local pointer, or stack of pointers, to the root(s) could be a possible solution for accessing it easily. Not a very pretty solution for an OO person who is normally wary of global variables but with logic encapsulated in “safe pointers” this implementation detail can be hidden from reactor developers.
    
    In principle it should be possible – the same way it’s possible for an OS to assign a process its own address space. And it should be possible for a process to do this without too much of a performance hit, the same way a virtual machine is able to run with good performance, with the exception – (simplification even)- that reactors running on the same threads are scheduled with the granularity of React() calls, instead of being scheduled nondeterministically by the OS. (Obviously the infrastructure threads can still be scheduled by the OS – process and thread priorities might help)
    A programming language could provide support for this – I have to reread your “usable c dialect” article, since I only skimmed it some time ago.
    
    - "No Bugs" Hare says
      
      May 3, 2018 at 11:50 am
      
      Yes, this is the talk (sure it was 2017 not 2018 ;-)).
      
      > The “actual” address of the subsystem root can’t be stored in the pointer type
      
      But at least for (Re)Actors it can be stored in thread_local – see above. What we’re doing looks pretty close to your thoughts – it is just that we allow conversion from “owning/weak pointer” (which are encapsulated) to a “naked” pointer (which is “native”) – but “naked” pointers are restricted to be very temporary, so they won’t be able to hurt much (and by the time we’re out of react(), there are no “naked” pointers whatsoever(!)). And as for owning/weak pointers – we’ll think about an option to make them relative (might simplify deserialization too).
      
Jesper Nielsen says

May 7, 2018 at 11:44 am

Preliminary results with grid cells and line segments (walls) stored as structs in arrays instead of as objects on heap with references in lists (similar to std::vector) :

Simple line of sight algorithm runs 35% faster. At first I thought it was 300% but that was because of another inefficiency in the old code 🙂

This possibly only confirms that object access using array indices is a lot faster than object access using arrays of references, which is not surprising.

To see if array indices can compete directly with object references I need to run another test, where “Next” references are compared to “Next” indices. Pathfinding will do nicely:)

- "No Bugs" Hare says
  
  May 8, 2018 at 6:28 am
  
  If I understand you correctly, you’re removing one level of indirection, which is inherently a Good Thing(tm), and 35% looks a perfectly feasible number in such scenarios (each indirection over large scattered data structures, when caching doesn’t work well, can easily cost 100 CPU cycles, which is a Damn Lot(tm)).
  
  - Jesper Nielsen says
    
    May 8, 2018 at 10:48 am
    
    Exactly – that’s why I’m comparing apples to oranges here and not really learning anything about references vs array indices from the example.
    But 35% is 35% either way and I’m pleased – even more pleased with the actual 300% better performance because of an “accidental” optimization of the purely numerical line traversal algorithm. In particular given how critical this algorithm is for the world simulation.