Ultra-fast Serialization of C++ Objects

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

Task definition

Recently, we were working on a system which required an extremely fast (ideally, the fastest possible) serialization of the state of the Reactor/Finite State Machine (FSM). In addition, we knew for sure that deserialization would happen with exactly the same executable; in other words, we didn’t care at all about either (a) cross-platform issues or (b) extensibility.

serialization: per-field vs whole struct

Where it came from

The whole task comes from exploiting deterministic Reactors/FSMs. As discussed in [NoBugs15] and [NoBugs16], as soon as we have a deterministic Reactor/FSM, it is possible to use this determinism to achieve such things as production post-mortem analysis, and low-latency fault tolerance. For example, for post-mortem analysis, it is sufficient to write all the inputs of the deterministic Reactor/FSM, and in the case of a crash to replay it from the very beginning.

On the other hand, keeping the whole history of the Reactor inputs is usually impractical, so we need to resort to some kind of ‘circular buffer’ [NoBugs15]. To be able to observe the last N seconds of the life of the Reactor/FSM before the crash, the ‘circular buffer’ needs to contain (a) a snapshot of the current state of the Reactor/FSM, and (b) all the inputs received after this snapshot is taken. To achieve low-latency determinism-based fault tolerance, the logic is more complicated, but the snapshot of the current state is still required.

And as soon as we’ve said ‘we need to make a snapshot’, we need to serialize our state one way or another. Moreover, we need to do it Damn Fast – otherwise this debugging/fault tolerance feature would become too expensive. On the positive side – in practice, serialization will happen to memory (and in case of a production post-mortem – it won’t even be used in any way until program crashes), so we’ll be dealing with purely serialization code, with very little overhead to mask any of our performance blunders.

“in both these cases we can be 100% sure that we’ll be deserializing this state on the executable which is identical to the executable which serialized the state. In other words, all the usual serialization/marshalling problems such as different alignments, endianness, etc. – do NOT apply here. Note that in both these cases we can be 100% sure that we’ll be deserializing this state on the executable which is identical to the executable which serialized the state. In other words, all the usual serialization/marshalling problems such as different alignments, endianness, etc. – do NOT apply here. 🙂

One more case when we know for sure that it is exactly the same executable is when we’re serializing data for inter-thread transfers within the same process; as a result, techniques discussed below will work in this case too. However, whether our serialization is optimal in such scenarios is not that obvious. In some cases – specifically, if you do NOT need to reconstruct a modifiable state on receiving side and are just passing messages around – flattening techniques such as those by FlatBuffers, MAY still happen to be faster (on the deserialization side, that is).

The fastest way to serialize – C

Now, as we have our task defined as ‘the fastest possible serialization for in-memory structure, assuming that it will be deserialized by exactly the same executable’, we can start thinking about implementing it.

First, let’s consider serializing a state in a C program.

Usually, FSM/Reactor state can be described as a kind of generalized tree, with each of the nodes being a C struct, and containing ‘owning’ pointers to other allocated C structs. As a simple example, see Listing 1.

//LISTING 1
struct Y {
  int yy;
};

struct X {
  int xx;
  struct Y* y; // allocated via
               // malloc(); 'owning' pointer
  int z;
};

And the fastest way to serialize struct X, will be something along the lines of Listing 2. Unless we’re resorting so some trickery with allocators or ‘flattening’ of our original structure, it is extremely difficult to beat this code performance-wise.

//LISTING 2
struct OutMemStream { 
  uint8_t* pp;
  uint8_t* ppEnd;
}; 

inline void writeToStream( OutMemStream* dst, 
                           void* p, size_t sz ) { 
  assert( dst->pp + sz <= ppEnd ); //in the real-world, 
                                   //think what to do here 
  memcpy( dst->pp, p, sz );
  dst->pp += sz;
}

void serializeX( OutMemStream* dst, X* x ) {
  writeToStream( dst, x, sizeof(X) );
  writeToStream( dst, x->y, sizeof(Y) );
  //that's it! 
}

Deserialization would work along the lines of Listing 3. Deserialization is inevitably slower than serialization (there is an expensive malloc() within, ouch) – but it is pretty much inevitable for the kind of data structure we’re working with. Also, for our use cases described above, deserialization will happen MUCH more rarely than serialization (on program crash or on hardware catastrophic failure), so we don’t really care too much about the performance of deserialization – we just need it to work.

//LISTING 3
struct InMemStream {
  uint8_t* pp;
  uint8_t* ppEnd;
};

inline void readFromStream( InMemStream* src,
                            void* p, size_t sz ) {
  assert( src->pp + sz <= ppEnd ); 
  memcpy( p, src->pp, sz );
  src->pp += sz;
}

void deserializeX( SomeMemStream* src, X* x ) {
  readFromStream( src, x, sizeof(X) );
    // x->y contains garbage at this point(!)
    // ok, not exactly garbage - but a pointer
    // which is utterly invalid in our current space
  x->y = malloc( sizeof(Y) );
    //phew, no garbage anymore
  assert( x->y );
  readFromStream( src, x->y, sizeof(Y) );
}

From C to C++

C++ serialization

Ok, now let’s try to rewrite it into C++ (where we’re no longer restricted to Plain Old Data a.k.a. POD). To make things closer to reality, let’s serialize the class X in Listing 4, which contains (directly or indirectly) two std::strings, an std::vector, and a std::unique_ptr.

//LISTING 4
class OutMemStream {
  public:
    inline void write( const void* p, size_t sz );
      // implemented along the lines of the
      // writeToStream() above

    inline void writeString( const std::string& s ) {
      size_t l = s.length();
      write( &l, sizeof(size_t) );
      write( s.c_str(), l );
    }

  template<class T>
  inline void writeVector( const std::vector<T>& v ) {
    // NB: can be further optimized by writing the
    // whole v.data() at once.
    size_t sz = v.size();
    write( &sz, sizeof(size_t) );
    for( auto it : v )
      it.serialize( this );
  }
};

class Y {
  public:
    int yy;
    std::string zz;
    std::string zz2;

    void serialize( OutMemStream* dst ) const;
    Y( const PreDeserializer& );
      //pre-deserializing constructor, see below
    Y( InMemStream* src );
      //deserializing constructor
};

class X {
  int xx;
  std::unique_ptr<Y> y;
  std::vector<Y> vy;

  void serialize( OutMemStream* dst ) const;
  X( const PreDeserializer& ) const;
    //pre-deserializing constructor
  X( InMemStream* src );
    //deserializing constructor
};

void Y::serialize( OutMemStream* dst ) const {
  dst->write( this, sizeof(Y) );
  dst->writeString( zz );
  dst->writeString( zz2 );
  // NB: we do NOT serialize POD members
  // such as 'yy' separately
}

void X::serialize( OutMemStream* dst ) const {
  dst->write( this, sizeof(X) );
  y->serialize( dst );
  dst->writeVector( vy );
  // NB: we do NOT serialize POD members
  // such as 'xx' separately
}

Once again, it is very difficult to beat this serialization performance-wise (that is, unless playing some dirty tricks with flattening or allocators). Nonetheless, it contains all the necessary information (in fact, a little bit more than that) to deserialize our object when/if we need it.

C++ deserialization – Take 1

However, deserialization in C++ is not going to be that simple. The problem here is that as we didn’t store data on a per-field basis, which means that unless we do something, on deserialization we’ll be overwriting ‘owning’ pointers with their values in the old program (and rewriting this garbage with a pointer to allocated data later). While this was ok for POD types in C, in C++ it can cause all kinds of trouble (such as an attempt to free a non-allocated pointer) unless we’re careful. The approach in Listing 5, however, is very clean in this regard.

//LISTING 5
class PreDeserializer {
}; 
// just an empty class, to be used as a tag
// for constructor

class InMemStream {
  uint8_t* pp;
  uint8_t* ppEnd;

  public:
    inline void read( void* p, size_t sz );
      // implemented along the lines
      // of the readFromStream() above
    inline void constructString( std::string* s ) {
      size_t l;
      read( &l, sizeof(size_t) );
      assert( pp+l <= ppEnd );
      new( s ) std::string(reinterpret_cast<const char*>(pp), l );
      pp += l;
    }

  template<class T>
  inline void constructVector( std::vector<T>* v ) {
    size_t sz;
    read( &sz, sizeof(size_t) );
    new( v ) std::vector<T>;
    v->reserve(sz);
    for( size_t i=0; i < sz ; ++i ) { 
      v->push_back( T( this ) );
    }
  }
};

Y::Y( const PreDeserializer& ) {
  // here we need to construct a valid object
  // just ANY valid object, preferably the
  // cheapest one to be constructed-destructed,
  // as it will be destructed right away
}

Y::Y( InMemStream* src ) {
  // at this point 'zz' and 'zz2' are already
  // constructed we cannot call src->read(this) as
  // it will overwrite valid 'zz'/'zz2' causing all
  // kinds of trouble.
  zz.~basic_string();
    //no idea why zz.~string() doesn't work
  zz2.~basic_string();
    // now 'zz'/'zz2' are no longer constructed,
    // and we can overwrite them safely. On the
    // other hand, starting from this point, we're
    // NOT exception-safe
  src->read( this, sizeof(Y) );
    //at this point 'zz'/'zz2' contain garbage
  src->constructString( &zz );
  src->constructString( &zz2 );
    // phew, no garbage anymore,
    // 'this' is once again a valid object
    // and we're again exception-safe
}

X::X( const PreDeserializer& ) {
// nothing here; we do NOT really need
// anything from here
}

X::X( InMemStream* src ) {
  // at this point 'y' is already constructed
  // we cannot call src->read(this) as it will
  // overwrite valid 'y' and 'vy' causing all
  // kinds of trouble.
  vy.~vector<Y>();
  y.~unique_ptr<Y>();
    // now 'y' and 'vy' are no longer constructed,
    // and we can overwrite them safely. On the other
    // hand, starting from this point, we're NOT
    // exception-safe
  src->read( this, sizeof(X) );
    //at this point 'y' and 'vy' contain garbage
  new(&y) std::unique_ptr<Y>( new Y(src) );
  src->constructVector( &vy );
  // phew, no garbage anymore,
  // 'this' is once again a valid object
  // and we're again exception-safe
}

Overall, deserialization of a class T goes as follows:

We construct an object of our class T, constructing all its non-POD members using Pre-Deserialization constructors (we don’t need to construct the members at all, but there is no way to avoid it in C++)

Within the object deserializing constructor, we have the following ‘sandwich’:
- We destruct all non-POD members by explicitly calling their respective destructors. It gives us the right to overwrite them.
- We overwrite the whole object T via memcpy(). At this point, non-POD members will contain garbage (more precisely, pointers which are invalid in our current space).
- We re-construct all the non-POD members via their deserializing constructor. No garbage anymore, and we’re ready to go. 

Our Take 1 approach will work well – that is, until we need to deal with base classes, and especially polymorphic classes. 🙁 Polymorphic objects, among other things, contain a so-called ‘Virtual Table Pointer’, and overwriting it almost universally qualifies as a ‘pretty bad idea’. 🙁 Which leads us to the following…

C++ deserialization – Take 2, inheritance-friendly

Let’s consider the same classes X and Y, with class X having a unique_ptr<Y>, but let’s say that Y is a polymorphic base class, so unique_ptr<Y> can be either an instance of Y, or an instance of YY.

Strictly speaking, our original serialization already has all the information we need; however, extracting it can be quite cumbersome without knowing the exact class layout (and this is compiler-specific). So, we’ll modify our serialization a bit (see Listing 6).

//LISTING 6
class Y { //polymorphic base
  public:
  int yy;
  std::string zz;
  std::string zz2;

  void polymorphicSerialize( OutMemStream* dst ) const;
  void serialize( OutMemStream* dst ) const {
    dst->write( this, sizeof(Y) );
    serializeAsBase( dst );
  }

  void serializeAsBase( OutMemStream* dst ) const {
    // non-POD ONLY for serializeAsBase()
    dst->writeString( zz );
    dst->writeString( zz2 );
  }

  explicit Y( InMemStream* src );
  explicit Y( const Y* that );
    // constructor from struct serialized by
    // child class
  
  void deserializeAsBase( InMemStream* src );
  static std::unique_ptr<Y>
    polymorphicCreateNew( InMemStream* src );

  virtual size_t serializationID() const { return 0; }
  virtual ~Y() {}
};

class YY : public Y {
  public:
    int yy2;
    void serialize( OutMemStream* dst ) const {
      dst->write( this, sizeof(YY) );
      Y::serializeAsBase(dst);
    }

  explicit YY( InMemStream* src );
  virtual size_t serializationID() { return 1; }
};

void Y::polymorphicSerialize( OutMemStream* dst ) {
  size_t id = serializationID();
  dst->write( &id, sizeof(size_t) );
  serialize( dst );
}

class X {
  int xx;
  std::unique_ptr<Y> y;
  std::vector<Y> vy;
  void serialize( OutMemStream* dst ) const;
  X( InMemStream* src );
  //deserializing constructor
};

void X::serialize( OutMemStream* dst ) const {
  dst->write( this, sizeof(X) );
  y->polymorphicSerialize( dst );
  dst->writeVector( vy );
    // we still do NOT serialize non-POD objects
    // explicitly in Take 2, we will deserialize
    // them explicitly though
}

Here, we’re sacrificing a tiny bit of performance on serialization (sigh) to keep things very cross-platform and not to depend on the exact class layout; on the other hand, the penalty here is pretty small (we’re speaking at most about 1–2 CPU clocks plus a pipeline stall per polymorphicSerialize(), though in practice usually it will be much less than that due to branch predictions).

Now to deserialization. When deserializing inherited/polymorphic objects (and let’s not forget about multiple inheritance and virtual bases) we cannot really overwrite the whole object without the risk of overwriting virtual table pointer(s)¹. As a result, the best way we can see for deserializing such objects is on a per-field basis (see Listing 7).

//LISTING 7
class InMemStream {
  uint8_t* pp;
  uint8_t* ppEnd;

  public:
    inline void read( void* p, size_t sz );
      //same as before
    inline void* readInPlace( size_t sz ) {
      assert( pp + sz <= ppEnd );
      void* ret = pp;
      pp += sz;
      return ret;
    }
    inline void* fetchInPlace( size_t sz ) const {
      assert( pp + sz <= ppEnd );
      return pp;
    }
    inline std::string readString() {
      size_t l;
      read( &l, sizeof(size_t) );
      assert( pp+l <= ppEnd );
      pp += l;
      return std::string(reinterpret_cast<const char*>(pp - l), l );
    }
    template<class T>
    inline void readVector( std::vector<T>& v ) {
      size_t sz;
      read( &sz, sizeof(size_t) );
      v.clear();//just in case
      v.reserve(sz);
      for( size_t i=0; i < sz ; ++i ) {
        v.push_back( T( this ) ); 
      } 
    } 
}; 

Y::Y( InMemStream* src ) {
  Y* that = reinterpret_cast<Y*>( src->readInPlace(sizeof(Y)) );
  yy = that->yy;
  deserialiseAsBase( src );
}

std::unique_ptr<Y> Y::polymorphicCreateNew( InMemStream* src ) const {
  size_t id;
  src->read( &id, sizeof(size_t) );
  switch( id ) {
    case 0:
      return std::unique_ptr<Y>( new Y(src) );
    case 1:
      return std::unique_ptr<Y>( new YY(src) );
    default:
      assert( false );
  }
}

Y::Y( Y* that ) {
  // NB: on non-x86/x64 CPUs, there may be a need
  // to memcpy 'that' into a temporary aligned
  // variable, along the lines of:
  // alignas(Y) uint8_t tmp[sizeof(Y)];
  // memcpy(tmp,that,sizeof(Y));
  // and then use 'tmp' instead of 'that'
  // this applies to ALL the cases where
  // readInPlace()/fetchInPlace() are involved
  yy = that->yy;
}

void Y::deserializeAsBase( InMemStream* src ) {
  //non-POD ONLY for deserializeAsBase()
  zz = src->readString();
  zz2 = src->readString();
}

YY::YY( InMemStream* src ) 
: Y( reinterpret_cast<YY*>( src->fetchInPlace(sizeof(YY)) )) {
  YY* that = reinterpret_cast<YY*>(src->readInPlace(sizeof(YY)) );
  yy2 = that->yy2;
  Y::deserializeAsBase( src );
}

X::X( InMemStream* src ) {
  X* that = reinterpret_cast<X*>(src->readInPlace(sizeof(X)) );
  xx = that->xx;
  y = Y::polymorphicCreateNew( src );
  src->readVector( vy );
}

Phew. This kind of code should be able to handle pretty much any kind of inheritance – and at extremely high serialization speeds too. Still, a virtual call to serializationID() is a slowdown (however minor it is), and apparently it can be avoided.

¹ Yes, there can be more than one virtual pointer per object – at least, in the case of virtual base classes.

C++ deserialization – Take 2.1, deducing object type from VMT pointers

Strictly speaking, when we’re writing the whole object it already contains everything we need to deserialize. In particular, it already contains a Virtual Method Table (VMT) pointer which is equivalent to serializationID(); in other words, it is not really necessary to invoke a rather expensive virtual serializationID() on serialization. The only problem is how to deduce object type from the VMT pointers (and that’s without making too many assumptions about object layout, which is very platform- and compiler-dependent).

One thing which seems to work (still to be double-checked) to deduce object type from VMT pointers is as follows:

At some point (say, when our program starts), we’re creating an instance of each polymorphic objects we’re interested in

To avoid dealing with garbage, we’re creating these instances over zeroed memory (for example, using placement new over pre-zeroed buffer)

When creating a child object, we’re initializing the parent object within the child, in exactly the same manner as we’re initializing standalone parent object
We memcpy() each of such objects, creating an ‘object dump’ of each of the polymorphic objects
Then, as soon as we have a child object and a parent object and their respective dumps, we can:
- Cast child pointer to parent pointer to determine offset at which parent sits within the child
- Now we can compare byte-by-byte dumps of the parent-within-child (using the offset mentioned above) and standalone-parent. Normally, the only different bytes within the parent-within-child, and standalone-parent (given that they were created as described above), are VMT pointers; moreover, the dumps should differ in at least one byte. Therefore, we can distinguish between a polymorphic child and polymorphic parent (by one of them having certain byte(s) at certain offset(s) as certain pre-defined value(s))

Bingo! These bytes are equivalent to the serializationID().

Therefore, we can avoid writing the serializationID() during serialization, saving a few more CPU cycles (and bringing performance back to the C structure performance level) – all of that without any prior knowledge about class layout(s).

It should be noted that we didn’t try this approach ourselves, but it still looks perfectly plausible ;-).

Other stuff

Of course, this is not really an exhaustive list of problems you can encounter during ultra-high-speed serialization. However, most of the other problems you’ll run into are typical for any kind of C++ serialization. In particular, non-owning pointers (and abstract graphs) need to be handled in pretty much the same manner as for any other C++ serialization (see, for example, [ISOCPP] for a relevant discussion).

Performance

“Even when comparing with home-grown code with per-field serialization, our Ultra-Fast Serialization still wins (up to 1.5x-2x) due to memcpy() over the whole struct having significant advantage over per-field copying.From what we’ve seen, this kind of Ultra-Fast Serialization is extremely fast; it is pretty much on par with C raw-structure-dump serialization, and is around 5–10 times faster than FlatBuffers (this is also consistent with the numbers provided by FlatBuffers themselves here: [FlatBuffers]). Even when comparing with home-grown code with per-field serialization, our Ultra-Fast Serialization still wins (up to 1.5x-2x) due to memcpy() over the whole struct having significant advantage over per-field copying.

However, comparing the performance of our Ultra-Fast Deserialization with FlatBuffers is neither very interesting nor really relevant. It is not very interesting because, for the use cases described above, we’ll be doing serialization orders of magnitude more frequently than deserialization (as deserialization occurs only when something goes wrong). It is not really relevant, because (unlike FlatBuffers) we need to restore a data structure to exactly the same as its original state (which is usually built in the manner described above, and is not easily flattenable); as a result, we’re bound to make all those expensive allocations (and they will eat most of the CPU clocks on deserialization).

On code generation

It is always a good idea to move all this mundane serialization code into some kind of code generator; as described in [NoBugs16a], writing a code generator which will generate code along the lines above is not rocket science.

Still, even in a manually-written form, this technique is actually usable in practice (that is, unless your data structures are very elaborate).

Limitations

One all-important caveat of our Ultra-Fast Serialization technique is the following:

DON’T even think of using it unless you can GUARANTEE that the serialized data will be deserialized by EXACTLY the same executable as the one which serialized it.

This explicitly prohibits ALL of the following:

Deserializing using the same code compiled by a different compiler/for different platform

Deserializing using the same library within different executables (well, this MIGHT fly, but we’d rather not risk it). Exactly the same .so/.dll library is ok, however.

Deserializing by different version of the same executable/shared library

If you do one of these things, most likely, Ultra-Fast Serialization will work for some time – but using it under these conditions is akin to sitting on a powder keg with a fuse already lit. Still, if you know for 100% sure that all the serialization/deserialization will happen in EXACTLY the same executable, it will be very difficult to beat this serialization technique performance-wise.

If the ‘same executable’ prerequisite doesn’t apply to your case, use FlatBuffers (or any of their competitors) instead. As usual, there is no such thing as ‘The Best Technique for Everything in Existence’, so you DO need different tools for different types of job. And ‘serialization for transfer over the network for client code compliant with the protocol’ and ‘serialization for exactly the same executable’ are two rather different beasts.

[+]References

[+]Disclaimer

Acknowledgements

This article has been originally published in Overload Journal #136 in December 2016 and is also available separately on ACCU web site. Re-posted here with a kind permission of Overload. The article has been re-formatted to fit your screen.

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

Comments

Marek says

May 19, 2017 at 6:11 pm

Line 28 in LISTING 5: Why the hell are you not using reserve?

Line 46 in LISTING 5: Because templates and typedefs.

Line 30 in LISTING 7: Use reserve.

Did you consider fork-ing the process?

- "No Bugs" Hare says
  
  May 20, 2017 at 5:36 am
  
  Thanks for the comments! As for reserve() – you’re right, I added it. A (lame) excuse: for our tasks (serializing all the time just in case, but deserializing only IF/WHEN the crash happens – to debug how it happened) speed of deserialization didn’t really matter, so we didn’t optimize for it; our aim was to optimize serialization only (as for our use cases, it happens A LOT, and in production).
  
  About forking – we think about using Copy-on-Write (though without fork), and hope to write about it for August issue of Overload. It should be even faster – but will require writing an allocator.
  
- oleg says
  
  November 10, 2017 at 7:04 am
  
  lol
  
Dahrkael says

May 19, 2017 at 10:28 pm

Interesting matter,
if the data inside Y is big enough (or the whole memory is in one chunk), it would be interesting to try blosc, it claims to copy memory around faster than memcpy itself! (mix of multithread and L1 cache sized chunks)

- "No Bugs" Hare says
  
  May 20, 2017 at 5:54 am
  
  Thanks for the reference (didn’t see it before). OTOH, in practice, I don’t really expect chunks to be large enough for this kind of stuff to be efficient; that is – until we’ll start speaking about serializing state as a whole heap (which we plan to write about for August issue of Overload) :-).
  
Hugh Wang says

June 7, 2017 at 11:35 am

Why don’t you resort to an allocator? I know it’s a low-level trick, but for *ultra-fast* serialization, it’s worthy. What’s more, it’s platform independent to some extent.

- "No Bugs" Hare says
  
  June 13, 2017 at 3:51 am
  
  Allocator is indeed a yet another way to skin this cat :-). My take at allocations and (Re)Actors just got published in June issue of Overload: https://accu.org/var/uploads/journals/Overload139.pdf (with serialization to follow in August I hope) :-).
  
lpc says

March 29, 2018 at 9:31 pm

Line 9 of Listing 2: should be assert( dst->pp + sz ppEnd ); instead of assert( dst->pp + sz < ppEnd );

- "No Bugs" Hare says
  
  March 30, 2018 at 12:38 pm
  
  huh? Looks that wordpress has ate some symbols from your comment :-(. But if you meant it should be “<=", you're right, I fixed it :-).