Part IIb: Protocols and APIs (continued) of 64 Network DO’s and DON’Ts for Game Engine Developers

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

This post continues an article about implementing network support for game engines (this includes game engines intended for one single app). Our analysis includes all genres and all platforms, with the only exception being browser-based games, which are out of scope.

In response to your SYN packet as referenced above, we're pleased to inform you that...

Previous parts:

Part I. Client Side
Part IIa. Protocols and APIs

Present Part IIb continues discussion on specifics of protocols and APIs – i.e. how to organize interactions between different nodes, and between different layers.

Upcoming parts include:

Part IIIa. Server-Side (Store-Process-and-Forward Architecture)
Part IIIb. Server-Side (deployment, optimizations, and testing)
Part IV. Great TCP-vs-UDP Debate
Part V. UDP
Part VI. TCP
Part VIIa. Security (TLS/SSL)
Part VIIb. Security (concluded)

12. DON’T provide blocking RPC-like API

Blocking RPC-like APIs, such as DCE RPC, MS RPC, MS COM, etc. should be avoided. Blocking over the network call is generally a Bad Idea (see, for example, item #3 in Part I); it becomes even worse over the Internet, where round-trip times (RTTs) are at least 100 ms for a trans-atlantic connection. NB: trans-atlantic RTT is not going to become an order of magnitude better than it is now; for example, from London to New York the distance is 5570 km, for which the best theoretically possible RTT (assuming that signal propagates with a speed of light in vacuum and zero processing times) is 37 ms.

“All those RTTs for each RPC call have accumulated and have lead to 20-minute time required to open a single form One real-world example why blocking RPCs are a Bad Thing for over-the-Internet apps. Once upon a time, I’ve seen an app which used CORBA (without structures passed by value, as it wasn’t available yet) as an underlying transport; it meant that the coding style was similar to “call-RPC-constructor-to-create-remote-object – call-RPC-method-of-remote-object-to-set-property-A – call-RPC-method-of-remote-object-to-set-property-B – …“. While the app has performed reasonably well in LAN, it has became an outright disaster when the team tried to test it over the trans-atlantic Internet (where it was intended to work); all those RTTs for each RPC call have accumulated and have lead to 20-minute time required to open a single form. Later on, it was rewritten to a simple message exchange, which resulted in 400x improvement in form opening times (and brought them to more or less reasonable numbers).

It should be noted that no-return-non-blocking RPC calls (such as those used in Unity 3D) are not discouraged under present item. The primary problem with RPCs is not whether it looks as a function call or as a message (personally I prefer messages, but it is an issue of much less importance); the real problem with RPCs is their blocking nature, which tends to cause all kinds of problems with apps intended to work over the Internet.

12a. DO implement callback-on-reply instead of RPC

As we’ve seen that blocking-RPC-with-reply is not exactly a good thing, the question arises: how to support replies to messages? One way to support it is the following: whenever game developer sends a message which implies reply, she may specify a callback which will be called when the reply comes; according to items #1-3 from Part I, this callback should be called within the main event-processing thread. An example code for game developer might look as follows:


class GotReplyFromXYZ : CallbackObject {
  public:
  void gotReply(Message& reply) {
    //... will be called when the reply comes back
    //...
  }
};

//...
sendMsgWithReply(msg,new GotReplyFromXYZ());

For Unity3D, to implement this kind of things, they’re suggesting to use no-reply-no-block-RPC initiated in response to no-reply-no-block-RPC; while technically correct, I don’t really like their approach and prefer explicit callback with matching of request and response supported by network library (as described right above), which allows to make game logic a bit simpler.

13. DO encourage „authoritative servers“

Traditionally, especially within real-time-simulation world, there are two kinds of beasts: „authoritative servers“ and „non-authoritative servers“. Authoritative servers make all decisions on „what is going on in the world“ themselves, and non-authoritative ones just blindly trust the client to make decisions. In general, non-authoritative servers can possibly work only if all the clients can be trusted, which becomes not the case as soon as the game becomes more or less popular (even with a million registered users you can pretty much count on some of them willing and being able to hack your client). Non-authoritative servers they’re still used relatively often because of the job of game developer being more straightforward (and being more natural when coming from single-player 3D development). Still – authoritative servers are clearly preferred, especially if the game involves some kind of money (even indirectly); if there is money to be obtained from hacking of the game client or the protocol – you can count on multi-million army of your users to do it. To make things worse, there is no real protection from this kind of hacking, when players wants to hack his own machine; all protection for the client is by definition only “security by obscurity”. The nightmare scenario in this regard, from the point of view of game developers, is the following:

“You shouldn't count on outsmarting all those millions of highly motivated hackers operating in an environment under their own control.developers wrote the game based on non-authoritative server
it became successful with millions of players
developers added monetizing with in-game items bought for real money
in-game items got traded on eBay, which created a strong incentive to hack the game
the game got hacked, which is pretty much inevitable under the circumstances. You shouldn’t count on outsmarting all those millions of highly motivated hackers operating in an environment under their own control. The task of protecting the client is very similar to the task of bot fighting (which is a subject for a separate article), but impact which bots have on the game, unless the bot is published on the Internet, is well-contained (and if it is published – disabling it is not a big deal); on the other hand, hacked client which allows to manipulate gameplay, is a pretty much guaranteed disaster, even if the hack is not published.
developers got into lots of trouble with paying customers, AND need to rewrite the whole thing pretty much from scratch – or to scrap it completely

At the very least, I suggest:

to explicitly say (in documentation etc.) that you discourage („don’t support“, „don’t guarantee security for“, even better – all of the above) non-authoritative servers. If you don’t say it loudly – they will come to you saying „hey, we’ve got this game of ours and everything is working perfectly, so now it got very popular, now just make it secure“ – and you will find yourself in a hot water
not to provide any kind of explicit support for non-authoritative servers. While functions such as GetSomethingFromClientAndUpdateServerStateAccordingly() might usually be written by game developers themselves, at least not make it easier for them.

Oh, and to provide a pointer on “how to handle latencies for authoritative servers in fast-paced multiplayer environment” – I refer you to an excellent piece by [Gambetta].

14. DO provide your own addressing

One thing which you certainly SHOULD encapsulate within your network library, is addressing. Game developer shouldn’t need to say „send this message to a server with an IP of… on port of…, specifying as a part of the message that it is intended to ‘Oasis #29’“; instead, the game developer should say „send this message to an object ‘Oasis #29’ within the world named ‘Africa’“, and the translation of this information into IPs/ports (and probably something else which goes over those IPs/ports) should be handled within the network layer.

This serves two purposes:

first, it makes life of game developers simpler and frees them from bothering about irrelevant details (believe me, they already have enough to think about).
second, it allows the network layer to handle changing IPs (and mapping of objects between servers) without worrying that game developer has stashed stale IP somewhere. Overall, the whole issue is about clean separation of concerns, and in practice it helps a lot.

Oh, and BTW – as soon as you have your own addressing – it opens the door to all kinds of optimizations, such as multiple-logical-connections-over-one-single-TCP-connection, prioritization and so on.

Whether your own addressing should be text-based or integer-based – it is up to you and it depends on your requirements; in practice, I’ve seen very little difference between these two approaches; what really matters is to have your own addressing, but whether it is text or int (or ASN.1-style tree) – it doesn’t usually matter too much for internal addressing.

15. DO think twice about providing multi-cast

Multi-cast APIs might be necessary, but they’re tricky, especially if you’re going to provide multi-cast to many thousands of users. First of all, keep in mind that most likely, you won’t be able to use UDP multicast over the Internet (which will be discussed in part IV and part V). It implies that you’ll need to implement multi-cast yourself, sending multiple uni-cast packets.

“Burst consisting of several million packets, may start causing observable issues for a mid-size ISP.Implementing multi-cast as multiple uni-cast packets is fine – until you need to send your packet to dozens of thousands of users; sending too many packets at the same time creates a powerful burst, which starts causing overload of network devices and/or links, which can lead to delays, unnecessary packet loss, and even device reboots (the latter is obviously a bug of the device, but it still may happen). In practice, a burst consisting of a a certain number (very roughly – between 10’000 and 100’000) of simultaneous packets, may cause problems to your single server; burst consisting of several million packets – may start causing observable issues for a mid-size ISP. In general, bursts are to be avoided, which should be taken into account when implementing multi-casts (for example, by providing a time frame over which the multi-cast may be spread, see publisher/subscriber APIs for details).

Overall, whenever applicable – publisher/subscription model (as described in the next item) is generally preferable to multi-casts. At the very least, publisher/subscriber model allows for a better abstraction level; it also allows to isolate publisher from subscribers via front-end servers (see below), which helps scalability a lot.

16. DO provide “publisher/subscriber“ APIs where applicable

As we’re speaking about multi-player games (with an emphasis on “multi-“…), it usually implies that there are some bits of information which need to be distributed to many (N) players. However, sending such data as a multi-cast might not be optimal for quite a few reasons (some of them are explained in the previous item #15). In practice, it has been observed that the “publisher/subscriber“ model is usually more efficient (and is more obvious for the game developer) than manually tracking updates.

For the “publisher“ side (usually server-side) an API may look along the lines of (it is C++, but the same thing can be written in pretty much any programming language):

class MyPublisher : GenericPublisher {
  MyPublisher() : GenericPublisher(„MyPublisherName“,
    MY_PUBLISHER_PRIORITY,MY_PUBLISHER_MAX_ACCEPTABLE_DELAY) {}
    /* see below on acceptable delays */

  void updateX(int x) { GenericPublisher::updateField(„x“,x); }
  void updateY(int x) { GenericPublisher::updateField(„y“,y); }
};

For the „subscriber“ side:

class MySubscriber : GenericSubscriber {
  MySubscriber(): GenericSubscriber() {}

  void subscribe() { GenericSubscriber::subscribe(server_id,„MyPublisherName“); }
  /* NB: server_id should use own addressing, as described in item #13 above */
};

One of the advantages of publisher/subscriber approach is that it can easily support (without any changes to game logic) the following deployment architecture:

publisher-server -> two-dozens-of-front-end-servers -> million-of-users

In this deployment model (which you as a network library developer should support), each node sends updates to all the nodes which have requested them (i.e. “subscribed” to the publication); the trick is that publisher-server has only front-end-servers as subscribers, and serving two dozens of connections won’t take too much time of valuable mission-critical publisher-server (which also has game logic to run); serving millions of users will take time, but it can be done on cheap-and-dispensable front-end servers.

Serving all the million of users directly from a single server is problematic (at the very least it will require a behemoth server which will cost you an arm and leg, but I have doubts that number million is even feasible on one server); serving the same million of players via two dozens of front-end servers is much easier (I’ve personally seen half a million of simultaneous highly active users served in such an architecture, using 4-socket $50K server as a publisher, and $10K servers as a front-end ones, and it worked like a charm).

CDN is a large distributed system of servers deployed in multiple data centers across the Internet.— Wikipedia — This architecture, in turn, may be used to provide CDN-style deployment, with front-end-servers placed in ISPs around the world, high-speed low-latency channels (such as frame relay) between publisher-server and front-end-servers, and apps choosing their best server based on Geo-IP information, or on latencies, or even (if everything else fails) on user choice. While I didn’t see such deployments on a large scale, they still might be a good thing to reduce latencies (though the latency gain needs to be carefully measured).

If we’re speaking about millions of simultaneous users, then we should take bursts (as described in previous item) into account. To deal with it, publishers SHOULD specify acceptable delay when publishing data, and your library on front-end servers SHOULD spread updates over this time when they receive the update from the publisher-server.

“Subscriber/publisher“ model can work both for guaranteed and non-guaranteed message delivery; for non-guaranteed delivery it will become very close to Unity 3D’s “state synchronization“ (though, as far as I know – please correct me if I’m wrong – Unity doesn’t directly support goodies such as front-end-server-support etc.).

Bottom line. Of course, not all the data in the system will be publisher/server data, but in general, whenever you have multi-cast, you SHOULD consider replacing it with publisher/subscriber model.

17. DO provide prioritization for connections and subscriptions

“Game traffic usually has higher priority than download-DLC traffic, and real-time-combat traffic usually has higher priority than update-top-players-list traffic.From the network point of view, there are different things which are happening, and many of them have different priorities. For example, game traffic usually has higher priority than download-DLC traffic, and real-time-combat traffic usually has higher priority than update-top-players-list traffic. All traffic needs to be delivered, the question is about the order. In general, you SHOULD allow the game developer to specify priority for your connections and publishers. The realistic-to-implement contract on the priority with the game developer may read as follows:

within single priority, all the messages (publish updates) are guaranteed to be delivered in exactly the same order as they were issued
lower-priority messages are guaranteed not to go faster than higher-priority ones
higher priority is not guaranteed to be delivered faster than lower priority; in the extreme case implementation may ignore priorities altogether.

Different types of contracts are possible (specifically, guarantee #2 may be tricky to implement if multiple connections are used, and maybe dropped in some implementations); however, in any case, providing priority even with these quite weak rules (and good will to provide prioritization even if it isn’t strictly required by the rules, but is feasible at the moment), has been observed to be a Good Thing which provides quite a few benefits to the end-user.

Unfortunately, implementing prioritization is quite tricky; I’ve done it myself for a single-TCP-between-nodes underlying model, and it worked pretty well (though with some limitations). A very sketchy explanation of this model is provided in the item #17c below.

Implementing prioritization is quite different for guaranteed (such as TCP and reliable UDP) and non-guaranteed delivery (such as bare UDP) messages. For guaranteed delivery mechanisms (such as TCP or reliable UDP with flow control) it can be implemented at least as single-connection with internal priorititzation as briefly described in the item #17c below (which I’ve seen personally working quite well); implementation based on the item #17d looks possible, though I didn’t see it in practice and cannot vouch for it working good in practice. For non-guaranteed delivery messages I could only suggest to try relying on packet size (item #17d); if anybody knows any other ways of prioritization – I would really like to hear about them.

17a. Implementing prioritization: DO know where your bottleneck is

When speaking about prioritization, way too many developers start to think in terms of „how to increase thread priority for the thread which handles this traffic“. In general, your threads should not be a bottleneck (if they are for any kind of kinda-real-time experience – you should work really hard to eliminate it), and if they’re not a bottleneck, playing with thread priorities won’t do much (if anything at all).

Normally, the bottleneck (and inevitable one) is the network itself. And for the most players (assuming that game servers are not exceeding their network capacity), the worst bottleneck will be for the player’s incoming traffic on the „last mile“ between player’s ISP and player; second worst bottleneck will be on the same „last mile“, but for the player’s outgoing traffic (from player to server).

It is the thing which you should consider first when implementing prioritization.

17b. Implementing prioritization: DON’T hold your breath over QoS

“On the Internet the current state of affairs is 'forget about it'.In general, despite all the good intentions behind Quality of Service, and it being supported by vendors, it is not working on the Internet [NetEqualizer]. QoS can be made to work in the LAN and in the Intranet, but on the Internet the current state of affairs (and for a foreseeable future [CISCO]) is “forget about it“. It is a sad fact, but we have to live with it.

NB: I’ve heard rumors that specifying QoS MIGHT work for outgoing packets; it might help a bit, though as for the most users the main bottleneck (as mentioned above in the item #17a) is on the “last mile“ in the direction from server to client, i.e. on players-ISP-router (which is extremely unlikely to honour QoS) – I have doubts about the practical advantages of specifying QoS; on the other hand – I didn’t perform any experiments in this regard recently, so feel free to experiment (and please let me know if you get any positive results).

17c. Implementing prioritization: single-flow-controlled-connection-between-nodes approach

“This model has it's limitations, and prioritization is not perfect, but if we're speaking about times on the order of 1-3 seconds – it has been observed to work surprisingly well in practiceAs QoS doesn’t really work, is there anything we can do? In fact, there is something. It is quite a complicated subject, so I will highlight only the main points. If you have a flow control (such as TCP, but other ways of flow control will also fly) between the nodes in your network, then you have some kind of “outstanding data which is intended for node B, but is still located on node A“. As soon as you have this outstanding data – you can prioritize it (for example, by placing the data into the prioritized queue). However, for this approach to work efficiently – you need a single flow control (for TCP it translated into having a single TCP connection) between each pair of nodes (otherwise prioritization won’t work across these independent flow controls).

This model has it’s limitations, and prioritization is not perfect (especially when it goes over TCP, due to TCP buffers being quite large and difficult-to-control from the application level at least on Windows), but if we’re speaking about times on the order of 1-3 seconds – it (if implemented carefully) has been observed to work surprisingly well in practice (substantially better than multiple connections for which you cannot control the priorities at all).

17d. Implementing prioritization: independent connections and small packets

Ok, previous item covers TCP and other flow-control-enabled ways of communication. But is there something for UDP? Well, mostly no (at least I’m not aware of any really good ways to prioritize UDP over the Internet), but at least there is something.

There exists an observation that quite a few of the Internet routers out there tend to prefer small IP packets to large ones (in a sense that short packets get higher priority and/or less chances to be dropped). While I cannot vouch for it really being the case, it is one thing worth trying (especially if you don’t have any other options). If small packets are indeed prioritized at least to some extent, you could try to send your urgent data in smaller packets; this would work for UDP, but for TCP the following restrictions are to be observed (for this model only): (a) low-priority and high-priority communications need to go over different TCP connections; (b) TCP stream needs to be artificially “split“ into smaller packets using TCP_NODELAY setsockopt() parameter (see item #45 in Part VI for some details on TCP_NODELAY, and on ways how you SHOULD NOT use it).

In general, if you have guaranteed connections with flow control (i.e. TCP or guaranteed-UDP-with-flow-control), I would not suggest to use „small packets“ approach; I expect single-connection-between-nodes model described in the item #17c above to work better on average (however, it is not carved in stone, and if you get any results showing otherwise – please let me know).

To be continued…

The next part, “Server Side”, is planned for release next Monday. It will cover quite a few issues, including Store-Process-and-Forward architecture, server-side communications, application-level balancing, and others.

EDIT: The series has been completed, with the following parts published:
Part IIIa. Server-Side (Store-Process-and-Forward Architecture)
Part IIIb. Server-Side (deployment, optimizations, and testing)
Part IV. Great TCP-vs-UDP Debate
Part V. UDP
Part VI. TCP
Part VIIa. Security (TLS/SSL)
Part VIIb. Security (concluded)

[+]References

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.