TCP Peculiarities as Applied to Games, Part II

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

[rabbit_ddmog vol=”4″ chap=”Chapter 13(i) from “beta” Volume IV”]

Nagle’s algorithm

“Nagle’s algorithm aims to deal with those (CENSORED) developers who’re trying to send data over TCP in really small chunks (like 1 byte each time)Nagle’s algorithm is probably the very first thing you’ll hear when starting to speak about using TCP for games – and yes, it does provide an improvement; however – it is (by far) not a silver bullet. Very briefly – Nagle’s algorithm aims to deal with those [CENSORED] developers who’re trying to send data over TCP in really small chunks (like 1 byte each time). And to achieve this – all the not-so-[CENSORED] developers¹ are paying the price of decreased responsiveness of TCP (though, however, Nagle’s algorithm can be disabled, phew).

The idea behind Nagle’s algorithm is to avoid so-called “small packets”; it is done by ensuring that new packet is sent only if one of the following conditions apply:

Full packet² is available on the sender’s side
There are no packets which have been sent by sender but are not acknowledged yet

As a result – Nagle’s algorithm can easily introduce latencies of the order of RTT; as typical RTTs are around 80-100ms even within one single continent – it can easily affect fast-paced games in a bad way.

Fortunately, Nagle’s algorithm can be disabled on all the systems I know; to do it – use setsockopt() function with TCP_NODELAY parameter.

Note that using TCP_NODELAY will not make your game as responsive as UDP (in particular, head-of-line blocking and exponential back-off will still apply(!)) – but it is indeed a prerequisite to having somewhat-responsive communication over TCP.

¹ Of course, the readers of this article are automatically included ;-(

² Formally – determined by MSS or by PMTUD (more on the latter below)

Caution when using TCP_NODELAY

A word of caution:

when using TCP_NODELAY – make sure to prepare the whole message before calling send()

When using TCP with Nagle’s algorithm on (the default, i.e. without TCP_NODELAY option) – it is reasonably forgiving to “calling send() too many times” (well, avoiding “too many small chunks” is all Nagle is about). In other words – if you’re doing something along the following lines:

void sendMsg( const uint8_t* msg, size_t msgSz ) {
  assert(msgSz < 60000);
  uint16_t sz = (uint16_t)msgSz;
  send(sock, &sz, 2, 0 );
  send(sock, msg, sz, 0 );
}

– then for a default socket (without TCP_NODELAY) things won’t be too bad.³ However, if you use the same thing with TCP_NODELAY on – things can get quite ugly (especially for a loaded Server-2-Server TCP connection).

Moral of the story – when using TCP_NODELAY, make sure to prepare the whole message on the sending side – and send it with a single call to send(). Actually, it is better to do it even without TCP_NODELAY in sight – but with TCP_NODELAY it goes from SHOULD-do to MUST-do category. NB: instead of preparing the whole message, on some systems it MIGHT be possible to use so-called scatter/gather calls; for Linux – see sendmsg() function and struct msghdr, for Windows – see WSASendMsg. Still, whenever it is possible without extra allocations (and in 99% of the cases it is), I still prefer to do preparing the whole message at user level.

³ it will still result both in unnecessary packets being sent, and in extra calls to kernel – but at least the number of unnecessary packets will be kept in check

[[TODO: PSH flag]]

OOB data (don’t hold your breath)

One TCP feature which frequently causes lots of hope with game developers who want to use TCP (only to become a frustration as soon as its nature is understood) is so-called Out-Of-Band (OOB) data. It may look as just-the-ticket to solve that annoying Head-Of-Line Blocking problem – or at least to introduce priorities. However, TCP OOB comes with two big caveats:

you cannot really send some out-of-band data; instead – you’re sending an urgent pointer to upcoming data (with the data still sent as a part of the regular TCP stream). Among other things, it means that even with OOB, in cases of flow control kicking in we still can suffer from Head-Of-Line Blocking (though it should be pretty rare, especially in the context of games); on the other hand, if the packet with OOB byte has managed to make it to the receiving side – it could be delivered to our app without waiting for those missing-segments.
Even more importantly – you cannot have more than one OOB byte in transit at any given moment.⁴ BTW, while you can call send(…,MSG_OOB) with more-than-one-byte, usually only the last byte of the buffer will be interpreted as OOB.

The only real-world use I know for TCP OOB, is sending an abort command during a long file transfer (such as FTP); in this case, receiver would skip all the data in the buffer until reaching TCP OOB marker – which allows receiver to re-sync with the sender a bit faster (and without completing all the no-longer-necessary work).

Unfortunately, as of now, I don’t know of any uses for TCP_OOB for games; however, it doesn’t mean that such uses cannot exist. If you want to research OOB in more detail – I will refer you to [Stevens] (also available online at [Masterraghu]).

⁴ well, you can – but then you cannot guarantee that all OOB data will be received. It comes from an observation that size of OOB buffer on the receiving side is exactly one byte.

Termination Handshake and SO_LINGER

Normally, when TCP connection is terminated, it goes through so-called “termination handshake” (which involves at least three packets, though usually there are four). The problem solved here is that while we’re terminating TCP connection, it is assumed that we still want to be sure that the “last data we’ve just sent” still reaches the other party (which means being acknowledged etc.).

“I’ve seen pretty large games disabling normal TCP termination – and with very good results tooWhile for file transfers and HTTP transfers such a behavior is a must – for gaming purposes it is rarely required. As a result – I’ve seen pretty large games disabling normal TCP termination – and with very good results too. In particular, pretty bad problem with Server having tons of connections in TIME_WAIT state (and eating up all the resources – especially ports) under certain network conditions and/or attacks, doesn’t happen with termination handshake disabled.

To disable termination handshake – use setsockopt() with SO_LINGER parameter, and struct linger having fields of l_onoff=1 and l_linger=0.

While (as noted above) I’ve seen it working with very good results on a large real-world game – there are still a few potential caveats:

If using it indiscriminately – it may cause trouble; in particular – you should not use SO_LINGER on listening sockets – and also should not use it if you’re using shutdown() on the same socket (you won’t need either – but you never know what that-3^rd-party-library-you’re-using will call on your behalf). For details – see [Nybek] and [Nybek2] (they also indicate that using l_onoff=1 and l_linger=0 on connected sockets and without shutdown() work exactly the same – and exactly as expected – across the board, for both blocking and non-blocking sockets).
During deployment, your admins (and/or their monitoring tools) go crazy because of the number of the RST packets you receive.
- Note that while taming your admins is only a matter of being patient (ok, let’s admit you should be very patient), dealing with their tools can be more problematic. Still, usually it can be solved.
There is a potential problem of stray packets arriving late and potentially breaking connections. I have never seen it to cause any kind of trouble during intensive real-world usage – but you never know what can potentially hit you. Still, my educated guess is that overload coming from TIME_WAITs is much more likely to hit you than stray packets.
It is usually inappropriate for scenarios when you need to be sure that you have delivered something to the other side of communication.⁵ While this is usually not the case for game traffic – be careful, for some stuff you may want exactly the default-TCP termination handshake kind of behavior.

⁵ On the other hand – if we need it, we can always use an app-level acknowledgement

“Hung” TCP Connections and Keep-Alives

“TCP connection looks as if it is alive, but in reality nothing goes through it; however, Internet connectivity is present, so if we try to establish a new connection to the same address – we’ll see it coming through(!)One nasty thing about TCP connections is that in real-world they often “hang” (such connections are also known as “black-holed connections”). In other words – TCP connection looks as if it is alive, but in reality nothing goes through it; however, Internet connectivity is present, so if we try to establish a new connection to the same address – we’ll see it coming through(!). NB: when pressing Refresh on the “stuck” browser page and obtaining the page back – at least in 50% of the cases we’re observing this exact thing.

“Hung” TCP connections can be attributed to several different reasons (including such different things as exponential back-off – which can easily double the time after the connectivity has been broken and restored – and changing-IP-by-an-intermediate-NAT). However, for our current purposes, it is not really important to know why connections “hang” – it is much more important that they do “hang”, so if we’re going to use TCP even for not-so-fast-games, we still should address it.

One common way to detect “hung” TCP connections is via using some kind of Keep-Alives. In general, there are two types of keep-alives which might be usable to detect “hung” TCP connection:

TCP Keepalive (SO_KEEPALIVE in setsockopt()). This is a neat option which does pretty much what we need – however, the default value of TCP Keepalive being 2 hours (!!) made it perfectly useless for games. However, with a recent addition of socket-level options such as TCP_KEEPINTVL/TCP_KEEPCNT/TCP_KEEPIDLE (for Linux) and SIO_KEEPALIVE_VALS (for Windows) it became possible to control timeouts of TCP Keepalives (reducing them to single-digit seconds⁶).
Application-level Keep-Alive. If we don’t want to rely on system-specific options (or just happen to run on a box which doesn’t have them) – we can always use app-level Keep-Alives. For example, if we’re using (2-byte-size,message-of-specified-size) format described above – nothing prevents us from integrating app-level Keep-Alives into this format (either marking Keep-Alives with 0 size, or adding some message-type field etc.). In this case – we need to do all the handling of Keep-Alives ourselves, but in return we’ll get significant benefits:
- App-level Keep-Alives are perfectly cross-platform
- We get an option to use “opportunistic” connection recovery mentioned above. Very shortly: when we feel that connection is in danger of being “hung” – we can try to establish a new connection, without closing the existing one. If existing connection springs back to life – we can continue using it; otherwise – we switch to the new one (which is very likely to work as we just completed a handshake on it). This approach allows to avoid an unpleasant dilemma of “how large we want to have our Keep-Alive timeout” (if we make it too large – we get too much delays, if we make it too small – we’ll start breaking completely workable connections, potentially losing on delays too); this, in turn, allows to use smaller-timeouts than otherwise possible (as with “opportunistic” recovery – we have nothing to lose). Still, even with opportunistic connection recovery we’re speaking about timeouts in the range of single-digit-seconds and up.

⁶ While Windows allows to specify milliseconds – I don’t think going below 1 second or so qualifies as a good idea

SACK, Fast Retransmit

One of interesting features which may help a little bit to improve latencies over TCP, is related to mechanism known as Fast Retransmit (which is in turn closely related to Selective Acknowledgement also known as SACK). Very briefly:

SACK allows to acknowledge a non-contiguous portion of TCP stream; this, in turn, allows sender to avoid retransmitting of the-data-already-got-to-receiver. However, SACK as such doesn’t really improve latencies (and Head-of-Line Blocking is still there).
Fast Retransmit causes sender to retransmit the packet before the RTO (discussed above) got expired. Traditionally, it is based on the number of duplicated ACKs received in a row.

In general, we cannot do much to affect the way Fast Retransmit is handled – with one notable exception. On our Linux server, we can try setting a tcp_thin_dupack parameter (which usually resides in /etc/sysctl.conf) to the value of 1 – which might help to improve latencies for our gaming app a little bit. While results are not guaranteed and in any case are likely to be rather mild – I don’t see any reasons not to try it. This is known to play well with another parameter discussed above – tcp_thin_linear_timeouts (which disables exponential back-off if our connection is “thin” enough).

Buffer Sizes

“in some cases (and in the context of games) it might make sense to reduce these default buffer sizesAs it was noted above – TCP has buffers on both sides of the connection; as it was also noted above – the default size of these buffers these days is at least 16-32K⁷. However, in some cases (and in the context of games) it might make sense to reduce⁸ these default buffer sizes (using setsockopt() and SO_SNDBUF and SO_RCVBUF). This can have the following benefits:

Reduced resource usage (often, for these buffers TCP uses so-called “non-paged memory”, which can be a pretty scarce resource on the Server-Side).
Reduced latencies in case if we need to push something unexpectedly urgent to the same TCP connection (to achieve it – we should keep our own queue of outstanding messages and get messages from the queue as soon as space appears there).
Improved overall control; in particular – it can be used to implement conflation (in a manner similar to that of [Lightstreamer], as it was discussed in [TODO] section).

On the other hand – while reducing buffer sizes, make sure that they’re larger than your maximum-message-sent-over-TCP.⁹

⁷ and often even more – especially if you’re messing with parameters such as tcp_rmem/tcp_wmem

⁸ Sic! – most of “TCP tuning” guides will speak about increasing these buffers “to improve performance” – but for games, it is decreasing that we need.

⁹ It doesn’t need to be “strict maximum” – but keeping it above “more-or-less common maximum” is usually very desirable.

PMTUD

One of the TCP features which happen to be quite-annoying-for-games is so-called Path MTU Discovery a.k.a. PMTUD. The very basic idea of PMTUD is to get (and update as network condition change) an estimate of the maximum available packet that fits over the whole network path from our Client to our Server. While PMTUD has its obvious benefits for file transfers – it is quite a headache for games. In particular:

Using PMTUD requires ICMP Fragmentation Needed (ICMP Type 3, Code 4) to go through the whole path from Client to Server and back – otherwise your connection will “hang”.
- OTOH, these days – it is very rarely a problem (that is, if you make sure to push your admin not to filter out ICMP Type 3, Code 4 regardless of him reading somewhere-on-the-Internet that all ICMPs are evil).
PMTUD logic causes to lose packets periodically (this is a result of PMTUD trying to find an increase in already-negotiated PMTUD, more on it in Section 6.3 of [RFC1191]). While it is a small percentage of packets – losing something which we could avoid losing, is rarely a good idea (especially over inherently lossy Client-2-Server connections where it will interplay with other losses etc.).

As a result – I would consider disabling PMTUD on your gaming servers (no warranties of any kind, batteries not included). Note that when disabling PMTUD – you will usually need to specify the hard-coded number for the packet size which will be used; usually – the best results are achieved somewhere between 576 and 1280 bytes.

Congestion Control/Avoidance

As it was discussed in [[TODO]] section above – Active Queue Management (AQM) and closely related congestion avoidance algorithms play significant role for TCP. In particular, for our latency purposes, the popular concept of “halving congestion window on congestion event” tends to cause pretty unpleasant effects at the moment when such halving happens.

“it seems that “proportional rate reduction” (PRR, reportedly used by Linux kernels by default starting from 3.2) performs for gaming purposes a little bit better than the alternativesAs a result, it seems that “proportional rate reduction” (PRR, reportedly used by Linux kernels by default starting from 3.2) performs for gaming purposes a little bit better than the alternatives (for some background, see [RFC6937]).

Another Congestion-Control feature which is rather annoying for games – is so-called “slow start”; while connection establishment is relatively rare – “slow start” can (and will) affect our recovery-from-“hung”-connection. It is usually not a major problem compared to all the delays caused by “hung” connection, but can easily add a few additional RTTs to the delay.

As a result, I would certainly advise to (a) measure real-world latencies on your live server (one of the not-so-important ones), (b) test different congestion algorithms on your Linux box (the latter can be usually done via /proc/sys/net/ipv4/tcp_congestion_control); (c) choose whatever-works-better for your specific game.

[[To Be Continued…

This concludes beta Chapter 13(i) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”.

Stay tuned for beta Chapter 13(j), where we’ll discuss how to make TCP as responsive as UDP (more strictly – how to pass UDP over TCP without increasing latencies).]]

[+]References

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

Comments

danadam says

March 15, 2017 at 11:33 pm

Any reason why footnote 4 is in “Termination Handshake and SO_LINGER” instead of “OOB data (don’t hold your breath)”?

Regarding buffer size, what about tcp_rmem and tcp_wmem? man tcp says: tcp_rmem (since Linux 2.4) … TCP dynamically adjusts the size of the receive buffer from the defaults listed below, in the range of these values, depending on memory available in the system.

On my machine (Ubuntu 14, 24 GB RAM), newly created socket reports rcvbuf=87’380, sndbuf=16’384 and after connect() rcvbuf=1’061’808, sndbuf=2’626’560.

- "No Bugs" Hare says
  
  March 16, 2017 at 11:02 am
  
  re: footnote 4 – the reason was missing shortcode in the text ;-), fixed now, thanks!
  
  About tcp_rmem – you’re right, actually sizes can be even larger than I originally wrote, THANKS! Practically – it means that we have even more reasons to restrict them for games. These large buffers are optimized for file transfers – and games are very very different.
  
wodny says

April 1, 2017 at 11:16 pm

One more interesting thing to remember is that setting TCP_KEEPIDLE/TCP_KEEPINTVL/TCP_KEEPCNT to have a per application hung connections prevention timeout values does not influence the timeout when the connection is not idle and is waiting for an ACK. The TCP_USER_TIMEOUT option might be useful but it overrides the TCP_KEEPCNT value. Also as far as I remember results of my tests the TCP_USER_TIMEOUT value is not exactly the time between calling send() and getting ETIMEDOUT.