TCP Peculiarities for Games, part 1

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

TCP Exponential Back-Off

[rabbit_ddmog vol=”4″ chap=”Chapter 13(h) from “beta” Volume IV”]

For quite a few games out there, TCP is a no-no. Additional latencies and especially head-of-line blocking, which haunt TCP whenever packets can be lost – are indeed a problem for fast-paced stuff. In general – whenever TCP packet is lost – we’re speaking about additional delay (which is equal to initial re-transmission time out a.k.a. RTO) of about 200ms (see further discussion in “RTO” section below). In case of loss of two packets in a row – which DO happen very frequently too, the delay grows to about 3x of initial RTO, making it into half-a-second range. Below is a very rough table of chances to lose N packets in row – and associated delays for TCP (all assuming purely random packet loss¹; also “Fast Retransmit” mechanisms are not accounted for – though IIRC their impact is usually pretty minor for game-like scenarios):

Table XI.1

Number of packets lost in row N	Chances of it happening (assuming 5% random loss)	Frequency for such a loss to happen (assuming 20 packets/second)	Additional Delay²
1	5e-2	Every second	Init_RTO³ ~= 200ms
2	2.5e-3	Every 20 seconds	3*Init_RTO⁴ ~= 600 ms
3	1e-4	Every 6 minutes	7*Init_RTO ~= 1.5 sec
4	6e-6	Every 2 hours	15*Init_RTO ~= 3 sec
5	3e-7	Every 2 days	31*Init_RTO ~= 6 sec

¹ In practice, it is never completely random, so the frequency which such delays are observed with, is usually much higher

² that’s on top of usual ~RTT/2 delays; see also discussion in “Retransmit Timeout” section below

³ Initial RTO (before exponential back-off takes its toll)

⁴ increase is due to exponential back-off, see “Exponential Back-off” section below

“the best we can hope when using a single TCP connection over 5%-loss channel – is having 1.5-second “lag spike” every 5 or so minutes, and a 3-second “lag spike” every 2 hours.As we can see – the best we can hope when using a single TCP connection over 5%-loss channel⁵– is having 1.5-second “lag spike” every 5 or so minutes, and a 3-second “lag spike” every 2 hours. For most of the fast-paced games out there – it won’t fly well (though it may fly somehow – more on it below).

Still, there are game genres (social, turn-based strategies, casino-like games, some of RPGs – think Sims, etc.) where such latencies are perfectly acceptable; also – as we’ll see below – there are some subtle ways to mitigate these latencies too. And whenever we can push latencies out of the way – TCP has quite a few significant benefits; most important ones include such things as simplified development (compared to UDP), and – IMNSHO much more importantly – TCP being MUCH more firewall-friendly than UDP. As noted in [https://www.ietf.org/proceedings/88/slides/slides-88-tsvarea-10.pdf], only 91-94% of users can make outbound UDP connections to the Internet; whether this in number is “high” or “low” – depends on the point of view (and on genre involved; this number will be surely MUCH lower for hardcore gamers, but will be pretty much as mentioned above, for casual gamers), but it is certainly enough to start thinking about it⁶As discussed in Vol. I, such an analysis of minor player groups becomes especially interesting if we consider highly competitive field with 10+ of games competing for players; then being able to cater for these 6-9% of potential players can mean than all these 6-9% of players can be grabbed (i.e. these 6-9% are non-competitive), providing a Big Advantage™ – in particular, in terms of critical mass.[/rabbit_footnote].

Last but not least, I want to note that we won’t discuss those-things-which-are-both-well-known-and-are-barely-relevant to game development. In particular, we won’t make an umptieth description of mechanics of TCP window, or of 3-way handshake; sure, they will be working when implementing a game over TCP – but they’re soooo easy to Google for, and are quite difficult to misuse in our context, so I clearly prefer to concentrate on much-less-known and/or much-less-misusable aspects of TCP.

⁵ and while 5% represents a pretty bad packet loss – if you have a million players, at least a few tens of thousands will have at least this much. Moreover, current tendency (lead by a battle for increase of channel utilization by ISPs) tends to cause packet loss increased from year to year

⁶ As discussed in Vol. I’s chapter on GDD, such an analysis of minor player groups becomes especially interesting if we consider highly competitive field with 10+ of games competing for players; then being able to cater for these 6-9% of potential players can mean than all these 6-9% of players can be grabbed (i.e. these 6-9% are non-competitive), providing a Big Advantage™ – in particular, in terms of critical mass.

TCP Fundamentals

From an app-developer perspective, very basics of TCP consist of two all-important properties.

TCP is a stream, nothing more, nothing less

First of all, TCP is a stream, whole stream, and nothing but stream (so help me Stevens). There is no concept of “message” in TCP (though, as discussed below, we can implement messages ourselves on top of TCP). Moreover –

nobody guarantees that calls to send() on one side of communication will correspond to calls to recv() on the other side.

This 1-to-1 mapping between send*() and recv*() calls, while present for UDP, is NOT guaranteed for TCP. In other words, if we want to send messages over TCP (as we usually do), the following needs to be done:

On sending side, messages need to be formatted in a way that allows to find message boundaries.
- One popular formatting goes as (2-bytes-representing-message-size, message-body-of-corresponding-size). Note that going for more generic 4-byte size is not that obvious: for security and reliability purposes we DO need a limit on message size (and 4G size is usually way too high); at least, if you want to have 4-bytes as message size – still make sure to limit it at about a-few-megabytes on receiving side.
“On receiving side, our code MUST be prepared to handle all possible splits of the stream, from receiving-a-dozen-of-messages-as-one-chunk on one side of the spectrum – to receiving everything byte-by-byte on another one, with all the possibilities in betweenOn receiving side, our code MUST be prepared to handle all possible splits of the stream, from receiving-a-dozen-of-messages-as-one-chunk on one side of the spectrum – to receiving everything byte-by-byte on another one, with all the possibilities in between. In particular, assuming that we’re using message formatting described above:
- First, we need to receive 2-bytes-representing-message-size. Still – it may come as two separate bytes in two separate recv() calls, so we SHOULD account for scenario when we get just one byte – and store it somewhere⁷ until we get the other one.
- Then, we can allocate message (according to its size). Strictly speaking, allocation is not exactly required (for example, we can take message buffer from some kind of message pool) – but is quite common for implementing communication stuff.
- Now, we can start reading message body; once again – it can arrive as any number of recv() calls, so we’ll need to keep “number of received bytes from the message” – most likely, within the same somewhere as above.

Ignoring these things is one of the most popular mistakes when trying to use TCP (for the first time, that is). To make things worse – if you’re illegally relying on one-to-one correspondence between send() and recv() – it is likely to work for a while – especially as long as you’re testing your program locally and/or in LAN; this may lead to Somebody trying it and saying “hey – ‘No Bugs’ has no idea of he’s saying, I’ve done it against his advice – and much simpler too – and it still works!”. Don’t worry, Mr. Somebody – it will start to fall apart as soon as you’re out from LAN to the Internet (detailed mechanics of this phenomenon are outside the scope of this book, but in general – both delays and packet losses tend to break this never-promised 1-to-1 correspondence very quickly).

⁷ If we’re using (Re)Actors – as a part of (Re)Actor state

TCP works over IP

The second all-important property of TCP is that

While TCP is nothing but (reliable) stream, it is implemented on top of inherently unreliable IP packets.

Two communicating TCP hosts have no other way to communicate than to exchange IP packets, plain and simple. And as we discussed before – each and every IP packet can be lost.

“TCP needs to detect those lost packets, and to re-send themAs a result – TCP needs to detect those lost packets, and to re-send them. Primary method for such retransmits is based on ACKs (of course, ACK is a part of a TCP/IP packet). This basic mechanics (which goes back to RFC793 published in 1981) consists of the following elements:

receiver sends ACKs back to sender to indicate that it has received portions of the stream (known as “segments” in TCP speak)
if sender doesn’t receive ACK within certain timeout – it re-sends the segment. Calculations of timeout (known as RTO) are quite convoluted and even controversial, see “RTO” section for details.

It is worth noting that original ACKs were acknowledging only the last contiguous position which has been received; as a result – one lost packet could have caused the re-send of quite a long chunk. To address it, modern TCP stacks implement so-called Selective Acknowledgements (SACKs) – more on them below.

Send/Receive Buffers

In addition to two all-important properties above – one implementation detail is so ubiquitous (and actually unavoidable) that it is well-worth mentioning.

To enable flow-controlled stream on top of unreliable packets – TCP stacks have two buffers, one on sending side – and another on receiving side. Default sizes of these buffers are in the range of a few kilobytes (like 8-16K, but if the exact size is important for you – it is better to double-check, these things change all the time).

Receive-side buffer is especially important – among other things, it serves as a buffer to temporarily store out-of-order packets (which need to be stored at least until the missing portion arrives).

To make sure that receive-side buffer is never overflown – flow control mechanism (based on so-called TCP Window a.k.a. RWIN – very shortly, receiver advertising free space in receiving buffer to the sender) is used; and that’s all we need to know about TCP Window for our current purposes.

Not-So-Fundamentals

Head-of-Line Blocking

All-important-for-our-purposes consequence of TCP being stream and nothing but stream, is that if there are several packets in transit from sender to receiver, and one of the packets is lost – then

packets after the one which is lost, even if present on the receiving host, are not given to the app-level.

This phenomenon is known as “head-of-line blocking” (a.k.a. HOL blocking).

“we may already have the-information-we-need on the receiving host – but this information is hidden from us by layers of abstraction on receiving side.BTW, let’s note that with TCP and lost packets, we find ourselves in a funny situation – we may already have the-information-we-need on the receiving host (this information may have arrived in those packets-after-the-one-lost) – but this information is hidden from us by layers of abstraction on receiving side.

In theory, it is possible to provide access to these out-of-order packets (more strictly – not-contiguous-yet portion of the stream) – just by adding an API on the receiving side and without any changes whatsoever to the protocol itself – but I don’t know of any implementations or proposals on doing it. Which is a pity, as this approach would allow to build more responsive apps over TCP.

Retransmit Time-Out (RTO)

Formally, calculation of retransmit time-out SHOULD be made according to [RFC6298]. And most of it is indeed followed.

Very rough reasoning behind calculations in RFC6298 can be described as follows:

Unless we’re willing to risk making unnecessary retransmits – we need to wait at least RTT before retransmit.
So, we need to measure RTT.
However, as RTT is not guaranteed to be the same all the time – we also need to measure RTT variance too, and use Init_RTO = RTT + 4*variance(RTT).
- If RTT is not known yet (like “while TCP connection is still being established”) – fixed times are used; per RFC6298 – it is at least 1 second. BTW, it is an improvement over than pre-RFC6298 times, I still remember those sequences of unacknowledged SYNs going at T, T+3, and T+9 seconds.

All the logic above makes sense, and implemented more-or-less the same across the board. However, the next point is quite controversial. After making all those calculations above, RFC6298 says that if we’ve got a value less than one second, we SHOULD “round up” our Init_RTO to 1 second. However:

As – except for satellite connections – RTT+4*variance(RTT) over 1s is very rare, all that nice logic and reasoning above is effectively thrown out of the window (and would make timeouts excruciatingly sssslllooowww too).
Fortunately, existing implementations tend to use significantly smaller minimum for RTO; in particular, Linux seems to set the minimum to 200ms by default (see [Groš], and it is also supported by personal observations, though YMMV).
- BTW, on Linux minimum RTO can be adjusted (improving situation for games) – using “ip route change … rto_min” command (see [StackExchange.ChangeTcpRto] for details).
  - For non-gaming traffic, when playing with rto_min, we MIGHT still have a need to keep in mind so-called “Delayed ACKs” (explained, for example, in [Wikipedia.TCPDelayedAck]); delayed ACKs allow receiver to delay sending ACK for received packet for up to 500 ms (though in practice, much more modest times, like 40 ms for Linux and 200 ms for Windows, were reported). If we set rto_min too low – then an ACK to a received packet can be delayed for up to delayed-ACK-timeout of those 40-200 ms, which can in turn cause an unnecessary (spurious) retransmit. However, I would argue that for games, risks and costs of such spurious retransmits will be normally pretty much negligible compared to gains in player experience (due to reduced latency). Also, if we have a steady stream of packets at regular intervals in both directions (which is quite common for fast-paced games) – ACKs are effectively piggy-backed on the existing packets, so “Delayed ACKs” do NOT apply at all.

Exponential Back-Off

“in TCP, on each and every retransmit, retransmit timeout is doubledOne further feature of TCP RTO (and very annoying for gamedevs one) is so-called exponential back-off. Very shortly – in TCP, on each and every retransmit, retransmit timeout is doubled (that is, until we receive an ACK). It is exactly exponential back-off which is responsible for exponential growth of timeouts as a result of linear growth in number of packets lost, in Table XI.1.

Exponential back-off was introduced over 20 years ago, with the idea behind supposedly being to preserve stability of the Internet in case of some strange and undetermined problems. However, these days its merits are disputed [MondalEtAl]. I won’t go into a lengthy discussion about this matter, noting merely that whether exponential back-of is important for the well-being of the Internet or not – as of now, we DO need to live with it…

A few interesting consequences of exponential back-off:

If we have a relatively long interruption in IP connectivity (such as 1-2 minutes which are typical for BGP convergence time) – at TCP level we can get up to 2x longer disruption. I.e. if the IP packets weren’t delivered for 2 minutes – at TCP level the interruption can be anything between 2 minutes and 4 minutes (and quite evenly distributed too).
After some delay, it may happen that it is better latency-wise to re-establish new connection rather than wait for TCP to retransmit. In practice – I’ve seen the “sweet spot” for it being between 5 and 20 seconds, though YMMV can vary greatly.
- “an “opportunistic” switch to new connection has been observed to work quite wellIn this regard, an “opportunistic” switch to new connection has been observed to work quite well. With such an “opportunistic” approach:
  - Most likely, we’ll need to have some kind of “Keep-Alives” (more on them below) to detect that connection has “hanged”⁸
  - If existing TCP connection stays “quite long” without Keep-Alives – Client attempts to establish second TCP connection, without dropping the 1^st one
  - If, while doing it, first connection springs back to life – we can drop the second one, and continue with the first one.
  - However, if second connection is established while the 1^st one is still dead – we can drop the 1^st one and use already-established 2^nd one to move all our communications there.

BTW, we can try to disable exponential back-off at least on the server side (assuming that our Servers run Linux) – Google for tcp_thin_linear_timeouts parameter in /etc/sysctl.conf. While it is not exactly disabling exponential back-off for all connections – it effectively disables them for thin connection (and most of the time, our game connections are expected to qualify as “thin”).

⁸ there are several of reasons for TCP connection to “hang” – or to feel as it has “hanged”: from exponential back-off on top of several lost packets – to NAT on the way changing its IP

[[To Be Continued…

This concludes beta Chapter 13(h) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”.

Stay tuned for beta Chapter 13(i), where we’ll continue our discussion of not-that-well-known TCP features into Nagle algorithm, SACK and Fast Retransmit, lingering options and eliminating 4-way handshake, “hanged” connections and keep-alives, peculiarities related to PMTUD, and effects of congestion avoidance and AQM.]]

[+]References

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

Comments

David Turner says

March 9, 2017 at 8:20 am

Like the bit about reading messages out of a TCP stream, but there’s one other common mistake that wasn’t mentioned: you have to account for the recv() that yields the end of one message also to contain the start of the next one. Or possibly one or more complete messages followed by a partial one. Again the partial one may only be part of the message length.

- "No Bugs" Hare says
  
  March 9, 2017 at 9:33 am
  
  > there’s one other common mistake that wasn’t mentioned: : you have to account for the recv() that yields the end of one message also to contain the start of the next one
  
  From what I’ve seen, it is not that common, and is actually quite easily avoidable: at least if using the size-then-message format mentioned above – it is always possible to avoid reading too much (and that’s what most of the people I’ve seen, are doing after learning about stream nature of TCP).
  
  Sure, doing smaller reads makes the things not-100%-optimal (there will be unnecessary kernel calls) – but as we’re speaking about such simple mistakes as not reading the stream properly, I don’t think it is good idea to speak about perfectly-valid-but-sub-optimal stuff (at least not in the same breath).
  
  - David Turner says
    
    March 9, 2017 at 7:28 pm
    
    Ah, interesting. So you advocate starting with a recv(2) to get the length (handling the case where it only yields 1 byte by calling recv(1) of course) and then repeatedly recv() to get the rest of that message but nothing else?
    
    - "No Bugs" Hare says
      
      March 12, 2017 at 1:58 pm
      
      Not that I am necessarily advocating it, but (a) this is how I see people usually writing it; (b) it actually works (not 100% efficient, but not *too* bad either). 🙂