Packet Loss for an App-Level Developer. Part I. Router Failures, BGP Convergence Time, AQM, Traffic Shapers.

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

[[This is Chapter 13(f) from “beta” Volume IV of the upcoming book “Development&Deployment of Multiplayer Online Games”, which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the “release” book to those who help with improving; for further details see “Book Beta Testing“. All the content published during Beta Testing, is subject to change before the book is published.

To navigate through the book, you may want to use Development&Deployment of MOG: Table of Contents.]]

[[TODO: move to the beginning of Chapter 13]]

When speaking about the programming network-related app – usually very little attention is paid to the question “whether the app is going to run over the Internet, or locally in LAN”. And in theory – indeed, there is no difference whatsoever: there is a long-standing perception that we should just use our TCP (HTTP, whatever-else), and “smart TCP/IP stack will do everything-we-need for us”.

In practice, however – it is not that simple; and whenever we’re starting to speak about low-latency scenarios such as those involved in games (or stock exchanges) – the statement above tends to fall apart really really quickly.

Fundamental Properties of the Internet

The effects I’ll describe below, are all related to the nature of the Internet, and apply regardless of the transport-level and app-level protocols being used. As a result – for the time being we will NOT speak in terms of TCP or UDP; rather – we’ll concentrate on a good old IP packet (which belongs to the Level 3 a.k.a. L3 of OSI/ISO networking model).

The IP packet, the whole IP packet and nothing but the IP packet, so save me Stevens

First of all, we need to mention two absolutely fundamental properties of the Internet

Everything on the Internet is transferred via IP packets.

Whatever we’re doing on the Internet – is implemented on top of exchanging IP packets. TCP streams are nothing but sequences of IP packets, UDP datagrams are just IP packets (with an 8-byte UDP header starting their payload), IPSEC VPN is a yet another type of IP packet, and so on. Those few network-related protocols we’re using, and which do not fit into this patterns (such as ARP which effectively binds L3 to L2, or L2-and-lower Ethernet/Framerelay/ATM/SONET/…) – are inherently confined to our local networks (or “local” networks of respective ISPs¹). Or looking at it from a bit different angle – if we want to send something between two boxes which are sitting at the arbitrary points on the Internet – we MUST use IP packets to achieve it.

¹ They’re “local” even when they’re covering half of the globe – in a sense that none of them covers “the whole Internet”.

Each and Every IP Packet is Inherently Unreliable

The second fundamental property of the Internet says that

Each and every IP packet is in danger of being lost

“All the routers, switches (actually – pretty much each and every device which forms Internet infrastructure) – are allowed to drop each and every packet.All the routers, switches (actually – pretty much each and every device which forms Internet infrastructure) – are allowed to drop each and every packet. There is absolutely no way to tell them “hey, this packet is sooooo important, so you MUST deliver it – just this single packet, pretty please…”). This property simplifies implementation of the Internet greatly (and moreover – arguably, without it, Internet wouldn’t be able to work as it works now) – but has very significant implications.

In particular, if we need reliable delivery of anything – it is a responsibility of the communication end-points to (a) detect that the packet is lost, and (b) to re-send the packet.

This is exactly what TCP or Reliable UDP is doing. However, it has further implications. More specifically, detecting of a lost packet takes time, and re-sending it takes time too. This (combined with the fact that TCP is a stream, so out-of-order delivery is not really an option) is the primary reason why TCP tends to cause higher latencies (that is, in presence of lost packets) even after Nagle’s algorithm is disabled; more on it in [[TODO]] section.

Channel-Level Retransmits

In some cases (especially if physical channel is expected to be very poor and/or unstable), channel-level protocol (usually at L2) can have its own L2-level acknowledgements and initiate a retransmit if the packet is not acknowledged (one example of such protocol is Bluetooth, which in turn is often used for tethering).

However, this trickery doesn’t affect the fundamental property of all packets being losable. Just because even if one single channel provides this kind of guarantee, others normally don’t (and routers normally don’t provide guarantees about their processing abilities either). So, while indeed some of the channels over which your IP packets travel, can be sorta-reliable – when speaking about Internet as a whole, each and every IP packet still MUST be considered at risk of being lost.

Real-World Issues

Now, with these two fundamental properties in mind, we can start discussing more subtle issues of the real-world Internet. Let’s start with discussing things which happen “on the Internet itself” – or from another perspective, between ISP’s routers (i.e. not including connections such as connection from home modem to ISP, etc.).

Router Failures. BGP Convergence Time and Manually Handled Failures

“Everything out there can fail, and with about 20-30 routers on a typical over-the-Internet path, chances of any one of them to fail are rather high.Everything out there can fail, and with about 20-30 routers on a typical over-the-Internet path, chances of any one of them to fail are rather high. What happens if a router somewhere on the Internet fails – is that adjacent routers will eventually understand that one of their peers has failed, and will switch to an alternative path to route the packets. For all the even half-decent ISPs, having that alternative path (known as “multi-homing”) is a de-facto standard at least for 20 years. It means that most of the time, the effects of a failed router are limited to the (usually 100%) packet loss during that “detect-and-switch-to-alternative-path” window.

This time is known as “BGP convergence time”, with typical values being between 1 and 2 minutes. From our app-level developer perspective, it means that one of common scenarios to happen, will be 100% packet loss for a time between 1 and 2 minutes. And as you can imagine – it is going to be a very serious disruption for affected players.

While such disruptions are not going to be frequent for each of the customers/players, it is still important to account for them for two reasons:

If you have 100K simultaneous connections – you’ll see some of it happening on a very regular basis. Not that it is a problem per se – but at least you’ll need to make sure that your Client behaves reasonably under such circumstances.
- Oh, and be prepared to answer claims coming from your players/customers such as “hey, the whole Internet works for me except for your site, so it is not my problem, so it MUST be yours”. Quite often such complaints come when the major router somewhere on the way goes down.
“One of my own systems had exhibited a problem, related to O(N) lookup in a place where N was deemed to be very small; however, under one-way 100% packet loss over 10K Clients – it wasn’t really smallWorse than that – if such a failed router happens to be close to your Server (or it is a border router of a large ISP – which I’ve seen to fail much more frequently than backbone routers) – you can experience such a 2-minute packet loss for many thousands of your connections at the same time. And (especially if you’re using your own protocols) – you’d better think about such scenarios and test your system against them. Once upon a time, one of my own systems had exhibited a problem, related to O(N) lookup in a place where N was deemed to be very small; however, under one-way 100% packet loss over 10K Clients – it wasn’t really small, which has unnecessarily increased recovery time from the problem by 20 minutes or so. I got lucky at that time – as effects could have been much worse.

In addition, after router failure, occasionally the switch-to-alternate-path logic doesn’t work for whatever reason;² if a large ISP (think Comcast or AT&T) is involved – this can easily lead to problems lasting from half-an-hour to 2-3 hours, and affecting as much as 20% of all your Client population at the moment. Smaller ISPs can have problems which last much longer (whole day is not the limit) – but as they affect much smaller population of your players/customers – it rarely causes system-wide problems.

Handling Mass Disconnects

Whenever a large ISP affecting 10-20% of your players fails – you might want to declare an emergency state of “mass disconnect” and handle this situation gracefully for those disconnected.

A real-world example. I’ve personally seen a turn-based-game-with-time-control implementing detection of such “mass disconnects” and saying something along the lines of “hey, an Internet-wide event occurred limiting access to a large portion of players; as a result – all the timeouts are increased by 10x to limit the damage to those disconnected”. BTW, this system was accepted very well by players. The logic behind such a system goes two-fold:

the problem is certainly beyond control of those affected, so there is no chance for abuse, and
as a universal rule “for everybody”, such graceful handling is well-accepted because it can easily happen to anybody (i.e. each player thinks that she can be the one affected next time).

For other types of games – YMMV, but even for a game such as MOBA I’d still consider going as far as declaring all the current ranking matches as non-ranking – on the very same grounds as the turn-based game has increased their timeouts.

Autonomous system (AS) An autonomous system (AS) is a collection of connected Internet Protocol (IP) routing prefixes... that presents a common, clearly defined routing policy to the Internet.— Wikipedia —As noted above, for smaller ISPs – failures may be not that much of a problem for your whole ecosystem (both technically and psychologically). Still, you might want to detect “massive failures in IP clusters/AS ranges” – and, say, for a turn-based game – you could still try to increase timeouts (though just for those affected). Note that:

You SHOULD explain to all the players what is going on (i.e. why this mode is introduced, or why the player gets special handling)
Doing the same thing for individual players won’t fly – because it is very simple to build an abuse by simulating a network failure; however, when a hundred of players from the same city/ISP are disconnected at the same time – chances for such an abuse go down tremendously.

Router/Link Overload

If we’d live in the world where the only cause for lost packets is faulty routers, our life would be much better, I can assure you. However, packet loss is an inherent property of modern Internet even when there are no faults – at the very least because of t Internet being heavily oversubscribed.

If we add all the capacities sold to the end-users, they won’t fit into the Internet backbone by far (by at least 2 orders of magnitude³). This is quite a bit of oversubscription (to put it very mildly). In other words – if even 5% of us would start using all the bandwidth which we’re paying for, at the same time – the Internet would be fatally overloaded.⁴

“As “how many customers are using the Internet at any given moment” is not exactly subject to the law of big numbers, it means that from time to time certain routers (and/or certain links) on the Internet get overloaded.As “how many customers are using the Internet at any given moment” is not exactly subject to the law of big numbers (in particular – because there are world-wide events), it means that from time to time certain routers (and/or certain links) on the Internet get overloaded. And when it happens – it is usually manifested as a certain queue of the router (either incoming one – if it is router hardware which is overloaded, or an outgoing one – if it is a channel which is overloaded) becoming full. And as soon as the queue is full and a new packet arrives – well, the packet gets dropped. And as “each and every IP packet can be lost” – routers are perfectly within their rights to do it. Later on, this simplistic and obvious policy (which came naturally as first routers were implemented) got a fancy name of “drop-tail queue discipline”.

From our developer’s perspective – such an overload will look as a certain percentage of the IP packets on the way between Client and Server being lost. And in case of such an overload – the percentage of IP packets lost can easily go as high as 20-30%.⁵ On the positive side, though, we can say that most of the time, such packet loss is probabilistic – so packets are usually not lost in bursts (well, as long as loss percentage is relatively low).

² From what I’ve heard – most of the time, it is some kind of misconfiguration; OTOH it doesn’t really matter too much for our current purposes

³ “normal” residential ISP oversubscription ratios are like 20:1 to 50:1, but then there are oversubscriptions between ISPs too

⁴ Ok, first it would be residential ISPs failing really badly, but as there is an oversubscription between these ISPs and their upstreams, and then their upstreams – they would get overloaded too

⁵ in theory – it can go all the way up to 90%+, but as far as I know, this is quite rare in practice.

AQM

For a while,⁶ the simplistic “drop-tail” policy (i.e. dropping the packets only when router doesn’t really have any other options) was the only mechanism when a non-faulty router was dropping the packets.

However, with time it was realized that such a sharp difference between “good” router state (nothing-is-dropped) and “bad” router state (when quite a few packets are dropped) reduces overall bandwidth utilization (or, if looking at it from a bit different perspective – prevents ISPs from selling even more oversubscribed channels).

To deal with it – a bunch of techniques collectively known as “Active Queue Management” (AQM) was introduced.

In a nutshell, the idea behind the AQM goes along the following lines:

Routers start dropping packets before the queue is full (though only when it is “loaded enough”).⁷
- AQM-induced packet loss always happens in a probabilistic manner, so it is a certain percentage of IP packets which gets dropped.
  - Percentages of packet losses for different classes of IP packets may vary depending on the IP header fields, such as “protocol” (TCP/UDP/ICMP/…), “DiffServ” and so on.
    - This, in turn, allows for prioritization of different classes of traffic. And this could be a really nice side effect of AQM – that is, if such prioritization would really occur on a wide scale (beyond common thrashing of ICMP into oblivion).
  - Exact algorithms vary quite a bit – but their differences are out of the scope of our discussion, and wouldn’t affect it much anyway. Examples of different “queue disciplines” include RED, RRED, CoDel – and a dozen of others.
- “TCP, when it observes a dropped packet, interprets it as an indication of congestion – and slows down.TCP, when it observes a dropped packet, interprets it as an indication of congestion – and slows down.
  - These mechanisms were built into TCP for a long while,⁸ though recently much more efforts were spent on TCP congestion control (with a dozen of different algorithms developed – with most popular ones being CUBIC and Compound TCP) exactly to optimize throughput when facing even-more-overloaded Internet (which is in turn enabled by AQM).
- As vast majority of the Internet traffic is TCP – the whole thing does indeed help with solving the problem of “improving average bandwidth utilization” (a.k.a. “how to allow ISPs to oversubscribe even more while keeping congestion in check”).

For us as developers, the main takeout from AQM is that

It is likely that even if there is no overloaded router⁹ on the path between Client and Server, packets will be still lost.

From my experience, I can tell that this effect (of losing packets without pronounced channel overload) was much less prominent like 15 years ago, but it has a steady (though rather slow) tendency to increase.

AQM and Latencies

On the positive-for-gamedevs-side, in quite a few scenarios AQM can improve latencies. One common scenario goes as follows (and is especially important in practice for ISP-router-on-the-other-side-of-Client’s-Last-Mile):

there is an overloaded router (or link) somewhere on the path
- the traffic over the overloaded router/link is a mix of “heavy” TCP downloads and gaming traffic
without AQM, the overloaded router/link would be hopelessly overloaded by those “heavy” TCP downloads
with AQM, the affected router will start dropping packets a bit earlier (while there is still some space in its queue)
- this “early” packet drops will cause those “heavy” TCP downloads to slow down
  - which, in turn, will improve chances of non-download traffic (which is our gaming traffic) to get through (and without too much delay too).

These effects have been observed in practice – however, at the moment I cannot say how pronounced they are within modern Internet (and it is also not-so-important for our purposes, as from our app-level we cannot do much about it).

⁶ IIRC, until early 2000s

⁷ strictly speaking, AQM can issue “explicit congestion notifications” (ECN) before starting to drop packets, but in practice as of 2016 it still was happening only occasionally over the Internet

⁸ At least for 25 years

⁹ i.e. “router which has no space in its queue”

Traffic Shaping

A related, but quite different form of ISPs messing up Client’s connectivity, is so-called “traffic shaping”. It applies whenever there is a limit for an end-user connectivity, imposed not by “last mile” channel (i.e. “channel from end-user to ISP”) itself, but by some logic “behind” the “last mile” on ISP’s side. These days, it often comes with some kind of arrangement such as “unlimited traffic” with a fine print saying that “after the first 100G of traffic per month, you’re restricted to 1Mbit/s connection”. Traffic Shaping comes into play to enforce these “1Mbit/s”.

“In the past, when facing a Traffic Shaper, things often went really badIn the past, when facing a Traffic Shaper, things often went really bad; IMO, two different processes contributed to the problems due to Traffic Shapers being so severe:

Quite a few of Traffic Shapers of the past were operating along the lines of “having a pre-defined interval, calculating traffic allowance from the start of the interval, and starting to drop the packets when the allowance is exceeded”. If used with intervals significantly larger than RTT – it caused long packet loss bursts, which in turn exhibited pretty bad overall effects (generally much worse than probabilistic loss AQM-style).
As of early 2000s, TCP wasn’t really good in operating over connections with more than 0.5%-or-so packet loss; and 7% loss was able to make those ancient TCP stacks pretty much unusable. Things were even worse in case of bursted packet loss which lasted longer than RTT.

These days, though, situation has significantly improved. While your Client may still experience regular packet loss bursts¹⁰ when its traffic runs over Traffic Shaper – two improvements are usually observed: (a) the number of Traffic Shapers out there having atrociously long packet loss bursts, has decreased,¹¹ and (b) at least modern TCP stacks tend to handle packet loss much more gracefully.

Actually, if Traffics Shaper behaves nicely – it is pretty much indistinguishable from an AQM router; still, I cannot say that all currently existing Traffic Shapers are indistinguishable from AQM routers – some of them are not, and do cause quite a bit of pretty nasty packet loss bursts.

¹⁰ maybe “intermittent connectivity” is a better term?

¹¹ TBH, I don’t know how they’re doing it now, but my understanding that even a simple decrease of that monitor-and-drop interval below roughly 100ms should do the trick

[[TODO: route change, different routes at the same time (the latter – allowed but quite uncommon)]]

[[TODO: priorities – incl. DiffServ and smaller packets]]

[[To Be Continued…

This concludes beta Chapter 13(f) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”. Stay tuned for beta Chapter 13(g), covering other ways packets can be lost – and how it affects app-level programming.]]

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

Comments

wrdieter says

January 16, 2017 at 10:32 pm

The up side of AQM is lower latency over heavily loaded links. In some cases the improvement is two orders of magnitude (seconds of delay turn into tens of milliseconds).

Also, newer implementations of CoDel and FQ-CoDel (and possibly others) use ECN before dropping when ECN is enabled (which is unfortunately not that often).

- "No Bugs" Hare says
  
  January 17, 2017 at 10:12 am
  
  > The up side of AQM is lower latency over heavily loaded links. In some cases the improvement is two orders of magnitude (seconds of delay turn into tens of milliseconds).
  
  IIRC, there are at least two different phenomena related to AQM and latencies. First of all – as AQM effectively starts to slow down TCP before the queue becomes full – it can indeed allow to reduce latencies for traffic-competing-with-TCP (such as game traffic). OTOH, it comes at the cost of game UDP packets becoming at risk of being dropped (and this will hit observed latency). EDIT: I added a subsection on it, THANKS!
  
  In a separate development – AQM is often accompanied with reduced queue sizes, which also helps to reduce latencies – but TBH has very little to do with AQM itself (but rather with better overall understanding of the importance of interactivity for modern Internet – opposed to original vision of TCP as a thing-where-2-hour-delays-are-perfectly-normal).
  
  Did I miss anything?
  
  > Also, newer implementations of CoDel and FQ-CoDel (and possibly others) use ECN before dropping when ECN is enabled (which is unfortunately not that often).
  
  Yes – but from the app-level point of view the difference between AQM dropping or issuing ECN, is pretty much invisible.
  
Evgeny says

January 18, 2017 at 1:49 am

“Each and every IP packet can be lost” just does not sound right. “Each and every” means “all” like “literally all”, i.e. no packets can be delivered. On the other hand “Any IP packet can be lost” means exactly what you mean – a packet that cannot be lost does not exist. Subtle difference but still… maybe just me -)

“Quite often such complaints come from the major router somewhere on the way going down.” – probably you meant “… come when a major router …”, routers do not (usually) file any complains.

Thanks. Good article!

- "No Bugs" Hare says
  
  January 18, 2017 at 5:21 am
  
  > no packets can be delivered.
  
  I could argue about this one (along the lines that it is indeed one of possibilities indicated by “can”) – but I won’t, changing it into “each and every IP packet is in danger of being lost” 😉
  
  > probably you meant…
  
  Yep, thanks!
  
  P.S. A (rather lame, but I don’t have any better) excuse – this is just a “1st beta” of the book, so the language is indeed very poor :-(. I hope it becomes better by “3rd beta” – and then it’ll go into “line editing” and proofreading (who should be able to catch this kind of stuff).
  
Dave Taht says

August 24, 2019 at 5:25 am

Pretty good post. A few good things have happened since you wrote it. These two things made it into linux mainline:

fq_codel for wifi: https://www.usenix.org/system/files/conference/atc17/atc17-hoiland-jorgensen.pdf
sch_cake: https://arxiv.org/abs/1804.07617

as did fq_codel for freebsd.