Packet Loss for an App-Level Developer. Part II. Last Mile, Wi-Fi, and Summary

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

[[This is Chapter 13(g) from “beta” Volume IV of the upcoming book “Development&Deployment of Multiplayer Online Games”, which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the “release” book to those who help with improving; for further details see “Book Beta Testing“. All the content published during Beta Testing, is subject to change before the book is published.

To navigate through the book, you may want to use Development&Deployment of MOG: Table of Contents.]]

Last Mile

After we discussed mechanisms of the packet loss “on the Internet itself”, let’s take a look at another big source of packet loss – the infamous “last mile”.

Definitions of the “Last Mile” vary across the board. In the context of this book, we’ll use the term “Last Mile” to describe the connection between home router of our Customer, and the ISP.

As we defined it, “Last Mile” usually goes over the (A)DSL or over cable modem ([PointTopic] gave two these technologies about 75% of the broadband access combined in Q1’2013). On the other hand – FTTx technologies are ramping up, and this is generally a Good Thing™ (see below about problems with DSL and cable); however – don’t expect miracles from FTTx (in the long run, they will be subject to the same oversubscription problems as the others, so overall improvement will be likely relatively mild). As for dial-up connections, they’re pretty much dead to justify any commercial development aimed specifically for them (in US and EU, dial-up usage went below 5% several years ago,¹ so targeting them is rarely worth the trouble). From our network-app developer perspective, it means that we can safely assume at least ADSL for our customers (phew).

However, more important (and a bit less obvious) property of the Last Mile – is that in practice, whenever we’re speaking about home-to-Server connection, most of the packet loss happens over the last mile. This tends to be a result of three separate issues:

Very often, Last Mile connection just plain sucks from physical point of view. This applies to (A)DSL technologies in spades – in particular, if “conditioning” of the phone line has been poorly made (which happens all the time), our Customer’s DSL connection will suck big time, causing lots of trouble.
- “Among other things, DSL-over-phone-lines is inherently vulnerable to burst losses due to electromagnetic interference on pretty-much-unprotected phone wires.Among other things, DSL-over-phone-lines is inherently vulnerable to burst losses due to electromagnetic interference on pretty-much-unprotected phone wires.
- Another thing which indicates a poor connection – is so-called “modem retrains” (such as “ADSL modem retrains”). These are essentially the same as good ol’ dial-up modem retrains – and tend to indicate significant problems with the physical link. At app-level, are usually observed as 100% packet loss during re-train (which can last for up to a minute or so).
Last Mile oversubscription. This is a problem more typical for cable connections (though – unfortunately – not restricted to them). As soon as several Customers have shared bandwidth (and all the cable modems do share bandwidth by design) – well, as soon as ISP connects enough people there, they will start to compete for the bandwidth, damn. Characteristic symptom of such oversubscription is that your Customer’s Internet goes very well in the morning, while barely crawling at night (when everybody and their dog are watching Netflix).
- While this problem is indeed more typical for cable connections, similar symptoms can be seen over (A)DSL too. It can come either from electromagnetic interference between poorly shielded phone wires (and they are poorly shielded by design), and/or from oversubscription of the channel which comes from the-first-router-on-ISP-side further upstream.
Overload of your Customer’s channel. While the items above were about residential ISP being greedy (and/or having no idea what they’re doing²) – this one is an inherent problem which won’t go away (ever). As the capacity of the Customer’s channel is limited – whenever he tries to get more traffic than this capacity, his channel becomes overloaded, with all the processes discussed in “Router/Link Overload” section above, kicking in. Very shortly: as all the packets won’t fit – some will get dropped.
- This is an especially common scenario when the Customer’s downstream channel gets overloaded with downloads/video streams/… . In this case – two effects can be observed: (a) packet loss (which in turn leads to Server’s TCP stacks slowing down – which may be a relief for UDP-based games(!)), and (b) delays (which are especially prominent if ISP’s router has long queues, though there is a recent tendency in reducing router queue sizes).

¹ not to mention that people-using-dial-ups are not really likely to run anything beyond browser and e-mail

² BTW, I do not mean that all residential ISPs are like this; there are good residential ISP out there – but unfortunately, they’re few and far between. With hosting/server-side ISPs, my experience is completely the opposite – they tend to be professional (though quite a few are still trying to charge completely unjustified prices for their services)

Sudden IP Change

Most of the time (these days – like “at least 95+% of the time”), your home Client won’t get a static IP – but rather will have a dynamic one. Moreover, depending on the ISP, they can change your IP as they wish. This can happen for a number of reasons – from DHCP lease expiring, to PPPoE obtaining new IP on reconnect – and depends on ISP specifics heavily; experience I have is that frequency of IP changes varies from “once per month” to “more than once per day” depending on ISP.

“if there is a NAT device on the way from your Client to your Server – such a sudden IP change can occur with absolutely no notification to Client whatsoever.If PC-where-your-Client-runs is the one directly affected by an IP change – you might be able to detect it; however – if there is a NAT device on the way from your Client to your Server – such a sudden IP change can occur with absolutely no notification to Client whatsoever. As for home users, in 98%+ cases there is a NAT device on the way – it means that most of the time your Client won’t even know about the IP change, but all the packets coming from your Client to your Server, will have their source IP address (and often also source port) different from what-they-had-just-10-ms-ago. Ouch.

Let’s note that such a “sudden IP change” is very different from simple change of the route³ – and has much worse effects on connectivity. Route change doesn’t change packets – it is just the way the packets travelling, are changing; as a result – protocols don’t really notice the change much. Being very different, “sudden IP change” does change IP packets – which breaks mechanics of almost-all protocols out there. In most-but-the-most-latency-critical cases, route changes can be more-or-less ignored at app-level; for sudden IP changes situation is very different – and usually requires app-level handling (that is, if you want to keep your customers/players happy).

Overall, sudden IP changes are not too bad – unless they happen when there are alive connections. If there was a connection at the moment of sudden IP change – all hell breaks loose.

Specifics of such a hell depend heavily on the nature of the protocol. For example, in case of sudden IP change, any TCP connection will go dead in an irreparable way⁴ – and depending on many factors, it will be either reported to be dead (most likely – via an RST packet), or it won’t; in the latter case – the connection may “hang” in such as dead state forever. Worse than that – it very well may be that in case of a sudden IP change, TCP connection will be reported dead on Client – and it won’t be on Server (or the other way around).

For UDP-based connections, on a sudden IP change, packets will stop going through in both directions; whether you’ll get an ICMP saying that something went wrong – depends (i.e. “it may easily happen that you don’t”). And just as for TCP – it may be reported on one end of the communication, while staying unreported on the other one.

“As a result, I strongly suggest to deal with “hanged connections” (including, but not limited to, “sudden IP changes”)As a result, I strongly suggest to deal with “hanged connections” (including, but not limited to, “sudden IP changes”). And the best way I know for doing it, goes along the following lines:

detect connection being explicitly dropped (either via TCP RST, or via ICMP-in-response-to-UDP) – and re-connect
- if your game is at least somewhat-latency-sensitive – make sure to handle those ICMP-in-response-to-UDP packets. While there is a lot of advice (in particular, on stackoverflow) that such ICMP is useless – it is not; allowing your player to reconnect a tad earlier when you already have sufficient information – is a thing which we SHOULD do for our players (at least if we care about them). OTOH, detecting ICMP doesn’t mean that we can skip keep-alives – see below.
Have app-level keep-alive packets to track that connection is really alive – and reconnect when the timeout is exceeded.
- Keep-Alive packets should be sent in both directions, and timeout should be detected by both Client and Server
- For latency-critical scenarios – an option of “opportunistic reconnect” can be interesting (as described in [[TODO]] section below).

BTW, the logic above doesn’t address only “sudden IP changes” – but pretty much any “hanged” connection – which makes it even more important.

³ though it is perfectly possible that both sudden IP change and route change happen at the same time

⁴ you still can establish a new TCP connection

Wireless Last Mile

Recently, wireless (usually LTE-based) Last Mile connections have emerged. These do work with varying success – though they tend to exhibit all the problems described above. In addition – they’re often affected by current weather (in particular, fog tends to hit them a lot).

Still, if landline is particularly bad (like my current one) – LTE-based Last Mile can behave better latency– and packet-loss-wise.

Wi-Fi: a Mile beyond the Last One

When speaking about Internet connectivity for games – we shouldn’t forget about now-ubiquitous Wi-Fi. From the point of view of packet loss and latencies – Wi-Fi IMO usually qualifies as a “relatively mild annoyance”; if working under more-or-less good conditions – it tends to add single-digit milliseconds to the RTT, and well-below-1% packet loss. On the other hand – there are quite a few things which tend to affect its performance.

Interference

“When speaking about Wi-Fi, IMO the most annoying thing is interferences from competing Wi-Fi stations – as well as other devices working in the same frequency range (including such weird things as microwave ovens which are also using 2.4GHz frequencies).When speaking about Wi-Fi, IMO the most annoying thing is interferences from competing Wi-Fi stations – as well as other devices working in the same frequency range (including such weird things as microwave ovens which are also using 2.4GHz frequencies).

Oh, and to make things even more complicated – these effects tend to manifest themselves within apartment buildings much more severely than within houses. The reason is quite obvious: the more crowded with electronic devices the area is – the more interference Wi-Fi will experience, plain and simple.

Penetration and Reflection

Another issue related to Wi-Fi is its dependence on configuration. In general, Wi-Fi does like direct line of sight – and anything less than that can easily affect Wi-Fi performance, causing extra packet loss etc. etc. It is next-to-impossible to estimate the exact behavior of Wi-Fi (beyond configurations with direct line of sight⁵); however, impact of complicated configurations can easily lead to packet loss going into 10% or so – and it won’t be pleasant.

Technically speaking – Wi-Fi signal is subject to both penetration losses and reflection (for realistic measurements of these effects for different types of materials – see ). The ugliest things, however, tend to be either a metallic surface (causing lots of reflection – and metallic meshes such as those used in concrete, often aren’t that much different) – or multi-layer structures (with internal reflections on intra-material boundaries being further absorbed and completely useless – unlike external reflections, which have a chance to carry a meaningful signal).

2.4 GHz vs 5 GHz

“As a rule of thumb – 5GHz Wi-Fi tends to perform significantly better than 2.4GHz oneAs a rule of thumb – 5GHz Wi-Fi tends to perform significantly better than 2.4GHz one. This can be attributed to several separate phenomena:

Wider band available at 5GHz (2.4 GHz can actually host only 3 non-overlapping Wi-Fi channels, ouch)
Currently, most of routers are still 2.4 GHz, so 5GHz range is less crowded. However, this one is going to change in the future.
Less interference from non-Wi-Fi devices. Right now, microwave ovens are 2.4 GHz – though strictly speaking, some of them might move to 5GHz band in the future⁶ (OTOH, at the moment there is no apparent reason for manufacturers to undertake this effort, so it might stay this way for a rather long while – or forever, whichever comes first).
5GHz has worse penetration than 2.4Ghz (both air penetration and wall penetration); while being a headache if you’re not within the plain view of your Wi-Fi – it helps to reduce unwanted interference from neighbors.

On the other hand – as already noted above, 5 GHz has worse penetration than 2.4 GHz, that can easily cause trouble whenever your player is going out of direct line of sight. Still, unlike interference, this is in your customer’s power to address.

⁵ strictly speaking, even line of sight can be non-obvious – especially if there are lots of metal surfaces around

⁶ in spite of reports of 2.4 Ghz being a resonant frequency for water atoms/molecules – it doesn’t look so, but TBH if you’re really interested in it – it is better to ask somebody having more hands-on knowledge with physics than me.

Wi-Fi Extenders

Sometimes is happens (especially in US) that range of Wi-Fi router is not sufficient to cover the whole house. In this case, “Wi-Fi Extender” can be used.

“classical Wi-Fi extenders use the same radio channel for receiving the packet – and for retransmitting the packet further. This increases contention of the radio channel – and causes increases in packet drops and latencies.On the other hand, classical Wi-Fi extenders use the same radio channel for receiving the packet – and for retransmitting the packet further. This increases contention of the radio channel – and causes increases in packet drops and latencies.

To avoid these effects – the technology such as Netgear Orbi looks very promising.⁷ Very roughly, the idea behind Orbi goes as follows: Orbi router – having more than one transmitter – uses a different frequency for communications between “main” router and extender (see, for example, [Higgins]). This, in turn, means significantly reduced contention (both on channels-going-from-routers-to-PCs, and on channel-used-between-router-and-extender).

Summary of Packet Loss Mechanisms – and Their Observable Effects

Ok, we’ve discussed quite a few effects which can cause packet loss and delays between home-based Client (such as Game Client) and Server. Now – let’s see how they can be observed from our developer’s point of view.

Phenomenon	Commonly Observable As…
Internet router failure	100% packet loss for any number of your customers (from “1” to “all of them”). Usually recovers within 1-2 minutes, though in some weird cases can go for hours
Route change	Sharp change in connection parameters, including delays and random packet loss
Router/Link overload (with queues being full)	Random packet loss of 10%+. Significantly increased latencies
Router being loaded, but not overloaded. AQM fires in	Random packet loss of below 1%. Somewhat increased latencies
Traffic Shaper on the way	Either AQM-like effects, or burstable packet loss (with bursts lasting hundreds of milliseconds)
Last mile – electromagnetic interference (especially for ADSL)	Burstable packet loss
Last mile oversubscription	Significant (up to 10% and up) and/or burstable packet loss
Modem retrain	100% packet loss for up to a minute, then goes back up. Has a tendency to repeat.
Heavy download (overload of last mile downstream)	Effectively – it is a case of Link overload, so it has the same effects
LTE last mile – poor weather (fog or rain)	Increased packet loss
Sudden IP change	Change of source IP address for packets coming from Client to the Server. Often leads to “hanged” connections
Poor Wi-Fi	Packet loss; may be affected by such minor things as some door being open/closed, or by somebody standing in “preferred” way of radio-waves
Wi-Fi interference	Burstable packet loss. May be caused by a neighbor preparing his breakfast

If we take a look at the right column of the table above – we’ll see most of the observable problems above will look as one of the following:

100% packet loss
- 100% packet loss event can affect a damn large portion of your players (up to 30% easily, and this is even when it is your own hosting ISP on the Server-Side is working perfectly)
- “I’ve seen pretty much whole Comcast going down for more than an hour, weird.more often than not – it will recover within 1-2 minutes. This is especially true if multiple players are affected. However, I’ve seen pretty much whole Comcast going down for more than an hour, weird.
Random packet loss. Each packet has a chance to be lost, with packet losses being more-or-less independent.
Burstable packet loss (i.e. losing a few packets within 0.1-5 seconds – and then going back to normal)
- Often has with a significant chance of repetition.
“Hanged” connections. NB: As of now, I am not really sure that “sudden IP change” is the only cause for “hanged” connections; what I am sure about, however – is that “hanged” connections DO happen, and happen a lot.

As we can see – an abstract “packet loss” number (which is usually used with a silent assumption that all the losses are independent) – is certainly not sufficient to describe all the ways in which connectivity infrastructure can cause us trouble. Taking into account all the troubles listed above, IMNSHO, means at least that:

“We (as app-level developers) MUST handle “hanged connections”.We (as app-level developers) MUST handle “hanged connections”. It is possible – and doing it without forcing our customer/player to close-and-reopen-app (potentially losing a game meanwhile) is the minimum professional courtesy (just as doctors – while we cannot fix all the problems, causing unnecessary pain should be still a big no-no).
We also SHOULD prepare to handle “mass disconnects” – as described above.
- At the very least – we need to be ready to handle all-of-a-sudden disconnect-then-reconnect of a whole bunch of players (which, in turn, can easily cause problems with public-encryption-based stuff such as TLS).
- In addition, it might be a good idea to provide some kind of “safety net” for your players in such cases (mitigating impact of the mass disconnect on them). Examples of such handling may include very different things, from increasing time-outs to making all affected matches non-ranking ones.
When testing your app in-lab – we should make sure to go beyond simple “probabilistic packet drop with drops being independent”, and to simulate all the patterns mentioned above, at least to make sure that your Server handles them without crashing.
We should make sure to test our app outside of lab (on the worst possible link); once upon a time I myself used AOL for this purpose (and it worked great to reveal lots of the problems of the app), and I’ve heard of some major AAA gamedev company using Internet-connection from Whistler cabin for the same purpose. Specifics doesn’t matter too much; the only thing we can be sure about – is that the poor Internet connection will throw a lot of problems at our app, and then even more, so testing your app over a real-world connection (and a poor one at that) is likely to reveal quite a few problems (and testing in advance will save you quite a few bits of embarrassment).
In addition, these patterns need to be taken into account when developing your own protocols – and when analyzing behavior of existing ones over the real-world Internet (such as TCP).

[[To Be Continued…

This concludes beta Chapter 13(g) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”. Stay tuned for beta Chapter 13(h), about not-so-well peculiarities and mechanics of TCP (note: it won’t be an umptieth discussion of 3-way handshake, I promise).]]

[+]References

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.