(Almost-)Zero-Additional-Latency UDP-over-TCP

posted March 21, 2017 by "No Bugs" Hare, translated by Sergey Ignatchenko

	Author:	“No Bugs” Hare Follow:
	Job Title:	Sarcastic Architect
	Hobbies:	Thinking Aloud, Arguing with Managers, Annoying HRs, Calling a Spade a Spade, Keeping Tongue in Cheek

[rabbit_ddmog vol=”4″ chap=”Chapter 13(j) from “beta” Volume IV”]

It is rather well-known that TCP latency kinda suxx – and we have discussed in details why it is the case, above (in particular, head-of-line blocking is a Big Enemy of the latency).

However, it is really important to understand what exactly is meant under “TCP latency is bad”. Actually, on closer examination we will see that at least two different things can be understood under this umbrella statement. The first one is that

We cannot use convenient TCP-like reliable-streaming API without incurring a latency penalty.

Let’s note, however, that this effect can be at least partially mitigated by using Reliable UDP (RUDP). For the time being, I won’t concentrate on RUDP (and on whatever-we-could-get-from-it) – and will just say that if we use reliable-and-ordered RUDP, it will still suffer from Head-of-Line blocking (that’s by definition) – and as a result, any gains from using reliable-and-ordered RUDP will be rather mild compared to TCP¹ (and if it is some different flavor of RUDP – we’re not really using TCP-like reliable-streaming APIs anymore).

The second potential – and quite wrong at that – statement says that

<wrong>Whenever we don’t have UDP (for whatever reason) and are forced to use TCP – we’ll inevitably get a latency penalty.</wrong>

While the statement above can look obvious and what-I-am-about-to-say may sound as an obvious heresy – I am about to demonstrate

How to carry UDP-style packets over TCP without incurring (much of) additional latencies.

¹ Sure, exponential back-off can be turned off in RUDP, and more aggressive RTO strategy can be used too – but this is still peanuts compared to what can be gained by using, say, state sync algorithms.

Why would we want it?

Why we would want such a thing – it is simple. Let’s consider the following scenario:

We write our app in UDP style (using anything we like – including state sync algorithm such as the one described in Vol. I’s chapter on Communications and in [Fiedler])
Then we try to run it on real-world Clients – and we realize that:
- “we realize that for some of the Clients – UDP just doesn’t work because of some weird firewall between Client and ServerFor some of the Clients – UDP just doesn’t work because of some weird firewall between Client and Server (hotel and airport connections are particularly guilty of having such firewalls; some of mobile connections are reported to have this problem too). Overall, according to [QUIC] about 6 to 9% of all the Google users cannot reach Google servers via UDP.
- For some of the Client-platforms-we-need – UDP is not supported at all.
  - In particular, as of now, browsers have this problem. NB: I am not sure whether the approach I am describing will work over WebSockets – it at least depends on the question “whether different WebSockets use the same TCP or different ones” (and we need different TCP connections for the algorithm below to work). While reportedly ‘force new connection’:true should do the trick – I have no idea how universal this solution is (and no idea whether there are some peculiarities of WebSockets which prevents the approach below from working). [[BTW – if you find out whether algorithm below can be made workable with WebSockets – please let me know.]]

Whether it is worth the trouble – depends on your needs, but IMO it is still nice to know that whenever you need it – such an option of tunneling UDP-like requests over TCP (and without incurring latency penalty) does exist.

The Algorithm

Actually, the idea is dead simple:

We have N TCP connections between Client and Server
- For each of these TCP connections – there is a “packet en route” flag. This flag is set to false after TCP connection is established.
On the sending side, whenever an UDP packet comes in – we’re sending it over one of the TCP connections which does not have the packet-en-route flag. Also, we set the packet-en-route flag for the TCP connection where we sent the packet.
- Of course, we still need to wrap the packet to send it over TCP (to denote the packet size and/or boundary)
On the receiving side – whenever we get an incoming UDP-over-TCP packet – we send an app-level acknowledgement (as small as 1 byte) back to the sender, over the same TCP connection where we got the incoming message.
On the sending side, when we get this app-level acknowledgement – we reset the packet-en-route flag

The description above is sufficient to understand how the algorithm works – though of course, to become practical, it will need other stuff (such as detecting connection being “hung” – for example, using app-level keep-alives or just having to wait for the app-level ack too long, creating new connections instead of “hung” ones, using the same N TCP connections for bidirectional connection, auto-detecting number N, and so on). Still, all the other stuff seems to be perfectly viable.

Now, let’s see why this thing will work (almost) about incurring additional latencies for TCP. The answer is simple –

That’s because we always send a packet only to those connections of which we’re sure that there are no outstanding packets.

It means that whenever we’re sending the packet, several properties are guaranteed to stand:

“there is nothing to block us (so head-of-line blocking doesn’t apply)there is nothing to block us (so head-of-line blocking doesn’t apply).
there are no outstanding retransmits (i.e. everything which went to the other side – was already acknowledged and retransmitted). This means that exponential back-off doesn’t apply either. ²
there are no outstanding bytes. This has an important consequence:
- As “slow start” and most³ congestion avoidance algorithms work in terms of “congestion window” – which in turn is expressed in terms of outstanding bytes – it seems that we won’t run into problems with slow-start/congestion-avoidance either (at least as long as our sends() are small enough to fit into MSS – and most of the time MSS is at least 1200-1400 bytes – once again making it a direct analogy to typical UDP packet limits).
At least in theory, even Nagle algorithm shouldn’t hurt us under these circumstances (though I’d still probably disable it just in case).

Bingo! We’ve ate our cake got our UDP-over-TCP – and have it got no additional latencies too.

² Even if there is “residual” RTO (i.e. if RTO has been previously doubled per RFC2988 section 5.5) – the first packet will be sent right away anyway, and we don’t care much about subsequent retransmits

³ if not “all”

Calculating N

However, before we can claim it as a win, there is still a very important consideration here. In particular: this algorithm is realistically usable only if the number N is small enough (in other words, if we’d need 100 TCP connections per user – it will probably start causing problems).

Let’s make some back-of-the-envelope calculations. For example, for a game with 20 “network ticks” per second (which is quite typical), and for a Client with 100ms RTT (it is more than enough for intra-continent stuff) – then under ideal circumstances we’ll need only 0.1/(1/20)=2 TCP connections for the algorithm above to work. In fact – we’ll certainly need to add at least other 2 connections to account for packet loss (which will result in some connections having packet-en-route flag longer than usual). Still, 4 or 5 TCP connections seem as a reasonably good starting point for such 20 network-ticks/second, and 100 ms RTT.

Other scenarios and likely-to-be-reasonable values of N:

20 network ticks/second, 200ms RTT (this is enough to have global RTT). Reasonable starting N for this case is probably around 6-7 (4 in ideal case, 2-3 reserve).
60 network ticks/second, 100ms RTT. Reasonable starting N: 9-10 (6 for ideal case, 3-4 reserve).

“As we can see, even in more stressful scenarios N is expected to be relatively smallAs we can see, even in more stressful scenarios N is expected to be relatively small. Now, let’s see where having multiple connections will hurt us:

more memory used on the Server-Side per player. This one can be partially mitigated by reducing the size of the TCP buffers. With the schema above – we’re not likely to need more than 4K / TCP connection, so to run 1’000 players/2S-2Userver (which is a kind of “industry standard” number for 2017), we’d need around 4K/TCP*10TCP/player*10000players/server = 40MBytes of RAM, which is pretty much nothing by today’s standards. Even if speaking about a “really communications-heavy” server with 10K players – it is still mere 400MBytes.
More TCP ports used. However – on the Server-Side we won’t see these additional ports (they will be still terminated by the same Server-Side TCP port), so I don’t expect it to be a real problem.

As we can see, analysis and the numbers above seem to indicate that UDP-over-TCP using the schema above, is usable (no warranties of any kind, batteries not included). Moreover, as discussed above – we do NOT intend to use this thing for all the players – but just for those platforms (and/or firewalled users) which don’t have UDP at the moment. As a result – the overall impact of using additional resources for multiple connections will be most likely scaled down even further.

Disclaimer and Conclusion

Beware of bugs in the above code; I have only proved it correct, not tried it.

— Donald Knuth —

Now, a necessary disclaimer:

Beware of problems with the algorithm above. I have only kinda-demonstrated its correctness, not tried it.

However, assuming that the analysis above stands – it opens the door to the following approach for writing games:

Write all your time-critical protocols using UDP; it includes both different flavours of RUDP and state sync algorithm
However, for those players who cannot use UDP for whatever reason – your code can use UDP-over-TCP as an (almost-)no-added-latencies backup.
- An additional disclaimer: at least in case of firewalls, some of them will do weird things with your TCP stream (in particular, re-assembling it and forwarding the stream further only after re-assembling) – and the algorithm above will not be able to deal with them. Still, from my experience I feel that such firewalls are not that frequent (though as always, YMMV, and batteries are not included).

Phew, I think it is enough for today. And BTW, if you see any mistakes in the reasoning above (or have tried it in reality and got either positive or negative results) – please LMK.

[[To Be Continued…

This concludes beta Chapter 13(j) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”.

Stay tuned for beta Chapter 22, where we’ll start discussing all the boring stuff about debugging, tracing, logging – and not-so-boring post-factum beta/production debugging]]

[+]References

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.

ITHare (46)
Reddit (21 upvotes, 21 comments)

Comments

David Turner says

March 21, 2017 at 5:50 pm

Seems a nice idea.

What do you do if all your connections have the packet-en-route flag set? Do you open another connection? How about speculatively opening more connections when you start to run out, in order to ensure you have some spare ones? Or send it down an in-use connection and take the latency hit?

Reply
- "No Bugs" Hare says
  
  March 21, 2017 at 7:18 pm
  
  > What do you do if all your connections have the packet-en-route flag set?
  
  This is a tough one: ideally, it shouldn’t normally happen.
  
  > Do you open another connection? How about speculatively opening more connections when you start to run out, in order to ensure you have some spare ones? Or send it down an in-use connection and take the latency hit?
  
  Sure, we need to have some spare ones (if we need 2 just to cover ideal case – I suggested to have another 2 in reserve). OTOH, opening new ones forever-and-ever is rarely a good idea (when all connections, including reserve, are being stuck – we’re probably dealing with a temporary “black hole”); I’d rather play around the “if we have stuck connection – try ‘opportunistic reconnect'” (described, for example, here: http://ithare.com/tcp-and-websockets-for-games/ ) – without increasing overall number of connections too much. Still, it is a field for further research :-).
  
  Reply
  - Andrew Armstrong says
    
    March 22, 2017 at 1:43 am
    
    Why not just discard the packet? UDP is potentially lossy, so dropping packets during periods of poor network performance shouldn’t impact the protocols built on top.
    
    Reply
    - "No Bugs" Hare says
      
      March 22, 2017 at 5:28 am
      
      > Why not just discard the packet?
      
      EDIT: at this point, I misunderstood the question – please see another reply below.
      
      Reply
      - Jesper Nielsen says
        
        March 22, 2017 at 10:28 am
        
        I think what he’s trying to say is, that if all of the connections are busy then just “drop” the packet before even sending it.
      - "No Bugs" Hare says
        
        March 23, 2017 at 4:25 am
        
        Yes, this could happen to be a good option (as long as other mechanisms are present to detect and restore “hung” connections).
David Turner says

March 21, 2017 at 5:55 pm

One other cute observation is that this way you can send ‘packets’ that are much larger than you would be able to over UDP. Of course they’d be fragmented, but even with fragmentation you can’t send a UDP datagram larger than 64k.

Reply
- Jesper Nielsen says
  
  March 21, 2017 at 7:13 pm
  
  The maximum ‘packet’ size you would be able to send like this would be the size of your output buffer if I understand this correctly.
  
  Reply
  - "No Bugs" Hare says
    
    March 21, 2017 at 7:35 pm
    
    Yes, but if fighting for latencies – I’d rather not try sending stuff which is larger than MSS (usually within 1.4K).
    
    Reply
Jesper Nielsen says

March 21, 2017 at 7:08 pm

>However – on the Server-Side we won’t see these additional ports (they will be still terminated by the same Server-Side TCP port), so I don’t expect it to be a real problem.

I learned something new there for sure:) I wasn’t aware that the server port of an accepted connection didn’t change to a dynamic port.

>An additional disclaimer: at least in case of firewalls, some of them will do weird things with your TCP stream (in particular, re-assembling it and forwarding the stream further only after re-assembling) – and the algorithm above will not be able to deal with them. Still, from my experience I feel that such firewalls are not that frequent (though as always, YMMV, and batteries are not included).

I’m not sure I understand exactly what is meant by this? Could this be mitigated by listening on a number of server ports instead of only 1, and avoid using the same server port for more than one connection to the same client? I suppose if there are many clients behind the same NAT IP this will require too many server ports to be feasible?

One thing that has me a little worried was that a nonblocking Send()/BeginSend() is not guaranteed to send all of the bytes. I suppose if there are no outstanding packets then the outgoing buffer must be empty and as long as your “package” is smaller than the buffer then you should be able to Send() and you could in fact even perform a blocking Send(), knowing that it wouldn’t block?

Reply
- "No Bugs" Hare says
  
  March 21, 2017 at 7:45 pm
  
  > I wasn’t aware that the server port of an accepted connection didn’t change to a dynamic port.
  
  Well, I wouldn’t bet my life on it, but – last time I’ve look in Wireshark, it looked so :-).
  
  > I’m not sure I understand exactly what is meant by this?
  
  This means that “there are ugly TCP firewalls out there which cause crazy delays which we cannot do much about” 🙁
  
  > I suppose if there are no outstanding packets then the outgoing buffer must be empty and as long as your “package” is smaller than the buffer then you should be able to Send() and you could in fact even perform a blocking Send(), knowing that it wouldn’t block?
  
  Yes – though I’d still suggest to test it ;-).
  
  Reply
Michael Shuen says

March 22, 2017 at 5:37 pm

With regards to the suitability of the algorithm in web browsers, I would say that it could be done. The Mozilla Developer Network documentation on WebSockets (available at https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API) implies that with default settings, each opened WebSocket object uses a different TCP connection). I’ll do some more testing shortly. Even so, depending on the firewall/NAT situation, we might be able to use a narrow range of ports (N ports, one for each connection) to ensure each connection is separate. As for the rest of the algorithm, it seems quite workable. On the client-side, the browser gives us access to see the number of untransmitted bytes, so we can pick a connection that is ready to go. On the server-side, we can use the “packet en route” flag. The only minor problem would be to find a UDP stack for JavaScript. If that is a problem, the client can just use faux-UDP of a sort. I’ll do my best to post a demo of the algorithm shortly.

Reply
- "No Bugs" Hare says
  
  March 23, 2017 at 4:29 am
  
  > I’ll do my best to post a demo of the algorithm shortly.
  
  Would be great! I am really eager to see whether it works in real world (we’re going to include it to our product – but the context is different and it is going to take a while…). Oh, and if there are any questions/problems – feel free to ask (ideally – with Wireshark dumps, otherwise any kind of analysis is pretty much hopeless).
  
  Reply
  - Michael Shuen says
    
    March 24, 2017 at 4:48 am
    
    Progress is quite good so far, stay tuned!
    Things I’ve learned so far:
    Wireshark doesn’t lie.
    GDB and Wireshark are my friends.
    
    As I had hoped, the browser APIs are actually pretty good. Watching the communication reveals that each websocket goes over a separate TCP connection by its nature. There is only one round-trip to set up websocket before it is ready – not sure if that will influence the possibility of speculatvely opening new connections when bad things happen.
    
    I heard somewhere on StackOverflow that Chrome has a global limit of 255 websockets open at any time, but I don’t think that will be a problem.
    
    The only bug at the moment is the fact that the server mysteriously crashes when the demo page is refreshed (it works fine if you open another copy).
    
    I’ll keep you posted on my progress.
    
    Reply
    - Michael Shuen says
      
      March 24, 2017 at 4:54 am
      
      Whoops, I figured it out. I have a feeling I was trying to send via a closed socket.
      
      Reply
    - "No Bugs" Hare says
      
      March 24, 2017 at 3:16 pm
      
      Sounds very promising… Please keep posting the updates.
      
      > Wireshark doesn’t lie.
      
      Yep – this is one of the very very few things out there which doesn’t lie :-).
      
      Reply
      - Michael Shuen says
        
        March 25, 2017 at 4:24 am
        
        Quick semi-update:
        As before, coming along very well.
        
        Server side scalability seems pretty good. I’m using the uWebSockets server (https://github.com/uWebSockets/uWebSockets) and it works pretty good. Frontend servers just need to route/forward all traffic from a single session to the same WebSocket server.
        
        The server does need to hold maps of sockets to sessions, though, and needs to do multiple lookups whenever sending. I’ll have some performance numbers later when I test with more clients connected.
        
        The channels can definitely be bidirectional – can you advise on whether this would be wise? The browser *is* able to track if it has any unsent packets, and client code can be modified to count channels as busy if an ACK has not been received. However, if something happens to the packet going to the server and the server doesn’t know about it, we could lose the channel for a short time.
        
        Finally, I’ve traced the crash bug from earlier down to a threading race between the sending code and the code that handles disconnects. There shouldn’t be a performance penalty, as only a single user’s data needs to be locked at a time.
        
        I’ll keep you posted on my progress!
      - "No Bugs" Hare says
        
        March 25, 2017 at 5:35 am
        
        Thanks a lot for the update! 🙂
        
        The first most important question is IMO – ‘whether there the latency improvement is observed with real-world TCP stacks in presence of packet loss and delays?’ If I had time – I would run an experiment with packet loss simulation (based on intermediate linux box/VM with netem, netem adding both like 10% packet loss and 50-100ms delay in each direction to simulate real-world RTT of 100-200ms) – and would try to observe the difference in latencies on the receiving side depending on the number of channels in use. Sure, in theory the algorithm is solid – but I’d rather see experimental data to be 100% sure (it doesn’t make sense to go into fine details unless the effect is not only predicted theoretically, but is also observed in the experiment which somewhat-simulates real-world). Also – such a test bench is IMO a very important tool to test subtle modifications (and to see that at least they’re not making things worse).
        
        > The channels can definitely be bidirectional – can you advise on whether this would be wise? The browser *is* able to track if it has any unsent packets, and client code can be modified to count channels as busy if an ACK has not been received. However, if something happens to the packet going to the server and the server doesn’t know about it, we could lose the channel for a short time.
        
        I am not 100% sure what is the exact processing model you have in mind – could you elaborate (a.what exactly you’re using now – you mentioned ‘number of untransmitted bytes’, which can be tricky, and b. what you’re planning to add)? OTOH – those bidirectional systems I can think of, will cause additional latencies (due to interplay between app-level ACK and ‘UDP packet’ tunneled in the opposite direction but going over the same TCP), and TBH I have no idea how much this additional latency will be. My suggestion in this regard would be – (a) to make a test bench along the lines outlined above (with simulated packet loss and delays), and (b) to test it while using sockets only as unidirectional – and while using sockets as bidirectional. Trying to estimate delays theoretically would lead to pretty complicated math – with quite a bit of room to make a mistake, and quite a bit of things which are very difficult to account for; on the other hand, such an experiment will show behaviour under a more-or-less reasonable approximation of the reality.
        
        > I’ve traced the crash bug from earlier down to a threading race between the sending code and the code that handles disconnects
        
        I _hate_ multithreading… :-((.
- Jesper Nielsen says
  
  March 23, 2017 at 2:44 pm
  
  >The only minor problem would be to find a UDP stack for JavaScript.
  
  My guess is if you’re already having to resort to websockets vs. straight TCP then UDP is definitely not an option:)
  
  Reply
Dahrkael says

March 23, 2017 at 12:53 am

So if I understand correctly, this somewhat mitigates the head-of-line issue, creating a TCP version of what reliable UDP libraries usually call channels.
Lets say we have 200ms of RTT and 5 channels, and we are synchronizing states 20 times per second. This gives us exactly 20 packets every second in best case scenario, right? theres no way we can go up without more channels, even worse if a packet delays more than expected.
So we need an additional channel everytime we want to cut latency (between packets) by half. This can stack pretty quickly with RTTs above 200 (mobile games), specially if as you say, channels are not bidirectional (this doubles the amount of channels). Also bidirectional channels can incur on extra delay if the other side needs to resend a lost packet before sending the acknowledgement.
For sure this is only for special cases, I see it a bit fragile, but anyway an interesting approach.

Reply
- "No Bugs" Hare says
  
  March 23, 2017 at 4:35 am
  
  Yes, however – the whole point is that channels are cheap. There is very little cost in terms of ports (they’re consumed on the Client-Side only), and we should be able to limit TCP buffers to ~4K, which makes memory usage on server pretty much negligible (80M for 1000 players is pretty much nothing by today’s standards).
  
  > this is only for special cases
  
  We’re going to try using it as a backup for UDP (in cases when UDP doesn’t work – because of firewalls); plus, it _might_ also work for those platforms where UDP is not supported (such as browsers).
  
  Reply
Angela Collier says

March 23, 2017 at 1:15 am

A network newbie here. I was told that TCP and UDP don’t play nice together and to just use one or the other. Was it a big fat fib or is there some truth to it?

Reply
- "No Bugs" Hare says
  
  March 23, 2017 at 4:38 am
  
  Short answer: They don’t _always_ play nice when used alongside each other (in other words – you need to be REALLY careful when doing it). However, in some limited well-controlled cases it can fly.
  
  A bit longer answer: the most common problem when using TCP alongside UDP is that TCP can easily eat up all the “last mile” capacity, hurting UDP (and overall game experience). However, if you know for sure what’s going on within TCP (and use your own flow control, significantly more aggressive than built-in TCP one) – it can be avoided.
  
  Reply
  - Angela Collier says
    
    April 12, 2017 at 11:44 am
    
    I see, thank you for taking the time to reply! 🙂
    
    Reply
Cory Long says

March 24, 2017 at 3:55 pm

First time on IT Hare I’ve seen the comment section longer than the article

Reply
- "No Bugs" Hare says
  
  March 25, 2017 at 5:37 am
  
  Well, probably it means that _finally_ I managed to write something interesting 😉 .
  
  Reply
Michael Shuen says

March 27, 2017 at 1:14 am

Sorry for the lack of news yesterday. So far:

Server stability seems to be much better. Unfortunately, that came at the cost of only allowing one thread to send() at a time. I’m pretty sure it’s the library’s fault, but no matter.

So far, preliminary testing with netem indicates that more WebSocket connections will be necessary than anticipated. Somewhere on this site you mentioned tcp_thin_dupack and tcp_thin_linear_timeouts; I turned those on but saw little effect. I’ll get better statistics and data before saying whether they have an effect, however.

Using 100ms latency and ~30ms jitter with 10% packet loss, I got occasional events where the server had no “ready” sockets when there were 6 sockets, but little to none with 10 sockets. I’ll give more concrete numbers soon.

Stay tuned!

Reply
- "No Bugs" Hare says
  
  March 28, 2017 at 11:54 am
  
  > So far, preliminary testing with netem indicates that more WebSocket connections will be necessary than anticipated.
  
  > Using 100ms latency and ~30ms jitter with 10% packet loss, I got occasional events where the server had no “ready” sockets when there were 6 sockets, but little to none with 10 sockets. I’ll give more concrete numbers soon.
  
  Thanks (BTW, what is your ‘network tick’?)! TBH, I am not _too_ surprised – and IMO it is still a pretty good result :-). BTW, do you try to handle “hung” connections?
  
  Two thoughts on “how we might _try_ to reduce number of TCP connections”, which might (or might not) help.
  
  I. Now, we have two states of the connection – “good” (ACK’ed), and “bad” (non-ACK’ed). However, when we run out of “good” ones – there are quite a few connections which are actually “probably good” – but we just don’t know about it yet. I.e. when we run out of “good” connections – we could try to find out these “probably good” connections (for example – those which aren’t ACK’d yet – but which have time-since-last-post about RTT – though not much more, as being delayed too long means likely problems with specific TCP).
  
  II. Or – we may delay a packet-which-is-about-to-be-sent, for a few ms – waiting for a channel to be ACK’d; this can be still better than dropping our outgoing packet completely.
  
  IF using any of these two schemas – the criteria to be measured, should be not “number of times we’ve found a ‘free’ socket”, but rather “histogram of packet losses/delays on receiving side”.
  
  Just thinking aloud… 🙂
  
  Reply
Michael Shuen says

April 2, 2017 at 4:23 pm

Many apologies for not responding sooner, I’ve been somewhat busy lately. I’m currently working on refitting the server to collect performance metrics, and should have an update soon.

Reply
- "No Bugs" Hare says
  
  April 3, 2017 at 1:39 pm
  
  Absolutely no problem (to start with, this whole project is just your goodwill).
  
  Of course, it is indeed a _very_ interesting project – and I am really eager to hear about the results :-). In other words – I am all ears (which is not that difficult for a rabbit ;-)).
  
  Reply
Michael Shuen says

April 13, 2017 at 5:25 pm

Again, really sorry I’ve been busy lately – I should have time to work on it this weekend.

Reply
Michael Shuen says

May 16, 2017 at 1:53 am

So, I finally found the time to revisit this, and I have some good results! Using “tc qdisc replace dev lo root netem delay 100ms 30ms drop 10%”, 10 WebSocket connections, and 20 messages per second, over one minute, all sockets were busy only 5 times. With tcp_thin_linear_timeouts and tcp_thin_dupack, it only happened twice.

I’ll probably do some more scientific testing later. I’ve put the code up as a Gist at https://gist.github.com/Zirientis/16c60215c13880ada43014a533bf8e9e and I used the uWebSockets library at https://github.com/uWebSockets/uWebSockets. To run it, download the code, compile uWebSockets, and build with ‘g++ server.cpp -g -std=c++14 -luWS -lssl -lz -pthread -Wall -Wextra -fsanitize=address -o server’. Connect to localhost on port 3000.

I look forward to hearing from you!

Reply
- "No Bugs" Hare says
  
  May 16, 2017 at 3:16 pm
  
  Wow! Actually, even 10 sockets/player doesn’t look prohibitively expensive. It means that we’ll be able to handle about 6000 players/single IPv4 address – and an “industry standard” number for simulation-based games being ~1000 players/server-box means that it isn’t likely to cause any exhaustion of available TCP ports. To be 100% sure, I’d still check TCP memory usage (and to limit the size of TCP buffers if necessary) – but honestly, I don’t really expect problems on this way (as our channels are always-empty, even as little as 4K/connection should work fine).
  
  > I’ll probably do some more scientific testing later.
  
  Would be great! And BTW, if you ever have a problem “where to publish this research” (it can be tough these days) – I would be Really Happy(tm) to post it as a guest post here (and if not – please LMK when you publish it).
  
  Reply
- MS says
  
  June 1, 2017 at 6:34 pm
  
  I am also following this development closely. Tested Michael Shuen’s code on CentOS and it worked as advertised – opened 10 browser clients and left them running for 3 hours on unmodified project & settings.. All clients had all sockets busy only 5-6 times in a 3 hour nonstop run. I would like to see load capacity performance benchmarks a uWebSockets server.
  
  I am looking into using this method as an alternative to TURN servers for my WebRTC multiplayer game.
  
  Reply
- MS says
  
  June 30, 2017 at 2:16 pm
  
  Any new development on this front?
  
  Reply
Michael Shuen says

August 12, 2017 at 6:59 pm

Sorry – got really caught up in other things! I’ll definitely interested in perusing this, so I’ll try and make some time soonish. Thanks for the interest!

Reply
- "No Bugs" Hare says
  
  August 22, 2017 at 1:19 pm
  
  No problem – but pls make sure to keep us updated :-).
  
  Reply
Jackson says

December 30, 2018 at 11:44 pm

“On the receiving side – whenever we get an incoming UDP-over-TCP packet – we send an app-level acknowledgement (as small as 1 byte) back to the sender, over the same TCP connection where we got the incoming message.”

Isn’t TCP already sending an (invisible to the programmer) acknowledgement back for each packet? We’d be doubling up by sending an application-level acknowledgement, so it would be good if this internal acknowledgement could somehow be checked instead to keep track of the connection state (Is that possible?). Also, while we might only use one byte for our application-level acknowledgement, the TCP header and whatnot would be added. I read that the minimum size of both UDP and TCP packets with all the extras added is about 60 bytes, so to emulate the sending of one ~60 byte + data UDP packet would presumably require the original 60+ byte TCP packet, the application-level ~60 byte acknowledgement, and the internal acknowledgement (also ~60 bytes? Not sure.). In other words, it seems to me like we’d be sending a lot of data.

I don’t know much about network programming. I found this post because I had the same idea, although I was planning to not keep track of which ports are busy and to just accept that I may sometimes be sending on busy ports.

Reply
- "No Bugs" Hare says
  
  January 3, 2019 at 8:32 am
  
  > Isn’t TCP already sending an (invisible to the programmer) acknowledgement back for each packet?
  
  Sure, there are acknowledgements but they’re NOT for each packet (they try to piggy-back, and what not).
  
  > so it would be good if this internal acknowledgement could somehow be checked instead to keep track of the connection state (Is that possible?).
  
  The problem is that while this internal acknowledgement WILL be sent, usually it will be sent too late for our purposes.
  
  > I read that the minimum size of both UDP and TCP packets with all the extras added is about 60 bytes, so to emulate the sending of one ~60 byte + data UDP packet would presumably require the original 60+ byte TCP packet, the application-level ~60 byte acknowledgement, and the internal acknowledgement (also ~60 bytes? Not sure.). In other words, it seems to me like we’d be sending a lot of data.
  
  Assuming 60byte payload: from sender to receiver it will be like 60bytes of payload+40-60 bytes of IP+TCP header ~= 100-120 bytes. In the opposite direction it will be 1 byte of app-level payload + 40-60bytes of IP+TCP header ~= 40-60bytes (almost-always, internal acknowledgement will piggy-back on app-level acknowledgement using one of already-existing fields in TCP header, so it is “free”). Not much degradation compared to bare TCP if you ask me (bare UDP is a different story though…)
  
  Reply
  - Jackson says
    
    January 3, 2019 at 10:57 pm
    
    Thanks for your response 🙂
    
    > In the opposite direction it will be 1 byte of app-level payload + 40-60bytes of IP+TCP header ~= 40-60bytes (almost-always, internal acknowledgement will piggy-back on app-level acknowledgement using one of already-existing fields in TCP header, so it is “free”).
    
    Oh, I didn’t realize that the internal acknowledgement piggy-back, which makes a lot of sense. Nonetheless, our acknowledgements could raise overall network traffic by quite a lot – up to double (depending on the size of the regular packets we want to send).
    
    I’ve implemented the system, minus the acknowledgements, using WebSockets, but unfortunately don’t have the facilities to test its performance. I’m operating on the assumption that the performance will be better than regular TCP as far as head-of-line blocking is concerned, though it occurred to me that whatever network conditions cause a packet to be dropped on one socket may also cause drops on all the other sockets (How often this occurs vs how often a lone packet is dropped seems key to whether or not the whole multiplexing technique is worthwhile). See here – https://www.gamedev.net/forums/topic/700489-using-concurrent-tcpwebsockets-to-mitigate-head-of-line-blocking/ – for more info about what I’m trying to do.
    
    As for the acknowledgements, I’m not sure whether they would provide a lot of benefit over my naive approach, or whether the extra network traffic would outweigh the benefits. Only thorough, real-world testing could determine that.
    
    There was some discussion above about what to do if all sockets are blocked waiting for acknowledgements. Someone suggested that the attempts to send more packets should deliberately drop the packets. I think an alternative is to pick the socket that became blocked first and send the packet on that one on the assumption that it will probably be the first to become unblocked. In other words, if all connections are blocked, it might be time to just let TCP do its thing. Or mix the two options and allow the application-level code to specify which packets can be dropped if all sockets are blocked and which ones must be sent anyway (which would mimic a traditional reliability layer built on top of UDP).
    
    Anyway, I’ll add to this post the connection-management system I came up with for WebSockets, which can only be opened client-side:
    
    Connecting:
    
    1. When a client wants to connect to the server, it opens N connections all at once.
    2. When the server detects a new connection, it adds that connection to its list of “unbound” connections.
    3. When the client detects that all its N connections are opened, it sends a packet to the server along any connection requesting a unique client ID.
    4. When the server receives a request for a new client ID, it generates a new ID, sends the ID back along the same connection, and adds a new client object with that ID to its list of client objects.
    5. When the client receives its client ID, it sends a packet along each of the connections requesting that the server “bind” that connection to the client object with that ID.
    6. When the server receives a request to bind a connection to a specific client object, it removes that connection from its list of unbound connections and adds it a list in the corresponding client object. The server also checks to make sure this client is not trying to establish/bind too many connections (to prevent any funny business by modified clients). Optionally, if the server sees that this client object has now bound all N connections, it can send a message to the client to tell it that the whole process has succeeded.
    7. As a fail-safe, unbound connections are forcibly closed and removed by the server after, say, 5 seconds.
    
    Disconnects: I’m not yet sure to best handle the case that the server or client detects that one of the connections between them has failed. At the moment, the server simply removes the connection from the relevant client object’s list of connections and lets the client establish a new connection, which is as simple as detecting the dropped connection, opening a new socket, and sending a bind message on it (see above). However, if TCP is very reliable and not susceptible to drop-outs (?), all that might be unnecessary. A better system might be for both server and client to simply consider one dropped connection a total disconnect and respond accordingly.
    
    Reply
    - "No Bugs" Hare says
      
      January 4, 2019 at 6:35 am
      
      > Nonetheless, our acknowledgements could raise overall network traffic by quite a lot – up to double (depending on the size of the regular packets we want to send).
      
      Not really; if we don’t send app-level acknowledgments – internal acknowledgments will be sent anyway, so at least for predominantly-one-way traffic adding ACKS won’t change the traffic much (need to test, but my somewhat educated guess is that we’ll be increasing a 60-byte internal-ack packet into a 61-byte app-level one, which is not much difference at any rate).
      
      > I’ve implemented the system, minus the acknowledgements,
      
      But without an app-level ack, how do you know if the channel is already free, so you can use it without the risk of packet-you’re-sending being hopelessly delayed? Sending something into a ‘stuck’ TCP channel will kill the whole idea, and the whole problem with TCP-for-fast-paced-games is that it DOES get stuck pretty often 🙁 .
      
      > As for the acknowledgements, I’m not sure whether they would provide a lot of benefit over my naive approach, or whether the extra network traffic would outweigh the benefits. Only thorough, real-world testing could determine that.
      
      I don’t really see how you choose which channel to send your packet to, but if you’re merely sending a packet over a random/round-robin channel – then IMNSHO, “your” delays (or packet losses) are pretty much guaranteed to be MUCH worse than “mine”. If the channel you’re sending to, is currently “stuck” (which is BTW _exactly_ the problem we’re trying to deal with in the first place) – sending packet there is likely pointless (due to TCP retransmit mechanics and RTO calculations, we’re speaking about at the very least hundreds of milliseconds of delay – and all the way up to hours(!)).
      
      > However, if TCP is very reliable and not susceptible to drop-outs (?), all that might be unnecessary.
      
      It is susceptible to both drop-outs, AND (even worse) to being stuck for a long while :-(. Overall, on the first glance, your connection logic makes sense, but SEE ABOVE RE. APP-LEVEL ACKS : I feel that without them (and using round-robin without ‘stuck’ connection detection) the whole point is defeated.
      
      Reply
      - Jackson Allan says
        
        January 4, 2019 at 3:59 pm
        
        > Not really; if we don’t send app-level acknowledgments – internal acknowledgments will be sent anyway
        
        Right, I guess both sever and client are only using each socket ever 250 or 500ms, then that TCP socket would be forced to send its internal acknowledgement as a stand-alone packet instead of piggy-backing it onto another outgoing socket? If that’s the case, then certainly, sending our own app-level acknowledgements won’t make a difference in terms of bandwidth. Rather, the very act of using N connections would do that.
        
        R.e. your comment about predominantly one-way traffic, I don’t think that applies to games?
        
        > But without an app-level ack, how do you know if the channel is already free, so you can use it without the risk of packet-you’re-sending being hopelessly delayed?
        > I don’t really see how you choose which channel to send your packet to, but if you’re merely sending a packet over a random/round-robin channel – then IMNSHO, “your” delays (or packet losses) are pretty much guaranteed to be MUCH worse than “mine”.
        > I feel that without them (and using round-robin without ‘stuck’ connection detection) the whole point is defeated.
        
        Yes, my plan was to naively round-robin. The idea wasn’t to ‘solve’ the problem of head-of-line blocking but to mitigate it. If a packet does get sent on a blocked connection, well, we’re just back to regular TCP – no biggie. But the round-robining would still be beneficial because 1) if just one connection is blocked, the overwhelming of packets would still get through via the other channels and 2) a blocked connection has time to recover before it gets used again. If we use 10 connections and send packets at 20hz, each connection would have 500ms to re-transmit and recover from a dropped packet, which seems like it would usually be enough.
        
        > and all the way up to hours … It is susceptible to both drop-outs, AND (even worse) to being stuck for a long while
        
        Does that happen often? It seems to me that if TCP sockets often became silently blocked forever, then that would pose a major problem for the millions of applications that rely on a single connection. Naturally, if the client and server in our application initiate a full disconnect as soon as one of the N connections fails, then it stands to reason that our clients will be dropped at N times the rate that drops occur in applications using a single socket. So in that case, I think I should allow for individual dropped connections the be reopened by the client. (Not yet sure what to do about a perpetually blocked connection as described above, though.)
        
        Anyway, I implemented the acknowledgement system. Adding it only took a few extra lines of code, and I can turn it on and off easily. I still round-robin to ensure that the sockets are all being used roughly evenly. In other words, when I look for an unblocked connection to send a packet on, I start at the connection after the last one I used in the list of connections.
        
        So now I can run some tests when I find some time when I don’t want to use the internet for other things. My plan is to test bandwidth usage and average packet travel time for one socket (i.e. standard TCP arrangement), 5 sockets without acknowledgements, 10 sockets without acknowledgements, 5 sockets with acknowledgements, and 10 sockets with acknowledgements. Then I can report back with my findings.
      - "No Bugs" Hare says
        
        January 10, 2019 at 5:11 am
        
        > guess both sever and client are only using each socket ever 250 or 500ms, then that TCP socket would be forced to send its internal acknowledgement as a stand-alone packet instead of piggy-backing it onto another outgoing socket?
        
        Yep.
        
        > Does that happen often? It seems to me that if TCP sockets often became silently blocked forever, then that would pose a major problem for the millions of applications that rely on a single connection.
        
        Most of the time it is not technically ‘forever’, but for, say, 10 seconds (on a noisy connection 10-second delay can happen as frequently as once per hour; this will be accompanied with 2x 5-second delays during the same hour, 4x 2.5 seconds delays, and so on…); ‘forever’ does happen, but is indeed quite rare (and has different mechanics).
        
        That’s why, say, for a time-critical game (where “10 seconds” effectively == “forever”) which dares to operate over TCP, it is mandatory to implement app-level keep-alives and automated reconnects (and one of the major reasons why, say, Starcraft eventually moved from TCP to UDP). Other than that, way too often it is left up to the end-user to press ‘refresh’ button if things got stuck 🙁 (that’s how the vast majority of websites operate, effectively relegating error handling to the end-user 🙁 ; 10 seconds is nothing for a website, and if it gets stuck forever – well, user will be forced to press ‘Refresh’ and to deal with the ill effects of the session being in weird state, POSTs being re-sent etc.).
Jackson Allan says

January 4, 2019 at 6:34 pm

Unfortunately, it was premature of me to declare earlier today that I’d added the acknowledgement system with a few lines of code as further testing revealed some issues.

The first minor point is that if you are ever planning to send packets over connections marked as busy (because all connections are busy), then the packet-en-route flag needs to be not a flag but a counter indicating how many acknowledgements the connection is waiting to receive at a given time. Otherwise, a connection will eventually be marked ready when it’s actually still busy. Simple, but worth noting.

Secondly, I think a major conceptual issue lies in this step:

> On the receiving side – whenever we get an incoming UDP-over-TCP packet – we send an app-level acknowledgement (as small as 1 byte) back to the sender, over the same TCP connection where we got the incoming message.

So we receive a packet on a given connection and need to send an acknowledgement back on the same connection. There are two possibilities:

1) That connection is current flagged as ready.
2) That connection is currently flagged as busy.
In case 1, we can send the acknowledgement, but the problem is that we can’t ourselves expect an acknowledgement of our acknowledgement back from the client. Otherwise, the client and sever would enter into an infinite loop. So we’re left with a connection that is actually busy but must be flagged as ready.
In case 2, the same problem applies, but we have the added issue that we’re sending across a connection that we already know is busy.

The only solution to this issue that I can see is unidirectional connections. In other words, we have X connections for the client to send data to the server and the server to send only acknowledgements, and Y connections for the opposite. Maybe this was already what you had in mind, but it wasn’t obvious to me until now.

Reply
- "No Bugs" Hare says
  
  January 10, 2019 at 5:26 am
  
  > but the problem is that we can’t ourselves expect an acknowledgement of our acknowledgement back from the client.
  
  I’d say it is not a problem, but merely a task to solve (we should distinguish between acks and normal data anyway, so not acknowledging an ack is perfectly feasible). The real problem lies with case 2 (at the moment I cannot say how sending an ack over busy connection will affect observable delays 🙁 )…
  
  > Maybe this was already what you had in mind, but it wasn’t obvious to me until now.
  
  IIRC I meant unidirectional ones (in particular, otherwise sending 1-byte ack won’t be possible due to that need to distinguish acks and non-acks).
  
  As for bidirectional connections – this is a field for further research (hey, I just threw in a basic idea, not the Ultimate Solution(tm) which cannot be improved 😉 ). In this regard, experiments are needed to see how bidirectional ones (sending ack over a busy connection) will operate over connections with packet loss (which BTW can/should be simulated with netem; these days 5% loss is quite common, and 10% is not unheard of, and worse than that – situation with packet loss is expected to worsen due to the economy of ISPs and ironically improvements in TCP throughput when operating over lossy connections), and compare it to the behaviour of unidirectional ones.
  
  Hope it helps.
  
  Reply
  - Jackson Allan says
    
    January 11, 2019 at 1:48 am
    
    Thanks for all your responses!
    
    Just to be completely clear, with the bidirectional channels, the problem with sending and acknowledgement on a channel (busy or non-busy) is that we potentially block the channel with a lost packet, but we have to leave it flagged as ready or else enter into an infinite feedback loop. Assuming the client and server each send packets at similar rates, then approximately half our packets will be acknowledgements, which means that half the time, we’re going to be sending out packets without flagging our connections as busy. This seems like it would defeat the whole effort. I apologize for repeating myself here – I just wanted to ensure I’m expressing myself clearly.
    
    If the acknowledgement system means that the connections must essentially be unidirectional, the question again becomes, will it justify its cost? In other words, would it be better for the client and server to each have N / 2 connections at its disposal to use intelligently, or would it be better for each one to simple round-robin over N connections? (What we really need to know here is also, what is the cost associated with each additional connection?)
    
    > these days 5% loss is quite common, and 10% is not unheard of
    
    Wow! I was imagining the number to be more like 1% because that’s what’s reported by games running on UDP.
    
    I’ve tested five-socket round-robinning technique on a micro scale (i.e. two clients and one server all running on the same machine, each sending updates via the internet at 20hrtz with pings fluctuating around 150ms) and found it to provide a far smoother experience than a single socket. However, the issue now is that there are other variables at play. For example, I’m sending my packets through an ssh service (serveo.net), and my own application-level networking code is at work trying to minimize the effects of lag. So my initial impression is just that – an initial impression.
    
    Also, for any future readers, I’d just like to add something to the system of initializing/negotiating multiple connections between server and client that I mentioned earlier: The server can give the client, along with his ID, a unique password that the client then uses when telling the server which connections belong to it. This prevents a hacked client from doing disruptive things like trying to “bind” its connections to other ids/clients.
    
    Reply