Heartbeating is a common technique to check whether network connection is alive. The idea is that each end of the connection sends small packet od data called "hearetbeat" once in a while. If the peer doesn't receive a heartbeat for some time (typically a multiply of the interval between the heartbeats), the connection is considered broken.
Interestingly, TCP protocol doesn't provide heartbeats (there are optional keep-alives that are operating on scale of hours, but these are not really useful for swift dectection of network disruption). In short, if network connection is broken, TCP endpoints are not notified about the fact.
The problem is mitigated by the fact that if the application at the endpoint terminates or crashes, the peer is notified via FIN or RST packet and is thus aware of the connection failure. Therefore, you are going to experience the problem of missing failure notification only if the network itself is broken, e.g. if your router crashes, cable is cut, your ISP experiences a failure etc. In such case the TCP endpoint will live on forever, trying to communicate with the inaccessible peer.
For many applications, e.g. web browsers, this kind of behaviour makes sense. If the user feels the browser is stuck he can just hit the "reload" button. The browser will then try to open a new TCP connection. The attempt at the initial TCP handshake fails and user gets notified about the fact.
For other kind of applications — specifically those with high-availability requirements — the missing failure notification is a big deal. Imagine a critical application that has redundant access points to the network, for example via two different ISPs. The idea is to use one of the providers and fall back to the other one only if the first one becomes unaccessible. The idea is nice, but there's a catch: If TCP implementation doesn't let you know that the peer is inaccessible you have no reason to switch to the fall-back provider. You will continue using the failed provider, possibly missing some critical data.
That's why critical applications as well as tools intended to build such critical applications (e.g. message queuing systems) re-implement heartbeating on top of TCP over and over again. This article tries to explain the problems encountered when doing so.
Typically, business data (messages) are mixed with heartbeats within a single TCP bytestream. The main concern is thus to prevent one messing with the other. In other words, heartbeats shouldn't disrupt data transfer and data transfer should not cause hertbeating to misbehave, for example by reporting false connection failures or, coversely, not reporting actual connection failures.
The problem hits once the application stops reading data from the TCP connection. The heartbeats may be arriving, however, they cannot be read, because there are messages stuck in TCP receive buffer in front of them. Given that TCP doesn't provide a way to read data from the middle of the receive buffer, the application has no idea whether the heartbeats are arriving or not.
It's not immediately visible whether this is an actual problem or whether it can be solved by some clever trick. To get better understanding, consider the following questions and answers:
Q: Well, we can read those obnoxious business data, store it in a memory for a while and check whether heartbeats are still arriving, right?
A: The problem with that is that the application may be not reading data for a long time. For example, it may be processing some complex and lengthy task and ignore incoming messages while doing so. Or it may be waiting for user input and the user went out for luch. Whatever the reason, in the meantime all the data from the TCP connection have to be read to memory to be able to check whether heartbeats are arriving as expected. So, if there's a lot of incoming data, the application is ultimately going to exhaust all available memory and get killed by the operating system.
Q: A-ha! That's what we have message brokers for. They are supposed to have a big disk space available, and can store large amount of data without failing. Thus, if have a message broker in the middle, we are safe, no?
A: The problem is that heartbeats should flow in both ways. Even if message broker is able to deal with large amounts of data to store, the client application has to do the same thing to make sure that it receives heartbeats from the broker. And given that application is likely to run on a modest desktop, on a mobile phone or on a blade server with no local disk space, the memory is going to be exhausted pretty quickly.
Q: OK, but wait a second! If application is not receiving the data at the moment we don't care whether network failure is detected straight away. It should be sufficient to detect the failure once the application starts receiving data again and do all the failure handling, such as falling back to a different ISP, at that time.
A: Unfortunately, no. Imagine you want to send data instead of receiving them. In such case you won't detect the connection failure because you are not reading the data, including potential heartbeats, from the connection. Consequently, you will send the data to a broken connection. The data are not going to be delivered and — even worse — you are not going to be notified about the fact!
Taking the previous Q&A into account, there's only one way to deal with the problem: Preallocate a buffer at the endpoint to store any outstanding inbound data and introduce some kind of flow control to make sure that the buffer is never over-flowed. For example:
- Preallocate 100 bytes of buffer.
- Send a control message to the peer letting it know there are 100 free bytes that can be filled in.
- Peer gets the control message and it knows it can send at most 100 bytes.
- Say it has 30 bytes of data to send. So it sends them and it is aware of the fact that it can still send 70 more bytes, if needed.
- User reads 20 of the 30 received bytes. That leaves 90 bytes in the buffer free to be used.
- User sends a control message to the peer letting it know there are 20 more bytes avialable.
- The peer gets the control message and adds the new bytes to its current credit. Now it knows it can send 90 bytes (70+20=90).
- Etc.
This kind of alorithm ensures that there are no intervening data in TCP buffers that would prevent heartbeats to be passed through the connection without being stuck in the middle.
Thus, when you are evaluating a solution that implements heartbeating on top of TCP here is a checklist to help you to find out whether it actually works:
- If there are heartbeats but there's no flow control, the solution won't work.
- There must be a way to split larger messages into smaller units that will fit into the preallocated buffer. If there's no such mechanism, the solution won't work.
- The credit in the flow-control mechanism must be expressed in terms of bytes, not messages. If it uses number of messages to control the flow, the solution won't work.
- If the protocol allows to issue more credit than the space available in the receive buffer, the solution won't work.
The above being said, there is one subtler aspect of the problem, an aspect that hints at a more general problem with Internet stack itself.
In the case of heartbeats on top of TCP as well as in the case of multiplexing on top of TCP (read the related article here), one has to re-implement a big part of TCP (windowing and flow control) on top of TCP proper. It sounds almost like an engineering anti-pattern. The functionality should not be copy-pasted among the layers of the stack, rather, it should be localised at one well-defined layer. For example, routing is implemented by network layer. End-to-end relaibility is implemented by transport layer. And so on. However, what we are getting here is the same feature implemented redundantly at various layers of the stack.
Obviously, the right solution would be implement a new transport protocol directly on top of IP protocol, one that would provide desired functionality — failure detection and/or multiplexing — directly, without duplicating the features.
And, as a matter of fact, the above was already done. The protocol built directly on top of IP with both heartbeating and multiplexing is called SCTP and is available out of the box in most operating systems.
And here comes the problem: SCTP is not used, even for projects where those features are needed. Instead, they are lousily re-implemented on top of TCP over and over again.
What's going on here?
I mean, SCTP has all features necessary for networking business-critical applications. By now, 13 years since its standardisation, it should be used almost exclusively by banks and such.
Except that it's not.
What happens is that developers are not aware of SCTP. That admins are not fond of having non-TCP packets on the network. That firewalls are often configured to discard anything other than TCP or UDP. That NAT solutions sold by various vendors are not designed to work with SCTP. And so on, and so on.
In short, although designers of the Internet cleverly split the network layer (IP) from transport layer (TCP/UDP) to allow different transports for users with different QoS requirements, in reality — propelled by actions of multitude of a-bit-less-clever developers — the Internet stack have gradually fossilised, until, by now, there is no realistic chance of any new L4 protocol gaining considerable traction.
The process of fossilisation of the Internet stack seems to proceed even further. Gradually, HTTP traffic is becoming dominant (consider, for example, moving from traditional SMTP-based emailing to gmail) and at some point it can possibly turn any non-HTTP protocol into persona non grata in the Internet world.
Even further: With advance of WebSockets we now have a full-blown transport protocol, an equivalent to TCP, on top of HTTP!
And it's unlikely that that will be the end of the story.
So, while re-implementing TCP functionality on top of TCP may seem like a silly engineering solution when you look at it with narrow mindset, once you accept that Internet stack is formed by layers of gradually fossilising protocols, it may actually be the correct solution, one that takes political reality into consideration as well as technical details.
And hopefully, at some point, when everbody have already migrated to some future TCP-on-top-of-TCP-on-top-of-TCP-on-top-of-TCP-on-top-of-TCP protocol, we can create a shotcut and get rid of those good-for-nothing intermediate layers.
Martin Sústrik, April 24th, 2013
"Given that TCP doesn't provide a way to read data from the middle of the receive buffer, the application has no idea whether the heartbeats are arriving or not" - isn't TCP feature of "urgent data" solving this problem exactly?
I've never used urgent data personally, but as far as I understand, how it works is that urgent bit is part of the bytestream. So, if data transfer is blocked by backpressure, even urgent data won't get through.
Feel free to correct me though.
Actually I have no idea if urgent data obeys TCP window or not. Seems like a good candidate for experiment :)
Apparently (according to Wikipedia) there are incompatibilities between TCP implementations which effectively limit OOB data to 1 byte. For heartbeat though it is enough. More troubling are network devices (the presentation linked in Wikipedia mentions Cisco PIX) which clear urgent pointer.
In any case, urgent byte was designed to send Ctrl+C over the network and is barely used anymore. I would be pretty concerned if I've seen a protocol using it for heartbeats.
Btw, I recall some horror stories from long ago about using urgent byte as message delimiter :)
Apparently you're right - RFC 6093 contains a thorough discussion on TCP urgent data. I was mislead by SIGURG which indirectly implies that by sending some urgent data it is delivered out-of-band.
Good to know! I had that impression for a long time but never really bothered to check.
As another try - you could decrease TCP keepalives timeouts (on Linux you could set them individually per socket, not globally) and use them as heartbeat mechanism? Don't know if it's portable though.
Nope, it's not. The keep-alive options are Linux-specific.
Also, there's another aspect that I've deliberately ommitted to keep the article easy to understand: If you want adaptable heartbeats that "just work" not depending on the link latency, you need something like SCTP's heartbeating algorithm. TCP keepalives won't do that for you. Even worse, you either need access to TCP's RTO value (no standard way to get it) or re-implement the RTO measurement on top of TCP.
It really gets messy once you start thinking about it in depth.
Apparently I cannot add links here so see gist catwell/5451026 ;)
This kind of problem is not limited to TCP, it's happening everywhere.
In networks, one of the most ridiculous examples IMO is L2TP. I mean, come on, a protocol explicitly designed to make loops in the hourglass model… Here the missing functionality (analogous to heartbeat in your example) is tunneling. The right solution to this problem would be something like MPLS, which despite a bad reputation has the right design: adding a layer (2.5). And the solution people use instead is… encapsulating layer 2 protocols inside IP!
Another example is mobile OSs. The problem is distribution, the Web solves it, so let's re-implement a whole OS in the browser, right? Encapsulate applications in a stack that was designed for online documents by twisting it as much as we can and pretend JavaScript is an assembly language. Mozilla criticized Microsoft because they could not implement a IE competitor efficiently on Windows 8. I wonder how efficient alternative browsers for Firefox OS will be without a way to write native code.
As you said all this happens because people who use a technology see its shortcomings and want to solve them, but they don't understand the technology and its design enough to change it, or they don't have enough leverage to make it happen. They know how to use it though, so they hack on top of it.
This is why we keep reinventing square wheels since the 70s. By layering stuff on top of flat tires instead of replacing them.
The question is: Do we have an alternative?
If you stick to the right thing (TM) you'll go the way of SCTP — you'll create an obscure niche solution.
Not accepting the political reality is invariably turns out to be lousy solution in the end. So, what's needed IMO is hacking simultaneously on the technical and political level. It's not at all obvious how to do that, but it's definitely a challenge.
I agree. For instance, I criticize things like Firefox OS on a technical level, but I still think it is very interesting because it solves the distribution problem. In an ugly way, but it does.
Regarding SCTP, the problem is even worse than that: now for lots of things we can't use TCP either because incompetent system administrators (or rather their clueless bosses) decide to only leave port 80 open. So we end up with HTTP as a transport protocol and applicative firewalls.
How can we change that? I agree that not accepting the political reality is not the solution for us hackers, and we have to do with what we have. But I think large players could, if they were bold enough. For instance, how could we solve the IPv6 migration problem fast? Deprecate IPv4 for all Google services with a 3 months deadline. A.k.a. solution "a dancing turtle is not enough incentive".
I am impatient to see what will happen with HTTP 2 / SPDY and maybe QUIC (Google's UDP replacement). Maybe they will turn to TCP next? :)
My feeling about it is that we should simply accept the process of protocol fossilisation and gradual movement up the stack. It seems to be a fact of life and something you cannot really fight.
However, when you look at the process closer, as layers are added on top, the waist of the hourglass seems to move upwards as well. Once it was at IP level, now it's almost on TCP level and heading upward towards HTTP.
The nice side effect is that any technology that gets *below* the waist (and if we accept the movement-upward model, every technology will ultimately get there) can be relatively easily replaced. Once, replacing L2 layer was a complex problem. Today, nobody really worries about replacing 1GbE by 10GbE although the two are pretty different beasts. The reason is that L2 is below the waist of the hourglass. You just don't have to care.
One day, TCP, HTTP and WebSockets will get below the waist as well. Then we can replace them with something that fixes the problems.
Of course, the time horizon here is measured in decades rather than years.
Interesting. Now the problem is that the upper layers have not been designed to do what they do, so it still makes the whole thing more complex. HTML5 is not a good application development platform and HTTP is not a good transport protocol.
But it's still nice to think we will be able to replace some of these things someday because people will standardize above them…
And now apparently DJB is trying to replace TCP+TLS ;) http://cr.yp.to/tcpip/minimalt-20130522.pdf
Thanks for the pointer. I'll give it a read.
Hi Martin, thanks for the great article.
I have used the heartbeating technique for some time over both WebSockets and ZeroMQ. While the technique works well in both cases, it's far more pleasant over WebSockets since that protocol includes frame types for ping and pong (as well as allowing control frames, e.g. ping, to be interleaved between the many data frames of a WebSocket message).
Are you considering adding better support for heartbeating to nanomsg? It always feels like a hack when I add this to protocols built on top of ZMQ.
Hi Martyn,
The problem, as explained in the article, is that to implement *sane* heartbeats on top of TCP you have to re-implement a large part of TCP functionality. Which is a task that would require considerable amount of work.
Also, the system often used with ZeroMQ — passing heartbeats on a separate connection — seems to work quite well and makes investing in TCP re-implementation questionable.
One thing that's missing in ZeroMQ though, is the ability to funnel all the communication through a single open port. That's definitely on the roadmap for nanomsg.
As for websockets, I've tried to participate in the stadardisation with insights like those in this article, but I haven't had much time back then. My feeling at the time was that while the spec defined wire format for heartbeats and multiplexing, it was kind of short on specifying the actual semantics. Anyway, glad to hear it works for you.
I believe I mentioned this in #nanomsg, but this may be of interest to anyone who wants to learn more about the ossification of the transport layer: [Argh, it won't let me post links. Look up the DeDiS group at Yale, project 'Tng']
They suggest fixing what they term the 'transport logjam' by further subdividing the transport layer into:
Semantics (flow control, in-order vs out-of-order, e2e reliability) ["I want TCP/SCTP/etc"] over
Isolation (encryption, datagram integrity) [DTLS or similar] over
Flow (congestion control) [DCCP sans ports] over
Endpoint (ports) [UDP or similar]
That lets NAT deal with the endpoint layer, flow middleboxes like wireless optimizers deal with the flow layer, and everybody stays out of your transport semantics.
Isn't it kind of a chicken-and-egg problem? To beat the transport logjam they would have to beat the transport logjam to introduce new fine-grained layering at L4.
Not really - they use a system they're calling 'minion' in order to bypass that, by using existing protocols where possible. For instance, using UDP as the endpoint layer (thus it looks like a normal UDP packet) and sticking a portless DCCP on top of it as the flow layer is the trivial example.
One of their more ambitious ones is using COBS (Constant-Overhead Byte Stuffing, a record separation format) and a new sockopt (TCP_UNORDERED) in order to add out-of-order delivery to the TCP *API* without changing the wire protocol. That lets them use it as the endpoint and flow layers.
They took the same tack with TLS, too, so you can have what the network thinks is a normal TLS session but the API for the user is out-of-order unreliable datagrams.
Btw, I've allowed posting links in comments.
Isn't there a userspace SCTP over UDP solution? That both plays well with firewalls and gives you the advantages of SCTP with just a small amount of extra overhead.
Oh, and as far as fossilization goes, don't forget SPDY which is basically the subset of SCTP that is useful for HTTP, plus an HTTP specific compression library implemented on top of TCP.
As for SCTP over UDP I though about it myself, but I am not aware of such thing actually being out there. If you have any pointers, I would love to check it out.
WebRTC DataChannels are using SCTP over UDP+DTLS. FWIW, I have a writeup @ hpbn.co/webrtc. A couple of deep-links into the chapter:
http://chimera.labs.oreilly.com/books/1230000000545/ch18.html#_real_time_network_transports
http://chimera.labs.oreilly.com/books/1230000000545/ch18.html#_delivering_application_data_with_sctp
Available in Chrome and Firefox today, which is pretty neat!
The generic SCTP-over-UDP thing would be extremely useful in business messaging scenarios. What's the actual availability of the solution? Is it an actual C library that can be used anywhere? Does it require a daemon to watch for incoming connections etc.?
Post preview:
Close preview