Next: Bus Messaging Pattern
Implementing network protocols in user space is not as uncommon as it may seem. The advantages when compared to implementing them in the kernel space range from easier development, more freedom to experiment and greater portability to less hassle with the distribution (the pain of getting the patch into the mainline kernel vs. simply uploading a user space library to GitHub), improved performance thanks to kernel by-pass and no way to provide a kernel space implementation for proprietary platforms like Windows.
While my experience in the area is based on my work on ZeroMQ and more recently on nanomsg, there are many other libraries that share the same challenge. One may mention, for example, OpenPGM (reliable multicast, RFC 3208), UDT (bulk transfer protocol) or OpenOnload (Solarflare's kernel by-pass networking).
This article gives an overview of technical challenges of designing an idiomatic APIs for network protocols implemented in user space and provides a background for my recent EFD_MASK patch to Linux kernel.
When implementing a network protocol in user space you will almost certainly start with some kind of clone of BSD sockets API:
struct myproto_sock {
...
};
struct myproto_addr {
...
};
struct myproto_sock *myproto_socket (void);
int myproto_close (struct myproto_sock *s);
int myproto_bind (struct myproto_sock *s, myproto_addr *addr);
int myproto_connect (struct myproto_sock *s, myproto_addr *addr);
ssize_t myproto_send (struct myproto_sock *s,
const void *buf, size_t len, int flags);
ssize_t myproto_recv (struct myproto_sock *s,
void *buf, size_t len, int flags);
However, when implementing anything except for the simplest client applications this API is not sufficient. To handle more than one socket in parallel, you need polling. For example, your server application may want to wait until at least one of several myproto sockets becomes readable.
Once again, the most obvious solution is to fork the polling API from BSD sockets:
struct myproto_pollfd {
struct myproto_sock *fd;
short events;
short revents;
};
int myproto_poll (struct myproto_pollfd *fds, nfds_t nfds, int timeout);
The next stumbling block is combining classic TCP or UDP sockets with myproto sockets in a single pollset. What if the server application wants to handle a set of myproto sockets along with a set of TCP sockets? The above API doesn't support that kind of thing. We have to modify the myproto_pollfd prototype to allow for using either myproto_sock pointer or regular OS-level file descriptor:
/* rawfd field is used instead of fd when the latter is set to NULL */
struct myproto_pollfd {
struct myproto_sock *fd;
int rawfd;
short events;
short revents;
};
The API is getting a little ugly, but it's still more or less usable. Unfortunately, it's not yet the end of the trouble. Actually, this is the point where things start to get hairy. Let's inspect the next use case.
There are many widely used asynchronous networking frameworks out there. Their main goal is to shield the user from the complexity of handling many connections in parallel. Obviously, central piece of each such framework is a loop polling on a set of sockets and invoking user-supplied handlers when the poll reports a network event. The polling itself is done via one of the system-level polling mechanisms, such as select, poll, epoll, kqueue, IOCP, or /dev/poll.
However, myproto sockets require a special polling function (myproto_poll) and thus cannot be integrated with such frameworks.
In theory, the framework can be modified to use myproto_poll instead of the system poll, but such a patch is never going to get into the mainline. The most obvious reason is that it makes the framework dependent on the myproto library, but the real problem occurs when there's a need to handle two different user space protocols. One asks the framework to use myproto_poll function, other one asks it to use someoneelsesproto_poll. There's no way to reconcile the two.
In short, protocol developer's only option is to somehow provide a native file descriptor for the frameworks to poll on. It can be done, for example, in the following way. This example assumes that myproto is a protocol built directly on top of IP layer:
struct myproto_fd {
int raw; /* underlying IP (SOCK_RAW) socket */
...
};
int myproto_getfd (struct myproto_fd *s)
{
return s->raw;
}
The user is expected to retrieve the underlying system file descriptor from myproto_fd (via myproto_getfd function) and use it for actual polling (select, poll, epoll etc.)
The problem with this approach is that the underlying file descriptor signals individual poll events depending on what's happening on IP layer, rather than on the myproto layer. For example, if a myproto control packet with no embedded user data arrives, the file descriptor will signal POLLIN, however, subsequent myproto_recv will return no data.
To solve this problem we have to implement a new function that will check whether the socket is really readable. Something like this:
struct myproto_fd {
int raw; /* underlying IP (SOCK_RAW) socket */
uint8_t rx_buffer [128 * 1024]; /* receive buffer */
size_t bytes_in_rx_buffer; /* amount of user data in reveive buffer */
...
};
int myproto_getfd (struct myproto_fd *s)
{
return s->raw;
}
int myproto_can_recv (struct my_proto_fd *s)
{
if (s->bytes_in_rx_buffer > 0)
return 1;
else
return 0;
}
The intended usage is as follows:
- Retrieve the raw file descriptor form myproto socket.
- Poll on it.
- When POLLIN is signaled, use myproto_can_recv to find out whether the socket is really readable.
- If so, receive the data via myproto_recv. The call is now guaranteed to return at least 1 byte.
That's already pretty ugly. In the real world the API tends to get even worse. There are multiple reasons for that: The need to report special conditions like POLLERR, POLLPRI or POLLHUP. Using multiple underlying raw sockets. Handling underlying raw sockets in a background thread. Etc.
Let's consider the example of ZeroMQ. With ZeroMQ underlying sockets are managed by a worker thread. Worker thread communicates with user thread by sending it events via a socketpair. One event may mean, for example, "new messages have arrived". However, for efficiency reasons the event is not sent for every single arrived message, only when there was no message in the rx buffer beforehand. This kind of approach is called edge-triggering.
So, when user thread wants to poll on ZeroMQ socket it retrieves its internal file descriptor (receive side of the socketpair) and uses it for polling. Given that it is edge triggered, user has to deal with edge-triggering in the application. And it turns out that edge triggering is very counter-intuitive for most users and that it is seen as adding significant complexity to the API and leading to subtle bugs in the applications.
If you want to have a closer look at the convoluted API, check documentation for zmq_poll as well as for ZMQ_FD and ZMQ_EVENTS socket options. OpenPGM gets into pretty similar situation — see PGM_SEND_SOCK, PGM_RECV_SOCK, PGM_PENDING_SOCK and PGM_REPAIR_SOCK socket options. UDT implementation, if I recall correctly, just gives up and doesn't provide any generic polling mechanism at all. OpenOnload, on the other hand, hijacks the whole system socket API and replaces it with its own implementation.
As can be seen, all the possible solutions are basically ugly hacks. The question thus is, what can be done to fix the problem in a systematic manner.
Here's where proposed EFD_MASK patch for Linux kernel kicks in.
The fundamental idea is that it should be possible to create a system-level file descriptor to work as a placeholder for a socket implemented in the user space. Additionally, user space should be able to specify which events will be returned when the descriptor is polled on.
And as it turns out, Linux already offers an object that almost fits the bill. It's called eventfd.
For those not familiar with eventfd, it is basically a counter. By writing data to the eventfd you increase the counter, by reading data from eventfd you decrease the counter. When you poll on eventfd, it signals POLLIN when counter contains a value greater than zero. POLLOUT is signaled all the time except for the case when the counter reaches value of 0xffffffffffffffff.
There are couple of things missing though:
- There's no way to associate user space data (socket state) with eventfd.
- While eventfd is great for signaling POLLIN and more or less viable for signaling POLLOUT, you can't force it to signal special events, such as POLLHUP or POLLPRI.
- You can't even make it signal all possible combinations of POLLIN and POLLOUT. Specifically, there's no way to signal neither POLLIN nor POLLOUT, which is pretty common condition that can occur quite easily in network protocols (when receive buffer is empty and send buffer is full).
To solve these problems we need to, first, associate an opaque pointer with the eventfd, and second, replace eventfd's counter semantics with mask semantics, i.e. network protocol implementation should be able to explicitly specify the mask of events to be signaled when the file descriptor is polled on.
At the moment, the addition to the existing Linux eventfd API looks like this:
#define EFD_MASK 2
struct efd_mask {
uint32_t events;
union {
void *ptr;
uint32_t u32;
uint64_t u64;
};
} __attribute__ ((__packed__));
You can use eventfd system call in combination with EFD_MASK flag to create a special type of eventfd object with mask semantics:
s = eventfd (0, EFD_MASK);
You can use write system call to set events and opaque data to the eventfd:
struct efd_mask mask;
mask.events = POLLIN | POLLHUP;
mask.u32 = 1234;
write (s, &mask, sizeof (mask));
Afterwards, you can use read system call to get currently set events and opaque data from the eventfd:
struct efd_mask mask;
read (s, &mask, sizeof (mask));
assert (mask.u32 == 1234);
Finally, when you poll on the eventfd using select, poll or epoll_wait function, you'll get the events specified by last events written to eventfd:
struct pollfd pfd;
pfd.fd = s;
pfd.events = POLLIN | POLLOUT;
int cnt = poll (&pfd, 1, -1);
assert (cnt == 1);
assert (pfd.revents == POLLIN | POLLHUP);
What follows is an example on a network protocol implemented in user space. It takes advantage of EFD_MASK functionality to provide socket-like behaviour to the user. There's are only three functions (opening socket, closing socket and receive) in the example. Other functions (send, setsockopt, connect, bind etc.) are left as an exercise for the reader:
struct myproto_state
{
/* Undelying raw sockets go here, protocol state machine etc. */
};
int myproto_socket (void)
{
int s;
struct myproto_state *state;
struct efd_mask mask;
/* Create the file descriptor to represent the new myproto socket. */
s = eventfd (0, EFD_MASK);
/* Create socket state and associate it with eventfd. */
state = malloc (sizeof (struct myproto_state));
mask.events = 0;
mask.ptr = state;
write (s, &mask, sizeof (mask));
return s;
}
int myproto_close (int s)
{
struct efd_mask mask;
struct myproto_state *state;
/* Retrieve the state. */
read (s, &mask, sizeof (mask));
state = mask.ptr;
/* Deallocate the state and close the eventfd. */
free (state);
close (s);
return 0;
}
ssize_t myproto_recv (int s, void *buf, size_t len, int flags)
{
struct efd_mask mask;
struct myproto_state *state;
/* Retrieve the state. */
read (s, &mask, sizeof (mask));
state = mask.ptr;
... move data from protocol's rx buffer to the user's buffer ...
/* If there are no more data in rx buffer, unsignal POLLIN. */
if (state->rx_buffer_size == 0)
mask.events &= ~POLLIN;
/* Store the modified poll flags. */
write (s, &mask, sizeof (mask));
return nbytes;
}
The code is relatively self-explanatory, so let me make one final remark. All the socket functions (except for socket creation and termination) follow the same pattern:
- Read from eventfd to get the socket state
- Do actual work. Modify the poll flags in the process, if needed.
- Write the state to the eventfd.
Following these instructions should make implementing network protocols in user space easy. Or, if not easy (network protocols rarely are) it at least allows developers to focus on the functionality of the protocol rather than force them to fight with constraints and deficiencies of the underlying operating system.
Martin Sústrik, February 8th, 2013
Next: Bus Messaging Pattern
What is wrong with the approach from OpenSSL, Postgres and probably more libraries:
For non-blocking I/O you basically do this:
Once you detect POLLIN or POLLOUT you just run the code above again. This isn't much different from using raw sockets in non-blocking mode.
Also this should be doable in userspace. If your myproto implementation uses threads in the background, you just return a pipe or an eventfd filedescriptor that you control the other end of.
For blocking I/O it should be even simpler since myproto_recv() will only return
something positive or a protocol error.
"just return a pipe or an eventfd filedescriptor that you control the other end of"
Well, exactly. However, there's no way for pipe or eventfd to signal !POLLIN & !POLLOUT.
Also there's no way to signal POLLERR, POLLHUP, POLLPRIO etc.
That's what the kernel patch allows you to do.
Right, so what I can't seem to find in your post is why it is important that poll() returns POLLERR or POLLPRI and it is not enough that myproto_recv() or myproto_write() returns an error or something like MYPROTO_GOT_PRIORITY_MESSAGE.
That's just another case in the switch above.
There's no problem with read and write themselves. They are fully implemented by the library and can do whatever the library implementer wants them to do.
The problem is with poll/select. These functions are global, not protocol-specific, i.e. they span across all the possible protocols: TCP, UDP, SCTP, myproto, yourproto etc.
What that means is that you, as the library implementer, have no control of how they work. Still, you need your protocol implementation to be pollable.
To do that you need a native file descriptor (poll/select won't accept anything else.)
The only ways to create file descriptors in user space is pipe, socketpair and eventfd (Linux-only). All of them suffer of the same problems:
Yet one more try:
Imagine that you protocol implementation needs to both read and write more. Thus, the first function in your example returns MYPROTO_WANT_READ|MYPROTO_WANT_WRITE. The user retreives the file descriptor (which is pipe under the cover) and polls for POLLIN|POLLOUT.
However, there's no way to signal !POLLIN && !POLLOUT on the pipe. So, at least one of the two events must be signaled at any time. Thus, the poll(POLLIN|POLLOUT) is never going to block!
Yes, but this still doesn't explain why you need to signal !POLLIN & !POLLOUT in the first place.
In other words my question is this: Why isn't enough to have POLLIN mean "please call my library as soon as you can" and !POLLIN mean "chill"? Then after the user calls your library you can return all sorts of messages to the caller, like "EGAIN" or "you've got a new priority message" or "the other end hung up unexpectedly" etc.
Well, that's what everybody is doing. I've implemented that several times myself…
So it starts with this:
struct myproto_socket {
int in_pipe [2];
int out_pipe [2];
int err_pipe [2];
int hup_pipe [2];
…
};
One problem with that is the amount of fds used. Users start to hit the system fd limit pretty quickly.
So you change to implementation to use just one pipe and set of functions to test readability/writeability/error/hangup etc. That's rather weird API, but users are mostly still able to cope with it.
Then you start hitting performance limits, so you start using the pipe in edge-triggered mode rather than level triggered mode. So, the users have to cope with edge triggered mode. At this point most users are profoundly confused.
I can point out many discussions in ZeroMQ mailing list where people are confused about this API. I think I just saw one such email yesterday.
I see. So basically the idea is to cram 3 or 4 different kinds of events into a single fd instead of 1. I just don't see why this code:
is much easier than this
And surely if you have performance problems with the last bit of code, you'll have the same problems with the first bit. It'll just be the eventloop's job fan out the codepaths depending on the event flags rather than the switch. Right?
Also the second version can easily be extended to protocols that have more interesting events than the TCP/stream derived POLLIN, POLLERR, POLLPRI and POLLHUP.
First of all, this post is about *API* problem. I am not arguing the thing cannot be done with pipe/socketpair/eventfd. What I am saying is that when doing so the API is convoluted and confusing. Check ZeroMQ mailing list for messages from confused users using the API.
What I want to achieve is to enable user of the library to write the following code:
Note the the above is pure POSIX, something that everyone is familiar with.
To put it in a different way: Thanks to this patch, user-space implementations of network protocols can expose exactly the same API to the user as kernel-space implementations.
I think the best approach is to define an abstraction of a notification API, like a simplified version of libevent etc, and allow the user of your library to implement it. For example, the library would define this:
Then make everything in your library do I/O via this abstract API. On the other side, the user of the library is free to implement the API however he wishes, be it an existing event loop library (libevent, libev, glib, Qt) or his own creation.
Well, you are definitely free to define such abstraction. The only problem is that for it to be useful it has to be standardised and widely adopted. Which, of course, is hard to achieve :)
Anyway, we have one such standardised and widely adopted abstraction called "file descriptors". They work perfectly OK and there's no technical need to replace them by a different standard. There's only a little piece missing, namely the ability to create fully functional file descriptors in user space. My kernel patch solves this problem.
Yes, you need a kernel patch - on every operating system you want to support, and every user would need to have it. On the other hand I just want working software, now, on existing systems.
In the end, if you want your library to be completely portable, i.e. independent of any particular operating system and using only the C/C++ language, such user-implementable interfaces are the only way. For that you would need a higher I/O level interface instead of file descriptors, since file descriptors are OS-specific.
Some real libraries use such interfaces. For example take NSS/NSPR - you can use the NSS library to do SSL on top of absolutely anything as long as you can make it appear like TCP (reliable stream), in non-blocking mode (I've done it, it works).
" On the other hand I just want working software, now, on existing systems."
I have the working solution now. No problem with that. The problem is that it's ugly and confusing for the end users. That's what I am trying to solve.
"Yes, you need a kernel patch - on every operating system you want to support, and every user would need to have it."
Exactly. Linux seems to be a good starting point as it is rather widely deployed.
"For that you would need a higher I/O level interface instead of file descriptors, since file descriptors are OS-specific."
Why so? File descriptors are defined in POSIX. (Unless you are speaking of Windows, but nothing works as expected on Windows, so there's little point in caring.)
"For example take NSS/NSPR - you can use the NSS library to do SSL on top of absolutely anything as long as you can make it appear like TCP (reliable stream), in non-blocking mode (I've done it, it works)."
AFAICS the only way to make user-space implementation of the protocol behave exactly like TCP is to use the kernel patch. OK, there is still another option: Re-implement the whole BSD socket API in a library and overload the system functions by linking with library. Some products do that (SDP, OpenOnload) but it's kind of a brute-force approach.
Yes, I have systems other than POSIX in mind. Including Windows, and systems with no operating systems at all like microcontrollers. A library that can't work on top of abstract interfaces is useless in such cases.
It isn't the point to make it appear *exactly* like TCP. After all SSL isn't defined to work on top of TCP. From RFC 5246: "At the lowest level, layered on top of some reliable transport protocol (e.g., TCP), is the TLS Record Protocol."
Kernel implementations of protocols utilizing such abstract libraries also come to mind.
Ah, OK, I see.
Anyway, we've gone pretty far away from the topic of the original article. It was only about slighlty extending POSIX to make implementing protocols in user space easier. Implementing protocols in non-POSIX environment is a different, although extremely interesting, topic.
I strongly disagree with this. Sure, you need to implement the interface - but this is actually pretty easy as long as your program is not broken by design. If you use an existing event loop like libevent, implementing this interface is literally trivial.
Having such an interface standardized sure would be nice, but any interface is much better than no interface at all. It could be the difference between using an existing library by writing a hundred or so lines of glue code, and reinventing whatever you need just because the developers of the library didn't have such abstract usage in mind.
I think we are not on the same line here.
Specifics being put aside, the problem discussed is about generic interface for communication between different protocol implementations and different applications. It's a standardisation problem. It only works when all parties agree on the same interface.
I've opted for file descriptors because that's what everybody uses anyway and enhanced it to support one up to now unsupported corner case (user-space to user-space signaling). You can, of course, go for a different interface, but the more obscure it is, the less useful it will be.
Hi Martin,
Actually I'm trying to implement a network protocol Geonetworkin in the user-space level and to make it working over the 802.11 Mac layer. I found that your post can be very helpful. still I want to know;
1- If the patch " EFD_MASK patch to Linux kernel" has to be installed first.
2- Is the patch compatible with the linux-2.6.38 kernel.
3- If yes, are there any recommended parameters to enable in .config file before compiling the kernel.
4- If possible to provide me with the remaining functions not implemented in this post (send, setsockopt, connect, bind ).
5- small description how the send() and receive() will communicate with the low level handlers of the sk_buff structure in PHY and MAC level.
6- Finally, It will be very helpful if you have an already implemented example you can give.
Thanks a lot.
Hi,
The patch can be found at LKML (https://lkml.org/lkml/2013/2/8/67). I has not yet been merged in the mainline kernel as I don't have much time to actively push it. If you want to help with that, it would be great.
So yes, at the moment you have to apply the patch by hand and build the kernel yourself.
The patch was developed with 3.x kernel, but I guess backporting it to 2.6.38 should be easy, maybe even with no work needed, except for applying the patch.
No, nothing special has to be done with .config
Send, setsockopt etc. are not covered by the patch as you can implement those in user space easily with no special support from the kernel: just use geonetworking_send(), geonetworking_setsockopt() etc.
Post preview:
Close preview