Previous: The Nanny World
If one looks closely enough at green threads, such as Go's 'go' statement, it's not hard to realise that they are really a flow control mechanism similar to 'if', 'for' or 'while' statements. Where 'if' allows you to skip a block of code, green threads give you a way to easily switch between different points of execution. Very much like 'if' or 'while' they are glorified jump statements.
However, to use them as routinely and carelessly as we use ifs or whiles, the green threads would have to be as cheap to use as ifs and whiles are.
That sounds like a rather tall order, given that we tend to think of threads as being on much higher abstraction level than humble 'if' statement, but maybe we are misjudging the difficulty of the problem. Let's have a look!
First of all, 'go' statement in Go language does parallelism in addition to simple green-threading. With all the talk about 'concurrency is not parallelism' that seems to be part of Go's DNA, I find the design decision to support both using a single statement a bit weird. In any case, let's ignore the parallelism part and assume that our ideal 'go' statement does just the concurrency part. In other words, that it never involves switching to a different CPU core. If it was not so, the implementation would have to handle synchronisation between cores, locking of cachelines and so on. That would in turn make it slower than other flow control structures.
Next, let's ignore all the hardware and software corner cases, such as chips with no virtual memory, no caches, 16-bit CPUs or similar. Our goal is not to demonstrate that 'go' can be as fast as 'if' in every possible case but that it can be made comparably fast if we really want to. We should also show that doing so doesn't require major changes in our mainstream technology stacks.
That being said, let's focus on the technical challenge now.
There are two parts to the problem: Stack management and context switching.
So, can we allocate a stack in time comparable to the execution of 'if' statement?
Imagine that instead of allocating the stacks as we go we'll simply preallocate a bunch of them at start up time. This really means just reserving part of the virtual address space for stacks. If we reserve 4GB and assume one stack to be 1MB long, we have enough space for 4000 concurrently running green threads.
With memory overcommit it doesn't even consume any memory. The pages in question are mapped to physical memory only if they are actually used.
One has to consider the impact of page faults, of course, but same is true of classic call stacks: When 'if' statement happens to cause a page fault it will be slow. Green threads don't change anything in this respect.
But wait, won't creating a new green thread *always* cause a page fault?
Not if the stacks are cached in a sane way. If the most recently freed stack is re-used by a new green thread the page will be already mapped to physical memory. Even better, given that new green thread will start messing with the top of the stack — which is exactly the last location previously finished green thread was accessing — the chances are good that the memory in question will still be cached in one of the CPU caches.
In the end, allocating stacks boils down to doing a single pop operation on a singly-linked list, which can be super fast.
Additionally, if this kind of programming model becomes common, one can imagine optimising the stack allocation mechanism even on the microarchitecture level.
Second part of the problem is context switching.
To do a context switch one has to store current state of all registers and load a different, previously saved, state instead. This is going to be slower than stack allocation, but still doable in 20 or so CPU cycles.
When doing context switch, we also have to choose the green thread to schedule. I don't want to dive deep into scheduling algorithms, but the simplest possible one is a simple queue, i.e. a singly-linked list, which is easy and quick to manipulate. We need one push and one pop operation for each context switch.
Once again, there's a lot of optimisation that CPU designers could do if this became a mainstream model of computation.
In summary, 'go' can't be exactly as fast as 'if', but it can be comparable with execution times deep in the nanosecond range.
Once that happens programmers will be free to use green threads to solve even the smallest problems — like those that we now solve using ifs and whiles — and to do so even in performance-critical environments.
Martin Sústrik, May 8th, 2016
Previous: The Nanny World
Wouldn't it be similar to how Erlang programmers use processes for the last 30 years?
Sure. Do you happen to know how the performance looks like in Erlang? Is it closer to if/while or rather closer to an OS thread?
Erlang thread creation is comparable to invoking 10 Erlang function calls.
No amount of caching, pre-allocating, or any other kind of trick is going to be able to come anywhere close to the speed of
cmp.l #$something, d0
bne.s Somewhere
.Else: …
That all happens in one or two clock cycles so we are talking, what, nanoseconds? To execute all the code making use of the pre-cached stack, even if it takes only four to ten instructions, we are already in millisecond territory. Multiplied by a number of threads… I just can't see it.
Do you have any measurements, with which to prop these wild, wild assertions?
4 instructions taking 1 millisecond? You are off by many orders of by many orders of magnitude.
The measurements with libmill show that's its able to launch ~20 million green threads per second, i.e. ~50ns per green thread.
Of course, with proper support on languaga/OS/microarch level one would expect it to be faster.
Yes, the admin is way off. And not.
Careful with microbenchmarking! Unless one has a very profound understanding of a variety of fields ranging from the inner workings of a processor to the OS and to abstract reasoning, chances are one will find funny numbers without much meaning.
I'm currently working on a project with some extremely time-critical parts and, although having 30+ years of experience, had plenty of painful understanding to do. And when I say time-critical I mean it; time-critical not only as in "using custom functions rather than libc ones" but also as in "avoiding an if (let alone in a loop)".
I don't believe in those 50ns green threads. Mainly for 3 reasons:
1) (and most important) the user of your library. Chances are extremely slim that (s)he won't cripple down your ferrari to a taxi cab or even to a tractor.
2) Optimization tends to shift responsibility and efforts away. Yes, one can build a "launch_green_thread()" working within 50 ns but in most high-performance libs I've seen the fat price tag comes on the user side. Example case: Your code might succeed in avoiding a cache miss but someone *will* cause one. Hunting down yet another 5 ns (so we look bright) may actually well turn out to be a trap for the user and the end result is worse (which would be point 4 in this list: It's not the single routines and not even your lib but it's the whole programm that counts).
3) Often it simply doesn't make sense. And by "often" I mean exactly the classical cases like networking. You may use a blocking, pseudo-blocking or non blocking approach but in the end there will be OS calls involved and one ends up optimizing 1 inch away from a mile long train.
Let me end with a comparison from the security area: Everyone and his cat is hunting for even more secure crypto. In reality though, SSL is messed up and even that is not important because attackers don't attack crypto (as in "mathematics") - they attack implementation. Or worse they attack the victims unpatched flash player.
So, security efforts must not be about even-more-secure-crypto, they must not be about optimizing crypto from 99,995% to 99,997% - no, they must enhance those 95% of code of extremely lousy quality, they must bring the security there from 9% to 80%.
In a similar way the problem isn't about code that wastes 3% or even 15% performance. The problem is the vast amount of code where "new connection" equals to "oh, well, let's bring up a new OS process" or were the developers went for select rather than at least poll, let alone aio.
And finally, your 50 ns green threads aren't worth anything if the vast majority of users get the wall time (expensive OS call) andthen use snprintf (about 150 ns) for error message formatting.
That, sir, is what we call a big fat attack vector (by DOSing your server into spitting out error messages and getting strangled while doing it).
While green threads may be faster than LWPs during creation and context-switch operations they lack the basic reasoning for having threads in the first place: the parallelism and the fast activation on external events.
Using "green threads to solve even the smallest problems" ? You are joking here, right ?
No, why? Here's my problem: Imagine a network stack with, say, 10 thin layers. The best way to model a single layer is 2 green threads (one for sending, one for receiving). If you avoid using green threads you get an unreadable and unmaintainable mess. So, each byte going out and each byte going in has to pass through 10 green threads. To do this at the network speed, green threads cannot by much more expensive then ifs and whiles. Doing inter-core synchronisation, for example, is definitely not an option.
No, the best way to model a network stack layer doesn't involve a thread for each direction of each layer. Especially green threads. First of all your network stack will have zero parallelism, and second the whole stack speed will be dictated by the slowest layer processing the packet data, but not in a parallel way but in an additional and self propagating delay way. Also there is absolutely no reason for a context switch when passing from a layer to another.
Interlocking in network stack is usually avoided by using lockless network stacks distributed over multiple cores, stacks that use different queues exposes by the NIC for doing enqueueing and dequeueing frames.
I understand where you are coming from, but my use case wasn't processing incoming IP packets and dispatching them to different sockets.
It was rather the stack above the TCP connections: A message delimitation layer. Encryption layer. Websocket layer. Etc. In short, think netty.
Now, there's no much to gain from parallelism as we are operating on ordered byte of streams anyway. Quite the opposite, the whole thing may be slow because of pointless synchronisation between cores.
As for the reason to use green threads, consider an object that takes TCP packets on one side and transforms them to websocket messages on the other. Without green threads you'll end up with super-ugly and hard-to-maintain state machine.
I hear that exact same thing about every 10 years.
Somebody comes up with a new abstraction, and then somebody figures out how to make it efficient, and people realize they can use it to make their lives easier, and then some old person from before step 2 appears and says "WTF?"
Won't 1MB stacks kill locality of reference? Firing up the same thread 1000 times might be reasonably fast, but you'll probably want to use $core_count threads, at least..
With a 1MB preallocated stack and register capture, the "green" thread would start looking exactly like a native thread!
Except that everything is running on a single core, i.e. it's very fast.
Post preview:
Close preview