Previous: Getting Rid of ZeroMQ-style Contexts
I've spent last month re-writing the code of nanomsg to use state machines internally, passing asynchronous events around instead of using random callbacks between the components. The change is complex, requires a lot of work and it is not visible to the end user, so the question is: Why do it at all? The time can be put to better use implementing sexy new features that would make the users happy. Instead, the progress on the library seems stalled and such major re-write may even result in regressions. Why even bother then?
I'll try to provide the answer in this article. It will introduce a generic argument against callbacks, so if you are using callbacks, give it a read even if you are not interested in nanomsg per se. By the way, note that I am not going to say anything new here. The knowledge has been around for literally decades. However, given the amount of callback-driven programming that's being done and apparent — although completely incomprehensible to me — craving of some users of ZeroMQ and nanomsg for callback-based APIs, I believe that re-iterating the basics would be useful.
The problem is that ZeroMQ codebase evolved to the point where the internal interactions within the library are so complex that adding any new core functionality have become virtually impossible. For example, while the library exists for 6 years there was no new transport mechanism added during that time. There were many attempts, but none have succeeded to provide a fully stable code that could be merged back to the mainline. Instead, people resort to writing bridges between ZeroMQ and other transports.
One of the goals of nanomsg was to improve the internal architecture of the library in such a way that adding new functionality would be easy. In the first iteration I've tried to avoid state machines once again (in some future article I'll try to explain why) and I've failed. The complexity have gradually crept back. It seems that the only way to keep the complexity at bay is to avoid callbacks.
Consider this code:
struct A
{
void foo () {b->baz ();}
void bar () {++i;}
B *b;
int i;
}
struct B
{
void baz () {a->bar ();}
A *a;
};
It's pretty simple. A has pointer to B and vice versa. If you invoke A::foo, it will invoke B::baz which in turn will invoke A::bar. A::bar increments a member variable of A. There's no catch there. The program will work as expected.
However, imagine we'll make A::foo a little bit more complex:
void A::foo ()
{
int tmp = i;
b->baz ();
assert (tmp == i);
}
We copy a class member variable into a local variable and invoke a function. Nothing wrong with that. However, right afterwards we'll do a sanity check and test whether the local variable still equals to the member variable. Surprisingly, it does not. The value of the member variable have mysteriously changed just because we've invoked method of a different object.
Of course, it's easy to spot what went wrong in this example. We can fix the problem in a simple way. For example, we can re-read the value of the member variable into the local variable after the call:
void A::foo ()
{
int tmp = i;
b->baz ();
tmp = i;
assert (tmp == i);
}
Now the program works as expected. It looks like those pesky callbacks are not that hard to handle after all!
Well, not quite. In what follows I'll try to explain why the code above is a tarpit just waiting to eat you alive.
First, imagine the callback happens in a more complex setup. Object A calls object B, which calls object C, which calls object D, which calls object E, which in turn calls back to object A.
It's pretty obvious that handling the callback is going to be much more complex in this case. The problem stems from the fact that when A invokes B, it has no idea that there, nested five levels deep, is a call back to A. Thus, when the call to B returns, the developer will be genuinely surprised that the state of A have mutated in the meantime.
If there was only a single call in each function, as shown on the picture above, i.e. call to B would be the only function invocation in A, call to C would be the only function invocation in B etc. spotting the cycle would be still possible. However, let's suppose that each function contains in average three function invocations. That means that down there, five levels deep, are 243 functions one of which may — or may not — be a call back to A. That kind of thing is extremely hard to spot just by looking at the code.
Furthermore, many of the functions in the call graph are invoked only when certain condition is met and some of those conditions are pretty rare. If the path from A back to A contains several rare conditions, the probabilities get multiplied and the callback will almost never happen. Thus, even with extensive testing it is entirely possible that the callback will never be triggered and the problem it causes will go through the testing unnoticed — only to occur in production, presumably at the worst possible moment.
Add to that that the callback cycles are often not 5, but 10 or 15 steps long. In such environment there is basically no way to make sure that the program will behave decently. The best thing you can do is to perform some testing, then ship the product, then fix the bugs reported by the users. Even then you can be pretty sure there are some very rare bugs still lurking in the codebase.
I am going to suppose that by now I've persuaded you that long cycles in the call graph are a really bad idea. So let's get back to our original example. A calls B which in turn calls back to A. It's the simplest possible case of callback. The cycle is immediately visible, the developer can carefuly tune it to work in all circumstances. He can document the cycle and put big WARNING comment before each function invocation so that no future maintainer of the codebase can accidentally overlook it. What can possibly go wrong?
The problem is that in any realistic scenario the call graph is much more complex that the simple two-node graph as shown at the top of this article. There are other functions called by A and B gets called from other functions as well:
Now imagine that at some point in the future some random developer adds a call from C to D. He's not even aware of existence of A and B, let alone the cycle between them. However, the change introduces a new 6-node-long cycle:
Suddenly, it may happen that component E makes call to B — which worked with no problems before — and finds its own state modified when the function returns. Which, of course, makes it fail, or, even worse, misbehave.
To understand the scale of the problem we've created consider this: Developer of C have made a local change to C. The change have interacted with the small cycle in a completely unrelated subsystem and created a big cycle, which in turn causes yet enother completely unrelated component (E) to fail. Now let's assume that developer of C, developer of A&B and developer of E are three different people, maybe even working in different departments or — if 3rd party libraries are used — in different companies. Perfectly reasonable change done by the first developer interacts in a bad way with a code written 15 years ago by the second developer and results in bug reported to the third developer working for a different company in some distant country and not even speaking English. I would rather not be that in that guy's shoes.
The common way to deal with the problems caused by cycles in the call graph is to introduce a new class variable ("executing") that is turned on when the object is already being used and turned off when it is not. By checking this variable the component can identify the case where there is a cycle on the call stack and handle the case appropriately.
What follows is a simple example of such code. executing and i are member variables of class A. If the function is called in cycle it does nothing. If there's no cycle it increments the variable i:
void A::foo ()
{
if (executing)
return;
executing = true;
++i;
executing = false;
}
This approach can help to get rid of a particular bug, but it's a hacky solution that may cause even more troubles in the future. Imagine that the code is modified like this:
void A::foo ()
{
if (executing) {
delete this;
return;
}
executing = true;
b->bar ();
executing = false;
}
void B::bar ()
{
a->foo ();
}
Can you spot the problem?
The program will fail trying to access invalid memory location when doing executing=false. We can't really touch the member because the object can change while we do the b->bar() call and the "change" can actually mean that it gets deallocated.
The only real solution here is to delay the callback. To simply make a note that it has to be executed and execute it later on when call to A::foo() exits. And that, of course, is the first step towards a full-blown state machine approach.
If you are interested in the topic, there's a lot of literature about state machines on the web as well as lot of tools that will help you with implementing them. I would also like to explore this matter further in this blog. Specifically, there are two questions I am interested in: First, why are the developers willing to jump through the hoops just to avoid using state machines? Second, are there any good rules of thumb (as opposed to actual software tools) that the developer should keep in mind when implementing a state machine?
Stay tuned.
Martin Sústrik, May 28th, 2013
Previous: Getting Rid of ZeroMQ-style Contexts
Very nice summary of callbacks' fallbacks!
Thanks!
The reason why all those problem exists in the code you presented is because you are doing things after a callback.
So you ask, how do I deal with cases when I do need to do something after a callback?
Push it into a global LIFO callback queue! The LIFO queue is managed by the event loop, and the event loop gives the events in the LIFO queue maximum priority.
NOTE: I use some pseudo-callbacks in this code, real code would use std::function or similar mechanisms.
Note that the LIFO nature of the queue ensures that the event we push before calling b->foo() is executed now only after the b->foo() callback itself, but after all the LIFO event that pushes, recursively. The LIFO queue can be viewed as some kind of secondary stack for callbacks. You can even do recursive callbacks via the LIFO queue without overflowing the stack.
If you want to know more about this pattern, check out my SO answer here http://stackoverflow.com/questions/10064229/c-force-stack-unwinding-inside-function/10065950#10065950 . Also, please do note that I'm not just giving you theory here - I've used this pattern extensively in my software for years, with great results.
Note: the LIFO is best implemented using a doubly linked list. Singly-linked-list or array won't suffice because sometimes we want to de-queue already pushed events.
Yes. Event queue is the first step towards fully formalised state machines.
I think you're misunderstanding my design pattern as some kind of formalism you need to deal with on top the problem you're solving. You don't need to prove any theorems or do anything else formal to use this pattern. The goal is event driven code that's easier to write and maintain, not an extra burden on development.
You can check out some of my code to get a better idea. This piece of code is especially simple:
https://code.google.com/p/badvpn/source/browse/trunk/ncd/extra/BEventLock.c
It implements an asynchronous "lock". The BEventLock represents a resource, and the BEventLockJob represents an operation that requires exclusive access to a resource. The BEventLockJob_Wait() is called to request access. When access is granted, a callback is invoked. Access can either be granted immediately, or cam be delayed (put in a queue), depending on whether the resource is currently locked. In the former case, the invocation of the callback is is done by pushing to the LIFO (BPending_Set()) directly from _Wait(); in the latter, the pushing is done from BEventLockJob_Release() when the resource is released.
See how from the perspective of someone who wants to lock the resource it doesn't look any different. There's no return code indicating whether access was granted immediately or whether a wait is necessary. This avoids a lot of hard to reproduce bugs (in code that would be handling the "need to wait" return code, since that happens rarely).
And there's no formalism here, just nice and correct code ;) As far as formalism is concerned, the clear separation of event processing into individual LIFO events actually makes proofs about the behavior of code easier, whether they are formal or informal.
Out of interest, two questions:
1. Why LIFO? Traditionally FIFO is used for event queues.
2. I've never seen de-queueing used in cases like this. Is there any systemic reason for having it there?
I guess we are both speaking about the same thing. You call it LIFO, I call it event queue. The goal is to keep individual "steps" in processing fully atomic.
I agree that using term "state machine" implies a bit more than an event queue, specifically, explicit state transition diagram, so forget about what I said about state machines and substitute term "event queue" instead.
A LIFO because it provides useful guarantees about the order of event processing.
So if we start with EventA pushed, the event queue will change like that:
You could claim that a LIFO is bad because you could get caught in an infinite loop of event processing and stop processing new events. That can certainly happen, but the same can happen if you "just do things after calling callbacks", and it is your responsibility to write code that doesn't get caught in such loops (such as counting invocations where there is potential for infinite loops). The advantage of using the LIFO is that such a loop will not crash your program via stack overflow.
So the LIFO is basically just a safe replacement for the usual "do something after calling a callback". Instead of doing it after the callback, you push it before the callback.
De-queuing is very useful when the callback does something that removes the need for the "do after callback" code to be called. The best use of it is when the callback destroys the object that is calling it, and the "destructor" of that object simply unqueues the callback (if it didn't, the program would crash when the queued call is called). And this is something you need to do quite often. For example, in a Client you have a Socket, and the Socket calls the Client to tell it that the connection broke, and the Client in turn destroys itself including the Socket (so Socket is being destroyed while it's calling to the Client).
"The program will fail trying to access invalid memory location when doing executing=false"
Pretty much the problem I've mentioned above. In the delete, destructor would de-queue any pending LIFO events, preventing their execution and avoiding a crash.
"To simply make a note that it has to be executed and execute it later on when call to A::foo() exits"
This is the exact reason a LIFO is useful. Pushing A before calling B makes sure that A executes after B is done. There's no need to complicate with partial orders and a "a full-blown state machine approach".
Another thing I noticed about your examples is that the objects (A and B for instance) are coupled to each other. This coupling is one factor that makes the system harder to maintain. You can use runtime dispatch (std::function or virtual methods or function pointers) or C++ templates to decouple them (unless they really have to be coupled).
I beg to differ. With straight functions you can at least check what's going to happen when you invoke it. On the other hand, virtual functions being "abstract" in a sense, you have no idea what the call graph looks like, whether invoking the function can possibly result in a cycle etc.
That's one way to look at it. But I usually design software by breaking problems down into smaller ones, resulting in a "dependency graph" that is a tree. After you define what the behavior of a component needs to be, you can implement and verify it independently of components "above" it. If your dependency graph has cycles, it's harder to look at components in isolation.
Oh, and there's an affect directly relevant to the callback problem. If you have coupling all around your codebase, then the question "will calling this callback call me back" can indeed be a very hard one. But what if you fixed your coupling? Well, if you are the owner of the set of objects {A,B,C}, and you call into A, then you *know* only the callbacks you gave to A could be invoked in response. If B and C have nothing to do with A, they couldn't possibly call you back in response to you calling into A. Not so much if A, B and C were coupled deeply in the your circular dependency graph.
"Now imagine that at some point in the future some random developer adds a call from C to D".
In a properly decoupled program you would first have to restructure the code a bit, because C has no way to access and doesn't know about D. The restructuring would very likely reveal the new circular calls.
Is there any specific reason why do you think C won't know about D?
In your first call graph C doesn't ever communicate with D. The only way C communicates with other components is by being called by the component above it. So in a properly designed system, C would only be aware of this component above, and maybe not even that, if it doesn't need to call back to it.
Here we come to your previous problem about "Contexts". If you turn objects (like D) into global state, that makes it much easier for C to get access to D, possibly going against the design. On the other hand if you always require a pointer to D in order to access it, the problem is more obvious, because C wouldn't have a pointer to D - it didn't need it, before someone had the idea to make it talk to D.
Right. Avoiding global state would help to some extent.
Still there is another problem: Any network stack is entered from two distinct directions. From above (user API, such as BSD sockets) and from below (network interrupts). In first case the call sequence is, say, TCP->IP->Ethernet. In the latter case it is Ethernet->IP->TCP. That in turn means that there's no way to create a simple one-directional hierarchy of references. Upper layer has to have pointer to the lower layer and vice versa.
So what? In my (C) code, upper layer owns the lower layer (so it calls it directly), and lower layer has callbacks (function pointer) to the upper layer. It looks like this:
The lower layer can't just do anything it wants - the only way it can communicate with the upper layer is via defined callbacks, and of course the invocation of the callbacks is subject to contracts. Here, the Socket is only allowed to call the recv_done_handler once after a StartRecv.
All my C code looks like that, and it's very maintainable.
P.S. when I said the upper layer calls the lower layer directly, not always. Sometimes thin interface classes using function pointers may be involved, such as a Read/Done interface as used here. Because sometimes you don't want to couple your code to a Socket, you want it to work with anything that can receive data.
The problem here is not the direction of function calls, there sure can be cycles in the call graph. What matters more from the design perspective is the dependency graph. My Client class knows about Socket, but the Socket doesn't really know anything about the Client other than the defined callbacks it has provided.
Yes. That's the correct and maintainable approach. But I wouldn't call it callback anymore. It's an event queue.
Depending on the complexity of problem, you may decide to stick with simple event queues, or move further towards fully formalised state machines.
Btw, I don't know much about your solution, but it may be that using LIFO instead of FIFO and dequeuing events is in your case an alternative way to deal with the problems normally addressed by state machines.
Hi Martin !
At the very beginning, you wrote :
void A::foo ()
{
int tmp = i;
b->foo ();
assert (tmp == i);
}
Shouldn't it be :
void A::foo ()
{
int tmp = i;
b->baz();
assert (tmp == i);
}
?
Oops! Fixed. Thanks for spotting it!
I'm not convinced. You have started with a conclusion (callbacks are evil) and then you have shown some very contrived and unrealistic examples to prove that conclusion.
In a more reasonable scenario there are no callbacks between peer components at the same level of the software stack, but they will rather exist across layers, which makes them easier to manage.
Also, circularity can be approached in other ways than detection - you could anticipate it and execute the callback in the state where subsequent actions on the same component would be perfectly OK and even not distinguishable from first-level calls. This approach was taken in the YAMI4 messaging library and guess what? No bugs in this area since the beginning of the project. So - it can be done properly with proper design in place. There's no hell there.
That's pretty much my opinion when I mentioned that he should decouple the components. You should read my comment about the LIFO queue, it makes callbacks "not distinguishable from first-level calls" easier to implement.
The examples are short to make the post readable.
In the real world the problem happens when a single component in "steered" from multiple directions. Typical example is network stack. There are calls originating in the user-facing API and there are calls originating from interrupts by networking hardware. The former traverse the layers from top to bottom, the latter from bottom to top. Unless you keep the two call graphs completely separate its very easy to form cycles in such environment.
And yes, the solution you are proposing is an event queue delaying the execution of actions till it's safe to call them. Event queues are a principal components of state machine implementation.
If you have to search for evidence like "ZeroMQ doesn't make it easy to add transports" as proof that callbacks are bad, your argument is broken. The difficulty in adding transports to libzmq stems from poor internal design from the start, resulting in lack of internal abstractions. This was the design you built in there from the very start. Callbacks may be involved but are not the cause. The cause was rather more profound, the notion that a tiny team could accurately predict the needs of a widely-used product.
It is most certainly not due to increasing technical debt. The libzmq engine is getting cleaner and more extensible over time, not worse. For example it now supports multiple protocol versions very nicely.
I agree that callbacks are a pain but you should not use broken argumentation, it doesn't help your overall thesis. Also if you step aside from the "ZeroMQ is crap and Nano is fantastic" theme that seems to drive your thinking, and consider where the design problems in ZeroMQ actually came from (your vision of engineering, largely), you might realize how to save Nano from being an interesting experiment that finally, no-one really uses. I'm surely not the only person who wants Nano to succeed.
Portfolio
What's wrong with the argument? Callbacks cause local changes to have global repercussions. Which is a maintenance nightmare. In the end they are pretty similar to using gotos. I guess Dijkstra explained the problem better than I did.
BTW, your experience with OpenAMQ would be valuable in this context. What are the problems of state-machine-based approach? Can they be solved by introducing callbacks? Etc.
Your argument was (unless I misunderstood) that callbacks caused technical debt, i.e. a build up of bad code, in libzmq that made it impossible to extend six years later. Whereas in fact adding transports was hard from day one, due to the lack of internal abstractions (i.e. internal APIs designed to be extended).
Callbacks can be entirely local, this is how CZMQ/zreactor works (passing object instances around). The resulting code is easier to maintain in some ways but harder to understand because it's fractured.
State machines can provide extreme leverage, but are poorly implemented by most people. And even when well implemented they create a barrier to entry that is the real problem. You can win on engineering but lose on participation. OpenAMQ was a prime case of this.
If you look at how we used e.g. tools like Libero in the past you will see very clean, ultra-maintainable code, all callback based. I've used the same style in FileMQ, partly to show how to use state machines in protocol engines. But it's only maintainable once you've learned the model, and that's a barrier.
Perhaps it's safe to say that callbacks in general lead to arbitrary one-off abstractions that are very hard to learn. It was the same problem with GOTOs. Nothing to do with local vs. global. GOTOs let you create arbitrary structures that follow no patterns at all. Replacing them with WHILE, IF, DO meant we could learn a small fixed set of patterns instead.
Portfolio
The callbacks were there from the day one. Interestingly, I've opted for callbacks because of the barrier to entry we've experienced with OpenAMQ. However, it seems that avoiding state machines altogether is not a good option either. There must be some kind of middle ground. What I was thinking about was that instead of using a special language for state machines (ragel, libero, mbeddr, etc.) just explicitly documenting small set of rules that the state machines have to adhere to. Anyway, it's hard to say in advance what kind of compromise would work the best.
Perhaps part of the barrier to entry with OpenAMQ was the XML/C DSL and code generation it uses, not sure at what version this started, but it looks like it's always used a DSL+CodeGenerator, possibly Libero.
Are you considering a code generator for nanomsg or a spec/api for the state machine?
I've already written it by hand — i.e. no code generation.
Although it means a lot of boilerplate code in the codebase (yuck!), on the other hand it allows any developer to peek directly at the source code and understand what's going on without to learn a new language.
I took a look at the state machine in cipc.c on the aio2 branch it's very clean and easy to understand the transitions. I prefer the boilerplate to code-gen or macros any day.
One thing I did find a little confusing was the nested "switch (type)" it looks like type is the state of another state machine, "switch (sock->state)".
There is also the possibility of using a state transition table and/or function lookup tables to remove some of the boiler plate. With this approach your default: handles all the common cases, like make a call and sets the next state and more complex cases get their own case statement "case NN_COMPLICATED_STATE:". If you can make the lookup arrays easy to read and maintain this could be a win. You could even case the boilerplate states and let them all fall through to the boilerplate lookup->call->set case and still maintain the default state for trapping invalid states. Lookup tables can also eliminate a lot of branching.
In even more complex state transitions a stack can be used to pass information between nested states, not sure that applies here, but it's useful in lex/parse.
Another nice advantage to the state machine approach is debugging, just log the states or even keep the last n states in a ring buffer.
I don't see any barrier to entry here, all the states and transitions are clearly defined in a hundred lines of very readable code. If you need any clarification, email or reply. I've have built quite a few DFSMs.
Several good points here. Let me comment on them one by one.
I don't feel good about name 'type' myself. It's not another state. It's the type of the event being processed. So, a single source object can emit different kind of events, say a socket can emit SENT and RECEIVED events. 'type' argument is used to distinguish between the two. Suggestions for better fitting name are welcome.
As for state transition tables I have deliberately not used them. The rationale is the same as the rationale for not using code generation: State transitions tables make the code very hard to follow. Instead, I've opted for a single transition function with three nested levels of switches (in this order): the state, the source of the event, the type of the event. That makes it relatively easy to find your way through the state machine be simply scrolling the source code up and down.
As for nested state machines, I actually use them. For example, when cipc state machine is in ACTIVE state it handles the execution to sipc sub-state-machine. When sipc state machine terminates it hands the execution back to the cipc state machine. The same trick can be applied recursively, getting an arbitrary stack depth.
The debug log is a pretty neat idea. It would be worth of implementing, I guess.
Great to hear you don't see any barrier to entry. That's what I was trying to achieve. Typically, when code generation or macros are used, people feel there's an barrier to overcome.
A source (usock) initiates an event (NN_USOCK_CONNECTED) on the target (cipc). The target state machine looks at the current state, looks at the source of the event and then acts on a valid event. target->state->source->event
If this is correct you might consider a convention such as NN_USOCK_EVENT_CONNECTED, and it might even be useful distinguish between an action and event where an action ACTION_CONNECT is initiated and EVENT_CONNECTED moves the machine into STATE_CONNECTED, action and event would still reside at the same level in the state machine.
Regarding the nested state machines:
I could not find the hand off in cipc.cs NN_CIPC_STATE_ACTIVE, can you point me to src?
Good suggestion about distinguishing the events and the actions. Currently, events have no specific prefix and actions have, confusingly, prefix _EVENT_. Let me fix that.
As for handing of the control from cipc to sipc it's done here:
nn_sipc_start (&cipc->sipc, &cipc->usock);
What happens is that sipc object takes ownership of the usock object and redirects any events from it to itself. When the connection breaks, it hands ownership of the usock back to cipc object.
One unrelated thought: I am considering passing adding one parameter to the handler function. Currently it is (target,source_ptr, event_type). I am thinking of extending it to (target, source_type, source_ptr, event_type). The problem it solves is when the state machine owns an unlimited number of source objects, for example, a list of sockets. In such case checking that source_ptr is one of the objects in the list would require list traversal (O(n) operation). Any thoughts about that?
Martin, why does the concept of a FSM exist in your code at all? For examole, in cipc.c, you do nn_fsm_init_root and pass nn_cipc_handler to that, and by passing that to other init's you set up nn_cipc_handler to handle events from three different sources. Why not just have three callback functions, what benefit does this indirection provide?
Even more, I would use different handlers for different event types, instead of using just one handler per event source. This way you can add event-specific arguments to the handler.
Regarding adding source_type, without knowing more about the number of source_types and if they could be organized into groups based on behavior. Here are some thoughts…
Reserve a few bytes at the beginning of the source object, the owner could then attach the type once before the state machine executes. This could also be used for other data such as the owner setting a flag on the source.
Create a state machine for a single type in the O(n) cases and have the source call into the handler for that machine, for example nn_cipc_handler_spic(…). If possible it would be nice eliminate the need for "if (source == &cipc->sipc)" without over complicating the state machine.
If source_type cannot be avoided and the owner cannot write to the source object even one time during the wiring/initialization phase, consider some type of context object that the owner hands to the source that the source must pass with each handler call, that context object could contain the source_type, source_ptr etc.
Reserving a few bytes in the source seems the most appealing, you could have an issue where a source object might have more than one owner, then you would have to consider how many owners a source object could have, but I could still see this being workable.
If you can explain why some of these options won't work it might provide some more insight.
@Ambroz: The reason is to keep the hierarchy of the handler this way: state=>source=>event_type.
If you have different handler functions for different events the code will be structured (presumably) like this: event_type=>source=>state. That kind of code is extremely hard to follow.
@mike: As for nn_cipc_handler_spic(…) I don;t like it for the same reason stated above: Having multiple handlers make code hard to follow (i.e. code belonging to a single state is suddenly scattered among different places in the source file.
"Reserving the few bytes" thing seems more resonable. I thought of just reserving a single integer. You would initialise it when intialising the source and get it back once the source fires an event.
For example:
nn_timer_init (&self->retry_timer, NN_REQ_RETRY_TIMER);
And then, in the handler:
if (source_type == NN_REQ_RETRY_TIMER && event_type == NN_TIMER_TIMEOUT) {
timer = (struct nn_timer*) source;
….
}
Four bytes seems reasonable due to alignment, you could even just use for example one byte if there would never be more than 255 types, or two bytes ect, and still have a couple free bytes for edge cases or anything else.
I'm not familiar enough with the code base and how initialization works yet, so put another way, the requirement could be that the first four bytes of any object participating in the state machine belong to the owner so I think we are on the same page.
Can a source object ever have more than one owner?
Another option is to require all state machine objects to use a special allocator that over allocates allowing room for source_type and returns an offset pointer.
Sure. But there's already a base class for all state machines (nn_fsm). Storing the integer there seems to be a cleaner solution.
No. Not allowed.
It's actually pretty crucial requirement. The idea is to have objects arranged in a tree (thus one owner for each one). Then we have to ways of communication:
That makes interactions between the components relatively easy to grasp. With mulitple owners (i.e. graph instead of tree) it would be much harder.
That makes sense, thanks for the clarification, looking forward to the state machine alpha.
I was looking over some of the new state machine code and wanted to make a clarification on the naming. Keep in mind I'm not a network programmer so these might not be the best examples.
STATE:
The internal state of the machine
NN_USOCK_STATE_CONNECTED
ACTION:
A command placing the machine in a new state and possibly raising an event
NN_USOCK_ACTION_CONNECT
EVENT:
Raised when the machine enters a new state
NN_USOCK_EVENT_CONNECTED
The distinction between ACTION and EVENT may or may not be necessary, I don't yet know enough about the internals and you might be just using STATE vs having an EVENT fire and event I believe is your up-stream function call.
I hope this makes sense.
Yes.
The only thing different is that EVENTS have no EVENT prefix. I.e. NN_USOCK_CONNECTED rather than NN_USOCK_EVENT_CONNECTED. The reason is that outgoing events are the only entities visible to the user of the object — states and actions are private to the state machine. Thus, as the state machine API goes, EVENT prefix would serve no meaningful purpose and just make the identifiers longer.
(This comment talks about callbacks among separate processes)
I can totally see the point of callback hell. Just last week I made an audit of our complete codebase to search for cycles. In my case it was combination of HTTP calls and ZeroMQ REQ/RESP. Found one. It can come back any time. And I still have it easy since I have complete control over the codebase. Imagine you have to prepare for situations where you don't have control over parts of the system. Like I suppose nanomsg doesn't.
Anyway if I had a list of communication protocols (state machines) that is explicit and testable I would feel much better. Especially the internal HTTP calls are dangerous since I don't control the URL. Stable connections from A to B are audit-able. HTTP calls are not.
Agreed. State machines would alleviate the problem. However, it's hard to enforce consistent usage of state machines, especially if the development team is large and distributed. There are no widely used software tools to enforce the rules (expect maybe for Erlang) and after all, you just need one person to screw it up.
So, Martin, can you provide a reference or example of good (and/or bad) state machines implementations?
I thought I knew what I was doing with them. Now I'm not so sure.
I don't really see a problem with state machines per se. Consistent usage of rules and control of source code are problems with software development in general.
Nice one Ambroz, really enjoyed reading your comments!
Callbacks and after effects are common problem and their separation are essential for any reactive system.
And that was one of clearest explanations on LIFO in this case.
Thanks, sla!
Have you seen any other project using a LIFO event queue like that? For all I know, I'm the first one to invent that design pattern ;)
Funny how this stuff goes around and around to be reinvented over and over. 20 years ago, I implemented a TCP stack and rapidly came to the conclusion it needed to be an FSM if it was going to be maintainable and extensible.
I was fascinated by the discussion of events verses 'safe' callback handling. The former puts event (handlers) in a (single) queue, the later puts the event handlers in two distinct queues - the LIFO and the stack. They are both fundamentally the same paradigm, they differ only in how the queue(s) is(are) managed.
I would argue that the 'safe' callback method is actually more dangerous for two reasons;
a) because it splits event handlers between the LIFO and the stack &
b) because it depends on a high level of dev discipline to keep track.
b) violates a fundamental rule that, despite all best intentions everyone is fallible.
I have not reviewed code, but you will find that the 'cleaner' (i.e. cleanly abstracted) your event paradigm, the easier it will be for devs to use the pattern and thus, they will be less likely to spend energy implementing something else. 'Clean' abstraction means easy setup, simple APIs, transparent architecture, event queue management (typically at least 3 levels) and queue visibility (debugging and system handling).
I have since taken the (event) paradigm to the ultimate conclusion of eliminating the need for threads (RTOS) altogether in many systems.
fwiw…
+1 for not using threads at all.
I wonder how the stuff even got so widespread in the first place. We had a perfectly viable model (processes&pipes) even before threads so there was no obvious reason… except maybe, when you are coding a boring crud application, using threads generates a lot of fun. Not even speaking of job security :)
Post preview:
Close preview