My view is that the Intel Smoke framework is somewhat over-generic (universal scene, which is extended with system scenes, change management) and some of its concepts are dubious, such as conflict resolution if several systems change the same property such as object transform (IMO that should not happen in the first place.) It doesn't really address the ordering of tasks within a frame, which doesn't hurt in the demo, but in an actual game it could lead to nasty non-determinism if for example AI is sometimes running one frame ahead, sometimes one frame behind in respect to physics. I'd consider it potentially dangerous if used directly as a basis for learning or for own engine.
I'd recommend, at least in the beginning, to be very explicit (even hardcoded) of what systems or tasks are actually running simultaneously, and what are the changes they're propagating to each other. For example physics updating object transforms, which the renderer picks up. Then profile constantly to discover the bottlenecks and to able to decide if threading will actually benefit them.
Some more resources (in case you're not familiar with them yet)
I tend to agree with AgentC as to Smoke being over generic and also having some really bad bits which should not exist. Having said that though, let me rephrase it as my preferred question: does the complication outweigh the benefit? Not everyone is a threading guru, nor in reality should they be. A person doing gameplay code should not have to worry about threading beyond a very few rules, if they do have to worry about threading a lot, it is an architecture failure in terms of the Knuth saying: "premature optimization is the root of all evil" and threading is most definitely an optimization. A threading system doesn't have to execute at 99+% of Amdahl's law to be a benefit, 90% is good enough for most things as it scales to 8-10 cores before diminishing returns prevents further benefit. Better than that, most of the performance loss is pre/post frame organization; the internal per frame work, when done well, but non-intrusively, can average closer to 97-98% which scales beyond 10 cores.
I am a fan of staged execution myself. It is similar in concept to the discussions of threading in the entity frameworks folks are talking about. You break up the execution into several pieces (components) using a set of simple rules even a junior programmer can follow. Take for example a flocking system, I won't detail the algo, just the break down and how the rules apply:
// Read positions of flock.
// Calculate center point, avoidance etc.
// Generate new velocity.
mVelocity = CalculateFlocking();
mPosition += mVelocity * Time::DeltaTime();
The above won't work in the way I do threading because it breaks the prime rule of my system: In a single stage of update you can not read from and write to a variable. And the other way around also in terms that you can't write a variable and then use it for further calculation. In the case of flocking, the update function breaks the rule because the calculation reads multiple object positions and then writes to this objects position. The reason is simple, without inserting a mutex (against another rule) to protect the position member, the assignment is non-atomic and you could get partially updated (potentially invalid) vectors used by other simultaneous executions of this function. Additionally, there is no consistency in terms of reading a position and getting an old or new version, which can throw off a number of calculations.
To fix the update function is simple, break it into two pieces (note: velocity is also read from the flocking calculation, so I fix that also):
Now, iterating Update1 and Update2 with multiple threads is completely safe without any per object synchronization, you just have to make sure you iterated all Update1's completely before starting on Update2's. Performance wise, even with the temporaries, this runs exceptionally fast with only a single synchronization point between Update1 and Update2. I posted an example video at one time which shows flocking implemented in this manner: it is not fancy (and takes a second up front to stabilize but 2000 objects updating multi-threaded in the above manner is not exactly trivial. The more important number, which I didn't show, is a near complete lack of locking/kernel time taking place except between the various update calls, of which there were 12 stages in that example.
Finally, to keep with the basic idea of staying out of peoples way, you can write the first example during initial development and simply mark the "stage" as single threaded. Once you get things working, you can turn on a debug helper which flags any same stage data accesses and start the process of decomposing the update into different pieces.
With only 3 rules and better than 95% Amdahl's efficiency, I've never seen any real reason to worry about this solution much further. It stays out of the way of getting things done, provides plenty of performance benefit and is very simple to implement. I'll take the minor losses for the gain of simplicity in this case, not sure how you will feel of course but it is worth keeping in mind.