I am not sure how this approach fits in with your design, but it may spark some new ideas.
Doesn't that architecture only allow for two primary threads?
In the example, there are two update loops in different threads: game loop and physics loop. I don't see why you couldn't have more.
Why not just implement a simple job graph at that point? They're not particularly difficult to write.
In that particular architecture, the more threads you have trying to communicate like that, the deeper the pipeline becomes, so you end up with more latency.