@MJP: I tried some CopySubresourceRegion for dummy buffers between dispatches and driver is too smart for that. Inserting artificial synchronization points might work, but I won't do it until every other option fails. It seems nightmare to keep code understandable after that.
@ATEFred: Is it only when gpu must sync between them, or it always happens? It would be strange if gpu stalled on each dispatch with idle shading units.