weird enough, Intel's GPA Monitor is showing me that CPU load is about 30%... while GPU Busy % is around 93%, but with Exectution Units active only 50%...
so, if its not being stalled by the CPU, then something must be doing in the GPU that is stalling all EU's half of the time! :S
The difference is merely a conceptual issue... the only difference is that your application would be the "root", instead of having a well defined struct/class..
programatically there would be no difference, but it's cleaner and more intuitive for other programmers to have a root node from which everything "spawns"...
most of the time you will have to take desisions between two options which both work ok (and there's no performance gain/loss between them), but with the question of which way would you leave a be better experience to the people who might be using your code.. ;)