//.. join all 4 threads before moving to the next batch.
But like Hogman mentioned, i am spawning 64 threads even with this approach, i'll just create some kind of threadpool of 4 permanent threads instead.
Side question though, assuming i make this thing work properly with expected fps boost, would i get better performance with OpenMP (so code still executed on CPU), or should i jump directly to a OpenCL implementation ?
Thanks for your help !
Take 4 threads (from a pool, don't create new ones every frame / update) and split the work between those 4 threads.