On my gtx 970 I have 16 queue's with full support, and 1 transfer only queue. I read somewhere that AMD had only 1 queue with full support, 1-4 compute only queue's, and 2 transfer queue's and intel GPUs only have 1 queue for everything.
AMD has 1 graphics / compute queue and 3 compute only queues + 2 x transfer - you can lookup this stuff here for various GPUs: http://vulkan.gpuinfo.org/displayreport.php?id=700#queuefamilies
Early drivers supported 8 compute queues, so this may change.
The big number of 16 does not mean there is native hardware support if we talk about e.g. processing 16 different compute tasks simultaneously.
AFAIK AMDs async compute is still better (GCN can execute work from different dispatches on one CU or at least on different CUs, Pascal can only do preemption? Not sure about it).
I did some experiments with async compute using multiple queues on AMD but it was just a small loss. Probably because my shaders were totaly saturating the GPU so there is no point to do them async.
But I noticed that their execution start and end timestamps overlap, so at least it works.
I will repeat this with less demanding shaders later - guess it becomes a win then.
What i hear anywhere is that we should do compute work while rendering shadow maps or early z - at least one example where we need to use multiple queues (beside obvious data transfer).
Personally i don't think using one queue per render thread makes a lot of sense. After multiple command buffers has been created we can just use one thread to submit them.
Welcome to the world of hardware differences and the lies they tell :)
So, NV.. ugh.. NV are basically a massive black box because unless you are an ISV of standing (I guess?) they pretty much don't tell you anything about how the important bits of their hardware work, which is a pain and leads to people trying to figure out wtf is going on when their hardware starts to run slowly.
The first thing we learn is that not all queues are created equally; even if you can create 16 queues which can consume all the commands this doesn't indicate how well it will execute. The point of contention doesn't in fact seem to be the CUDA or CU cores (NV and AMD respectively) but the front end dispatcher logic, at least in NV's case.
A bit of simplified GPU theory; Your commands, when submitted to the GPU are dispatched in order the front end command processor sees them. This command processor can only keep so many work packets in flight before it has to wait for resources. So, for example, it might be able to keep 10 'draw' packets in flight but if you submit an 11th, even if you have a CUDA/CU unit free which could do the work the command processor can't dispatch the work to it until it has free resources to track said work. Also, iirc, work is retired in-order : so if 'draw' packet '3' finishes before '1' then the command processor can still be blocked from dispatching more work out to the ALU segment of the GPU.
On anything pre-Pascal I would just avoid trying to interleave Gfx and compute queue work at all; just go with a single queue as the hardware seems to have problem keeping work sanely in flight when mixing. By all accounts Pascal seems to do better, but I've not seen many net wins in benchmarks from it (at least, not to a significant amount) so even with Pascal you might want to default to a single queue.
(Pre-emption doesn't really help with this either; that deals with the ability to swap state out so that other work can take over and it is a heavy operation; CPU wise it is closer to switching processes with the amount of overhead. Pascal's tweak here is the ability to suspend work at instruction boundaries.)
AMD are a lot more open with how their stuff works, which is good for us :)
Basically all AMD GCN based cards in the wild today will have hardware for 1 'Graphics queue' (which can consume gfx, compute and copy commands) and 2 DMA queues which can consume copy commands only.
The compute queues are likely both driver and hardware dependant however. When Vulkan first appeared my R290 reported back only 2 compute queues; it now reports 7. However I'm currently not sure how that maps to hardware; while the hardware has 8 'async compute units' this doesn't mean that it is one queue per ACE as each ACE can service 8 'queues' of instructions themselves. (So, in theory, Vulkan on my AMD hardware could report 64 compute only queues, 8*8) If it is then life gets even more fun because each ACE can maintain two pieces of work at once meaning you could launch, resources allowing, 14 'compute' jobs + N gfx jobs and not have anything block.
When it comes to using this stuff, certainly in an async manner, then it is important to consider what is going on at the same time.
If your gfx task is bandwidth heavy but ALU light then you might have some ALU spare for some compute work to take advantage of - but if you pair ALU heavy with ALU heavy, or bandwidth heavy with bandwidth heavy you might see a performance dip.
Ultimately the best thing you can do is probably to make your graphics/compute setup data driven in some way so you can reconfigure things based on hardware type and factors like resolution of the screen etc.
I certainly, however, wouldn't try to drive the hardware from multiple threads in to the same window/view - that feels like a terrible idea and problems waiting to happen.
- Jobs to build command lists
- Master job(s) to submit built commands in the correct order to the correct queue
That would be my take on the setup.
A third option would be to write separate rendering pipelines for each card type. Write one solution for NVidia, one for AMD, etc... but this isn't something I'd like to do.
I think you should. I'm still on AMD only at the moment but in the past i've had some differences just in compute shaders between NV / AMD resulting in +/-50% performance. I expect the same applies to almost anything else especially with a more low level API :(
To a degree you'll have to do this if you want maximal performance; if you don't want max performance I would question your Vulkan usage, more so if you don't want to deal with the hardware awareness which comes with it :)
However, it doesn't have to be too bad if you can design your rendering/compute system in such as way that it is flexiable enough to cover the differences - at the simplest level you have a graph which indicates your gfx/compute work and on AMD you dispatch to 2 queues and on NV you serialise in to a single queue in the correct order. (Also, don't forget Intel in all this).
Your shaders and the parameters to them will require tweaking too; AMD prefer small foot prints in the 'root' data because they have limited register space to preload. NV on the other hand are good with lots of stuff in the 'root' signature of the shaders. You'll also likely want to take advantage of shader extensions too for maximal performance on the respective hardware.