AFAIK there's three kinds of "support" for the "compute queue plus graphics queue" work-case.
- The GPU has one HW queue. The OS shares it between multiple SW queues. Switching SW queues requires a full GPU pipeline flush / stall.
- As above, but there's no need to flush the GPU between swapping SW queues.
- The GPU actually has many HW queues and the switching is done by the GPU itself.
- As above, but commands from the queue can actually be executed in parallel within different "shader cores".
NV has been in category #1 in the past, I'm not sure if they've reached #3/4 yet or are just at #2.
AMD has been at #4 for a while now. Their hardware queues are themselves made up of multiple pipes. e.g. 8 queues with 8 internal pipes gives 64 HW command lists in flight at once. Each HW queue can be given a different priority, and then the GPU can round-robin services the pipes within that queue.
As well as taking turns to read commands out of the different pipes and scheduling them for execution, the execution of these commands can actually overlap -- e.g. for a post-process compute job to execute simultaneously with a shadowmap rendering job. Or for a high priority compute task to start executing half-way through a long draw call. This is just for compute queues though - I'd usually assume that a single graphics queue is a safe bet.
Some GPU's might actually support two (or more) graphics queues (now or in the future)...
Async compute + graphics is very useful because it allows you to make use of HW resources that would otherwise be sitting idle. Shadowmap rendering is bottlenecked by the fixed-function rasterization hardware and the DB cache / ROP / OM stage, which means all the "shader cores" are sitting around idle. Procedural texturing will keep the "shader cores" ALU units busy, but leaves the memory controllers idle. Voxel ray-marching is pretty light on ALU cost, but completely stresses the memory controllers. Finding different tasks like this that have opposite bottlenecks, and then running them in parallel lets you take advantage of these idle HW resources.
This is actually a pretty hard thing to pull off though.
Most examples that I've seen where people have found a decent win is "do some compute work while rendering your shadow maps"... and having two graphics contexts doesn't fit into that use case -- they'd both competing for the fixed function rasterization hardware.
In the future it may become useful though -- e.g. running ALU heavy pixel shaders in parallel with memory-bound pixel shaders... Or running a multi-frame, ultra-low-priority light-baking shader in parallel with your actual realtime graphics, for lightmaps that are updated every few seconds as the time of day progresses. I've shipped a game that implemented that last feature, but it was a PITA, as we had to implement it very carefully so the lightbaker only did exactly 1ms of work every frame -- basically cooperative multitasking. Having a second graphics context of a different priority level might've made it much easier to implement.
Or another use-case is simply when multiple apps are running at once. Windows uses D3D to composite it's own desktop/windows now, so a windowed game is sharing the GPU with the OS and other apps too, which have their own graphics contexts!
As for copy queues, I'd assume that every vendor can benefit from you having one of these. Some GPU's may support more than one...