• Advertisement

Vulkan Question concerning internal queue organisation

Recommended Posts

Hello,

my first post here :-)

About half a year ago i started with C++ (did a little C before) and poking into graphics programming. Right now i am digging through the various vulkan tutorials.

A probably naive question that arose is:

If i have a device (in my case a GTX970 clone) that exposes on each of two gpus two families, one with 16 queues for graphics, compute, etc and another one with a single transfer queue, do i loose potential performance if i only use 1 of the 16 graphics queues ? Or, in other words, are these queues represented by hardware or logical entities ?

And how is that handled across different vendors ? Do intel and amd handle this similar or would a program have to take care of different handling across different hardware ?

Cheers

gb

Share this post


Link to post
Share on other sites
Advertisement

Yes, this is very vendor specific.

On AMD you can use multiole queues to do async compute (e.g. doing compute shader and shadow maps rendering at the same time).

You can also do multiplie compute shadres at the same time, but it's also likely that's slower than doing them in order in a singe queue.

 

On NV the first option is possible on recent cards, but the second option is not possible -and they will serialize internally (AFAIK - not sure)

 

On both Vendors it makes sense to use a different queue for data transfer, e.g. a streaming system running while rendering.

Not sure about Intel, IFAIK they recommend to just use a single queue for everything.

 

In practice you need a good reason to use multiple queues, test on each HW, and use differnet settings for different HW.

E.g. for multithreaded command buffer generation you don't need multiple queues and queue per thread would be a bad idea.

Share this post


Link to post
Share on other sites

Thanks. So i understand that a single graphics queue is the best solution.

Yeah, i could split the 2*16 queues freely among graphics, compute, transfer and sparse, and the family with the single q is transfer only. Like this, but two times for two devices:

VkQueueFamilyProperties[0]:
===========================
        queueFlags         = GRAPHICS | COMPUTE | TRANSFER | SPARSE
        queueCount         = 16
        timestampValidBits = 64
        minImageTransferGranularity = (1, 1, 1)

VkQueueFamilyProperties[1]:
===========================
        queueFlags         = TRANSFER
        queueCount         = 1
        timestampValidBits = 64
        minImageTransferGranularity = (1, 1, 1)

 

I am not that far as to test anything on different platforms/devices. My "training" pc is a debian linux one. But in principle and if one day i shall do a basic framework for my own i would of course aim towards a solution that is robust and works for different platforms / manufacturers. That would probably be a compromise and not the ideal one for every case.

 

Share this post


Link to post
Share on other sites
1 hour ago, Green_Baron said:

Thanks. So i understand that a single graphics queue is the best solution.

 

Probably. I'm no graphics pipeline expert, but i'm not aware of a case where using two graphics queues can make sense. (Interested, if anybody else does)

It also makes sense to use 1 graphics queue, 1 upload queue, 1 download queue on each GPU to communicate (although you don't have this option because you have only one seperate transfer queue).

And it makes sense to use multiple compute queues on some Hardware.

I proofed that GCN can perfectly overlap low work compute workloads, but the need to use multiple queues, so multiple command buffers and to sync between them destroyed the advsante in my case.

Personally i think the concept of queues is much too high level and totally sucks. It would be great if we could manage unique CUs much more low level. The hardware can do it but we have no access - VK/DX12 is just a start... :(

 

Share this post


Link to post
Share on other sites

I haven't quite figured out the point/idea behind queue families.  Its clear that all queue's of a given family share hardware.  Also a GPU is allowed a lot of leeway to rearrange commands both within in command buffers, and across command buffers within the same queue.  So queue's from separate queue families are most likely separate pieces of hardware, but are queue's from the same family?  I've never been able to get a straight answer on this, but my gut feeling is no.

For example AMD has 3 queue families, so if you create one queue for each family (one for graphics, one for compute, and one for transfer) you can probably get better performance.  But is it possible to get significantly better performance with multiple queues from the same queue family?  So far from what I've been able to gather online, is probably not.

While I do agree with JoeJ that queue's are poorly designed in vulkan, I don't think direct control of CUs makes sense IMHO.  I think queue's should be essentially driver/software entities.  So when you create a device you select how many queue's of what capabilities you need, the driver gives you them and maps them to hardware entities however it feels is best.  Sort of like how on the CPU we create threads and the OS maps them to cores.  No notion of queue families.  No need to query what queue's exist and try to map them to what you want.

TBH, until they clean this part of the spec up, or at least provide some documentation on what they had in mind, I feel like most people are just going to create 1 graphics queue, 1 transfer queue, and just ignore the rest.

Edited by Ryan_001

Share this post


Link to post
Share on other sites

I'm just beginning to understand how this works, am far from asking "why". And for a newcomer vulkan is a little steep in the beginning and some things seem highly theoretic (like graphics without presentation or so).

Thanks for the answers, seems like i'm on the right track :-)

Share this post


Link to post
Share on other sites
On 7/10/2017 at 6:05 PM, Green_Baron said:

do i loose potential performance if i only use 1 of the 16 graphics queues ? Or, in other words, are these queues represented by hardware or logical entities ?

No, you probably only need 1 queue. They're (hopefully) hardware entities. If you want different bits of your GPU work to be able to run in parallel, then you could use different queues, but you probably have no need for that.

For example, if you launch two different apps at the same time, Windows may make sure that each of them is running on a different hardware queue, which could make them more responsive / less likely to get in each other's way.

On 7/11/2017 at 1:41 AM, JoeJ said:

I'm no graphics pipeline expert, but i'm not aware of a case where using two graphics queues can make sense. (Interested, if anybody else does)

In the future when vendors start making GPU's that can actually run multiple command buffers in parallel to each other, then you could use it in the same way that AMD's async compute works.

On 7/11/2017 at 2:57 AM, Ryan_001 said:

I haven't quite figured out the point/idea behind queue families.

To use OOP as an analogy, a family is a class and a queue is an instance (object) of that class.

On 7/11/2017 at 1:41 AM, JoeJ said:

Personally i think the concept of queues is much too high level and totally sucks. It would be great if we could manage unique CUs much more low level. The hardware can do it but we have no access - VK/DX12 is just a start...

Are you sure about that? AFAIK the queues are an abstraction of the GPU's command engine, which receives draws/dispatches and hands them over to an internal fixed function scheduler.

Share this post


Link to post
Share on other sites
2 hours ago, Hodgman said:

 

On 10.7.2017 at 5:41 PM, JoeJ said:

Personally i think the concept of queues is much too high level and totally sucks. It would be great if we could manage unique CUs much more low level. The hardware can do it but we have no access - VK/DX12 is just a start...

Are you sure about that? AFAIK the queues are an abstraction of the GPU's command engine, which receives draws/dispatches and hands them over to an internal fixed function scheduler.

I would have nothing aginst the queue concept, if only it would work.

You can look at a my testproject i have submitted to AMD: https://github.com/JoeJGit/OpenCL_Fiji_Bug_Report/blob/master/async_test_project.rar

...if you are bored, but here is what i found:

You can run 3 low work tasks without synchornizition perfectly parallel, yeah - awesome.

As soon as you add sync, which is only possible by using semaphores, the advantage gets lost due to bubbles. (Maybe semaphores sync with CPU as well? If so we have a terrible situation here! We need GPU only sync between queues.)

And here comes the best: If you try larger workloads, e.g. 3 tasks with runtimes of 0.2ms, 1ms, 1ms without async, going async the first and second task run parallel as expected, although 1ms become 2ms, so there is no win. But the third task raises to 2ms as well, even it runs alone and nothing else - it's runtime is doudled for nothing.

It seems there is no dynamic work balancing happening here - looks like the GPU gets divided somehow and refuses to merge back when possible.

2 hours ago, Hodgman said:

AFAIK the queues are an abstraction of the GPU's command engine, which receives draws/dispatches and hands them over to an internal fixed function scheduler.

Guess not, the numbers don't match. A Fiji has 8 ACEs (if thats the correct name), but i see only 4 compute queues (1gfx/CS+3CS). Nobody knows what happens under the hood, but it needs more work, at least on the drivers.

 

 

Access to unique CUs should not be necessary, you're right guys. But i would be willing to tackle this if it would be an improvement.

There are two situations where async compute makes sense:

1. Doing compute while doing ALU light rendering work (Not yet tried - all my hope goues in to this, but net everyone has rendering work.)

2. Paralellizing and synchronizing low work compute tasks - extremely important if we look towards more complex algotrithms reducing work instead to brute force everything. And sadly this fails yet.

Edited by JoeJ

Share this post


Link to post
Share on other sites

I think part of your disappointment is the assumption that the GPU won't already be running computations async in parallel in the first place, which means that you expect "async compute" to give a huge boost, when you've actually gotten that boost already.

In a regular situation, if you submit two dispatch calls "A" and "B" sequentially, which each contain 8 wavefronts, the GPU's timeline will hopefully look like this:
Ab29aQ2.png

Where it's working on both A and B concurrently.

If you go and add any kind of resource transition or sync between those two dispatches, then you end up with a timeline that looks more like:
nyFY8aj.png

If you simply want the GPU to work on as many compute tasks back to back without any bubbles, then the new tool in Vulkan for optimizing that situation is manual control over barriers. D3D11/GL will use barriers all over the place where they aren't required (which creates these bubbles and disables concurrent processing of multiple dispatch calls), but Vulkan gives you to the power to specify exactly when they're required.

Using multiple queues is not required for this optimization. The use of multiple queues requires the use of extra barriers and syncrhonisation, which is the opposite of what you want. As you mention, a good use for a seperate compute queue is so that you can keep the CU's fed while a rasterizer-heavy draw command list is being processed.

Also take note that the structure of these timelines makes profiling the time taken by your stages quite difficult. Note that the front-end processes "B" in between "A" and "A - end of pipe"! If you time from when A reaches the front of the pipe to when it reaches the end of the pipe, you'll also be counting some time taken by the "B" command! If you count the time from when "A" enters the pipe until when "B" enters the pipe, then your timings will be much shorter than reality. The more internal parallelism that you're getting out of the GPU, the more incorrect your timings of individual draws/dispatches will be. Remember to keep that in mind when analyzing any timing data that you collect.

Share this post


Link to post
Share on other sites

Whooo! - I already thaught the driver could figure out a dependency graph und do things async automatically, but i also thought this being reality would be wishfull thinking.

This is too good to be true, so i'm still not ready to believe it :)

(Actually i have too much barriers, but soon i'll be able to push more independent work to the queue and i'm curious if i'll get a lot of it for free...)

Awesome! Thanks, Hodgman :)

 

 

 

 

Share this post


Link to post
Share on other sites

The following quote from spec:

Quote

Command buffer boundaries, both between primary command buffers of the same or different batches or submissions as well as between primary and secondary command buffers, do not introduce any additional ordering constraints. In other words, submitting the set of command buffers (which can include executing secondary command buffers) between any semaphore or fence operations execute the recorded commands as if they had all been recorded into a single primary command buffer, except that the current state is reset on each boundary. Explicit ordering constraints can be expressed with explicit synchronization primitives.

I read this as meaning that for a single queue command buffer boundaries don't really matter, at least where pipeline barriers are concerned.  A pipeline barrier in one command buffer will halt execution, not only of commands within that command buffer, but also any subsequent commands in subsequent command buffers (of course only for those stages that the barrier applies to), even if those subsequent command buffers are independent and could be executed simultaneously.  So if the work is truly independent, I could see there being a small potential performance increase when using multiple queue's.

That said I feel this was an error/oversight IMHO.  It seems clear (at least to me : ) that semaphores are the go-to primitive for synchronization between separate command buffers, and hence it would make sense to me that pipeline barriers operate only within a particular command buffer.  This way independent command buffers on the same queue could be fully independent, and there would be no need for other queue's.  Alas, this is not the case.  Perhaps they had their reasons for not doing that, its often hard to read-between the lines and understand the 'why' from a specification.  I guess to me, its why queue's feel like such a mess.  Vulkan has multiple ways of doing essentially the same thing, it feels like the spec is stepping on its own toes.  But perhaps its just my OCD wanting a more ortho-normal API.

Share this post


Link to post
Share on other sites

I tried to verify with my test...

I have 3 shaders:

1: 10 dispatches of 50 wavefronts

2: 50 dispatches of 50 wavefronts

3: 50 dispatches of 50 wavefronts

With memory barrier after each dispatch i get 0.462 ms

Without: 0.012 ms (speed up of 38.5)

To verify i use 1 dispatch of 5500 wavefronts (same work): 0.013 ms

 

So yes, not only the GPU is capable of doing async compute perfectly with a single queue, we also see to API overhead of multiple dispatches is zero :)

Finally i understand why memory barriers appeared so expensive to me. Shame on me and all disappointment gone :D

Share this post


Link to post
Share on other sites
7 hours ago, Ryan_001 said:

 So if the work is truly independent, I could see there being a small potential performance increase when using multiple queue's.

If the work is truly independent, you won't have any barriers and could use one queue just fine.

Share this post


Link to post
Share on other sites
1 hour ago, Hodgman said:
9 hours ago, Ryan_001 said:

 So if the work is truly independent, I could see there being a small potential performance increase when using multiple queue's.

If the work is truly independent, you won't have any barriers and could use one queue just fine.

Yes, using one queue is faster even if there are no memory barriers / semaphores. Submitting a command buffer has a noticeable cost, so putting all work in one command buffer and one queue is the fastest way.

I also tested to use queues from different families vs. using all from one family which had no effect on performance.

All tests with only compute shaders.

 

Now i don't see a need to use multiple queues other than for up / downloading data. Maybe using 2 queues if we want to do compute and graphics makes sense but i guess 1 queue is better here too.

 

Edit: Maybe using multiple queues results in dividing work strictly between CUs, while using one queue can distribute multiple dispatches on the same CU. If so, maybe we could avoid some cache thrashing my grouping work with similar memory access together. But i guess cases where this wins would be extremely rare.

 

Edited by JoeJ

Share this post


Link to post
Share on other sites
3 hours ago, Hodgman said:

If the work is truly independent, you won't have any barriers and could use one queue just fine.

With all due respect, not necessarily.  Assuming I'm reading the spec correctly...

Imagine you had 3 units of work, each in their own command buffer, and each fully independent from the other 2.  Now within each command buffer there is still a pipeline barrier because while each command buffer is completely independent from the others, there are dependencies within the commands of each individual command buffer.  You could submit these 3 command buffers to 3 different queue's and they could run (theoretically) asynchronously/in any order.

Now if pipeline barriers were restricted to a given command buffer, then submitting these 3 commands buffers to a single queue would also yield asynchronous performance.   But as it stands, submitting these 3 commands to a single queue will cause stalls/bubbles because pipeline barriers work across command buffer boundaries.  The pipeline barrier in command buffer 1 will not only cause commands in buffer 1 to wait but also commands in buffers 2 and 3, even though those commands are independent and need not wait on the pipeline barrier.

This change would also give a bit of purpose to secondary command buffers, of which (at this time) I see little use for.

Now I just need to convince the Vulkan committee of what a great idea retroactively changing the spec is, and that breaking everyone's code is no big deal.  /sarcasm

Edited by Ryan_001

Share this post


Link to post
Share on other sites
6 minutes ago, Ryan_001 said:

Imagine you had 3 units of work, each in their own command buffer, and each fully independent from the other 2.  Now within each command buffer there is still a pipeline barrier because while each command buffer is completely independent from the others, there are dependencies within the commands of each individual command buffer.  

Ah ok I didn't get the last assumption -- that the big picture work of each buffer is independent, but contain internal dependencies.

Yes, in theory multiple queues could be used to allow commands from another queue to be serviced while one queue is blocked doing some kind of barrier work. In practice on current hardware I don't know if this makes any difference though -- the "barrier work" will usually be made up of a command along the lines of "halt the front-end from processing any command from any queue until all write-through traffic has actually been flushed from the L2 cache to RAM"... In the future there may be a use for this though.

I don't know if the Vulkan spec allows for it, but another use of multiple queues is prioritization. If a background app is using the GPU at the same time as a game, it would be wise for the driver to set the game's queues as high priority and the background app's queues as low priority. Likewise if your gameplay code itself uses GPU compute, you could issue it's commands via  a "highest/realtime" priority queue which is configured to immediately interrupt any graphics work and do the compute work immediately -- which would allow you to perform GPGPU calculations without the typical one-frame delay. Again, I don't know if this is possible (yet) on PC's either.

12 minutes ago, Ryan_001 said:

This change would also give a bit of purpose to secondary command buffers, of which (at this time) I see little use for.

 

AFAIK, they're similar to "bundles" in D3D12 or display lists in GL, which are meant for saving on the CPU cost of repeatedly re-recording draw commands for a particular model every frame, and instead re-using a micro command buffer over many frames.

Share this post


Link to post
Share on other sites

Well barriers in the spec are a little more fine grained, you can pick the actual pipeline stages to halt on.  For example if you wrote to a buffer from the fragment shader, and then read it from the vertex shader, you would put a pipeline barrier which would halt all subsequent vertex shader (and later stages) from executing prior to the fragment shader complete'ing.  But I have the feeling you were talking about what hardware actually does?  In which case you are probably right, I have no idea how fine-grained the hardware really is.

The spec does support queue priority, sort of:

Quote

4.3.4. Queue Priority

Each queue is assigned a priority, as set in the VkDeviceQueueCreateInfo structures when creating the device. The priority of each queue is a normalized floating point value between 0.0 and 1.0, which is then translated to a discrete priority level by the implementation. Higher values indicate a higher priority, with 0.0 being the lowest priority and 1.0 being the highest.

Within the same device, queues with higher priority may be allotted more processing time than queues with lower priority. The implementation makes no guarantees with regards to ordering or scheduling among queues with the same priority, other than the constraints defined by any explicit synchronization primitives. The implementation make no guarantees with regards to queues across different devices.

An implementation may allow a higher-priority queue to starve a lower-priority queue on the same VkDevice until the higher-priority queue has no further commands to execute. The relationship of queue priorities must not cause queues on one VkDevice to starve queues on another VkDevice.

No specific guarantees are made about higher priority queues receiving more processing time or better quality of service than lower priority queues.

As I read it, this doesn't allow one app to queue itself higher than another, and only affects queue's created on the single VkDevice.  Now whether any hardware actually does this... you would know better than I, I image.

As far as secondary command buffers, I've seen that suggested.  I don't disagree, its just that I don't see that being faster than just recording a bunch of primary command buffers in most circumstances.  The only 2 situations I could come up with were:

1) The small command buffers are all within the same render pass, in which case you would need secondary command buffers.

2) You have way too many (thousands? millions?) small primary command buffers, and that might cause some performance issues on submit, so by recording them as secondary and using another thread to bundle them into a single primary, might make the submit faster.

Share this post


Link to post
Share on other sites

Some interesting points, i made this test now:

1: 10 dispatches of 50 wavefronts

2: 50 dispatches of 50 wavefronts

3: 50 dispatches of 50 wavefronts

With memory barrier after each dispatch and 1 queue: 0.46 ms

With memory barrier after each dispatch and 3 queues, one per shader: 0.21 ms

 

So we can use multiple queues to keep working while another queue is stalled.

I'll modify my test to see if i still could one queue for the same purpose by setting the memory ranges within the same buffer per shader, or by using multiple buffers per shader...

EDIT1:

...but first i tried to make the frist shader 5 times more work than 2 & 3. Actually all shaders did the same calculations, so i cant be sure a barrier on queue 1 does not stall queue 0 as well because barriers happen at the same time. Now i see shader 1 still completes first and is slightly faster than the other two so it is not affected by their barriers :)

Runtime 3 queues: 0.18ms, 1 queue: 0.44ms (not the first time seeing doing more work is faster on small loads)

 

 

 

 

 

Edited by JoeJ

Share this post


Link to post
Share on other sites
23 minutes ago, JoeJ said:

Some interesting points, i made this test now:

1: 10 dispatches of 50 wavefronts

2: 50 dispatches of 50 wavefronts

3: 50 dispatches of 50 wavefronts

With memory barrier after each dispatch and 1 queue: 0.46 ms

With memory barrier after each dispatch and 3 queues, one per shader: 0.21 ms

 

So we can use multiple queues to keep working while another queue is stalled.

I'll modify my test to see if i still could one queue for the same purpose by setting the memory ranges within the same buffer per shader, or by using multiple buffers per shader...

Its good to know that theory and practice align, at least for this : ) Nice work.  I'm curious, what sort of barrier parameters are you using?

Share this post


Link to post
Share on other sites
12 minutes ago, Ryan_001 said:

Its good to know that theory and practice align, at least for this : ) Nice work.  I'm curious, what sort of barrier parameters are you using?

BufferMemoryBarriers, here's code.

I leave comments in to illustrate how poor and uncertain the specs leave us at trial and error - or would you get the idea you need to set VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT for an indirect compute dispatch? :)

(Of course i could remove this here as i'm only writing some prefix sum results and no dispatch count, but offset and size becomes interesting now...)

 

	void MemoryBarriers (VkCommandBuffer commandBuffer, int *bufferList, const int numBarriers)
    {
        int const maxBarriers = 16;
        assert (numBarriers <= maxBarriers);
	        VkBufferMemoryBarrier bufferMemoryBarriers[maxBarriers] = {};
        //VkMemoryBarrier memoryBarriers[maxBarriers] = {};
	        for (int i=0; i<numBarriers; i++)
        {
            bufferMemoryBarriers.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
            //bufferMemoryBarriers.srcAccessMask = VK_ACCESS_MEMORY_READ_BIT | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
            //bufferMemoryBarriers.dstAccessMask = VK_ACCESS_MEMORY_WRITE_BIT | VK_ACCESS_SHADER_WRITE_BIT;
            bufferMemoryBarriers.srcAccessMask = VK_ACCESS_MEMORY_WRITE_BIT | VK_ACCESS_SHADER_WRITE_BIT;
            bufferMemoryBarriers.dstAccessMask = VK_ACCESS_MEMORY_READ_BIT | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
            //bufferMemoryBarriers.srcAccessMask = VK_ACCESS_MEMORY_WRITE_BIT | VK_ACCESS_SHADER_WRITE_BIT | VK_ACCESS_MEMORY_READ_BIT | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
            //bufferMemoryBarriers.dstAccessMask = VK_ACCESS_MEMORY_WRITE_BIT | VK_ACCESS_SHADER_WRITE_BIT | VK_ACCESS_MEMORY_READ_BIT | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
            bufferMemoryBarriers.srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
            bufferMemoryBarriers.dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
            bufferMemoryBarriers.buffer = buffers[bufferList].deviceBuffer;
            bufferMemoryBarriers.offset = 0;
            bufferMemoryBarriers.size = VK_WHOLE_SIZE;
	            //memoryBarriers.sType = VK_STRUCTURE_TYPE_MEMORY_BARRIER;
            //memoryBarriers.srcAccessMask = VK_ACCESS_MEMORY_WRITE_BIT;// | VK_ACCESS_SHADER_WRITE_BIT;
            //memoryBarriers.dstAccessMask = VK_ACCESS_MEMORY_READ_BIT;// | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
        }
	        vkCmdPipelineBarrier(
            commandBuffer,
            VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT,
            VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT | VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT,
            0,//VkDependencyFlags       
            0, NULL,//numBarriers, memoryBarriers,//
            numBarriers, bufferMemoryBarriers,
            0, NULL);
    }
        
    void Record (VkCommandBuffer commandBuffer, const uint32_t taskFlags,
        int profilerStartID, int profilerStopID, bool profilePerTask = true, bool use_barriers = true)
    {
        VkCommandBufferBeginInfo commandBufferBeginInfo = {};
        commandBufferBeginInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;
        commandBufferBeginInfo.flags = 0;//VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT;
	        vkBeginCommandBuffer(commandBuffer, &commandBufferBeginInfo);
	
#ifdef USE_GPU_PROFILER
        if (profilerStartID>=0) profiler.Start (profilerStartID, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
#endif
	 
	
        if (taskFlags & (1<<tTEST0))
        {
            vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelineLayouts[tTEST0], 0, 1, &descriptorSets[tTEST0], 0, nullptr);
        
            vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelines[taskToPipeline[tTEST0]]);
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Start (TS_TEST0, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
            int barrierBuffers[] = {bTEST0};
            for (int i=0; i<TASK_COUNT_0; i++)
            {
                vkCmdDispatchIndirect(commandBuffer, buffers[bDISPATCH].deviceBuffer, sizeof(VkDispatchIndirectCommand) * (0 + i) );
                if (use_barriers) MemoryBarriers (commandBuffer, barrierBuffers, 1);
            }
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Stop (TS_TEST0, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
        }
	        if (taskFlags & (1<<tTEST1))
        {
            vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelineLayouts[tTEST1], 0, 1, &descriptorSets[tTEST1], 0, nullptr);
        
            vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelines[taskToPipeline[tTEST1]]);
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Start (TS_TEST1, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
            int barrierBuffers[] = {bTEST1};
            for (int i=0; i<TASK_COUNT_1; i++)
            {
                vkCmdDispatchIndirect(commandBuffer, buffers[bDISPATCH].deviceBuffer, sizeof(VkDispatchIndirectCommand) * (200 + i) );
                if (use_barriers) MemoryBarriers (commandBuffer, barrierBuffers, 1);
            }
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Stop (TS_TEST1, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
        }
	        if (taskFlags & (1<<tTEST2))
        {
            vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelineLayouts[tTEST2], 0, 1, &descriptorSets[tTEST2], 0, nullptr);
        
            vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelines[taskToPipeline[tTEST2]]);
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Start (TS_TEST2, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
            int barrierBuffers[] = {bTEST2};
            for (int i=0; i<TASK_COUNT_2; i++)
            {
                vkCmdDispatchIndirect(commandBuffer, buffers[bDISPATCH].deviceBuffer, sizeof(VkDispatchIndirectCommand) * (400 + i) );
                if (use_barriers) MemoryBarriers (commandBuffer, barrierBuffers, 1);
            }
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Stop (TS_TEST2, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
        }
	 
	#ifdef USE_GPU_PROFILER
        if (profilerStopID>=0) profiler.Stop (profilerStopID, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
#endif
	        vkEndCommandBuffer(commandBuffer);
	    }
	

Share this post


Link to post
Share on other sites

Ok, so finally and as expected it makes no difference with these options:

Use barriers for unique buffers per task.

Use barriers for nonoverlapping memory regions per task but the same buffer for all.

 

The driver could figure out to still use async with 1 queue in both cases, but it does not. Just like the specs say.

I hope i've set up everything correctly (still unsure about the difference of VK_ACCESS_MEMORY_WRITE_BIT and VK_ACCESS_SHADER_WRITE_BIT, but this did not matter).

So the conclusion is:

We have to use multiple queues to keep busy on pipeline barriers.

We should reduce sync between queues to a minimum.

 

A bit more challenging than initially thought and i hope 2 saturating tasks in 2 queues don't slow each other down too much. If so we need more sync to prevent this and it becomes a hardware dependent act of balancing. But i'm optimisitc and it all makes sense now.

 

Share this post


Link to post
Share on other sites

Interesting, I don't know if you need VK_ACCESS_MEMORY_READ_BIT and VK_ACCESS_MEMORY_WRITE_BIT there.

Quote
  • VK_ACCESS_MEMORY_READ_BIT specifies read access via non-specific entities. These entities include the Vulkan device and host, but may also include entities external to the Vulkan device or otherwise not part of the core Vulkan pipeline. When included in a destination access mask, makes all available writes visible to all future read accesses on entities known to the Vulkan device.

  • VK_ACCESS_MEMORY_WRITE_BIT specifies write access via non-specific entities. These entities include the Vulkan device and host, but may also include entities external to the Vulkan device or otherwise not part of the core Vulkan pipeline. When included in a source access mask, all writes that are performed by entities known to the Vulkan device are made available. When included in a destination access mask, makes all available writes visible to all future write accesses on entities known to the Vulkan device.

I read that as meaning memory read/write bits are things outside the normal Vulkan scope, like the presentation/windowing system.  The demo/examples I looked at also never included those bits.  I agree with you completely in that the spec leaves alot of things ambiguously defined.  What surprised me a bit was that image layout transitions are considered both a read and write, so you have to include access/stage masks for the hidden read/write that occurs during transitions.

This thread has helped clarify a lot of these things.

I wrote my own pipeline barrier wrapper, which I found made a lot more sense (apart from not really understanding what VK_ACCESS_MEMORY_READ_BIT and VK_ACCESS_MEMORY_WRITE_BIT mean).  The whole thing isn't important but you might find the flag enumeration interesting.

enum class MemoryDependencyFlags : uint64_t {

	none											= 0,

	indirect_read								= (1ull << 0),				// VK_ACCESS_INDIRECT_COMMAND_READ_BIT + VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT
	index_read								= (1ull << 1),				// VK_ACCESS_INDEX_READ_BIT + VK_PIPELINE_STAGE_VERTEX_INPUT_BIT
	attribute_vertex_read				= (1ull << 2),				// VK_ACCESS_VERTEX_ATTRIBUTE_READ_BIT + VK_PIPELINE_STAGE_VERTEX_INPUT_BIT

	uniform_vertex_read					= (1ull << 3),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_VERTEX_SHADER_BIT
	uniform_tess_control_read		= (1ull << 4),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_TESSELLATION_CONTROL_SHADER_BIT
	uniform_tess_eval_read			= (1ull << 5),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_TESSELLATION_EVALUATION_SHADER_BIT
	uniform_geometry_read			= (1ull << 6),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_GEOMETRY_SHADER_BIT
	uniform_fragment_read				= (1ull << 7),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT
	uniform_compute_read				= (1ull << 8),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT

	shader_vertex_read					= (1ull << 9),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_VERTEX_SHADER_BIT
	shader_vertex_write					= (1ull << 10),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_VERTEX_SHADER_BIT
	shader_tess_control_read			= (1ull << 11),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_TESSELLATION_CONTROL_SHADER_BIT
	shader_tess_control_write		= (1ull << 12),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_TESSELLATION_CONTROL_SHADER_BIT
	shader_tess_eval_read				= (1ull << 13),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_TESSELLATION_EVALUATION_SHADER_BIT
	shader_tess_eval_write				= (1ull << 14),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_TESSELLATION_EVALUATION_SHADER_BIT
	shader_geometry_read				= (1ull << 15),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_GEOMETRY_SHADER_BIT
	shader_geometry_write			= (1ull << 16),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_GEOMETRY_SHADER_BIT
	shader_fragment_read				= (1ull << 17),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT
	shader_fragment_write				= (1ull << 18),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT
	shader_compute_read				= (1ull << 19),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
	shader_compute_write				= (1ull << 20),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT

	attachment_fragment_read		= (1ull << 21),				// VK_ACCESS_INPUT_ATTACHMENT_READ_BIT + VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT
	attachment_color_read				= (1ull << 22),				// VK_ACCESS_COLOR_ATTACHMENT_READ_BIT + VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
	attachment_color_write			= (1ull << 23),				// VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT + VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
	attachment_depth_read_early	= (1ull << 24),				// VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT + VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT
	attachment_depth_read_late		= (1ull << 25),				// VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT + VK_PIPELINE_STAGE_LATE_FRAGMENT_TESTS_BIT
	attachment_depth_write_early	= (1ull << 26),				// VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT + VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT
	attachment_depth_write_late	= (1ull << 27),				// VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT + VK_PIPELINE_STAGE_LATE_FRAGMENT_TESTS_BIT

	transfer_read							= (1ull << 28),				// VK_ACCESS_TRANSFER_READ_BIT + VK_PIPELINE_STAGE_TRANSFER_BIT
	transfer_write							= (1ull << 29),				// VK_ACCESS_TRANSFER_WRITE_BIT + VK_PIPELINE_STAGE_TRANSFER_BIT

	host_read									= (1ull << 30),				// VK_ACCESS_HOST_READ_BIT + VK_PIPELINE_STAGE_HOST_BIT
	host_write								= (1ull << 31),				// VK_ACCESS_HOST_WRITE_BIT + VK_PIPELINE_STAGE_HOST_BIT

	memory_read							= (1ull << 32),				// VK_ACCESS_MEMORY_READ_BIT
	memory_write							= (1ull << 33),				// VK_ACCESS_MEMORY_WRITE_BIT
	};

The formatting is a mess, but you get the idea.  Only certain combinations of stage + access are allowed by the spec, by enumerating them it made it far more clear which to pick.  I then can directly convert these to the associated stage + access masks without any loss in expressiveness/performance (or at least there shouldn't be if I understand things correctly).

Edited by Ryan_001

Share this post


Link to post
Share on other sites

Copy that - it's good to compare your own guessing against the guessing of others :D

(I spot you don't cover the case of a compute shader writing indirect dispacth count.)

 

I wonder if Events could help here: http://vulkan-spec-chunked.ahcox.com/ch06s03.html

I have not used them yet. Could i do something like triggereing a memory barrier, processing some other work, waiting on the barrier with a high chance it has been done already?

I really need Vulkan for dummies that tells me some usecases of such things...

 

Share this post


Link to post
Share on other sites
5 minutes ago, JoeJ said:

(I spot you don't cover the case of a compute shader writing indirect dispacth count.)

I'm not sure exactly what you mean.  The flags are pretty much taken verbatim from Table 4 of the spec: https://www.khronos.org/registry/vulkan/specs/1.0/html/vkspec.html#VkPipelineStageFlagBits (scroll down a screen or two).

I haven't played around with indirect stuff yet.  I'm assuming you write to a buffer the commands (either through memmap/staging buffer/copy or through a compute shader or similar), then use that buffer as the source for the indirect command correct?  If I was transferring from the host then I'd use host_write or transfer_write as my source flags (depending on whether or not I used a staging buffer), and then I'd use indirect_read as my dest flags.  If I were computing the buffer on the fly would you not use shader_compute_write as src, and indirect_read as dest?

24 minutes ago, JoeJ said:

I really need Vulkan for dummies that tells me some usecases of such things...

Isn't that an oxymoron :)

Share this post


Link to post
Share on other sites
11 minutes ago, Ryan_001 said:

If I were computing the buffer on the fly would you not use shader_compute_write as src, and indirect_read as dest?

Oh sorry, yes.  Confused this with the dstStageMask from vkCmdPipelineBarrier()

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


  • Advertisement
  • Advertisement
  • Popular Tags

  • Advertisement
  • Popular Now

  • Similar Content

    • By Ward Correll
      I include the source code from what I am playing with. It's an exercise from Frank Luna's DirectX 12 book about rendering a skull from a text file. I get a stack overflow error and the program quits. I don't know where I went wrong it's messy programming on the parts I added but maybe one of you masterminds can tell me where I went wrong.
      Chapter_7_Drawing_in_Direct3D_Part_II.zip
    • By GameDev.net
      This is an extract from Practical Game AI Programming from Packt. Click here to download the book for free!
      When humans play games – like chess, for example – they play differently every time. For a game developer this would be impossible to replicate. So, if writing an almost infinite number of possibilities isn’t a viable solution game developers have needed to think differently. That’s where AI comes in. But while AI might like a very new phenomenon in the wider public consciousness, it’s actually been part of the games industry for decades.
      Enemy AI in the 1970s
      Single-player games with AI enemies started to appear as early as the 1970s. Very quickly, many games were redefining the standards of what constitutes game AI. Some of those examples were released for arcade machines, such as Speed Race from Taito (a racing video game), or Qwak (a duck hunting game using a light gun), and Pursuit (an aircraft fighter) both from Atari. Other notable examples are the text-based games released for the first personal computers, such as Hunt the Wumpus and Star Trek, which also had AI enemies.
      What made those games so enjoyable was precisely that the AI enemies that didn't react like any others before them. This was because they had random elements mixed with the traditional stored patterns, creating games that felt unpredictable to play. However, that was only possible due to the incorporation of microprocessors that expanded the capabilities of a programmer at that time. Space Invaders brought the movement patterns and Galaxian improved and added more variety, making the AI even more complex. Pac-Man later on brought movement patterns to the maze genre – the AI design in Pac-Man was arguably as influential as the game itself.
      After that, Karate Champ introduced the first AI fighting character and Dragon Quest introduced the tactical system for the RPG genre. Over the years, the list of games that has used artificial intelligence to create unique game concepts has expanded. All of that has essentially come from a single question, how can we make a computer capable of beating a human in a game?
      All of the games mentioned used the same method for the AI called finite-state machine (FSM). Here, the programmer inputs all the behaviors that are necessary for the computer to challenge the player. The programmer defined exactly how the computer should behave on different occasions in order to move, avoid, attack, or perform any other behavior to challenge the player, and that method is used even in the latest big budget games.
      From simple to smart and human-like AI
      One of the greatest challenges when it comes to building intelligence into games is adapting the AI movement and behavior in relation to what the player is currently doing, or will do. This can become very complex if the programmer wants to extend the possibilities of the AI decisions.
      It's a huge task for the programmer because it's necessary to determine what the player can do and how the AI will react to each action of the player. That takes a lot of CPU power. To overcome that problem, programmers began to mix possibility maps with probabilities and perform other techniques that let the AI decide for itself how it should react according to the player's actions. These factors are important to be considered while developing an AI that elevates a games’ quality.
      Games continued to evolve and players became even more demanding. To deliver games that met player expectations, programmers had to write more states for each character, creating new in-game and more engaging enemies.
      Metal Gear Solid and the evolution of game AI
      You can start to see now how technological developments are closely connected to the development of new game genres. A great example is Metal Gear Solid; by implementing stealth elements, it moved beyond the traditional shooting genre. Of course, those elements couldn't be fully explored as Hideo Kojima probably intended because of the hardware limitations at the time. However, jumping forward from the third to the fifth generation of consoles, Konami and Hideo Kojima presented the same title, only with much greater complexity. Once the necessary computing power was there, the stage was set for Metal Gear Solid to redefine modern gaming.
      Visual and audio awareness
      One of the most important but often underrated elements in the development of Metal Gear Solid was the use of visual and audio awareness for the enemy AI. It was ultimately this feature that established the genre we know today as a stealth game. Yes, the game uses Path Finding and a FSM, features already established in the industry, but to create something new the developers took advantage of some of the most cutting-edge technological innovations. Of course the influence of these features today expands into a range of genres from sports to racing.
      After that huge step for game design, developers still faced other problems. Or, more specifically, these new possibilities brought even more problems. The AI still didn't react as a real person, and many other elements were required, to make the game feel more realistic.
      Sports games
      This is particularly true when we talk about sports games. After all, interaction with the player is not the only thing that we need to care about; most sports involve multiple players, all of whom need to be ‘realistic’ for a sports game to work well.
      With this problem in mind, developers started to improve the individual behaviors of each character, not only for the AI that was playing against the player but also for the AI that was playing alongside them. Once again, Finite State Machines made up a crucial part of Artificial Intelligence, but the decisive element that helped to cultivate greater realism in the sports genre was anticipation and awareness. The computer needed to calculate, for example, what the player was doing, where the ball was going, all while making the ‘team’ work together with some semblance of tactical alignment. By combining the new features used in the stealth games with a vast number of characters on the same screen, it was possible to develop a significant level of realism in sports games. This is a good example of how the same technologies allow for development across very different types of games.
      How AI enables a more immersive gaming experience
      A final useful example of how game realism depends on great AI is F.E.A.R., developed by Monolith Productions. What made this game so special in terms of Artificial Intelligence was the dialog between enemy characters. While this wasn’t strictly a technological improvement, it was something that helped to showcase all of the development work that was built into the characters' AI. This is crucial because if the AI doesn't say it, it didn't happen.
      Ultimately, this is about taking a further step towards enhanced realism. In the case of F.E.A.R., the dialog transforms how you would see in-game characters. When the AI detects the player for the first time, it shouts that it found the player; when the AI loses sight of the player, it expresses just that. When a group of (AI generated) characters are trying to ambush the player, they talk about it. The game, then, almost seems to be plotting against the person playing it. This is essential because it brings a whole new dimension to gaming. Ultimately, it opens up possibilities for much richer storytelling and complex gameplay, which all of us – as gamers – have come to expect today.
       
    • By mister345
      Hi guys, so I have about 200 files isolated in their own folder [physics code] in my Visual Studio project that I never touch. They might as well be a separate library, I just keep em as source files in case I need to look at them or step through them, but I will never actually edit them, so there's no need to ever build them.
      However, when I need to rebuild the entire solution because I changed the other files, all of these 200 files get rebuilt too and it takes a really long time.
      If I click on their properties -> exclude from build, then rebuild, it's no good because then all the previous built objects get deleted automatically, so the build will fail.
      So how do I make the built versions of the 200+ files in the physics directory stay where they are so I never have to rebuild them, but
      do a normal rebuild for everything else? Any easy answers to this? The simpler the better, as I am a noob at Visual Studio settings. Thanks.
    • By Snaked
      Im working in this project for 1 year .... mostly i develop a tool and databases for make the different maps and now i'm doing the client for play the game
      Tell me if you like it......
      this is a capture of how is viewing atm

       
       
      https://youtu.be/9251v4wDTQ0
    • By Gabriel Gonzalez
      Hi everyone, my name is Gabriel and this is my first post here. I graduated from college over three years ago with a degree in Physics and now I want to start a career as a gameplay programmer. Besides a single C++ programming class in college I have not had any prior experience programming. What I have done to learn until now is to use SFML to recreate Arkanoid and Space Invaders. My question is, am I on the right track if I just continue creating games from scratch using libraries such as SFML or would I benefit more if I move on to using an engine such as Unreal or Unity? Also, of how much help (if any) would my degree be when trying to join a team? I live in San Diego, CA if that matters at all.   I do appreciate in advance any guidance anyone could offer me.
  • Advertisement