Green_Baron

Vulkan Question concerning internal queue organisation

Recommended Posts

Hello,

my first post here :-)

About half a year ago i started with C++ (did a little C before) and poking into graphics programming. Right now i am digging through the various vulkan tutorials.

A probably naive question that arose is:

If i have a device (in my case a GTX970 clone) that exposes on each of two gpus two families, one with 16 queues for graphics, compute, etc and another one with a single transfer queue, do i loose potential performance if i only use 1 of the 16 graphics queues ? Or, in other words, are these queues represented by hardware or logical entities ?

And how is that handled across different vendors ? Do intel and amd handle this similar or would a program have to take care of different handling across different hardware ?

Cheers

gb

Share this post


Link to post
Share on other sites

Yes, this is very vendor specific.

On AMD you can use multiole queues to do async compute (e.g. doing compute shader and shadow maps rendering at the same time).

You can also do multiplie compute shadres at the same time, but it's also likely that's slower than doing them in order in a singe queue.

 

On NV the first option is possible on recent cards, but the second option is not possible -and they will serialize internally (AFAIK - not sure)

 

On both Vendors it makes sense to use a different queue for data transfer, e.g. a streaming system running while rendering.

Not sure about Intel, IFAIK they recommend to just use a single queue for everything.

 

In practice you need a good reason to use multiple queues, test on each HW, and use differnet settings for different HW.

E.g. for multithreaded command buffer generation you don't need multiple queues and queue per thread would be a bad idea.

Share this post


Link to post
Share on other sites

Thanks. So i understand that a single graphics queue is the best solution.

Yeah, i could split the 2*16 queues freely among graphics, compute, transfer and sparse, and the family with the single q is transfer only. Like this, but two times for two devices:

VkQueueFamilyProperties[0]:
===========================
        queueFlags         = GRAPHICS | COMPUTE | TRANSFER | SPARSE
        queueCount         = 16
        timestampValidBits = 64
        minImageTransferGranularity = (1, 1, 1)

VkQueueFamilyProperties[1]:
===========================
        queueFlags         = TRANSFER
        queueCount         = 1
        timestampValidBits = 64
        minImageTransferGranularity = (1, 1, 1)

 

I am not that far as to test anything on different platforms/devices. My "training" pc is a debian linux one. But in principle and if one day i shall do a basic framework for my own i would of course aim towards a solution that is robust and works for different platforms / manufacturers. That would probably be a compromise and not the ideal one for every case.

 

Share this post


Link to post
Share on other sites
1 hour ago, Green_Baron said:

Thanks. So i understand that a single graphics queue is the best solution.

 

Probably. I'm no graphics pipeline expert, but i'm not aware of a case where using two graphics queues can make sense. (Interested, if anybody else does)

It also makes sense to use 1 graphics queue, 1 upload queue, 1 download queue on each GPU to communicate (although you don't have this option because you have only one seperate transfer queue).

And it makes sense to use multiple compute queues on some Hardware.

I proofed that GCN can perfectly overlap low work compute workloads, but the need to use multiple queues, so multiple command buffers and to sync between them destroyed the advsante in my case.

Personally i think the concept of queues is much too high level and totally sucks. It would be great if we could manage unique CUs much more low level. The hardware can do it but we have no access - VK/DX12 is just a start... :(

 

Share this post


Link to post
Share on other sites

I haven't quite figured out the point/idea behind queue families.  Its clear that all queue's of a given family share hardware.  Also a GPU is allowed a lot of leeway to rearrange commands both within in command buffers, and across command buffers within the same queue.  So queue's from separate queue families are most likely separate pieces of hardware, but are queue's from the same family?  I've never been able to get a straight answer on this, but my gut feeling is no.

For example AMD has 3 queue families, so if you create one queue for each family (one for graphics, one for compute, and one for transfer) you can probably get better performance.  But is it possible to get significantly better performance with multiple queues from the same queue family?  So far from what I've been able to gather online, is probably not.

While I do agree with JoeJ that queue's are poorly designed in vulkan, I don't think direct control of CUs makes sense IMHO.  I think queue's should be essentially driver/software entities.  So when you create a device you select how many queue's of what capabilities you need, the driver gives you them and maps them to hardware entities however it feels is best.  Sort of like how on the CPU we create threads and the OS maps them to cores.  No notion of queue families.  No need to query what queue's exist and try to map them to what you want.

TBH, until they clean this part of the spec up, or at least provide some documentation on what they had in mind, I feel like most people are just going to create 1 graphics queue, 1 transfer queue, and just ignore the rest.

Edited by Ryan_001

Share this post


Link to post
Share on other sites

I'm just beginning to understand how this works, am far from asking "why". And for a newcomer vulkan is a little steep in the beginning and some things seem highly theoretic (like graphics without presentation or so).

Thanks for the answers, seems like i'm on the right track :-)

Share this post


Link to post
Share on other sites
On 7/10/2017 at 6:05 PM, Green_Baron said:

do i loose potential performance if i only use 1 of the 16 graphics queues ? Or, in other words, are these queues represented by hardware or logical entities ?

No, you probably only need 1 queue. They're (hopefully) hardware entities. If you want different bits of your GPU work to be able to run in parallel, then you could use different queues, but you probably have no need for that.

For example, if you launch two different apps at the same time, Windows may make sure that each of them is running on a different hardware queue, which could make them more responsive / less likely to get in each other's way.

On 7/11/2017 at 1:41 AM, JoeJ said:

I'm no graphics pipeline expert, but i'm not aware of a case where using two graphics queues can make sense. (Interested, if anybody else does)

In the future when vendors start making GPU's that can actually run multiple command buffers in parallel to each other, then you could use it in the same way that AMD's async compute works.

On 7/11/2017 at 2:57 AM, Ryan_001 said:

I haven't quite figured out the point/idea behind queue families.

To use OOP as an analogy, a family is a class and a queue is an instance (object) of that class.

On 7/11/2017 at 1:41 AM, JoeJ said:

Personally i think the concept of queues is much too high level and totally sucks. It would be great if we could manage unique CUs much more low level. The hardware can do it but we have no access - VK/DX12 is just a start...

Are you sure about that? AFAIK the queues are an abstraction of the GPU's command engine, which receives draws/dispatches and hands them over to an internal fixed function scheduler.

Share this post


Link to post
Share on other sites
2 hours ago, Hodgman said:

 

On 10.7.2017 at 5:41 PM, JoeJ said:

Personally i think the concept of queues is much too high level and totally sucks. It would be great if we could manage unique CUs much more low level. The hardware can do it but we have no access - VK/DX12 is just a start...

Are you sure about that? AFAIK the queues are an abstraction of the GPU's command engine, which receives draws/dispatches and hands them over to an internal fixed function scheduler.

I would have nothing aginst the queue concept, if only it would work.

You can look at a my testproject i have submitted to AMD: https://github.com/JoeJGit/OpenCL_Fiji_Bug_Report/blob/master/async_test_project.rar

...if you are bored, but here is what i found:

You can run 3 low work tasks without synchornizition perfectly parallel, yeah - awesome.

As soon as you add sync, which is only possible by using semaphores, the advantage gets lost due to bubbles. (Maybe semaphores sync with CPU as well? If so we have a terrible situation here! We need GPU only sync between queues.)

And here comes the best: If you try larger workloads, e.g. 3 tasks with runtimes of 0.2ms, 1ms, 1ms without async, going async the first and second task run parallel as expected, although 1ms become 2ms, so there is no win. But the third task raises to 2ms as well, even it runs alone and nothing else - it's runtime is doudled for nothing.

It seems there is no dynamic work balancing happening here - looks like the GPU gets divided somehow and refuses to merge back when possible.

2 hours ago, Hodgman said:

AFAIK the queues are an abstraction of the GPU's command engine, which receives draws/dispatches and hands them over to an internal fixed function scheduler.

Guess not, the numbers don't match. A Fiji has 8 ACEs (if thats the correct name), but i see only 4 compute queues (1gfx/CS+3CS). Nobody knows what happens under the hood, but it needs more work, at least on the drivers.

 

 

Access to unique CUs should not be necessary, you're right guys. But i would be willing to tackle this if it would be an improvement.

There are two situations where async compute makes sense:

1. Doing compute while doing ALU light rendering work (Not yet tried - all my hope goues in to this, but net everyone has rendering work.)

2. Paralellizing and synchronizing low work compute tasks - extremely important if we look towards more complex algotrithms reducing work instead to brute force everything. And sadly this fails yet.

Edited by JoeJ

Share this post


Link to post
Share on other sites

I think part of your disappointment is the assumption that the GPU won't already be running computations async in parallel in the first place, which means that you expect "async compute" to give a huge boost, when you've actually gotten that boost already.

In a regular situation, if you submit two dispatch calls "A" and "B" sequentially, which each contain 8 wavefronts, the GPU's timeline will hopefully look like this:
Ab29aQ2.png

Where it's working on both A and B concurrently.

If you go and add any kind of resource transition or sync between those two dispatches, then you end up with a timeline that looks more like:
nyFY8aj.png

If you simply want the GPU to work on as many compute tasks back to back without any bubbles, then the new tool in Vulkan for optimizing that situation is manual control over barriers. D3D11/GL will use barriers all over the place where they aren't required (which creates these bubbles and disables concurrent processing of multiple dispatch calls), but Vulkan gives you to the power to specify exactly when they're required.

Using multiple queues is not required for this optimization. The use of multiple queues requires the use of extra barriers and syncrhonisation, which is the opposite of what you want. As you mention, a good use for a seperate compute queue is so that you can keep the CU's fed while a rasterizer-heavy draw command list is being processed.

Also take note that the structure of these timelines makes profiling the time taken by your stages quite difficult. Note that the front-end processes "B" in between "A" and "A - end of pipe"! If you time from when A reaches the front of the pipe to when it reaches the end of the pipe, you'll also be counting some time taken by the "B" command! If you count the time from when "A" enters the pipe until when "B" enters the pipe, then your timings will be much shorter than reality. The more internal parallelism that you're getting out of the GPU, the more incorrect your timings of individual draws/dispatches will be. Remember to keep that in mind when analyzing any timing data that you collect.

Share this post


Link to post
Share on other sites

Whooo! - I already thaught the driver could figure out a dependency graph und do things async automatically, but i also thought this being reality would be wishfull thinking.

This is too good to be true, so i'm still not ready to believe it :)

(Actually i have too much barriers, but soon i'll be able to push more independent work to the queue and i'm curious if i'll get a lot of it for free...)

Awesome! Thanks, Hodgman :)

 

 

 

 

Share this post


Link to post
Share on other sites

The following quote from spec:

Quote

Command buffer boundaries, both between primary command buffers of the same or different batches or submissions as well as between primary and secondary command buffers, do not introduce any additional ordering constraints. In other words, submitting the set of command buffers (which can include executing secondary command buffers) between any semaphore or fence operations execute the recorded commands as if they had all been recorded into a single primary command buffer, except that the current state is reset on each boundary. Explicit ordering constraints can be expressed with explicit synchronization primitives.

I read this as meaning that for a single queue command buffer boundaries don't really matter, at least where pipeline barriers are concerned.  A pipeline barrier in one command buffer will halt execution, not only of commands within that command buffer, but also any subsequent commands in subsequent command buffers (of course only for those stages that the barrier applies to), even if those subsequent command buffers are independent and could be executed simultaneously.  So if the work is truly independent, I could see there being a small potential performance increase when using multiple queue's.

That said I feel this was an error/oversight IMHO.  It seems clear (at least to me : ) that semaphores are the go-to primitive for synchronization between separate command buffers, and hence it would make sense to me that pipeline barriers operate only within a particular command buffer.  This way independent command buffers on the same queue could be fully independent, and there would be no need for other queue's.  Alas, this is not the case.  Perhaps they had their reasons for not doing that, its often hard to read-between the lines and understand the 'why' from a specification.  I guess to me, its why queue's feel like such a mess.  Vulkan has multiple ways of doing essentially the same thing, it feels like the spec is stepping on its own toes.  But perhaps its just my OCD wanting a more ortho-normal API.

Share this post


Link to post
Share on other sites

I tried to verify with my test...

I have 3 shaders:

1: 10 dispatches of 50 wavefronts

2: 50 dispatches of 50 wavefronts

3: 50 dispatches of 50 wavefronts

With memory barrier after each dispatch i get 0.462 ms

Without: 0.012 ms (speed up of 38.5)

To verify i use 1 dispatch of 5500 wavefronts (same work): 0.013 ms

 

So yes, not only the GPU is capable of doing async compute perfectly with a single queue, we also see to API overhead of multiple dispatches is zero :)

Finally i understand why memory barriers appeared so expensive to me. Shame on me and all disappointment gone :D

Share this post


Link to post
Share on other sites
7 hours ago, Ryan_001 said:

 So if the work is truly independent, I could see there being a small potential performance increase when using multiple queue's.

If the work is truly independent, you won't have any barriers and could use one queue just fine.

Share this post


Link to post
Share on other sites
1 hour ago, Hodgman said:
9 hours ago, Ryan_001 said:

 So if the work is truly independent, I could see there being a small potential performance increase when using multiple queue's.

If the work is truly independent, you won't have any barriers and could use one queue just fine.

Yes, using one queue is faster even if there are no memory barriers / semaphores. Submitting a command buffer has a noticeable cost, so putting all work in one command buffer and one queue is the fastest way.

I also tested to use queues from different families vs. using all from one family which had no effect on performance.

All tests with only compute shaders.

 

Now i don't see a need to use multiple queues other than for up / downloading data. Maybe using 2 queues if we want to do compute and graphics makes sense but i guess 1 queue is better here too.

 

Edit: Maybe using multiple queues results in dividing work strictly between CUs, while using one queue can distribute multiple dispatches on the same CU. If so, maybe we could avoid some cache thrashing my grouping work with similar memory access together. But i guess cases where this wins would be extremely rare.

 

Edited by JoeJ

Share this post


Link to post
Share on other sites
3 hours ago, Hodgman said:

If the work is truly independent, you won't have any barriers and could use one queue just fine.

With all due respect, not necessarily.  Assuming I'm reading the spec correctly...

Imagine you had 3 units of work, each in their own command buffer, and each fully independent from the other 2.  Now within each command buffer there is still a pipeline barrier because while each command buffer is completely independent from the others, there are dependencies within the commands of each individual command buffer.  You could submit these 3 command buffers to 3 different queue's and they could run (theoretically) asynchronously/in any order.

Now if pipeline barriers were restricted to a given command buffer, then submitting these 3 commands buffers to a single queue would also yield asynchronous performance.   But as it stands, submitting these 3 commands to a single queue will cause stalls/bubbles because pipeline barriers work across command buffer boundaries.  The pipeline barrier in command buffer 1 will not only cause commands in buffer 1 to wait but also commands in buffers 2 and 3, even though those commands are independent and need not wait on the pipeline barrier.

This change would also give a bit of purpose to secondary command buffers, of which (at this time) I see little use for.

Now I just need to convince the Vulkan committee of what a great idea retroactively changing the spec is, and that breaking everyone's code is no big deal.  /sarcasm

Edited by Ryan_001

Share this post


Link to post
Share on other sites
6 minutes ago, Ryan_001 said:

Imagine you had 3 units of work, each in their own command buffer, and each fully independent from the other 2.  Now within each command buffer there is still a pipeline barrier because while each command buffer is completely independent from the others, there are dependencies within the commands of each individual command buffer.  

Ah ok I didn't get the last assumption -- that the big picture work of each buffer is independent, but contain internal dependencies.

Yes, in theory multiple queues could be used to allow commands from another queue to be serviced while one queue is blocked doing some kind of barrier work. In practice on current hardware I don't know if this makes any difference though -- the "barrier work" will usually be made up of a command along the lines of "halt the front-end from processing any command from any queue until all write-through traffic has actually been flushed from the L2 cache to RAM"... In the future there may be a use for this though.

I don't know if the Vulkan spec allows for it, but another use of multiple queues is prioritization. If a background app is using the GPU at the same time as a game, it would be wise for the driver to set the game's queues as high priority and the background app's queues as low priority. Likewise if your gameplay code itself uses GPU compute, you could issue it's commands via  a "highest/realtime" priority queue which is configured to immediately interrupt any graphics work and do the compute work immediately -- which would allow you to perform GPGPU calculations without the typical one-frame delay. Again, I don't know if this is possible (yet) on PC's either.

12 minutes ago, Ryan_001 said:

This change would also give a bit of purpose to secondary command buffers, of which (at this time) I see little use for.

 

AFAIK, they're similar to "bundles" in D3D12 or display lists in GL, which are meant for saving on the CPU cost of repeatedly re-recording draw commands for a particular model every frame, and instead re-using a micro command buffer over many frames.

Share this post


Link to post
Share on other sites

Well barriers in the spec are a little more fine grained, you can pick the actual pipeline stages to halt on.  For example if you wrote to a buffer from the fragment shader, and then read it from the vertex shader, you would put a pipeline barrier which would halt all subsequent vertex shader (and later stages) from executing prior to the fragment shader complete'ing.  But I have the feeling you were talking about what hardware actually does?  In which case you are probably right, I have no idea how fine-grained the hardware really is.

The spec does support queue priority, sort of:

Quote

4.3.4. Queue Priority

Each queue is assigned a priority, as set in the VkDeviceQueueCreateInfo structures when creating the device. The priority of each queue is a normalized floating point value between 0.0 and 1.0, which is then translated to a discrete priority level by the implementation. Higher values indicate a higher priority, with 0.0 being the lowest priority and 1.0 being the highest.

Within the same device, queues with higher priority may be allotted more processing time than queues with lower priority. The implementation makes no guarantees with regards to ordering or scheduling among queues with the same priority, other than the constraints defined by any explicit synchronization primitives. The implementation make no guarantees with regards to queues across different devices.

An implementation may allow a higher-priority queue to starve a lower-priority queue on the same VkDevice until the higher-priority queue has no further commands to execute. The relationship of queue priorities must not cause queues on one VkDevice to starve queues on another VkDevice.

No specific guarantees are made about higher priority queues receiving more processing time or better quality of service than lower priority queues.

As I read it, this doesn't allow one app to queue itself higher than another, and only affects queue's created on the single VkDevice.  Now whether any hardware actually does this... you would know better than I, I image.

As far as secondary command buffers, I've seen that suggested.  I don't disagree, its just that I don't see that being faster than just recording a bunch of primary command buffers in most circumstances.  The only 2 situations I could come up with were:

1) The small command buffers are all within the same render pass, in which case you would need secondary command buffers.

2) You have way too many (thousands? millions?) small primary command buffers, and that might cause some performance issues on submit, so by recording them as secondary and using another thread to bundle them into a single primary, might make the submit faster.

Share this post


Link to post
Share on other sites

Some interesting points, i made this test now:

1: 10 dispatches of 50 wavefronts

2: 50 dispatches of 50 wavefronts

3: 50 dispatches of 50 wavefronts

With memory barrier after each dispatch and 1 queue: 0.46 ms

With memory barrier after each dispatch and 3 queues, one per shader: 0.21 ms

 

So we can use multiple queues to keep working while another queue is stalled.

I'll modify my test to see if i still could one queue for the same purpose by setting the memory ranges within the same buffer per shader, or by using multiple buffers per shader...

EDIT1:

...but first i tried to make the frist shader 5 times more work than 2 & 3. Actually all shaders did the same calculations, so i cant be sure a barrier on queue 1 does not stall queue 0 as well because barriers happen at the same time. Now i see shader 1 still completes first and is slightly faster than the other two so it is not affected by their barriers :)

Runtime 3 queues: 0.18ms, 1 queue: 0.44ms (not the first time seeing doing more work is faster on small loads)

 

 

 

 

 

Edited by JoeJ

Share this post


Link to post
Share on other sites
23 minutes ago, JoeJ said:

Some interesting points, i made this test now:

1: 10 dispatches of 50 wavefronts

2: 50 dispatches of 50 wavefronts

3: 50 dispatches of 50 wavefronts

With memory barrier after each dispatch and 1 queue: 0.46 ms

With memory barrier after each dispatch and 3 queues, one per shader: 0.21 ms

 

So we can use multiple queues to keep working while another queue is stalled.

I'll modify my test to see if i still could one queue for the same purpose by setting the memory ranges within the same buffer per shader, or by using multiple buffers per shader...

Its good to know that theory and practice align, at least for this : ) Nice work.  I'm curious, what sort of barrier parameters are you using?

Share this post


Link to post
Share on other sites
12 minutes ago, Ryan_001 said:

Its good to know that theory and practice align, at least for this : ) Nice work.  I'm curious, what sort of barrier parameters are you using?

BufferMemoryBarriers, here's code.

I leave comments in to illustrate how poor and uncertain the specs leave us at trial and error - or would you get the idea you need to set VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT for an indirect compute dispatch? :)

(Of course i could remove this here as i'm only writing some prefix sum results and no dispatch count, but offset and size becomes interesting now...)

 

	void MemoryBarriers (VkCommandBuffer commandBuffer, int *bufferList, const int numBarriers)
    {
        int const maxBarriers = 16;
        assert (numBarriers <= maxBarriers);
	        VkBufferMemoryBarrier bufferMemoryBarriers[maxBarriers] = {};
        //VkMemoryBarrier memoryBarriers[maxBarriers] = {};
	        for (int i=0; i<numBarriers; i++)
        {
            bufferMemoryBarriers[i].sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
            //bufferMemoryBarriers[i].srcAccessMask = VK_ACCESS_MEMORY_READ_BIT | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
            //bufferMemoryBarriers[i].dstAccessMask = VK_ACCESS_MEMORY_WRITE_BIT | VK_ACCESS_SHADER_WRITE_BIT;
            bufferMemoryBarriers[i].srcAccessMask = VK_ACCESS_MEMORY_WRITE_BIT | VK_ACCESS_SHADER_WRITE_BIT;
            bufferMemoryBarriers[i].dstAccessMask = VK_ACCESS_MEMORY_READ_BIT | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
            //bufferMemoryBarriers[i].srcAccessMask = VK_ACCESS_MEMORY_WRITE_BIT | VK_ACCESS_SHADER_WRITE_BIT | VK_ACCESS_MEMORY_READ_BIT | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
            //bufferMemoryBarriers[i].dstAccessMask = VK_ACCESS_MEMORY_WRITE_BIT | VK_ACCESS_SHADER_WRITE_BIT | VK_ACCESS_MEMORY_READ_BIT | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
            bufferMemoryBarriers[i].srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
            bufferMemoryBarriers[i].dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
            bufferMemoryBarriers[i].buffer = buffers[bufferList[i]].deviceBuffer;
            bufferMemoryBarriers[i].offset = 0;
            bufferMemoryBarriers[i].size = VK_WHOLE_SIZE;
	            //memoryBarriers[i].sType = VK_STRUCTURE_TYPE_MEMORY_BARRIER;
            //memoryBarriers[i].srcAccessMask = VK_ACCESS_MEMORY_WRITE_BIT;// | VK_ACCESS_SHADER_WRITE_BIT;
            //memoryBarriers[i].dstAccessMask = VK_ACCESS_MEMORY_READ_BIT;// | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
        }
	        vkCmdPipelineBarrier(
            commandBuffer,
            VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT,
            VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT | VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT,
            0,//VkDependencyFlags       
            0, NULL,//numBarriers, memoryBarriers,//
            numBarriers, bufferMemoryBarriers,
            0, NULL);
    }
        
    void Record (VkCommandBuffer commandBuffer, const uint32_t taskFlags,
        int profilerStartID, int profilerStopID, bool profilePerTask = true, bool use_barriers = true)
    {
        VkCommandBufferBeginInfo commandBufferBeginInfo = {};
        commandBufferBeginInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;
        commandBufferBeginInfo.flags = 0;//VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT;
	        vkBeginCommandBuffer(commandBuffer, &commandBufferBeginInfo);
	
#ifdef USE_GPU_PROFILER
        if (profilerStartID>=0) profiler.Start (profilerStartID, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
#endif
	 
	
        if (taskFlags & (1<<tTEST0))
        {
            vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelineLayouts[tTEST0], 0, 1, &descriptorSets[tTEST0], 0, nullptr);
        
            vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelines[taskToPipeline[tTEST0]]);
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Start (TS_TEST0, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
            int barrierBuffers[] = {bTEST0};
            for (int i=0; i<TASK_COUNT_0; i++)
            {
                vkCmdDispatchIndirect(commandBuffer, buffers[bDISPATCH].deviceBuffer, sizeof(VkDispatchIndirectCommand) * (0 + i) );
                if (use_barriers) MemoryBarriers (commandBuffer, barrierBuffers, 1);
            }
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Stop (TS_TEST0, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
        }
	        if (taskFlags & (1<<tTEST1))
        {
            vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelineLayouts[tTEST1], 0, 1, &descriptorSets[tTEST1], 0, nullptr);
        
            vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelines[taskToPipeline[tTEST1]]);
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Start (TS_TEST1, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
            int barrierBuffers[] = {bTEST1};
            for (int i=0; i<TASK_COUNT_1; i++)
            {
                vkCmdDispatchIndirect(commandBuffer, buffers[bDISPATCH].deviceBuffer, sizeof(VkDispatchIndirectCommand) * (200 + i) );
                if (use_barriers) MemoryBarriers (commandBuffer, barrierBuffers, 1);
            }
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Stop (TS_TEST1, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
        }
	        if (taskFlags & (1<<tTEST2))
        {
            vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelineLayouts[tTEST2], 0, 1, &descriptorSets[tTEST2], 0, nullptr);
        
            vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelines[taskToPipeline[tTEST2]]);
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Start (TS_TEST2, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
            int barrierBuffers[] = {bTEST2};
            for (int i=0; i<TASK_COUNT_2; i++)
            {
                vkCmdDispatchIndirect(commandBuffer, buffers[bDISPATCH].deviceBuffer, sizeof(VkDispatchIndirectCommand) * (400 + i) );
                if (use_barriers) MemoryBarriers (commandBuffer, barrierBuffers, 1);
            }
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Stop (TS_TEST2, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
        }
	 
	#ifdef USE_GPU_PROFILER
        if (profilerStopID>=0) profiler.Stop (profilerStopID, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
#endif
	        vkEndCommandBuffer(commandBuffer);
	    }
	

Share this post


Link to post
Share on other sites

Ok, so finally and as expected it makes no difference with these options:

Use barriers for unique buffers per task.

Use barriers for nonoverlapping memory regions per task but the same buffer for all.

 

The driver could figure out to still use async with 1 queue in both cases, but it does not. Just like the specs say.

I hope i've set up everything correctly (still unsure about the difference of VK_ACCESS_MEMORY_WRITE_BIT and VK_ACCESS_SHADER_WRITE_BIT, but this did not matter).

So the conclusion is:

We have to use multiple queues to keep busy on pipeline barriers.

We should reduce sync between queues to a minimum.

 

A bit more challenging than initially thought and i hope 2 saturating tasks in 2 queues don't slow each other down too much. If so we need more sync to prevent this and it becomes a hardware dependent act of balancing. But i'm optimisitc and it all makes sense now.

 

Share this post


Link to post
Share on other sites

Interesting, I don't know if you need VK_ACCESS_MEMORY_READ_BIT and VK_ACCESS_MEMORY_WRITE_BIT there.

Quote
  • VK_ACCESS_MEMORY_READ_BIT specifies read access via non-specific entities. These entities include the Vulkan device and host, but may also include entities external to the Vulkan device or otherwise not part of the core Vulkan pipeline. When included in a destination access mask, makes all available writes visible to all future read accesses on entities known to the Vulkan device.

  • VK_ACCESS_MEMORY_WRITE_BIT specifies write access via non-specific entities. These entities include the Vulkan device and host, but may also include entities external to the Vulkan device or otherwise not part of the core Vulkan pipeline. When included in a source access mask, all writes that are performed by entities known to the Vulkan device are made available. When included in a destination access mask, makes all available writes visible to all future write accesses on entities known to the Vulkan device.

I read that as meaning memory read/write bits are things outside the normal Vulkan scope, like the presentation/windowing system.  The demo/examples I looked at also never included those bits.  I agree with you completely in that the spec leaves alot of things ambiguously defined.  What surprised me a bit was that image layout transitions are considered both a read and write, so you have to include access/stage masks for the hidden read/write that occurs during transitions.

This thread has helped clarify a lot of these things.

I wrote my own pipeline barrier wrapper, which I found made a lot more sense (apart from not really understanding what VK_ACCESS_MEMORY_READ_BIT and VK_ACCESS_MEMORY_WRITE_BIT mean).  The whole thing isn't important but you might find the flag enumeration interesting.

enum class MemoryDependencyFlags : uint64_t {

	none											= 0,

	indirect_read								= (1ull << 0),				// VK_ACCESS_INDIRECT_COMMAND_READ_BIT + VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT
	index_read								= (1ull << 1),				// VK_ACCESS_INDEX_READ_BIT + VK_PIPELINE_STAGE_VERTEX_INPUT_BIT
	attribute_vertex_read				= (1ull << 2),				// VK_ACCESS_VERTEX_ATTRIBUTE_READ_BIT + VK_PIPELINE_STAGE_VERTEX_INPUT_BIT

	uniform_vertex_read					= (1ull << 3),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_VERTEX_SHADER_BIT
	uniform_tess_control_read		= (1ull << 4),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_TESSELLATION_CONTROL_SHADER_BIT
	uniform_tess_eval_read			= (1ull << 5),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_TESSELLATION_EVALUATION_SHADER_BIT
	uniform_geometry_read			= (1ull << 6),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_GEOMETRY_SHADER_BIT
	uniform_fragment_read				= (1ull << 7),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT
	uniform_compute_read				= (1ull << 8),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT

	shader_vertex_read					= (1ull << 9),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_VERTEX_SHADER_BIT
	shader_vertex_write					= (1ull << 10),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_VERTEX_SHADER_BIT
	shader_tess_control_read			= (1ull << 11),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_TESSELLATION_CONTROL_SHADER_BIT
	shader_tess_control_write		= (1ull << 12),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_TESSELLATION_CONTROL_SHADER_BIT
	shader_tess_eval_read				= (1ull << 13),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_TESSELLATION_EVALUATION_SHADER_BIT
	shader_tess_eval_write				= (1ull << 14),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_TESSELLATION_EVALUATION_SHADER_BIT
	shader_geometry_read				= (1ull << 15),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_GEOMETRY_SHADER_BIT
	shader_geometry_write			= (1ull << 16),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_GEOMETRY_SHADER_BIT
	shader_fragment_read				= (1ull << 17),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT
	shader_fragment_write				= (1ull << 18),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT
	shader_compute_read				= (1ull << 19),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
	shader_compute_write				= (1ull << 20),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT

	attachment_fragment_read		= (1ull << 21),				// VK_ACCESS_INPUT_ATTACHMENT_READ_BIT + VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT
	attachment_color_read				= (1ull << 22),				// VK_ACCESS_COLOR_ATTACHMENT_READ_BIT + VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
	attachment_color_write			= (1ull << 23),				// VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT + VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
	attachment_depth_read_early	= (1ull << 24),				// VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT + VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT
	attachment_depth_read_late		= (1ull << 25),				// VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT + VK_PIPELINE_STAGE_LATE_FRAGMENT_TESTS_BIT
	attachment_depth_write_early	= (1ull << 26),				// VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT + VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT
	attachment_depth_write_late	= (1ull << 27),				// VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT + VK_PIPELINE_STAGE_LATE_FRAGMENT_TESTS_BIT

	transfer_read							= (1ull << 28),				// VK_ACCESS_TRANSFER_READ_BIT + VK_PIPELINE_STAGE_TRANSFER_BIT
	transfer_write							= (1ull << 29),				// VK_ACCESS_TRANSFER_WRITE_BIT + VK_PIPELINE_STAGE_TRANSFER_BIT

	host_read									= (1ull << 30),				// VK_ACCESS_HOST_READ_BIT + VK_PIPELINE_STAGE_HOST_BIT
	host_write								= (1ull << 31),				// VK_ACCESS_HOST_WRITE_BIT + VK_PIPELINE_STAGE_HOST_BIT

	memory_read							= (1ull << 32),				// VK_ACCESS_MEMORY_READ_BIT
	memory_write							= (1ull << 33),				// VK_ACCESS_MEMORY_WRITE_BIT
	};

The formatting is a mess, but you get the idea.  Only certain combinations of stage + access are allowed by the spec, by enumerating them it made it far more clear which to pick.  I then can directly convert these to the associated stage + access masks without any loss in expressiveness/performance (or at least there shouldn't be if I understand things correctly).

Edited by Ryan_001

Share this post


Link to post
Share on other sites

Copy that - it's good to compare your own guessing against the guessing of others :D

(I spot you don't cover the case of a compute shader writing indirect dispacth count.)

 

I wonder if Events could help here: http://vulkan-spec-chunked.ahcox.com/ch06s03.html

I have not used them yet. Could i do something like triggereing a memory barrier, processing some other work, waiting on the barrier with a high chance it has been done already?

I really need Vulkan for dummies that tells me some usecases of such things...

 

Share this post


Link to post
Share on other sites
5 minutes ago, JoeJ said:

(I spot you don't cover the case of a compute shader writing indirect dispacth count.)

I'm not sure exactly what you mean.  The flags are pretty much taken verbatim from Table 4 of the spec: https://www.khronos.org/registry/vulkan/specs/1.0/html/vkspec.html#VkPipelineStageFlagBits (scroll down a screen or two).

I haven't played around with indirect stuff yet.  I'm assuming you write to a buffer the commands (either through memmap/staging buffer/copy or through a compute shader or similar), then use that buffer as the source for the indirect command correct?  If I was transferring from the host then I'd use host_write or transfer_write as my source flags (depending on whether or not I used a staging buffer), and then I'd use indirect_read as my dest flags.  If I were computing the buffer on the fly would you not use shader_compute_write as src, and indirect_read as dest?

24 minutes ago, JoeJ said:

I really need Vulkan for dummies that tells me some usecases of such things...

Isn't that an oxymoron :)

Share this post


Link to post
Share on other sites
11 minutes ago, Ryan_001 said:

If I were computing the buffer on the fly would you not use shader_compute_write as src, and indirect_read as dest?

Oh sorry, yes.  Confused this with the dstStageMask from vkCmdPipelineBarrier()

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


  • Forum Statistics

    • Total Topics
      628288
    • Total Posts
      2981846
  • Similar Content

    • By Vityou
      I'm looking to create some kind of simulation like game similar to rimworld and dwarf fortress.  I would also like to include a way to control units through programming, similar to screeps, I'm even thinking of using JavaScript as well.  I would like the graphics to be separate from the actual game, so that people can make their own if they don't like the default on (the game will be open source).  Are there any languages or engines that would be good for this task?  I mostly program in functional languages like racket, but I know some python, Java, and JavaScript.  I know how to use unity somewhat, but I'm not sure that it would be best for this.  Also, I'm not sure if this is important for picking out the right tools, but I am thinking of including a feature to run parts of the simulation at different levels of detail, for example, you could "zoom in" on a battle that's happening and see each individual shot, or you could just get the main idea of the battle, like if you won and how much gold you got or something like that.  Thanks for any suggestions.
    • By reders
      Hi, everyone!
      I "finished" building my first game. Obviously Pong.
      It's written in C++ on Visual Studio with SFML.
      Pong.cpp
      What do you think? What should I consider doing to improve the code?
      Thank you very much.
       
      EDIT: added some screenshot and .zip file of the playable game
       
      Pong.zip


    • By noodleBowl
      I was wondering if anyone could explain the depth buffer and the depth stencil state comparison function to me as I'm a little confused
      So I have set up a depth stencil state where the DepthFunc is set to D3D11_COMPARISON_LESS, but what am I actually comparing here? What is actually written to the buffer, the pixel that should show up in the front?
      I have these 2 quad faces, a Red Face and a Blue Face. The Blue Face is further away from the Viewer with a Z index value of -100.0f. Where the Red Face is close to the Viewer with a Z index value of 0.0f.
      When DepthFunc is set to D3D11_COMPARISON_LESS the Red Face shows up in front of the Blue Face like it should based on the Z index values. BUT if I change the DepthFunc to D3D11_COMPARISON_LESS_EQUAL the Blue Face shows in front of the Red Face. Which does not make sense to me, I would think that when the function is set to D3D11_COMPARISON_LESS_EQUAL the Red Face would still show up in front of the Blue Face as the Z index for the Red Face is still closer to the viewer
      Am I thinking of this comparison function all wrong?
      Vertex data just in case
      //Vertex date that make up the 2 faces Vertex verts[] = { //Red face Vertex(Vector4(0.0f, 0.0f, 0.0f), Color(1.0f, 0.0f, 0.0f)), Vertex(Vector4(100.0f, 100.0f, 0.0f), Color(1.0f, 0.0f, 0.0f)), Vertex(Vector4(100.0f, 0.0f, 0.0f), Color(1.0f, 0.0f, 0.0f)), Vertex(Vector4(0.0f, 0.0f, 0.0f), Color(1.0f, 0.0f, 0.0f)), Vertex(Vector4(0.0f, 100.0f, 0.0f), Color(1.0f, 0.0f, 0.0f)), Vertex(Vector4(100.0f, 100.0f, 0.0f), Color(1.0f, 0.0f, 0.0f)), //Blue face Vertex(Vector4(0.0f, 0.0f, -100.0f), Color(0.0f, 0.0f, 1.0f)), Vertex(Vector4(100.0f, 100.0f, -100.0f), Color(0.0f, 0.0f, 1.0f)), Vertex(Vector4(100.0f, 0.0f, -100.0f), Color(0.0f, 0.0f, 1.0f)), Vertex(Vector4(0.0f, 0.0f, -100.0f), Color(0.0f, 0.0f, 1.0f)), Vertex(Vector4(0.0f, 100.0f, -100.0f), Color(0.0f, 0.0f, 1.0f)), Vertex(Vector4(100.0f, 100.0f, -100.0f), Color(0.0f, 0.0f, 1.0f)), };  
    • By deltaKshatriya
      Hey all, 
      This isn't really a post that fits into any particular forum, so I'm posting it here, but feel free to move it mods, if you feel that it should move. I figured it isn't really specifically about game dev related careers.
      I'm a recent college grad, currently working as a software engineer as part of a rotational program, so I'll be spending some time in my current role then rotating to a new location and new software engineering related position. I did my undergrad in Computer Science, and while Computer Science had been my main career interest for quite some time before college, while working my way through college my main focus really was just to get done with the degree, get a job, and be done with the extreme stress/too much work during college. Now that I'm out, I'm not as sure about my career direction as I was before. While I do still do like Computer Science, software engineering, etc., my current position, although well paying, doesn't really involve me doing much on a day to day basis (for now at least though that's subject to change). The good news is that I've got a lot of control over where I rotate to next. Interestingly enough, initially I got interested in Computer Science because of game dev (as a teenager at least). Then that morphed into AI and machine learning. Now it's....unknown really.
      Now the thing is I've kind of been bouncing around in all sorts of directions. I absolutely love 3d art and have been actively trying to get better at it. I've also taken up writing and considered trying to write a novel in my spare time. Then I'm finding graphics programming very interesting as well (although that's not what my day job is), and I still have quite an interest in machine learning, data science, text mining, etc.
      In short, I have absolutely no clue which direction to move towards. My parents believe I need to get a graduate degree, either an MBA or an MS in Computer Science. I, honestly, have no clue. And so I'm here, wondering what I should do with very little actual idea of what I should do.
      So I'd like to here your thoughts, fellow people of this particular section of the Internet. 
      Thanks in advance!
       
    • By Levi Lohman
      First off, I have some experience in coding, and I've been told I am talented in the ways of mathematics, but I never learned an entire programming language well enough to make an actual game. But I'm not looking for a game engine where there is no coding or scripting at all, I would prefer something where you can set up the game world or levels by dragging and dropping objects in. But I could control the behavior of the objects through simple logic parameters that you set up by selecting things from lists and inputting data. 
      One example is that if you were dropping in the area the player would walk on you could select the object that the player would walk on and from a list that would come up you would select something like "Lable" or "Property" that would bring up a text box where you could input something like "solidSurface" and then you would select the level which would bring up a list where you could select an if/then choice and you would be guided through a thing called "Object Define" where it would say, "If object has lable/property, " and you would select from a list of lables or properties you already made like the "solidSurface" thing you entered in earlier, then you would select some things from a list saying "Player" and you would select an action like "Collide" and finally you would select an action that would happen on collision like "Stop" and you would end up with a surface the player can walk on top of.
      Or if you were making an RPG and you wanted to define how a certain attack worked and had already set up variables for the stats of the player, enemies, and equipment you could type in some things like "preDamage = (weaponAtk x 1.25) x ((playerStrgth / 100) + 1)" and "enemyDefence = enemyArmor x ((enemyEnd / 100) + 1)" and "actualDamage = preDamage - enemyDefence" then you would select an if/then/else template saying something like "if actualDamage < 0, actualDamage = 0, else enemyHP = enemyHP - actualDamage"

      If you know of a game engine that is like or similar to what I'm looking for or if you need more information to know for sure, please leave a reply.
  • Popular Now