• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Green_Baron

Vulkan
Question concerning internal queue organisation

25 posts in this topic

Hello,

my first post here :-)

About half a year ago i started with C++ (did a little C before) and poking into graphics programming. Right now i am digging through the various vulkan tutorials.

A probably naive question that arose is:

If i have a device (in my case a GTX970 clone) that exposes on each of two gpus two families, one with 16 queues for graphics, compute, etc and another one with a single transfer queue, do i loose potential performance if i only use 1 of the 16 graphics queues ? Or, in other words, are these queues represented by hardware or logical entities ?

And how is that handled across different vendors ? Do intel and amd handle this similar or would a program have to take care of different handling across different hardware ?

Cheers

gb

0

Share this post


Link to post
Share on other sites

Yes, this is very vendor specific.

On AMD you can use multiole queues to do async compute (e.g. doing compute shader and shadow maps rendering at the same time).

You can also do multiplie compute shadres at the same time, but it's also likely that's slower than doing them in order in a singe queue.

 

On NV the first option is possible on recent cards, but the second option is not possible -and they will serialize internally (AFAIK - not sure)

 

On both Vendors it makes sense to use a different queue for data transfer, e.g. a streaming system running while rendering.

Not sure about Intel, IFAIK they recommend to just use a single queue for everything.

 

In practice you need a good reason to use multiple queues, test on each HW, and use differnet settings for different HW.

E.g. for multithreaded command buffer generation you don't need multiple queues and queue per thread would be a bad idea.

1

Share this post


Link to post
Share on other sites

Thanks. So i understand that a single graphics queue is the best solution.

Yeah, i could split the 2*16 queues freely among graphics, compute, transfer and sparse, and the family with the single q is transfer only. Like this, but two times for two devices:

VkQueueFamilyProperties[0]:
===========================
        queueFlags         = GRAPHICS | COMPUTE | TRANSFER | SPARSE
        queueCount         = 16
        timestampValidBits = 64
        minImageTransferGranularity = (1, 1, 1)

VkQueueFamilyProperties[1]:
===========================
        queueFlags         = TRANSFER
        queueCount         = 1
        timestampValidBits = 64
        minImageTransferGranularity = (1, 1, 1)

 

I am not that far as to test anything on different platforms/devices. My "training" pc is a debian linux one. But in principle and if one day i shall do a basic framework for my own i would of course aim towards a solution that is robust and works for different platforms / manufacturers. That would probably be a compromise and not the ideal one for every case.

 

0

Share this post


Link to post
Share on other sites
1 hour ago, Green_Baron said:

Thanks. So i understand that a single graphics queue is the best solution.

 

Probably. I'm no graphics pipeline expert, but i'm not aware of a case where using two graphics queues can make sense. (Interested, if anybody else does)

It also makes sense to use 1 graphics queue, 1 upload queue, 1 download queue on each GPU to communicate (although you don't have this option because you have only one seperate transfer queue).

And it makes sense to use multiple compute queues on some Hardware.

I proofed that GCN can perfectly overlap low work compute workloads, but the need to use multiple queues, so multiple command buffers and to sync between them destroyed the advsante in my case.

Personally i think the concept of queues is much too high level and totally sucks. It would be great if we could manage unique CUs much more low level. The hardware can do it but we have no access - VK/DX12 is just a start... :(

 

0

Share this post


Link to post
Share on other sites

Posted (edited)

I haven't quite figured out the point/idea behind queue families.  Its clear that all queue's of a given family share hardware.  Also a GPU is allowed a lot of leeway to rearrange commands both within in command buffers, and across command buffers within the same queue.  So queue's from separate queue families are most likely separate pieces of hardware, but are queue's from the same family?  I've never been able to get a straight answer on this, but my gut feeling is no.

For example AMD has 3 queue families, so if you create one queue for each family (one for graphics, one for compute, and one for transfer) you can probably get better performance.  But is it possible to get significantly better performance with multiple queues from the same queue family?  So far from what I've been able to gather online, is probably not.

While I do agree with JoeJ that queue's are poorly designed in vulkan, I don't think direct control of CUs makes sense IMHO.  I think queue's should be essentially driver/software entities.  So when you create a device you select how many queue's of what capabilities you need, the driver gives you them and maps them to hardware entities however it feels is best.  Sort of like how on the CPU we create threads and the OS maps them to cores.  No notion of queue families.  No need to query what queue's exist and try to map them to what you want.

TBH, until they clean this part of the spec up, or at least provide some documentation on what they had in mind, I feel like most people are just going to create 1 graphics queue, 1 transfer queue, and just ignore the rest.

Edited by Ryan_001
1

Share this post


Link to post
Share on other sites

I'm just beginning to understand how this works, am far from asking "why". And for a newcomer vulkan is a little steep in the beginning and some things seem highly theoretic (like graphics without presentation or so).

Thanks for the answers, seems like i'm on the right track :-)

0

Share this post


Link to post
Share on other sites
On 7/10/2017 at 6:05 PM, Green_Baron said:

do i loose potential performance if i only use 1 of the 16 graphics queues ? Or, in other words, are these queues represented by hardware or logical entities ?

No, you probably only need 1 queue. They're (hopefully) hardware entities. If you want different bits of your GPU work to be able to run in parallel, then you could use different queues, but you probably have no need for that.

For example, if you launch two different apps at the same time, Windows may make sure that each of them is running on a different hardware queue, which could make them more responsive / less likely to get in each other's way.

On 7/11/2017 at 1:41 AM, JoeJ said:

I'm no graphics pipeline expert, but i'm not aware of a case where using two graphics queues can make sense. (Interested, if anybody else does)

In the future when vendors start making GPU's that can actually run multiple command buffers in parallel to each other, then you could use it in the same way that AMD's async compute works.

On 7/11/2017 at 2:57 AM, Ryan_001 said:

I haven't quite figured out the point/idea behind queue families.

To use OOP as an analogy, a family is a class and a queue is an instance (object) of that class.

On 7/11/2017 at 1:41 AM, JoeJ said:

Personally i think the concept of queues is much too high level and totally sucks. It would be great if we could manage unique CUs much more low level. The hardware can do it but we have no access - VK/DX12 is just a start...

Are you sure about that? AFAIK the queues are an abstraction of the GPU's command engine, which receives draws/dispatches and hands them over to an internal fixed function scheduler.

2

Share this post


Link to post
Share on other sites

Posted (edited)

2 hours ago, Hodgman said:

 

On 10.7.2017 at 5:41 PM, JoeJ said:

Personally i think the concept of queues is much too high level and totally sucks. It would be great if we could manage unique CUs much more low level. The hardware can do it but we have no access - VK/DX12 is just a start...

Are you sure about that? AFAIK the queues are an abstraction of the GPU's command engine, which receives draws/dispatches and hands them over to an internal fixed function scheduler.

I would have nothing aginst the queue concept, if only it would work.

You can look at a my testproject i have submitted to AMD: https://github.com/JoeJGit/OpenCL_Fiji_Bug_Report/blob/master/async_test_project.rar

...if you are bored, but here is what i found:

You can run 3 low work tasks without synchornizition perfectly parallel, yeah - awesome.

As soon as you add sync, which is only possible by using semaphores, the advantage gets lost due to bubbles. (Maybe semaphores sync with CPU as well? If so we have a terrible situation here! We need GPU only sync between queues.)

And here comes the best: If you try larger workloads, e.g. 3 tasks with runtimes of 0.2ms, 1ms, 1ms without async, going async the first and second task run parallel as expected, although 1ms become 2ms, so there is no win. But the third task raises to 2ms as well, even it runs alone and nothing else - it's runtime is doudled for nothing.

It seems there is no dynamic work balancing happening here - looks like the GPU gets divided somehow and refuses to merge back when possible.

2 hours ago, Hodgman said:

AFAIK the queues are an abstraction of the GPU's command engine, which receives draws/dispatches and hands them over to an internal fixed function scheduler.

Guess not, the numbers don't match. A Fiji has 8 ACEs (if thats the correct name), but i see only 4 compute queues (1gfx/CS+3CS). Nobody knows what happens under the hood, but it needs more work, at least on the drivers.

 

 

Access to unique CUs should not be necessary, you're right guys. But i would be willing to tackle this if it would be an improvement.

There are two situations where async compute makes sense:

1. Doing compute while doing ALU light rendering work (Not yet tried - all my hope goues in to this, but net everyone has rendering work.)

2. Paralellizing and synchronizing low work compute tasks - extremely important if we look towards more complex algotrithms reducing work instead to brute force everything. And sadly this fails yet.

Edited by JoeJ
0

Share this post


Link to post
Share on other sites

I think part of your disappointment is the assumption that the GPU won't already be running computations async in parallel in the first place, which means that you expect "async compute" to give a huge boost, when you've actually gotten that boost already.

In a regular situation, if you submit two dispatch calls "A" and "B" sequentially, which each contain 8 wavefronts, the GPU's timeline will hopefully look like this:
Ab29aQ2.png

Where it's working on both A and B concurrently.

If you go and add any kind of resource transition or sync between those two dispatches, then you end up with a timeline that looks more like:
nyFY8aj.png

If you simply want the GPU to work on as many compute tasks back to back without any bubbles, then the new tool in Vulkan for optimizing that situation is manual control over barriers. D3D11/GL will use barriers all over the place where they aren't required (which creates these bubbles and disables concurrent processing of multiple dispatch calls), but Vulkan gives you to the power to specify exactly when they're required.

Using multiple queues is not required for this optimization. The use of multiple queues requires the use of extra barriers and syncrhonisation, which is the opposite of what you want. As you mention, a good use for a seperate compute queue is so that you can keep the CU's fed while a rasterizer-heavy draw command list is being processed.

Also take note that the structure of these timelines makes profiling the time taken by your stages quite difficult. Note that the front-end processes "B" in between "A" and "A - end of pipe"! If you time from when A reaches the front of the pipe to when it reaches the end of the pipe, you'll also be counting some time taken by the "B" command! If you count the time from when "A" enters the pipe until when "B" enters the pipe, then your timings will be much shorter than reality. The more internal parallelism that you're getting out of the GPU, the more incorrect your timings of individual draws/dispatches will be. Remember to keep that in mind when analyzing any timing data that you collect.

2

Share this post


Link to post
Share on other sites

Whooo! - I already thaught the driver could figure out a dependency graph und do things async automatically, but i also thought this being reality would be wishfull thinking.

This is too good to be true, so i'm still not ready to believe it :)

(Actually i have too much barriers, but soon i'll be able to push more independent work to the queue and i'm curious if i'll get a lot of it for free...)

Awesome! Thanks, Hodgman :)

 

 

 

 

0

Share this post


Link to post
Share on other sites

The following quote from spec:

Quote

Command buffer boundaries, both between primary command buffers of the same or different batches or submissions as well as between primary and secondary command buffers, do not introduce any additional ordering constraints. In other words, submitting the set of command buffers (which can include executing secondary command buffers) between any semaphore or fence operations execute the recorded commands as if they had all been recorded into a single primary command buffer, except that the current state is reset on each boundary. Explicit ordering constraints can be expressed with explicit synchronization primitives.

I read this as meaning that for a single queue command buffer boundaries don't really matter, at least where pipeline barriers are concerned.  A pipeline barrier in one command buffer will halt execution, not only of commands within that command buffer, but also any subsequent commands in subsequent command buffers (of course only for those stages that the barrier applies to), even if those subsequent command buffers are independent and could be executed simultaneously.  So if the work is truly independent, I could see there being a small potential performance increase when using multiple queue's.

That said I feel this was an error/oversight IMHO.  It seems clear (at least to me : ) that semaphores are the go-to primitive for synchronization between separate command buffers, and hence it would make sense to me that pipeline barriers operate only within a particular command buffer.  This way independent command buffers on the same queue could be fully independent, and there would be no need for other queue's.  Alas, this is not the case.  Perhaps they had their reasons for not doing that, its often hard to read-between the lines and understand the 'why' from a specification.  I guess to me, its why queue's feel like such a mess.  Vulkan has multiple ways of doing essentially the same thing, it feels like the spec is stepping on its own toes.  But perhaps its just my OCD wanting a more ortho-normal API.

0

Share this post


Link to post
Share on other sites

I tried to verify with my test...

I have 3 shaders:

1: 10 dispatches of 50 wavefronts

2: 50 dispatches of 50 wavefronts

3: 50 dispatches of 50 wavefronts

With memory barrier after each dispatch i get 0.462 ms

Without: 0.012 ms (speed up of 38.5)

To verify i use 1 dispatch of 5500 wavefronts (same work): 0.013 ms

 

So yes, not only the GPU is capable of doing async compute perfectly with a single queue, we also see to API overhead of multiple dispatches is zero :)

Finally i understand why memory barriers appeared so expensive to me. Shame on me and all disappointment gone :D

2

Share this post


Link to post
Share on other sites
7 hours ago, Ryan_001 said:

 So if the work is truly independent, I could see there being a small potential performance increase when using multiple queue's.

If the work is truly independent, you won't have any barriers and could use one queue just fine.

0

Share this post


Link to post
Share on other sites

Posted (edited)

1 hour ago, Hodgman said:
9 hours ago, Ryan_001 said:

 So if the work is truly independent, I could see there being a small potential performance increase when using multiple queue's.

If the work is truly independent, you won't have any barriers and could use one queue just fine.

Yes, using one queue is faster even if there are no memory barriers / semaphores. Submitting a command buffer has a noticeable cost, so putting all work in one command buffer and one queue is the fastest way.

I also tested to use queues from different families vs. using all from one family which had no effect on performance.

All tests with only compute shaders.

 

Now i don't see a need to use multiple queues other than for up / downloading data. Maybe using 2 queues if we want to do compute and graphics makes sense but i guess 1 queue is better here too.

 

Edit: Maybe using multiple queues results in dividing work strictly between CUs, while using one queue can distribute multiple dispatches on the same CU. If so, maybe we could avoid some cache thrashing my grouping work with similar memory access together. But i guess cases where this wins would be extremely rare.

 

Edited by JoeJ
0

Share this post


Link to post
Share on other sites

Posted (edited)

3 hours ago, Hodgman said:

If the work is truly independent, you won't have any barriers and could use one queue just fine.

With all due respect, not necessarily.  Assuming I'm reading the spec correctly...

Imagine you had 3 units of work, each in their own command buffer, and each fully independent from the other 2.  Now within each command buffer there is still a pipeline barrier because while each command buffer is completely independent from the others, there are dependencies within the commands of each individual command buffer.  You could submit these 3 command buffers to 3 different queue's and they could run (theoretically) asynchronously/in any order.

Now if pipeline barriers were restricted to a given command buffer, then submitting these 3 commands buffers to a single queue would also yield asynchronous performance.   But as it stands, submitting these 3 commands to a single queue will cause stalls/bubbles because pipeline barriers work across command buffer boundaries.  The pipeline barrier in command buffer 1 will not only cause commands in buffer 1 to wait but also commands in buffers 2 and 3, even though those commands are independent and need not wait on the pipeline barrier.

This change would also give a bit of purpose to secondary command buffers, of which (at this time) I see little use for.

Now I just need to convince the Vulkan committee of what a great idea retroactively changing the spec is, and that breaking everyone's code is no big deal.  /sarcasm

Edited by Ryan_001
1

Share this post


Link to post
Share on other sites
6 minutes ago, Ryan_001 said:

Imagine you had 3 units of work, each in their own command buffer, and each fully independent from the other 2.  Now within each command buffer there is still a pipeline barrier because while each command buffer is completely independent from the others, there are dependencies within the commands of each individual command buffer.  

Ah ok I didn't get the last assumption -- that the big picture work of each buffer is independent, but contain internal dependencies.

Yes, in theory multiple queues could be used to allow commands from another queue to be serviced while one queue is blocked doing some kind of barrier work. In practice on current hardware I don't know if this makes any difference though -- the "barrier work" will usually be made up of a command along the lines of "halt the front-end from processing any command from any queue until all write-through traffic has actually been flushed from the L2 cache to RAM"... In the future there may be a use for this though.

I don't know if the Vulkan spec allows for it, but another use of multiple queues is prioritization. If a background app is using the GPU at the same time as a game, it would be wise for the driver to set the game's queues as high priority and the background app's queues as low priority. Likewise if your gameplay code itself uses GPU compute, you could issue it's commands via  a "highest/realtime" priority queue which is configured to immediately interrupt any graphics work and do the compute work immediately -- which would allow you to perform GPGPU calculations without the typical one-frame delay. Again, I don't know if this is possible (yet) on PC's either.

12 minutes ago, Ryan_001 said:

This change would also give a bit of purpose to secondary command buffers, of which (at this time) I see little use for.

 

AFAIK, they're similar to "bundles" in D3D12 or display lists in GL, which are meant for saving on the CPU cost of repeatedly re-recording draw commands for a particular model every frame, and instead re-using a micro command buffer over many frames.

0

Share this post


Link to post
Share on other sites

Well barriers in the spec are a little more fine grained, you can pick the actual pipeline stages to halt on.  For example if you wrote to a buffer from the fragment shader, and then read it from the vertex shader, you would put a pipeline barrier which would halt all subsequent vertex shader (and later stages) from executing prior to the fragment shader complete'ing.  But I have the feeling you were talking about what hardware actually does?  In which case you are probably right, I have no idea how fine-grained the hardware really is.

The spec does support queue priority, sort of:

Quote

4.3.4. Queue Priority

Each queue is assigned a priority, as set in the VkDeviceQueueCreateInfo structures when creating the device. The priority of each queue is a normalized floating point value between 0.0 and 1.0, which is then translated to a discrete priority level by the implementation. Higher values indicate a higher priority, with 0.0 being the lowest priority and 1.0 being the highest.

Within the same device, queues with higher priority may be allotted more processing time than queues with lower priority. The implementation makes no guarantees with regards to ordering or scheduling among queues with the same priority, other than the constraints defined by any explicit synchronization primitives. The implementation make no guarantees with regards to queues across different devices.

An implementation may allow a higher-priority queue to starve a lower-priority queue on the same VkDevice until the higher-priority queue has no further commands to execute. The relationship of queue priorities must not cause queues on one VkDevice to starve queues on another VkDevice.

No specific guarantees are made about higher priority queues receiving more processing time or better quality of service than lower priority queues.

As I read it, this doesn't allow one app to queue itself higher than another, and only affects queue's created on the single VkDevice.  Now whether any hardware actually does this... you would know better than I, I image.

As far as secondary command buffers, I've seen that suggested.  I don't disagree, its just that I don't see that being faster than just recording a bunch of primary command buffers in most circumstances.  The only 2 situations I could come up with were:

1) The small command buffers are all within the same render pass, in which case you would need secondary command buffers.

2) You have way too many (thousands? millions?) small primary command buffers, and that might cause some performance issues on submit, so by recording them as secondary and using another thread to bundle them into a single primary, might make the submit faster.

0

Share this post


Link to post
Share on other sites

Posted (edited)

Some interesting points, i made this test now:

1: 10 dispatches of 50 wavefronts

2: 50 dispatches of 50 wavefronts

3: 50 dispatches of 50 wavefronts

With memory barrier after each dispatch and 1 queue: 0.46 ms

With memory barrier after each dispatch and 3 queues, one per shader: 0.21 ms

 

So we can use multiple queues to keep working while another queue is stalled.

I'll modify my test to see if i still could one queue for the same purpose by setting the memory ranges within the same buffer per shader, or by using multiple buffers per shader...

EDIT1:

...but first i tried to make the frist shader 5 times more work than 2 & 3. Actually all shaders did the same calculations, so i cant be sure a barrier on queue 1 does not stall queue 0 as well because barriers happen at the same time. Now i see shader 1 still completes first and is slightly faster than the other two so it is not affected by their barriers :)

Runtime 3 queues: 0.18ms, 1 queue: 0.44ms (not the first time seeing doing more work is faster on small loads)

 

 

 

 

 

Edited by JoeJ
2

Share this post


Link to post
Share on other sites
23 minutes ago, JoeJ said:

Some interesting points, i made this test now:

1: 10 dispatches of 50 wavefronts

2: 50 dispatches of 50 wavefronts

3: 50 dispatches of 50 wavefronts

With memory barrier after each dispatch and 1 queue: 0.46 ms

With memory barrier after each dispatch and 3 queues, one per shader: 0.21 ms

 

So we can use multiple queues to keep working while another queue is stalled.

I'll modify my test to see if i still could one queue for the same purpose by setting the memory ranges within the same buffer per shader, or by using multiple buffers per shader...

Its good to know that theory and practice align, at least for this : ) Nice work.  I'm curious, what sort of barrier parameters are you using?

0

Share this post


Link to post
Share on other sites
12 minutes ago, Ryan_001 said:

Its good to know that theory and practice align, at least for this : ) Nice work.  I'm curious, what sort of barrier parameters are you using?

BufferMemoryBarriers, here's code.

I leave comments in to illustrate how poor and uncertain the specs leave us at trial and error - or would you get the idea you need to set VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT for an indirect compute dispatch? :)

(Of course i could remove this here as i'm only writing some prefix sum results and no dispatch count, but offset and size becomes interesting now...)

 

	void MemoryBarriers (VkCommandBuffer commandBuffer, int *bufferList, const int numBarriers)
    {
        int const maxBarriers = 16;
        assert (numBarriers <= maxBarriers);
	        VkBufferMemoryBarrier bufferMemoryBarriers[maxBarriers] = {};
        //VkMemoryBarrier memoryBarriers[maxBarriers] = {};
	        for (int i=0; i<numBarriers; i++)
        {
            bufferMemoryBarriers[i].sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
            //bufferMemoryBarriers[i].srcAccessMask = VK_ACCESS_MEMORY_READ_BIT | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
            //bufferMemoryBarriers[i].dstAccessMask = VK_ACCESS_MEMORY_WRITE_BIT | VK_ACCESS_SHADER_WRITE_BIT;
            bufferMemoryBarriers[i].srcAccessMask = VK_ACCESS_MEMORY_WRITE_BIT | VK_ACCESS_SHADER_WRITE_BIT;
            bufferMemoryBarriers[i].dstAccessMask = VK_ACCESS_MEMORY_READ_BIT | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
            //bufferMemoryBarriers[i].srcAccessMask = VK_ACCESS_MEMORY_WRITE_BIT | VK_ACCESS_SHADER_WRITE_BIT | VK_ACCESS_MEMORY_READ_BIT | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
            //bufferMemoryBarriers[i].dstAccessMask = VK_ACCESS_MEMORY_WRITE_BIT | VK_ACCESS_SHADER_WRITE_BIT | VK_ACCESS_MEMORY_READ_BIT | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
            bufferMemoryBarriers[i].srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
            bufferMemoryBarriers[i].dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
            bufferMemoryBarriers[i].buffer = buffers[bufferList[i]].deviceBuffer;
            bufferMemoryBarriers[i].offset = 0;
            bufferMemoryBarriers[i].size = VK_WHOLE_SIZE;
	            //memoryBarriers[i].sType = VK_STRUCTURE_TYPE_MEMORY_BARRIER;
            //memoryBarriers[i].srcAccessMask = VK_ACCESS_MEMORY_WRITE_BIT;// | VK_ACCESS_SHADER_WRITE_BIT;
            //memoryBarriers[i].dstAccessMask = VK_ACCESS_MEMORY_READ_BIT;// | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
        }
	        vkCmdPipelineBarrier(
            commandBuffer,
            VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT,
            VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT | VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT,
            0,//VkDependencyFlags       
            0, NULL,//numBarriers, memoryBarriers,//
            numBarriers, bufferMemoryBarriers,
            0, NULL);
    }
        
    void Record (VkCommandBuffer commandBuffer, const uint32_t taskFlags,
        int profilerStartID, int profilerStopID, bool profilePerTask = true, bool use_barriers = true)
    {
        VkCommandBufferBeginInfo commandBufferBeginInfo = {};
        commandBufferBeginInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;
        commandBufferBeginInfo.flags = 0;//VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT;
	        vkBeginCommandBuffer(commandBuffer, &commandBufferBeginInfo);
	
#ifdef USE_GPU_PROFILER
        if (profilerStartID>=0) profiler.Start (profilerStartID, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
#endif
	 
	
        if (taskFlags & (1<<tTEST0))
        {
            vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelineLayouts[tTEST0], 0, 1, &descriptorSets[tTEST0], 0, nullptr);
        
            vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelines[taskToPipeline[tTEST0]]);
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Start (TS_TEST0, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
            int barrierBuffers[] = {bTEST0};
            for (int i=0; i<TASK_COUNT_0; i++)
            {
                vkCmdDispatchIndirect(commandBuffer, buffers[bDISPATCH].deviceBuffer, sizeof(VkDispatchIndirectCommand) * (0 + i) );
                if (use_barriers) MemoryBarriers (commandBuffer, barrierBuffers, 1);
            }
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Stop (TS_TEST0, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
        }
	        if (taskFlags & (1<<tTEST1))
        {
            vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelineLayouts[tTEST1], 0, 1, &descriptorSets[tTEST1], 0, nullptr);
        
            vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelines[taskToPipeline[tTEST1]]);
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Start (TS_TEST1, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
            int barrierBuffers[] = {bTEST1};
            for (int i=0; i<TASK_COUNT_1; i++)
            {
                vkCmdDispatchIndirect(commandBuffer, buffers[bDISPATCH].deviceBuffer, sizeof(VkDispatchIndirectCommand) * (200 + i) );
                if (use_barriers) MemoryBarriers (commandBuffer, barrierBuffers, 1);
            }
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Stop (TS_TEST1, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
        }
	        if (taskFlags & (1<<tTEST2))
        {
            vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelineLayouts[tTEST2], 0, 1, &descriptorSets[tTEST2], 0, nullptr);
        
            vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelines[taskToPipeline[tTEST2]]);
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Start (TS_TEST2, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
            int barrierBuffers[] = {bTEST2};
            for (int i=0; i<TASK_COUNT_2; i++)
            {
                vkCmdDispatchIndirect(commandBuffer, buffers[bDISPATCH].deviceBuffer, sizeof(VkDispatchIndirectCommand) * (400 + i) );
                if (use_barriers) MemoryBarriers (commandBuffer, barrierBuffers, 1);
            }
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Stop (TS_TEST2, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
        }
	 
	#ifdef USE_GPU_PROFILER
        if (profilerStopID>=0) profiler.Stop (profilerStopID, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
#endif
	        vkEndCommandBuffer(commandBuffer);
	    }
	

0

Share this post


Link to post
Share on other sites

Ok, so finally and as expected it makes no difference with these options:

Use barriers for unique buffers per task.

Use barriers for nonoverlapping memory regions per task but the same buffer for all.

 

The driver could figure out to still use async with 1 queue in both cases, but it does not. Just like the specs say.

I hope i've set up everything correctly (still unsure about the difference of VK_ACCESS_MEMORY_WRITE_BIT and VK_ACCESS_SHADER_WRITE_BIT, but this did not matter).

So the conclusion is:

We have to use multiple queues to keep busy on pipeline barriers.

We should reduce sync between queues to a minimum.

 

A bit more challenging than initially thought and i hope 2 saturating tasks in 2 queues don't slow each other down too much. If so we need more sync to prevent this and it becomes a hardware dependent act of balancing. But i'm optimisitc and it all makes sense now.

 

2

Share this post


Link to post
Share on other sites

Posted (edited)

Interesting, I don't know if you need VK_ACCESS_MEMORY_READ_BIT and VK_ACCESS_MEMORY_WRITE_BIT there.

Quote
  • VK_ACCESS_MEMORY_READ_BIT specifies read access via non-specific entities. These entities include the Vulkan device and host, but may also include entities external to the Vulkan device or otherwise not part of the core Vulkan pipeline. When included in a destination access mask, makes all available writes visible to all future read accesses on entities known to the Vulkan device.

  • VK_ACCESS_MEMORY_WRITE_BIT specifies write access via non-specific entities. These entities include the Vulkan device and host, but may also include entities external to the Vulkan device or otherwise not part of the core Vulkan pipeline. When included in a source access mask, all writes that are performed by entities known to the Vulkan device are made available. When included in a destination access mask, makes all available writes visible to all future write accesses on entities known to the Vulkan device.

I read that as meaning memory read/write bits are things outside the normal Vulkan scope, like the presentation/windowing system.  The demo/examples I looked at also never included those bits.  I agree with you completely in that the spec leaves alot of things ambiguously defined.  What surprised me a bit was that image layout transitions are considered both a read and write, so you have to include access/stage masks for the hidden read/write that occurs during transitions.

This thread has helped clarify a lot of these things.

I wrote my own pipeline barrier wrapper, which I found made a lot more sense (apart from not really understanding what VK_ACCESS_MEMORY_READ_BIT and VK_ACCESS_MEMORY_WRITE_BIT mean).  The whole thing isn't important but you might find the flag enumeration interesting.

enum class MemoryDependencyFlags : uint64_t {

	none											= 0,

	indirect_read								= (1ull << 0),				// VK_ACCESS_INDIRECT_COMMAND_READ_BIT + VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT
	index_read								= (1ull << 1),				// VK_ACCESS_INDEX_READ_BIT + VK_PIPELINE_STAGE_VERTEX_INPUT_BIT
	attribute_vertex_read				= (1ull << 2),				// VK_ACCESS_VERTEX_ATTRIBUTE_READ_BIT + VK_PIPELINE_STAGE_VERTEX_INPUT_BIT

	uniform_vertex_read					= (1ull << 3),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_VERTEX_SHADER_BIT
	uniform_tess_control_read		= (1ull << 4),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_TESSELLATION_CONTROL_SHADER_BIT
	uniform_tess_eval_read			= (1ull << 5),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_TESSELLATION_EVALUATION_SHADER_BIT
	uniform_geometry_read			= (1ull << 6),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_GEOMETRY_SHADER_BIT
	uniform_fragment_read				= (1ull << 7),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT
	uniform_compute_read				= (1ull << 8),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT

	shader_vertex_read					= (1ull << 9),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_VERTEX_SHADER_BIT
	shader_vertex_write					= (1ull << 10),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_VERTEX_SHADER_BIT
	shader_tess_control_read			= (1ull << 11),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_TESSELLATION_CONTROL_SHADER_BIT
	shader_tess_control_write		= (1ull << 12),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_TESSELLATION_CONTROL_SHADER_BIT
	shader_tess_eval_read				= (1ull << 13),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_TESSELLATION_EVALUATION_SHADER_BIT
	shader_tess_eval_write				= (1ull << 14),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_TESSELLATION_EVALUATION_SHADER_BIT
	shader_geometry_read				= (1ull << 15),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_GEOMETRY_SHADER_BIT
	shader_geometry_write			= (1ull << 16),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_GEOMETRY_SHADER_BIT
	shader_fragment_read				= (1ull << 17),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT
	shader_fragment_write				= (1ull << 18),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT
	shader_compute_read				= (1ull << 19),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
	shader_compute_write				= (1ull << 20),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT

	attachment_fragment_read		= (1ull << 21),				// VK_ACCESS_INPUT_ATTACHMENT_READ_BIT + VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT
	attachment_color_read				= (1ull << 22),				// VK_ACCESS_COLOR_ATTACHMENT_READ_BIT + VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
	attachment_color_write			= (1ull << 23),				// VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT + VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
	attachment_depth_read_early	= (1ull << 24),				// VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT + VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT
	attachment_depth_read_late		= (1ull << 25),				// VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT + VK_PIPELINE_STAGE_LATE_FRAGMENT_TESTS_BIT
	attachment_depth_write_early	= (1ull << 26),				// VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT + VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT
	attachment_depth_write_late	= (1ull << 27),				// VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT + VK_PIPELINE_STAGE_LATE_FRAGMENT_TESTS_BIT

	transfer_read							= (1ull << 28),				// VK_ACCESS_TRANSFER_READ_BIT + VK_PIPELINE_STAGE_TRANSFER_BIT
	transfer_write							= (1ull << 29),				// VK_ACCESS_TRANSFER_WRITE_BIT + VK_PIPELINE_STAGE_TRANSFER_BIT

	host_read									= (1ull << 30),				// VK_ACCESS_HOST_READ_BIT + VK_PIPELINE_STAGE_HOST_BIT
	host_write								= (1ull << 31),				// VK_ACCESS_HOST_WRITE_BIT + VK_PIPELINE_STAGE_HOST_BIT

	memory_read							= (1ull << 32),				// VK_ACCESS_MEMORY_READ_BIT
	memory_write							= (1ull << 33),				// VK_ACCESS_MEMORY_WRITE_BIT
	};

The formatting is a mess, but you get the idea.  Only certain combinations of stage + access are allowed by the spec, by enumerating them it made it far more clear which to pick.  I then can directly convert these to the associated stage + access masks without any loss in expressiveness/performance (or at least there shouldn't be if I understand things correctly).

Edited by Ryan_001
1

Share this post


Link to post
Share on other sites

Copy that - it's good to compare your own guessing against the guessing of others :D

(I spot you don't cover the case of a compute shader writing indirect dispacth count.)

 

I wonder if Events could help here: http://vulkan-spec-chunked.ahcox.com/ch06s03.html

I have not used them yet. Could i do something like triggereing a memory barrier, processing some other work, waiting on the barrier with a high chance it has been done already?

I really need Vulkan for dummies that tells me some usecases of such things...

 

0

Share this post


Link to post
Share on other sites
5 minutes ago, JoeJ said:

(I spot you don't cover the case of a compute shader writing indirect dispacth count.)

I'm not sure exactly what you mean.  The flags are pretty much taken verbatim from Table 4 of the spec: https://www.khronos.org/registry/vulkan/specs/1.0/html/vkspec.html#VkPipelineStageFlagBits (scroll down a screen or two).

I haven't played around with indirect stuff yet.  I'm assuming you write to a buffer the commands (either through memmap/staging buffer/copy or through a compute shader or similar), then use that buffer as the source for the indirect command correct?  If I was transferring from the host then I'd use host_write or transfer_write as my source flags (depending on whether or not I used a staging buffer), and then I'd use indirect_read as my dest flags.  If I were computing the buffer on the fly would you not use shader_compute_write as src, and indirect_read as dest?

24 minutes ago, JoeJ said:

I really need Vulkan for dummies that tells me some usecases of such things...

Isn't that an oxymoron :)

0

Share this post


Link to post
Share on other sites
11 minutes ago, Ryan_001 said:

If I were computing the buffer on the fly would you not use shader_compute_write as src, and indirect_read as dest?

Oh sorry, yes.  Confused this with the dstStageMask from vkCmdPipelineBarrier()

0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now

  • Similar Content

    • By Shnoutz
      Hi, quick question.. Why is the type of "fooBar" ambiguous?
      template< typename T > struct Bar { }; template< typename T1, typename T2 > struct Foo; template< typename T > struct Foo< T, Bar< T > > { }; template< typename T > struct Foo< float, Bar< T > > { }; Foo< float, Bar< float > > fooBar; I would expect that the second version of Foo is more specialized than the first...
      (I am building a compiler and I would like to understand the template specialization selection algorithm)
      Cheers!
    • By AlekseyRybchak
      Hi there!
      We are working on an fantasy based RPG with some really challenging gameplay.  
      We've got two writers (and we are sure we need no more), a couple of designers, one coder, one musician and one PM. 
      Here is the deal. We want to publish a demo within a 3-4 months from now. We got a popular Youtuber who would promote the game. 
      Now we need the following additional members:
      2 designers who know substance designer, photoshop, 3ds max/maya. Specifically a character artist and an environment artist.

      2 more programmers with experience in Blueprint, C++ and Unreal Engine.
      You'll get rev share after we publish the full game.  Publishing would not be an issue.
       
      Important:
      Part timers, dudes runing over 9000 other projects, lurkers, and those who got 'only 30 minutes per week' are highly unwelcome.
       
      We communicate in Hacknplan and Discord. And yes we love discipline and consistency! So get ready for real deadlines and regular team calls!
      If you are interested PM me or better e-mail Paul: pauldando21@icloud.com
       
      BR,
      Alex
       
       
    • By khawk
      Have you noticed the Image of the Day box on pages across GameDev.net?
      We're featuring one screenshot per day out of the GameDev.net Gallery as the Image of the Day. Submit an interesting screenshot to the Gallery, work in progress, or your recently announced game and it could be featured across the site as the IOTD.
      Go to the gallery to submit your images at https://www.gamedev.net/gallery.
    • By FordPerfect
      This is basically a repost of http://www.gamedev.ru/flame/forum/?id=228402 .
       
      As many programmers know, there is quite a lot of undefined behavior in C++ shifts.
      So let's make our own ones!
      Since the task is relatively small, it seems worth doing WELL, to save everyone using it a bit of time.
      Motivation:
      0. Well-defined semantics rulez.
      1. For many cases (e. g. shifts by a constant) the result should be exactly the same as built-in shifts, with no performance penalty.
      2. For many other cases performance does not matter.
      3. If old code (with built-in shifts) worked, then, presumably the shift amount was always in the allowed range, and the branch in the new code would have essentially 100% prediction.
      4. For the cases, where shifts are variable, and performance matters so much, that even predicted branches make difference (e. g. coding/decoding) - well there built-in shifts still remain. Hopefully, it is <5% of code.
      My present attempt:
      // This software is in the public domain. Where that dedication is not // recognized, you are granted a perpetual, irrevocable license to copy // and modify this file as you see fit. template<typename L,typename R> L shl(L value,R amount) { if(amount>=R(sizeof(L))*8) return L(0); if(amount<R(0)) { if(amount<=-R(sizeof(L))*8) { if(value<L(0)) return L(-1); return L(0); } return value>>(-amount); } return value<<amount; } template<typename L,typename R> L shr(L value,R amount) { if(amount>=R(sizeof(L))*8) { if(value<L(0)) return L(-1); return L(0); } if(amount<R(0)) { if(amount<=-R(sizeof(L))*8) return L(0); return value<<(-amount); } return value>>amount; } I subject it to scrutiny of the community.
      Basically, pretty please, check that all corner cases actually work.
      Quite a bit of trickery went into the writing this code, to satisfy the rules of C++. As you can imagine, there are A LOT of corner cases, and I am not confident everithing works correctly.
       
      Some useful links:
      http://www.gamedev.ru/flame/forum/?id=228402 - original post (in Russian)
      http://en.cppreference.com/w/cpp/language/operator_arithmetic - for language-lawering
      http://en.cppreference.com/w/cpp/language/implicit_cast#Integral_promotion - likewise
      http://rextester.com - for testing snippets
      http://gcc.godbolt.org - for testing assembly output
       
      P. S. Does anyone see a nice way to do it in C? Templates are C11, macros evaluate arguments multiple times, and lots of functions are ugly.
      I'm asking because it seems like a good candidate for inclusion into stb_* (right along with stb_divide.h), and maybe if we ask really nicely, Sean Barrett can do just that.
    • By khawk
      Have you noticed the Image of the Day box on pages across GameDev.net?
      We're featuring one screenshot per day out of the GameDev.net Gallery as the Image of the Day. Submit an interesting screenshot to the Gallery, work in progress, or your recently announced game and it could be featured across the site as the IOTD.
      Go to the gallery to submit your images at https://www.gamedev.net/gallery.
       
  • Popular Now