• 9
• 10
• 10
• 11
• 15
• ### Similar Content

• I have a pretty good experience with multi gpu programming in D3D12. Now looking at Vulkan, although there are a few similarities, I cannot wrap my head around a few things due to the extremely sparse documentation (typical Khronos...)
In D3D12 -> You create a resource on GPU0 that is visible to GPU1 by setting the VisibleNodeMask to (00000011 where last two bits set means its visible to GPU0 and GPU1)
In Vulkan - I can see there is the VkBindImageMemoryDeviceGroupInfoKHR struct which you add to the pNext chain of VkBindImageMemoryInfoKHR and then call vkBindImageMemory2KHR. You also set the device indices which I assume is the same as the VisibleNodeMask except instead of a mask it is an array of indices. Till now it's fine.
Let's look at a typical SFR scenario:  Render left eye using GPU0 and right eye using GPU1
You have two textures. pTextureLeft is exclusive to GPU0 and pTextureRight is created on GPU1 but is visible to GPU0 so it can be sampled from GPU0 when we want to draw it to the swapchain. This is in the D3D12 world. How do I map this in Vulkan? Do I just set the device indices for pTextureRight as { 0, 1 }
Now comes the command buffer submission part that is even more confusing.
There is the struct VkDeviceGroupCommandBufferBeginInfoKHR. It accepts a device mask which I understand is similar to creating a command list with a certain NodeMask in D3D12.
So for GPU1 -> Since I am only rendering to the pTextureRight, I need to set the device mask as 2? (00000010)
For GPU0 -> Since I only render to pTextureLeft and finally sample pTextureLeft and pTextureRight to render to the swap chain, I need to set the device mask as 1? (00000001)
The same applies to VkDeviceGroupSubmitInfoKHR?
Now the fun part is it does not work  . Both command buffers render to the textures correctly. I verified this by reading back the textures and storing as png. The left texture is sampled correctly in the final composite pass. But I get a black in the area where the right texture should appear. Is there something that I am missing in this? Here is a code snippet too
void Init() { RenderTargetInfo info = {}; info.pDeviceIndices = { 0, 0 }; CreateRenderTarget(&info, &pTextureLeft); // Need to share this on both GPUs info.pDeviceIndices = { 0, 1 }; CreateRenderTarget(&info, &pTextureRight); } void DrawEye(CommandBuffer* pCmd, uint32_t eye) { // Do the draw // Begin with device mask depending on eye pCmd->Open((1 << eye)); // If eye is 0, we need to do some extra work to composite pTextureRight and pTextureLeft if (eye == 0) { DrawTexture(0, 0, width * 0.5, height, pTextureLeft); DrawTexture(width * 0.5, 0, width * 0.5, height, pTextureRight); } // Submit to the correct GPU pQueue->Submit(pCmd, (1 << eye)); } void Draw() { DrawEye(pRightCmd, 1); DrawEye(pLeftCmd, 0); }

• Hi,
I finally managed to get the DX11 emulating Vulkan device working but everything is flipped vertically now because Vulkan has a different clipping space. What are the best practices out there to keep these implementation consistent? I tried using a vertically flipped viewport, and while it works on Nvidia 1050, the Vulkan debug layer is throwing error messages that this is not supported in the spec so it might not work on others. There is also the possibility to flip the clip scpace position Y coordinate before writing out with vertex shader, but that requires changing and recompiling every shader. I could also bake it into the camera projection matrices, though I want to avoid that because then I need to track down for the whole engine where I upload matrices... Any chance of an easy extension or something? If not, I will probably go with changing the vertex shaders.

• I publishing for manufacturing our ray tracing engines and products on graphics API (C++, Vulkan API, GLSL460, SPIR-V): https://github.com/world8th/satellite-oem
For end users I have no more products or test products. Also, have one simple gltf viewer example (only source code).
In 2016 year had idea for replacement of screen space reflections, but in 2018 we resolved to finally re-profile project as "basis of render engine". In Q3 of 2017 year finally merged to Vulkan API.

• vkQueuePresentKHR is busy waiting - ie. wasting all the CPU cycles while waiting for vsync. Expected, sane, behavior would of course be akin to Sleep(0) till it can finish.
Windows 7, GeForce GTX 660.
Is this a common problem? Is there anything i can do to make it behave properly?

• I am working on reusing as many command buffers as I can by pre-recording them at load time. This gives a significant boost on CPU although now I cannot get the GPU timestamps since there is no way to read back. I Map the readback buffer before and Unmap it after reading is done. Does this mean I need a persistently mapped readback buffer?
void Init() { beginCmd(cmd); cmdBeginQuery(cmd); // Do a bunch of stuff cmdEndQuery(cmd); endCmd(cmd); } void Draw() { CommandBuffer* cmd = commands[frameIdx]; submit(cmd); } The begin and end query do exactly what the names say.

# Vulkan Branching in compute kernels

This topic is 564 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi,

I'm porting a D3D11 program I wrote to Vulkan, in this program I do volume ray casting.

It is a simple fog simulator I'm expering with.
At each ray step I need to do some operations that requires a dynamic for loop, as the iterations number is evaluated on the fly at each step along the ray.
This means no loop unrolling or compiler optimizations.
On the other hand the hlsl code works just fine keeping a real-time speed.
In my Vulkan implementation, it just crashes after the compute fence goes timeout because of resource locking.
What "fixed" the problem is using a constant value in the loop of course, but that kinda kills my algorithm.

Does anybody have tried anything like this? Have some infos on this matter? I would like to dig into this further.

Here's a pseudo-code of my ray marching algorithm

vec3 ForEachStep()
{
vec3 retVal = 0, 0, 0;
numIter = FunctionCallToDetermine();
for i = 0 to numIter
{
retVal += FuncCall();
}
return retVal;
}

const uint numSteps = someValue;

void main()
{
// do other unrelated stuff
vec3 color = 0, 0, 0
for i = 0 to numSteps
{
// do other unrelated stuff

color += ForEachStep();

// do other unrelated stuff
}

// store color to texture target
}


Cheers!

##### Share on other sites

In my Vulkan implementation, it just crashes after the compute fence goes timeout because of resource locking.

So the shader finishes, and you get the crash afterwards?

Try a vkQueueWaitIdle() after vkQueueSubmit(), to quickly ensure all work has finished and see if it prevents the crash.

Or does the shader never finish? Usually this causes a bluescreen or a unstable system (Do NOT save your source files in this case - reboot instantly. I've lost some work with this)

Probably a infinite loop - implement an additional max counter to prevent this and see if the crash goes away.

Can you post some performance comparision if you get it to work?

##### Share on other sites

Fixed it! I overlooked an uint underflow when calculating the number of iterations! When the underflow occured it generated an infinite loop.

In my Vulkan implementation, it just crashes after the compute fence goes timeout because of resource locking.

So the shader finishes, and you get the crash afterwards?

Try a vkQueueWaitIdle() after vkQueueSubmit(), to quickly ensure all work has finished and see if it prevents the crash.

Or does the shader never finish? Usually this causes a bluescreen or a unstable system (Do NOT save your source files in this case - reboot instantly. I've lost some work with this)

Probably a infinite loop - implement an additional max counter to prevent this and see if the crash goes away.

Can you post some performance comparision if you get it to work?

I can only access my laptop in the weekend, I'll post some comparisons tomorrow!

##### Share on other sites

I can only access my laptop in the weekend, I'll post some comparisons tomorrow!

Hey, sorry for the late reply!

the GPU I've used for comparison is a GTX 970 3Gb

SPIR-V renders at 4/5 fps@720p, both ported to GLSL or directly compiled from HLSL.

HLSL renders at 25/27 fps@720p.

Did a second test with a bunch of spheres and simple lambertian reflectors/metal materials and the results are:

SPIR-V renders at ~2500 fps@720p

HLSL renders at ~400 fps@720p

Seems like that dynamic loop is killing SPIR-V!

##### Share on other sites

Thanks. Such varying results indicate they have some work to do at their drivers and it's still worth to try different APIs. :(

I may add a DX12 code path sometime...

Seems like that dynamic loop is killing SPIR-V!

Maybe in this case, but in general dynamic loops should be no problem really (I have them everywhere).

You would need to make much more test cases to verify an assumption like this.

If we could compare generated machine code, we could at least find out where the driver sucks.

I'm working on a large project and here i can compare Vulkan vs. OpenCL on AMD. Performance varies about 20-30%, mostly Vulkan is faster.

I've had only one extreme case where VK was 4 times faster. When adding fp16->fp32 conversion, CL compiler started wasting registers and occupancy went down.

Things like that are mostly the reason if 'good' code performs badly. Thus it's important to have a profiler tool showing register usage, occupancy etc.

Unfortunately AMDs CodeXL can't do this yet for VK (just DX, GL & CL), so coding feels like being blind folded at the moment.

##### Share on other sites

You would need to make much more test cases to verify an assumption like this.

Yes of course! Totally agreed.

If you are interested we can compare the machine code!

##### Share on other sites

NSight can show mapping between high level and machine code: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-view-assembly-code-correlation-nsight-visual-studio-edition/

Very nice. Try to see if this works for HLSL and Vulkan (although it's unlikely to work with GLSL->SPIR-V->PTX...)

Are you sure you upload the data properly to GPU memory using a staging buffer with Vulkan (data -> host visible buffer, host visible buffer -> device)?

Fetching the data from main memory would explain the bad performance.

##### Share on other sites

I'll try with NSight!

I've got a uniform buffer with a world matrix for the camera transform and a float4 filled with rendering mode data and volume info (this could probably be arranged better using push constants for the volume volume info and render mode).
Volume data is a 3D texture loaded with staging.
Raytracer output is a 2D texture that is not staged, created with STORAGE | SAMPLED usage flags, accessed as storage in compute, accessed with a sampler in fragment stage.

Would you do this different?

##### Share on other sites

Seems nothing wrong with that.

With NSight you should get at least exact timings for the various stages - maybe it's not the compute shader causing low fps.

(Also possible using vkCmdWriteTimestamp, but that's some extra work)