Jump to content
  • Advertisement
Sign in to follow this  
JoeJ

Vulkan How Do You Deal With Errors On Gpus? Do You At All?

This topic is 808 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I have a FuryX since launch and it worked for every game wihout issues and still does.

 

Now, after starting using it for developement, i have to realize the card is broken.

It calculates wrong results in one of 60000 cases, causing blue screens in my case (Vulkan -> accidently setting huge loop count -> blue screen)

An older 5850 works without issues and the test case is very simple (prefix sum), so probably it's no driver issue but a hardware failure.

 

I'll port to OpenCL to see if it's reproducible, but if i'm right i really wonder why the card works in games.

 

Is it common practice to expect errors and handle them???

Is it true that we can not expect GPUs to be as robust as CPUs? Some people say so but to be honest - i thought they simply tend to forget barriers.

 

I've worked seriously with only 4-5 GPUs up to now, but they all did exactly what i've told them - always and at least for hours.

Please share your experience.

 

 

 

 

Share this post


Link to post
Share on other sites
Advertisement

I'm in the camp that yes, (consumer) GPUs are not as reliable as CPUs. They take a lot of shortcuts and optimizations that generate nearly-correct results. That said, a blue screen failure due to a specific operation is definitely out of the ordinary, though using Vulkan in these early days definitely exposes you to it more than it otherwise would. This actually does sound like you should build a test case and submit it to AMD, as a lot of these weird corner cases can show up dependent on the GPU and they may not have noticed.

 

As far as handling errors - it is common to have workarounds for certain hardware configurations that are known to break. It is also common to have workarounds for particular drivers, particular operating systems, etc. These are all derived from testing before release (or afterwards...) to find out what works and what doesn't. What you don't have, though, is the ability to detect and handle GPU or driver errors in any sensible way. A blue screen is a kernel mode unhandled exception, and there is not a damn thing you can do about it after the bug has been invoked.

Share this post


Link to post
Share on other sites

Thanks, agree but already know - i'll describe the problem in more detail:

 

I use the prefix sum on uints to fill an acceleration structure.

The hardware bug (if so) seems to ignore my barriers, so it can happen that array[i+1] is smaller than array - usually array[i+1] MUST be >= array.

(This also happens with work group size of 64, but less often)

 

Later when processing for (uint i = array; i<array[i+1]; i++), because of unsigned numbers the difference overflows and gives a huge number close to 0xFFFFFFFFu.

Now i don't know if long runtime or out of buffer writes cause the blue screen, but i know why it happens.

 

 

So you say "GPUs are not as reliable as CPUs" - ok, but this is not about accuracy, it's a complete malfunction.

 

I mean - nobody is going to check array[i+1] >= array after a prefix sum, no matter if GPU or CPU, do you agree?

 

 

Hopefully it's just a Fury related driver bug, but i'll double check in OpenCL before wasting AMDs time...

Share this post


Link to post
Share on other sites

Later when processing for (uint i = array; i<array[i+1]; i++), because of unsigned numbers the difference overflows and gives a huge number close to 0xFFFFFFFFu.
Now i don't know if long runtime or out of buffer writes cause the blue screen, but i know why it happens.

I'm 99% sure what's happening is that because you get a bad loop count, your GPU will take too long to respond and thus you run into TDR.
 

The hardware bug (if so) seems to ignore my barriers, so it can happen that array[i+1] is smaller than array - usually array[i+1] MUST be >= array.
(This also happens with work group size of 64, but less often)

GPU threading is hard.

How are you issuing your barrier? Beware in GLSL memoryBarrier does only half of the job. You also need a barrier:

//memoryBarrierShared ensures our write is visible to everyone else (must be done BEFORE the barrier)
//barrier ensures every thread's execution reached here.
memoryBarrierShared();
barrier();

Share this post


Link to post
Share on other sites

No i even replaced my own code with this code from OpenGL Superbible and added additional barriers. Still the same bugs.

 

layout (local_size_x = 128) in;
#define lID gl_LocalInvocationID.x
shared uint _texCount [257];
for (uint step = 0; step < 8; step++)
{
uint mask = (1 << step) - 1;
uint rd_id = ((lID >> step) << (step + 1)) + mask;
uint wr_id = rd_id + 1 + (lID & mask);
uint r = _texCount[rd_id + 1];
barrier(); memoryBarrierShared(); // paranoia
_texCount[wr_id + 1] += r;
barrier(); memoryBarrierShared();
}

Share this post


Link to post
Share on other sites

The code you posted starts with _texCount uninitialized, which won't work as intended. It doesn't start with 0s unless you fill it. If you do fill it in your actual code, you need to sync that as well.

Share this post


Link to post
Share on other sites

Good point! Of course i did init the data, but it would be possible i upload wrong huge numbers causing overflow on input.

Added this to ensure small numbers:

 

if (lID==0)
    {
        for (uint i=0; i<=NUM_TEXELS; i++) _texCount &= 0xFF;
    }
    memoryBarrierShared();
    barrier();

 

But damn - it still happens. F....

I Also remembrer i already tried to fill it with all 1s. The bug mostly happens on array index 192, or 193 if i offset by one like in given code.

I run the shader each frame but on constant input data (i even stop uploading input data after some frames)

The bug happens on random indices of all 60000 work groups, next frame it's correct there just to fail somewhere else.

 

It's some work to port everything to OpenCL, after that i'll post stripped down complete shader code tomorrow.

I need to be sure because i already gave slightly wrong information to AMD yesterday...

Share this post


Link to post
Share on other sites

A day later there's just more confusion.

In OpenCL version the exclusive prefix sum (1-256) fails always with a very different error pattern, but the inclusive (0-255) version works.

In Vulkan exclusive fails more often than inclusive, but still only rarely, for few seconds no errors, then always more...

The stripped down version i made based on AMDs Vulkan GCN extension sample code works without issues - tested for half an hour.

All use the same code.

Also Doom Vulkan demo works, no pixel errors or something.

 

So i'm pretty sure the GPU is ok and a driver bug in Vulkan AND OpenCL is unlikely.

Most probably i don't know what i'm doing... :)

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!