Problems when moving from Nvidia to ATI card / GPGPU performance comparision

Started by
5 comments, last by JoeJ 9 years, 7 months ago

Hi,

i have a complex compute shader project that refuses to work since i've replaced gtx670/480 against R9 280X.

Hopefully at the end i can give some useful GPGPU performance comparison without the need to compare OpenCL against Cuda.

The first issue is: I'm unable to modify a Shader Storage Buffers by shader - maybe i miss some stupid little thing...

The setup code is this:

int sizeLists = sizeof(int) * 4096;
gpuData.dataDbgOut = (int*)_aligned_malloc (sizeLists, 16);

gpuData.dataDbgOut[0] = 10;
gpuData.dataDbgOut[1] = 20;
gpuData.dataDbgOut[2] = 30;
gpuData.dataDbgOut[3] = 40;

glGenBuffers (1, &gpuData.ssbDbgOut);
glBindBuffer (GL_SHADER_STORAGE_BUFFER, gpuData.ssbDbgOut);
glBufferData (GL_SHADER_STORAGE_BUFFER, sizeLists, gpuData.dataDbgOut, GL_DYNAMIC_COPY);
glBindBufferBase (GL_SHADER_STORAGE_BUFFER, 1, gpuData.ssbDbgOut);

gpuData.computeShaderTestATI = GL_Helper::CompileShaderFile ("..\\Engine\\shader\\gi_TestATI.glsl", GL_COMPUTE_SHADER, 1, includeAll);
gpuData.computeProgramHandleTestATI = glCreateProgram();
if (!gpuData.computeProgramHandleTestATI) { SystemTools::Log ("Error creating compute program object.\n"); return 0; }
glAttachShader (gpuData.computeProgramHandleTestATI, gpuData.computeShaderTestATI);
if (!GL_Helper::LinkProgram (gpuData.computeProgramHandleTestATI)) return 0;

Per frame code:

glBegin (GL_POINTS); glVertex3f (0,0,0); glEnd (); // <- remove this and it works

glUseProgram (gpuData.computeProgramHandleTestATI);
glDispatchCompute (1, 1, 1);
glMemoryBarrier (GL_ALL_BARRIER_BITS);

glBindBuffer (GL_SHADER_STORAGE_BUFFER, gpuData.ssbDbgOut);
int* result = (int*) glMapBuffer (GL_SHADER_STORAGE_BUFFER, GL_READ_ONLY);
for (int i=0; i<4; i++) base_debug::logF->Print ("dbg: ", float(result));
glUnmapBuffer (GL_SHADER_STORAGE_BUFFER);

Shader:

layout (local_size_x = 1) in;

layout (binding = 1, std430) buffer dbg_block
{
uint dbgout[];
};

void main (void)
{
dbgout[0] = 0;
dbgout[1] = 1;
dbgout[2] = 2;
dbgout[3] = 3;
}

For the output i' expect 1,2,3,4 as modified by shader, but it is still 10,20,30,40

I've tried GL error checking but there is none, also the shader program is definitively called, and there are no shader compiler errors.

Any idea what's wrong? Version is ok too:

OpenGL ok
GL Vendor : ATI Technologies Inc.
GL Renderer : AMD Radeon R9 200 Series
GL Version (string) : 4.3.12967 Compatibility Profile Context 14.200.1004.0
GL Version (integer) : 4.3
GLSL Version : 4.40

EDIT: Added the stupid little thing :)

Advertisement

Ok, i found the reason. If i do any intermediate draw call before executing the shader the problem happens.

I tried to fix it with glGetProgramResourceIndex and glShaderStorageBlockBinding, but it did not help.

I can't get rid of the old draw calls, because i still use them to render gui in my testbed.

Maybe i do something wrong with context setup?

GLint attribs[] =
{
WGL_CONTEXT_MAJOR_VERSION_ARB, 4,
WGL_CONTEXT_MINOR_VERSION_ARB, 3,
WGL_CONTEXT_PROFILE_MASK_ARB, WGL_CONTEXT_COMPATIBILITY_PROFILE_BIT_ARB,
0
};
if(wglewIsSupported("WGL_ARB_create_context") == 1) hGLRC = wglCreateContextAttribsARB (hGLDC, 0, attribs);

So this might be super obvious, but since there are no indications in the code above: Do you call glUseProgram(0) before doing immediate rendering (and all the other stuff that needs to be disabled like VAOs...)? I had no real problem getting it to work in a simple demo on a AMD HD6950. From my experience Nvidia cards are more forgiving, when comes to stuff like this.

Great idea! Unfortunately it did not help, i tried adding this:

glBindBuffer (GL_SHADER_STORAGE_BUFFER, 0);
glUseProgram (0);

... same bad result. I've also disabled anything else (no VAOs, Textures, render shaders in use - just that single immediate point, the compute shader and ssb)

Super obvious things are most probably the ones i do wrong. I move from GL 1.4 to 4.3, everything has changed, but there is no collection of solved pitfalls in my head.

You should honestly just remove everything that you don't need and slowly move away from the 1.4 stuff, as nothing will work as it should.

Having had immediate code in my projects before when I learned modern OpenGL, everything can crash, even if it's correct. glBegin() is a recipe for disaster.

Nothing helps, except moving to straight VAOs. Just make a VAO container and recreate the functionality you had before.

If it helps, here is my container:

https://github.com/fwsGonzo/library/blob/master/include/library/opengl/vao.hpp

https://github.com/fwsGonzo/library/blob/master/library/opengl/vao.cpp

Yes i came to the same conclusion and already started an immediate replacement. Then i've got a HD crash, reinstalled everything and also tried a r9 285 beta driver that someone has modified to work for older cards (guru3D forum). And now the simple sample above works - but the GI algo i'm working on does not.

I guess i'll waste another few hours now trying to make it work, but i'll end up plugging in NV again and finish the algo completely before doing that boring work.

I've had high hope an AMD to be faster - maybe, but it'll have to wait a few weeks to do the comparision...

Thanks!

Curiosity was stronger and i got the 2 most important shaders to work, running on core profile now.

Results:

// gtx480: 11.8 ms (4.5, 7.2)
// R9280x: 6.17 ms (4.0, 2.1)

I get those timings from GL performance queries and tried to limit the effect of GPU <-> data transfer.

(those queries get affected a bit by such things even if they should not because you do them before or later than the measurement)

The first ms number is total runtime averaged over multiple frames, the other 2 numbers represent how long each shader runs.

1st shader is tree traversal to find a LOD cut for one sample per thread, no synchronisation, atomics usage is ignoreable.

2nd shader performs interreflection of a given receiver sample with the resulting emitter samples from 1st step.

Building acceleration structure on the fly to accelerate visibilty test means using atomics and synchronisation a lot here.

All shaders suffer from pretty random memory access, i can not speak yet about simpler brute force problems.

I have no numbers for gtx670 but i remember the sum is the same as a 480.

670 is faster on shader 2, but slower on shader 1 in comparision to 480. (480 seems generally 20% faster if no atomics used)

Tweaking work group size for best results on each card is worth it, it might be good to give users the option to do that themselves or make an automatic test.

Hope this information is useful for some of you :)

This topic is closed to new replies.

Advertisement