Suspicious slow compute shader... Mem write bottleneck?

Graphics and GPU Programming Programming

Started by Mr_Fox January 06, 2017 08:09 PM

18 comments, last by galop1n 7 years, 3 months ago

1,046

January 11, 2017 12:50 AM

Yes gather only works with normalized coordinates in HLSL, but if you are memory bound, usually gather + alu outperform 4 loads with offsets. Once you have the result of the 1x1 test, you should be fixed on the bottleneck ( or at least discard that culprit ).

Mr_Fox

806

Author

January 11, 2017 02:22 AM

usually gather + alu outperform 4 loads with offsets

Is that because gather uses special HW? or it somehow issues less mem reads?

Thanks

JoeJ

4,181

January 11, 2017 06:25 AM

JoeJ, on 07 Jan 2017 - 08:37 AM, said: On GCN this can get extremely slow if the various buf_uavDataX buffers have an exact offset of a power of 2 in memory, which is likely to happen for 512*512 images. Thanks JoeJ, but could you elaborate on that a little more? Are you talking about something related to bank conflicts?

See there, starting at 'Channel Conflicts':

http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/#50401334_18986

You could calculate your bandwidth requirements and compare with GPU specs to estimate the time the shader should take (guess the shader is totally memory bound).

If you are far off such conflicts may be the reason.

Maybe you could also fix this at the time you allocate the memory for buf_uavData buffers by putting all in one allocation and insert a small offset (1 x unused float4) between them to change their stride.

Mr_Fox

806

Author

January 13, 2017 08:56 AM

Thanks JoeJ. I feel so intimidating there are so much things to consider if you want you shader run at full speed :(

JoeJ

4,181

January 13, 2017 05:52 PM

Yep, i wish i would know more about how this is related to graphics where powers of two are everywhere.

There's also no public discussion, blog posts etc. about it.

I try to calc your bandwidth needs:

512*424*28*4 = 24MB write

each thread reads 28 byte, i guess you have one thread per pixel, so

512*424*28 = 6 MB

total 30 MB - that's nothing, did i sum up correctly?

680m has bandwidth of 115.2 GB/s, so almost 2GB per 60 Hz frame.

It can't be a bandwidth limit, register usage seems not that bad, no LDS. This should be 10 times faster.

Conclusion: I'm pretty sure it's about bank or channel conflicts.

For a quick test, you can try:

buf_uavData0[uIdx] = 0.f;
buf_uavData1[uIdx+1] = 0.f;
buf_uavData2[uIdx+2] = 0.f;
buf_uavData3[uIdx+3] = 0.f;
buf_uavData4[uIdx+4] = 0.f;
buf_uavData5[uIdx+5] = 0.f;
buf_uavData6[uIdx+6] = 0.f;

and use the same pattern for the other writes and also the reads.

Does this change anything?

galop1n

1,046

January 13, 2017 07:24 PM

I would still like to hear about the 1x1 texture read test, and a dumb question while i am on it, did you think about dividing the resolution by 8 in the dispatch call ?

Also from my understanding of bank conflict, all what said do not apply, it is at the warp level and local to a single instruction, so because your write use the dispatch id, every single write you have already interact with all the banks and there is no conflict.

To get a conflict, you should have a line like : uav[ dtid.x * 32] = foo;

JoeJ

4,181

January 13, 2017 08:02 PM

To get a conflict, you should have a line like : uav[ dtid.x * 32] = foo;

My assumption is that the various uav buffers have an x^2 offset,

like &buf_uavData0[0] + 32 == &buf_uavData1[0], of course with a larger offset.

If that's the case conflict may happen, do you agree?

512x424 does not give a power of two, but a multiple of it, so is it possible?

(i assume the target buf_uavData buffers have the same size)

Edit: I did read too quickly,

what you mean is

mem[threadID * powerOf2] = 0

for a parallel instruction,

what i mean is

mem[index + powerOf2*1] = 0

mem[index + powerOf2*2] = 0

for successive instructions

I think both are problems, but i need to make sure i remember correctly...

Mr_Fox

806

Author

January 13, 2017 08:38 PM

After looking through JoeJ's bandwidth calculation I decide to look deeper about my dispatch call (though some of you guys already suggest that, but since my dispatch wrapper call takes threadcount and do the round up divide internally, so I don't believe I will make such dumb thing... but it turned out I am even dumber......)

I struggled a lot whether post this or not since it's so embarrassing......

Lesson learned:

do not use default params


void Dispatch2D(
        size_t ThreadCountX, size_t ThreadCountY,
        size_t GroupSizeX = 8, size_t GroupSizeY = 8);
void Dispatch3D(
        size_t ThreadCountX, size_t ThreadCountY, size_t ThreadCountZ,
        size_t GroupSizeX = 8, size_t GroupSizeY = 8, size_t GroupSizeZ = 8);

the dispatch call I made before: Dispatch2D(512, 424, 1);

the correct disptach call: Dispatch3D(512, 424, 1); // the previous one is typo.....

Also my THREAD=8 conflict with another pair in one of the shader included file, so get redfined to 128.......

After fix all my dumbness, now it takes 0.35ms.... (0.25ms is theoretical limit based on 115.2GB/s)

Though this is super embarrassing, I really appreciate all you guys' help. And I will double check everything before I post questions here in the future

JoeJ

4,181

January 13, 2017 09:20 PM

Ha ha :)

But that's a very common mistake. To prevent it i use atomic incrementing a debug buffer value by each workgroup, and display the value on screen when starting a new shader.

This way i see strange huge or tiny numbers popping in my eye faster :)

I made sure my perf drop has been caused indeed by successive instructions and the offsets between them, NOT by offsets between a parallel instruction.

(The writes are even arranged in a inner loop impossible to unroll, so not tightly successive)

But the OpenCL optimization guide i have linked talks only about the parallel instruction case like galop1n mentioned.

I never heared of something like that elsewhere. Maybe it's caused by bundling writes together somewhere, or unlikely the compiler managed to convert it to a parallel write.

So keep it in mind guys, maybe it hits you as well some time soon... :)

galop1n

1,046

January 13, 2017 09:30 PM

Haha, i was also feeling embarrassed asking the dispatch parameter question, that is such a dumb mistake that is usually catch quicker :)

Glad you solved you problem that has nothing to do with bandwidth, bank conflict and whatever crazy theory :)

Suspicious slow compute shader... Mem write bottleneck?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Suspicious slow compute shader... Mem write bottleneck?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines