Jump to content

  • Log In with Google      Sign In   
  • Create Account


[Compute Shader] Groupshared memory as slow as VRAM


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
2 replies to this topic

#1 CryZe   Members   -  Reputation: 768

Like
0Likes
Like

Posted 10 September 2012 - 12:40 AM

As far as I know, I have one of the earlier and cheaper mobile graphics cards that supported DX11. It's a AMD Radeon HD 5730M.

I've started optimizing graphics algorithms by porting them to compute shader and improving them by sharing memory and synchronizing the threads. This way I could improve the runtime of my Bloom from to per pixel.

But that was only the theoretical runtime. In reality, the algorithm performed so much worse than the original linear algorithm. I'm pretty sure I know the reason. Instead of let's say 32 read operations and 1 write operation, the algorithm now needs 1 read operation from VRAM, 5 read operations from groupshared memory, 5 write operations to groupshared memory and 1 write operation to VRAM.

Overall groupshared memory being L1 Cache should be way faster than 32 read operations from VRAM and it's even way less operations because of the algorithm having logarithmic runtime, but it's way slower (8ms instead of 0.5ms). The slowdown could be because of memory bank conflicts. But could they really cause such an enormous slowdown?

To me it looks like my graphics card might not even have an actual L1 cache residing on the Wavefront as groupshared memory at all. It performs just as bad as a UAV residing in VRAM would. So maybe they simply wrote a driver that uses 32kb of reserved memory in the VRAM as groupshared memory. Could that be the case or is it the bank conflicts?

I wish there were tools that could shine more light on such problems. Graphics cards and the tools should be more transparent in what's actually going on, so that the developers could improve the algorithms even further.

Update: After reading through NVidias CUDA documentation my shaders don't even cause any bank conflicts at all. Each half warp (16 threads) always accesses 16 different memory banks. Just a whole block (1024 threads) accesses them multiple times, which is normal and has nothing to do with bank conflicts.

Edited by CryZe, 10 September 2012 - 08:19 AM.


Sponsor:

#2 MJP   Moderators   -  Reputation: 10033

Like
1Likes
Like

Posted 10 September 2012 - 02:41 PM

I suppose it's possible that your hardware doesn't actually have on-chip shared memory and just uses global memory instead, but I've not heard of that ever being the case. Although mobile hardware isn't usually well-documented, so who knows. You could try using GPU PerfStudio or AMD's APP profiling suite, but I'm not sure if either those will give you enough information to narrow down the problem. Perhaps you might want to try running some samples that make use of shared memory to see if they also perform poorly on your hardware.

Also just so you know, shared memory isn't L1. On AMD and Nvidia hardware It's its own special type of on-chip memory, and it's separate from the caches.

#3 CryZe   Members   -  Reputation: 768

Like
0Likes
Like

Posted 10 September 2012 - 04:25 PM

Also just so you know, shared memory isn't L1. On AMD and Nvidia hardware It's its own special type of on-chip memory, and it's separate from the caches.

"As mentioned in Section F.4.1, for devices of compute capability 2.x and higher, the same on-chip memory is used for both L1 and shared memory, and how much of it is dedicated to L1 versus shared memory is configurable for each kernel call." Source: CUDA Programming Guide

I've tried both the PerfStudio and AMD's APP Profiler. But they didn't work at all. PerfStudio wasn't able to catch a frame (endlessly trying to connect, even though it was already connected) and the APP Profiler showed me an error message in both of its modes. I'll probably try it again tomorrow.

Perhaps you might want to try running some samples that make use of shared memory to see if they also perform poorly on your hardware.

Oh, that's a good idea. I remember that the OIT11 Sample from the DirectX Sample Browser performs incredibly bad on my hardware (9FPS at 320x240). I don't know if it performs bad in comparison to the other samples on other hardware as well, though. I'll take a look into it's source to check out why it might perform that bad.

I'll also try to implement a bandwidth heavy compute shader that either performs an enormous amount of write operations to shared memory or to shared memory while causing as many bank conflicts as possible or to global memory. If the performance is the same the chances that my graphics card uses on chip memory as shared memory are pretty much zero.

Edited by CryZe, 11 September 2012 - 03:27 AM.





Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS