Jump to content

  • Log In with Google      Sign In   
  • Create Account


#ActualCryZe

Posted 11 September 2012 - 03:27 AM

Also just so you know, shared memory isn't L1. On AMD and Nvidia hardware It's its own special type of on-chip memory, and it's separate from the caches.

"As mentioned in Section F.4.1, for devices of compute capability 2.x and higher, the same on-chip memory is used for both L1 and shared memory, and how much of it is dedicated to L1 versus shared memory is configurable for each kernel call." Source: CUDA Programming Guide

I've tried both the PerfStudio and AMD's APP Profiler. But they didn't work at all. PerfStudio wasn't able to catch a frame (endlessly trying to connect, even though it was already connected) and the APP Profiler showed me an error message in both of its modes. I'll probably try it again tomorrow.

Perhaps you might want to try running some samples that make use of shared memory to see if they also perform poorly on your hardware.

Oh, that's a good idea. I remember that the OIT11 Sample from the DirectX Sample Browser performs incredibly bad on my hardware (9FPS at 320x240). I don't know if it performs bad in comparison to the other samples on other hardware as well, though. I'll take a look into it's source to check out why it might perform that bad.

I'll also try to implement a bandwidth heavy compute shader that either performs an enormous amount of write operations to shared memory or to shared memory while causing as many bank conflicts as possible or to global memory. If the performance is the same the chances that my graphics card uses on chip memory as shared memory are pretty much zero.

#5CryZe

Posted 11 September 2012 - 03:27 AM

Also just so you know, shared memory isn't L1. On AMD and Nvidia hardware It's its own special type of on-chip memory, and it's separate from the caches.

"As mentioned in Section F.4.1, for devices of compute capability 2.x and higher, the same on-chip memory is used for both L1 and shared memory, and how much of it is dedicated to L1 versus shared memory is configurable for each kernel call." Source: CUDA Programming Guide

I've tried both the PerfStudio and AMD's APP Profiler. But they didn't work at all. PerfStudio wasn't able to catch a frame (endlessly trying to connect, even though it was already connected) and the APP Profiler showed me an error message in both of its modes. I'll probably try it again tomorrow.

Perhaps you might want to try running some samples that make use of shared memory to see if they also perform poorly on your hardware.

Oh, that's a good idea. I remember that the OIT11 Sample from the DirectX Sample Browser performs incredibly bad on my hardware (9FPS at 320x240). I don't know if it performs bad in comparison to the other samples on other hardware as well, though. I'll take a look into it's source to check out why it might perform that bad.

I'll also try to implement a bandwidth heavy compute shader that either performs an enormous amount of write operations to shared memory or to shared memory while causing as many bank conflicts as possible or to global memory. If the performance is the same the chance that my graphics card uses on chip memory as shared memory are pretty much zero.

#4CryZe

Posted 10 September 2012 - 04:34 PM

Also just so you know, shared memory isn't L1. On AMD and Nvidia hardware It's its own special type of on-chip memory, and it's separate from the caches.

"As mentioned in Section F.4.1, for devices of compute capability 2.x and higher, the same on-chip memory is used for both L1 and shared memory, and how much of it is dedicated to L1 versus shared memory is configurable for each kernel call." Source: CUDA Programming Guide

I've tried both the PerfStudio and AMD's APP Profiler. But they didn't work at all. PerfStudio wasn't able to catch a frame (endlessly trying to connect, even though it was already connected) and the APP Profiler showed me an error message in both of its modes. I'll probably try it again tomorrow.

Perhaps you might want to try running some samples that make use of shared memory to see if they also perform poorly on your hardware.

Oh, that's a good idea. I remember that the OIT11 Sample from the DirectX Sample Browser performs incredibly bad on my hardware (9FPS at 320x240). I don't know if it performs bad in comparison to the other samples on other hardware as well, though. I'll take a look into it's source to check out why it might perform that bad.

#3CryZe

Posted 10 September 2012 - 04:33 PM

Also just so you know, shared memory isn't L1. On AMD and Nvidia hardware It's its own special type of on-chip memory, and it's separate from the caches.

"As mentioned in Section F.4.1, for devices of compute capability 2.x and higher, the same on-chip memory is used for both L1 and shared memory, and how much of it is dedicated to L1 versus shared memory is configurable for each kernel call." Source: CUDA Programming Guide

I've tried both the PerfStudio and AMD's APP Profiler. But they didn't work at all. PerfStudio wasn't able to catch a frame (endlessly trying to connect, even though it was already connected) and the APP Profiler showed me an error message in both of its modes. I'll probably try it again tomorrow.

Perhaps you might want to try running some samples that make use of shared memory to see if they also perform poorly on your hardware.

Oh, that's a good idea. I remember that the OIT11 Sample from the DirectX Sample Browser performs incredibly bad on my hardware (3FPS). I don't know if it performs bad in comparison to the other samples on other hardware as well, though. I'll take a look into it's source to check out why it might perform that bad.

#2CryZe

Posted 10 September 2012 - 04:26 PM

Also just so you know, shared memory isn't L1. On AMD and Nvidia hardware It's its own special type of on-chip memory, and it's separate from the caches.

"As mentioned in Section F.4.1, for devices of compute capability 2.x and higher, the same on-chip memory is used for both L1 and shared memory, and how much of it is dedicated to L1 versus shared memory is configurable for each kernel call." Source: CUDA Programming Guide

I've tried both the PerfStudio and AMD's APP Profiler. But they didn't work at all. PerfStudio wasn't able to catch a frame (endlessly trying to connect, even though it was already connected) and the APP Profiler showed me an error message in both of its modes. I'll probably try it again tomorrow.

#1CryZe

Posted 10 September 2012 - 04:25 PM

Also just so you know, shared memory isn't L1. On AMD and Nvidia hardware It's its own special type of on-chip memory, and it's separate from the caches.

"As mentioned in Section F.4.1, for devices of compute capability 2.x and higher, the same on-chip memory is used for both L1 and shared memory, and how much of it is dedicated to L1 versus shared memory is configurable for each kernel call."

I've tried both the PerfStudio and AMD's APP Profiler. But they didn't work at all. PerfStudio wasn't able to catch a frame (endlessly trying to connect, even though it was already connected) and the APP Profiler showed me an error message in both of its modes. I'll probably try it again tomorrow.

PARTNERS