"As mentioned in Section F.4.1, for devices of compute capability 2.x and higher, the same on-chip memory is used for both L1 and shared memory, and how much of it is dedicated to L1 versus shared memory is configurable for each kernel call." Source: CUDA Programming Guide
Also just so you know, shared memory isn't L1. On AMD and Nvidia hardware It's its own special type of on-chip memory, and it's separate from the caches.
I've tried both the PerfStudio and AMD's APP Profiler. But they didn't work at all. PerfStudio wasn't able to catch a frame (endlessly trying to connect, even though it was already connected) and the APP Profiler showed me an error message in both of its modes. I'll probably try it again tomorrow.
Oh, that's a good idea. I remember that the OIT11 Sample from the DirectX Sample Browser performs incredibly bad on my hardware (9FPS at 320x240). I don't know if it performs bad in comparison to the other samples on other hardware as well, though. I'll take a look into it's source to check out why it might perform that bad.
Perhaps you might want to try running some samples that make use of shared memory to see if they also perform poorly on your hardware.
I'll also try to implement a bandwidth heavy compute shader that either performs an enormous amount of write operations to shared memory or to shared memory while causing as many bank conflicts as possible or to global memory. If the performance is the same the chances that my graphics card uses on chip memory as shared memory are pretty much zero.