• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0

[Compute Shader] Groupshared memory as slow as VRAM

2 posts in this topic

As far as I know, I have one of the earlier and cheaper mobile graphics cards that supported DX11. It's a AMD Radeon HD 5730M.

I've started optimizing graphics algorithms by porting them to compute shader and improving them by sharing memory and synchronizing the threads. This way I could improve the runtime of my Bloom from [eqn]O(n)[/eqn] to [eqn]O(log(n))[/eqn] per pixel.

But that was only the theoretical runtime. In reality, the algorithm performed so much worse than the original linear algorithm. I'm pretty sure I know the reason. Instead of let's say 32 read operations and 1 write operation, the algorithm now needs 1 read operation from VRAM, 5 read operations from groupshared memory, 5 write operations to groupshared memory and 1 write operation to VRAM.

Overall groupshared memory being L1 Cache should be way faster than 32 read operations from VRAM and it's even way less operations because of the algorithm having logarithmic runtime, but it's way slower (8ms instead of 0.5ms). The slowdown could be because of memory bank conflicts. But could they really cause such an enormous slowdown?

To me it looks like my graphics card might not even have an actual L1 cache residing on the Wavefront as groupshared memory at all. It performs just as bad as a UAV residing in VRAM would. So maybe they simply wrote a driver that uses 32kb of reserved memory in the VRAM as groupshared memory. Could that be the case or is it the bank conflicts?

I wish there were tools that could shine more light on such problems. Graphics cards and the tools should be more transparent in what's actually going on, so that the developers could improve the algorithms even further.

Update: After reading through NVidias CUDA documentation my shaders don't even cause any bank conflicts at all. Each half warp (16 threads) always accesses 16 different memory banks. Just a whole block (1024 threads) accesses them multiple times, which is normal and has nothing to do with bank conflicts. Edited by CryZe

Share this post

Link to post
Share on other sites
I suppose it's possible that your hardware doesn't actually have on-chip shared memory and just uses global memory instead, but I've not heard of that ever being the case. Although mobile hardware isn't usually well-documented, so who knows. You could try using GPU PerfStudio or AMD's APP profiling suite, but I'm not sure if either those will give you enough information to narrow down the problem. Perhaps you might want to try running some samples that make use of shared memory to see if they also perform poorly on your hardware.

Also just so you know, shared memory isn't L1. On AMD and Nvidia hardware It's its own special type of on-chip memory, and it's separate from the caches.

Share this post

Link to post
Share on other sites
[quote name='MJP' timestamp='1347309681' post='4978690']
Also just so you know, shared memory isn't L1. On AMD and Nvidia hardware It's its own special type of on-chip memory, and it's separate from the caches.
"As mentioned in Section F.4.1, for devices of compute capability 2.x and higher, the same on-chip memory is used for both L1 and shared memory, and how much of it is dedicated to L1 versus shared memory is configurable for each kernel call." Source: [url="http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf"]CUDA Programming Guide[/url]

I've tried both the PerfStudio and AMD's APP Profiler. But they didn't work at all. PerfStudio wasn't able to catch a frame (endlessly trying to connect, even though it was already connected) and the APP Profiler showed me an error message in both of its modes. I'll probably try it again tomorrow.

[quote name='MJP' timestamp='1347309681' post='4978690']
Perhaps you might want to try running some samples that make use of shared memory to see if they also perform poorly on your hardware.
Oh, that's a good idea. I remember that the OIT11 Sample from the DirectX Sample Browser performs incredibly bad on my hardware (9FPS at 320x240). I don't know if it performs bad in comparison to the other samples on other hardware as well, though. I'll take a look into it's source to check out why it might perform that bad.

I'll also try to implement a bandwidth heavy compute shader that either performs an enormous amount of write operations to shared memory or to shared memory while causing as many bank conflicts as possible or to global memory. If the performance is the same the chances that my graphics card uses on chip memory as shared memory are pretty much zero. Edited by CryZe

Share this post

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  
Followers 0

  • Similar Content

    • By YixunLiu
      I have a surface mesh and I want to use a cone to cut a hole on the surface mesh.
      Anybody know a fast method to calculate the intersected boundary of these two geometries?
    • By hiya83
      Hi, I tried searching for this but either I failed or couldn't find anything. I know there's D11/D12 interop and there are extensions for GL/D11 (though not very efficient). I was wondering if there's any Vulkan/D11 or Vulkan/D12 interop?
    • By lonewolff
      Hi Guys,
      I am just wondering if it is possible to acquire the address of the backbuffer if an API (based on DX11) only exposes the 'device' and 'context' pointers?
      Any advice would be greatly appreciated
    • By MarcusAseth
      bool InitDirect3D::Init() { if (!D3DApp::Init()) { return false; } //Additional Initialization //Disable Alt+Enter Fullscreen Toggle shortkey IDXGIFactory* factory; CreateDXGIFactory(__uuidof(IDXGIFactory), reinterpret_cast<void**>(&factory)); factory->MakeWindowAssociation(mhWindow, DXGI_MWA_NO_WINDOW_CHANGES); factory->Release(); return true; }  
      As stated on the title and displayed on the code above, regardless of it Alt+Enter still takes effect...
      I recall something from the book during the swapChain creation, where in order to create it one has to use the same factory used to create the ID3D11Device, therefore I tested and indeed using that same factory indeed it work.
      How is that one particular factory related to my window and how come the MakeWindowAssociation won't take effect with a newly created factory?
      Also what's even the point of being able to create this Factories if they won't work,?(except from that one associated with the ID3D11Device) 
    • By ProfL
      Can anyone recommend a wrapper for Direct3D 11 that is similarly simple to use as SFML? I don't need all the image formats etc. BUT I want a simple way to open a window, allocate a texture, buffer, shader.
  • Popular Now