Are there some memory limits to consider when writing a shader?

Started by
18 comments, last by JoeJ 5 years, 10 months ago

Would it be a problem to create in HLSL ~50 uninitialized arrays of ~300000 cells each and then use them for my algorithm(what I currently do in C++(and I had stack overflows problems because of large arrays)).
It is something internal to the shader. Shader will create the arrays in the beginning, will use them and not need them anymore. Not taking data for the arrays from the outside world, not giving back data from the arrays to the outside world either. Nothing shared.
My question is not very specific, it is about memory consumption considerations when writing shaders in general, because my algorithm still has to be polished. I will let the writing of HLSL for when I have the algorithm totally finished and working(because I expect writing HLSL to be just as unpleasant as GLSL). Still it is useful for me to know beforehand what problems to consider.

Advertisement

If I'm understanding you correctly, what you're suggesting is probably the worst thing you could be trying to do. Shaders don't have a stack or any local memory to play with. They have a few (~256) registers to play with and that's it. Compute Shaders can use LDS / Group Shared Memory, but I don't think that's what you're using here.

If you're writing a Pixel Shader that uses large arrays local to each thread then you're in for a world of hurt I'm afraid. The compilers will probably find a way to run what you've written by spilling back to main memory a lot, but performance will be dire.

Do you have an example of the types of situations where you'd want a thread on the GPU to have access to 300,000 'cells' of data and for that data to not live in a Resource such as a Buffer or a Texture?

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

If you're trying to use the GPU to circumvent your stack overflow problem, the solution would be to allocate heap memory using new/malloc.

If you're trying to speed up an algorithm, you could simply bounce data between image buffers - 300000 is less than a 600*600 buffer - depending on what you define as a 'cell'.

Without more information it's hard to understand what it is you're trying to achieve.

 

I am designing my algorithm in C++ now. Lately I want to adapt it to HLSL. It is working but it is huge and I have to optimize it more because I know the CPU each shader uses is not as fast as the main CPU where C++ runs. I could even redesign it completely if needed, but I don't think there is a way to do it with 256 registers only.

I could say I wrote a genetic algorithm, but I don't want to enter discussions about what is a genetic algorithm/ML/big data/etc. I know for sure that my data grows iteratively and then iteratively shrinks. I think it is genetic algorithm, but I don't want to argue.

Should I use RWBuffer* or something else more appropriate than an array?

(a "cell" meaning "a small room in which a prisoner data is locked up", not "the smallest structural and functional unit of an organism")

16 minutes ago, NikiTo said:

I know the CPU each shader uses is not as fast as the main CPU where C++ runs

It's not just that they're slower - they're different. Shader processors use SIMD to run on multiple pixels/vertices/compute-threads at the same time.

On the CPU we have SSE instructions, which operate on four values at the same time. GPU processors operate on up to 64 values at the same time! You need to be aware of this when structuring your algorithms. e.g. branching on a CPU is fine, but branching on the GPU can come at a huge cost (if only 1 pixel needs to take the branch, you'll be running instructions where 1/64th of the computational power is being utilized!).

I used to use a lot the SIMD in assembler. I'm used to it. For some reason It is harder for me to explore the GPU the same way I did with CPU. Intel has tons of very explicit documentation. in ASM I know what goes where and where is everything at any moment. But in GPU, although should be very similar it is not so clear for me. I will try to redesign the algorithm to run on the SIMD of Intel. It should help me to orientate myself. And it should be easier to port it from SIMD Intel to GPU, than from my C++ code full of branches(for some reason I totally forgot about branches, if you ask me: "are branches good for GPU" I would say NO! but for some reason I totally forgot about them)

@Hodgman Why do you refer to it as SIMD instead of SIMT?  To me SIMT is more accurate.

As far as your C++ version goes either as mentioned before don't allocate the arrays on the stack (use the heap or dare I say it globals) or increase the size of the stack.

Also as alluded to by my reply to Hodgman, GPU's are SIMT not SIMD its similar but not the same.  Also the number of registers should have no bearing on whether you can or cannot implement your algorithm although it will affect performance.

-potential energy is easily made kinetic-

10 hours ago, Infinisearch said:

@Hodgman Why do you refer to it as SIMD instead of SIMT?  To me SIMT is more accurate.

Because I'm imprecise when I talk most of the time :D There's not too much difference between them. You're right though.
Shader code is SIMD if your thread-group size matches the hardware vector instruction width. If your thread-group size is bigger then it's SIMT as the GPU is juggling multiple thread-groups at the same time (something akin to "hyperthreading").
i.e. the instruction set is SIMD, but the shader code environment built on top of that is SIMT. 

3 hours ago, Hodgman said:

Shader code is SIMD if your thread-group size matches the hardware vector instruction width.

I don't agree with this either... SIMD doesn't allow for branch divergence, SIMT does.  You can have branch divergence within a single thread group size that matches the hardware vector instruction width.  At least thats my understanding of it.

-potential energy is easily made kinetic-

22 minutes ago, Infinisearch said:

I don't agree with this either... SIMD doesn't allow for branch divergence, SIMT does.  You can have branch divergence within a single thread group size that matches the hardware vector instruction width.  At least thats my understanding of it.

Many vector instruction sets offer masked operations nowadays. And with instructions like movemask you can make sure that only those branches that are in use are evaluated. So it really is more of a programming model thing than actual differences in hardware.

This topic is closed to new replies.

Advertisement