This topic is 2943 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I'm getting so frustrated here at work. We're not doing anything graphics related anymore so no opportunity to expand my knowledge and there is no real resident graphics expert to learn from. So I'm reaching out to the experts on Gamedev to set me straight.

I want to get some experience/knowledge on optimizing shaders. I know that's a bit general, but I know its a big weak point of mine. Other than avoid expensive instructions, or dependent texture reads, I really don't have much of an idea on how to make any given shader run faster. I have some ideas, but honestly I have not done any profiling to see if any of them are true. Honestly I don't even know what tools to use so that I can profile. What do people use out there?

I would like to ask the experts out there...

1) Are dependent texture reads a real problem anymore on the newer cards? (I never really understood why they were so slow on the old cards other than maybe stalling the pipeline due to poor/non-existent thread scheduling??)
2) I remember reading somewhere that I should issue texture lookups asap so that there is work to do while we wait, is this true?
3) I also remember read about how the newer cards aren't really 4-way SIMD processors (as in operate on float4's) and that really they execute four instructions on four pieces of data in one clock cycle. Something about how NVIDIA analyzed shader usage and found that typical operations are not float4s but rather on floats. Does that mean I should not try and pack data into float4's in the shader? I'm not talking about packing data into a vertex attribute, I mean temp registers in the shader. Speaking of temp regs...
4) Now that everything is unified architecture I read that I should keep GPR usage to a minimum such that more threads can be executing in parallel. This seems to conflict with 3), although I would think 4) would outweigh 3).
5) I had a guys tell me once that every keyword in hlsl is takes 1 clock cycle (ex: dot( N, L) is 1 cycle, exp( X ) is 1 cycle, pow( X, Y), clamp(), max(), sin(), atan() etc). Is this true? He's a pretty senior guy but not really up on the new tech. I'm skeptical. I wouldn't even be so sure that one assembly instruction is equal to 1 cycle since different GPU's will decode the general assembly into something else, right? But I could be wrong (which is why I'm asking you)
6) I was in an interview once and a graphics guy told me that sometimes interpolators can be a bottleneck in the pipeline. I can understand interpolators (color) causing some precision issues, but speed!? How? I believe interpolators are separate hardware units so maybe they can't really handle a full load when all of the vertex attributes need interpolating. (just guessing)

I'm deeply interested in these kind of things to keep in mind when writing my shaders (or interviews). I understand some of these will depend on hardware, but I'm not content with that. I want to understand WHY something is the way it is. I would much rather hear something like "Well on the PS3 it will do X, on the 360 it will do Y, and on ATI it will do Z vs NVIDIA it will do W. Documentation or an article would be the best so that I can read about these things myself.

##### Share on other sites
1) dependent reads should be fine; I've DX9 code that does a lot of dependent reads and it runs quite well on even integrated Intel stuff. Obviously don't overdo it, but otherwise just do it and wait until you've definitely confirmed it's a problem before considering alternatives.

2) Sounds like a semi-dubious micro-optimization to me.

3) I never bother packing and don't notice any performance issues. The GPU will expand your position vectors to float4 anyway so I'm skeptical. If someone can prove me wrong and back it up with a good example that's relevant in the real world (rather than just a techdemo that emphasises the point) I'd be interested.

4) No idea.

5) Sounds highly dubious. Even the old OpenGL ARB asm specs were pretty explicit that one asm instruction != one GPU instruction (I don't know about DX8 asm).

6) If interpolators are a bottleneck in the pipeline than it's a very strange program indeed. The real pipeline bottlenecks are going to be shader complexity and fillrate; interpolators are small fish compared to this and it sounds like another dubious micro-optimization.

##### Share on other sites

1) Are dependent texture reads a real problem anymore on the newer cards? (I never really understood why they were so slow on the old cards other than maybe stalling the pipeline due to poor/non-existent thread scheduling??)
this depends on how many arithmetic instructions you have between two reads, you'll be either texture read or ALU bound, gpu vendors suggest 6:1 alu:tex.
2) I remember reading somewhere that I should issue texture lookups asap so that there is work to do while we wait, is this true?[/quote]there are several reasons for this. some hardware has a special path where texture reads based on interpolators are scheduled before the shader is even executed, a texture read will cost nearly nothing in this case. Another reason is that pixel shaders are like a reduction, they gather a lot of math and data to output usually one float4 value. the later in a pixelshader you issue a texture read, the less the chance is the compiler can insert independent alu instructions before that texture fetch is used and that might rather stall the pipe. (although the newest hardware is doing its best to hide that by using a lot of threads).

3) I also remember read about how the newer cards aren't really 4-way SIMD processors (as in operate on float4's) andthat really they execute four instructions on four pieces of data in one clock cycle. Something about how NVIDIA analyzed shader usage and found that typical operations are not float4s but rather on floats. Does that mean I should not try and pack data into float4's in the shader? I'm not talking about packing data into a vertex attribute, I mean temp registers in the shader. Speaking of temp regs...[/quote]that really depends on hardware. you are right, nvidia GPUs are scalar nowadays, some vector optimizations like

 lumincance=dot(color,float4(1.f)); 

might rather make your shader slower than faster. Although ATI also does not have a dot4 anymore, those are still vector4 and vector4+1 GPUs (the newer tend to be float4) and if you vectorize your code by hand it might be better than what the compiler creates. So it's really up to the target hardware, what you do and sadly, optimizing for one usually hurts the other one. (as far as my sample goes, nvidia driver detect that case and reorder it to 3 add, it's just an example).

4) Now that everything is unified architecture I read that I should keep GPR usage to a minimum such that more threads can be executing in parallel. This seems to conflict with 3), although I would think 4) would outweigh 3).[/quote]it's not since unified architectures that GPR usage defines a lot the final performance outcome. GeforceFX (geforce 5) really suffered if you used more than 4 registers. So keeping register usage low is usually a good thing, but that's up to the compiler in most cases. on ATI those seem to be SIMD organized, so using vectors might actually lower the GPR pressure, but yes, on NVIdia, if they don't detect what you're intending to do, it might cost you something.

5) I had a guys tell me once that every keyword in hlsl is takes 1 clock cycle (ex: dot( N, L) is 1 cycle, exp( X ) is 1 cycle, pow( X, Y), clamp(), max(), sin(), atan() etc). Is this true? He's a pretty senior guy but not really up on the new tech. I'm skeptical. I wouldn't even be so sure that one assembly instruction is equal to 1 cycle since different GPU's will decode the general assembly into something else, right? But I could be wrong (which is why I'm asking you)[/quote]on microsoft hlsl intrinsic documentation, you can read how many "slots" instructions cost, e.g. out of my head "sign" costs 3 slots. On the other side, if the GPU detects you want to get the sign, it might cost you 1 instruction or in some cases even be for free if the compiler (I mean the one inside the driver, not fxc) figures out that in combination your "sign" can substitute e.g some redudant mov. if you want to see some performance metrics, try FXComposer, it has some cycle output for shaders (at least it was the case when I tried it some years ago).

6) I was in an interview once and a graphics guy told me that sometimes interpolators can be a bottleneck in the pipeline. I can understand interpolators (color) causing some precision issues, but speed!? How? I believe interpolators are separate hardware units so maybe they can't really handle a full load when all of the vertex attributes need interpolating. (just guessing)[/quote]interpolators are not purely seperate hardware anymore, it makes nowadays no sense to have special purpose hardware that would idle most of the time (you might have 16 interpolator accesses in a shader that has 200instructions, that's 184 idle loops), that's why shader units at least assist interpolator hardware and it's all not for free. if you have a shader e.g. to blend 8 terrain layers and all you do per pixel is to read 10 textures and blend them, interpolators might be the limit. That's also the case if you use tesselation with a lot of tiny polygones, just providing the interpolators for every micro triangle that covers 1pixel in average, might make the whole rendering interpolator bound.

I'm deeply interested in these kind of things to keep in mind when writing my shaders (or interviews). I understand some of these will depend on hardware, but I'm not content with that. I want to understand WHY something is the way it is. I would much rather hear something like "Well on the PS3 it will do X, on the 360 it will do Y, and on ATI it will do Z vs NVIDIA it will do W. Documentation or an article would be the best so that I can read about these things myself.
[/quote]all hardware is designed for some typical work load, if you move away from what is expected, you will be limited by some particular part that was not designed for that purpose. In typical rendering, the RSX of the PS3 is a well designed hardware, if you'd double some part of it, you would not gain any performance, as you'd either be limited by the previous stage that was not designed to provide as much data as is needed now, by the next stage, that was not designed to consume as much data as you produce now or you'd simply lower the whole architecture performance (e.g. lower timming, less MHz) due to too much transitors.

To always accept this is a key, I see too many times people start optimizing something by assuming "hey, this is expensive, I read this somewhere, I have to optimize it" and later on they ask "just why is it not faster". You need to understand what each stage really does and how you can validate if that's a limit. e.g. if you want to figure out if you are bandwidth limited, reduce the texture size to 1x1. if you figure out, that's not giving anything, then all optimization in that directions won't help. it's like the most basic thing to first profile and find the issue and then optimize, but somehow most people avoid it and waste time.

Once you found the bottleneck, it's quite easy to find optimization suggestions.

##### Share on other sites
1. Dependent texture fetches are no longer a problem. I believe are still an issue on PowerVR mobile GPU's, if that matters to you.
2. Texture fetches have latency that can be hidden with ALU instructions, but not in terms of a single shader thread. The latency is hidden by constantly switching execution to a different thread. Compilers are also free to reorder your instructions.
3. In terms of a single shader thread, on Nvidia DX10+ hardware the threads are indeed executed in a scalar fashion. So a float4 multiplied by a float4 will be executed as 4 serial multiplies. This means that there is no benefit from explicitly vectorizing your shader code on Nvidia hardware. ATI/AMD hardware is different, and does use vector ALU instructions for each thread. Up to their HD 5000 series it was 4D + 1 scalar/transcendental/integer, and on the latest it's 3D + 1 scalar. On that hardware vectorization can be key to performance.
4. Generally you want to reduce GPR usage, but optimizing it his highly specific to the hardware + driver so it's largely out of your control.
5. Absolutely not. There is no direct correlation between the HLSL intrinsics and the actual microcode instructions executed on the GPU, and even if there were there is no direct relation between instructions and cycle counts. A lot of HLSL intrinsics don't even map directly to HLSL assembly instructions, and merely act as math helper functions.
6. It's certainly possible. Newer GPU's use less special-case hardware and will use their shader ALU cores as much as possible.

PS3 specifics are under NDA so I can't talk about that. You can look up some details on older 7-series Nvidia GPU's if you want to get a general idea. Either way the hardware is old, and not terribly interesting if you want to understand performance on modern GPU's. There are a lot of details on on the Xbox 360 GPU floating around but I'm not sure what is and isn't under NDA, so I'm not going to comment on the specifics of that either. However it is definitely more of a modern GPU, and a lot of concepts from newer AMD GPU's will apply.

Anyway, if you really want to understand shader performance I would make sure that you have a solid understanding of GPU's work. This presentation has a great introduction to GPU architecture: http://bps10.idav.uc...IGGRAPH2010.pdf. You can also look through past GDC and SIGGRAPH presentations from AMD and Nvidia for some good insights and performance tips. AMD has their papers listed here, which includes a few presentations that have both AMD + Nvidia information.

Beyond that, it's a bit difficult to give any general performance tips because it can vary so much depending on the hardware and what you're doing. Probably the most general concepts to keep in mind are that on modern GPU's you have way more ALU resources available than bandwidth, and also that flow control efficiency is dependent on coherence.

##### Share on other sites

6) If interpolators are a bottleneck in the pipeline than it's a very strange program indeed. The real pipeline bottlenecks are going to be shader complexity and fillrate; interpolators are small fish compared to this and it sounds like another dubious micro-optimization.

I've seen this as a problem on X360 and PS3 shaders to the point where we had to pack values into different 'slots' in order to reduce our interpolator pressure. It does tend to show up more on short/already optimised shaders however.

The OP's second point (regarding moving texture fetches as early as possible) can also help, however this is a very specific thing and not a hard and fast rule. Mostly it allows some hardware to get the request sent and then if they get back to your thread before the memory has arrived you can get on with some ALU work instead of stalling. But it does rely on having ALU work to do 'in the mean time' as it were.

Finally, a note on AMD hardware; while it was 4D+1 and is now 3D+1 shortly they will be launching a new high end core which is going scalar as well.

##### Share on other sites
[color=#1C2837][size=2]I've seen this as a problem on X360 and PS3 shaders to the point where we had to pack values into different 'slots' in order to reduce our interpolator pressure. It does tend to show up more on short/already optimised shaders however.[/quote]
How do you even come to the conclusion that the interpolators are the bottleneck? Reduce the number of interpolated attributes and see if it's faster, maybe?

##### Share on other sites
No, we have tools on the consoles which let us profile the GPU and get detailed information on what the performance is for any given draw call and what parts of the GPU are the bottleneck.

I assume using vendor specific tools on the PC you can get the same information, I've not looked close enough to find out.

##### Share on other sites
My own gut feeling boils down to this: don't sweat the details. Much of it is outside your control, rules that work for the CPU may or may not work for the GPU (or vice-versa), newer generations of hardware can turn previous recieved wisdom on it's head, trusting the driver to do the right thing is a necessary evil, and wait until you've confirmed for certain that something is a problem in your program before doing something about it.

Much the same as anything else, really. ;)

##### Share on other sites
I havent ever really needed to optimize a shader, other than really obvious things like less texture lookups will make you go faster, and less math, of course. Really, the optimization I am doing is involving taking things off the cpu and performing them on the gpu, once I have the shader written im running about a thousand times faster already, maybe further optimization is a little fruitless. Often a gpu alternative involves doing MORE work, but the parallel power will still come through as a speed increase. You seem to be a lot more knowledgeable than me, or you are definitely asking some questions I would never think to ask... but actually to date I havent done any shader optimization much, its the simple fact im shifting work from cpu to gpu is the main thing, and you already get huge speed benefits already.
Im right into gpgpu and compute shaders, as you can probably tell.

My point is, theres other bottlenecks like sending too much through the pci express bus, its how the shader is working WITH the cpu which is what I call optimizing a shader solution, the system code is just as important as the shader code, to really shave the seconds off those computatons.

But I appreciate the questions you are asking, what if you can optimize shader code? I just never have yet, not up to that yet.

##### Share on other sites
Wow, thanks everyone. This certainly got more response than I was expecting. I have much reading to do (Giddy excitement ensues!!)

I hear several people say something along the lines of "Don't worry about performance until it becomes a problem.", "Early optimization is futile", etc. And I wholeheartedly agree. But the exercise is what to do when you do need to optimize. The interpolator question is a good example, I had no idea this could even be a problem, I guess I just assumed the hardware was designed to cope with the load, so now I know. But the overall idea is to gain a better understanding of what is happening under the hood so when things get slow I have a good model in my head to help problem solve.

• ### Game Developer Survey

We are looking for qualified game developers to participate in a 10-minute online survey. Qualified participants will be offered a \$15 incentive for your time and insights. Click here to start!

• 16
• 11
• 24
• 43
• 75