how do those Nvidia's SPUs affect me as a shader developer?

Started by
16 comments, last by SimmerD 17 years ago
Quote:Original post by AndyTX
That's the only thing that I do somewhat differently when targeting 8000 series hardware, and it's fairly minor in most cases (the exception being when a vector processor would require a non-trivial data rotation to perform efficiently, while on the G80 that would just waste time).


That's really interesting, do you know why this was done? Optimizing towards vector processors sometimes makes things look silly, but I assumed that it was just "the fast way" to do things in GPU land. Was there some sort of development that sped up scalar processors to the point that vector operations on each processor type were the same speed?
Advertisement
Quote:Original post by stanlo
That's really interesting, do you know why this was done?

By using scalar processors instead of vector ones, they are able to get 100% utilization out of them and reach peak performance. With vec4 processors, a good number of instructions will effectively not fully utilize the ALUs (ex. most lighting calculations are vec3 or even scalar). Also since the fragments going through the GPU are already a massive parallel machine, it makes sense to parallelize only on that level, rather than on that *and* the instruction level. The G80's design really avoids wasted ALU cycles and needless code restructuring while maintaining peak performance on even fully (instruction level) scalar code.

Quote:Original post by stanlo
Was there some sort of development that sped up scalar processors to the point that vector operations on each processor type were the same speed?

Parallelism is certainly how speed is achieved on GPUs, but as I mentioned above, there's really no need for two levels of parallelism. With respect to the hardware design, I can't comment as I don't know the complexity of the two designs, but it seems that NVIDIA figures that 128 scalar processors can pull better numbers overall than the equivalent transistor budget of vec4 processors. We'll see what R600 does and how it compares.
Quote:
Technically the hardware is not doing any compiling, the D3D API delivers standardised byte code to the driver which then translates it to hardware-specific micro code.


So how much optimization (if any) happens when the driver translates the byte code to hardware microcode?

Like in the case of 8000 series, is there any attempt to optimize code that is unnecessarily vectorized?
Quote:Original post by Unfadable
Quote:
Technically the hardware is not doing any compiling, the D3D API delivers standardised byte code to the driver which then translates it to hardware-specific micro code.


So how much optimization (if any) happens when the driver translates the byte code to hardware microcode?

Like in the case of 8000 series, is there any attempt to optimize code that is unnecessarily vectorized?


I think it is quite safe call to say that the driver will optimize the shaders. Quite probably the driver uses at some point SSA-form to represent the code and at least at that point unnecessary copying can be easily eliminated.
Quote:Original post by Unfadable
Quote:
Technically the hardware is not doing any compiling, the D3D API delivers standardised byte code to the driver which then translates it to hardware-specific micro code.


So how much optimization (if any) happens when the driver translates the byte code to hardware microcode?
Difficult to say for definite, but they'll do some optimization.

The big problem is that the driver (under D3D8/9) will only see the assembly byte code so it can lack a lot of context (such as what the HLSL form would give) with regards to what the shader is really trying to do. Therefore it's a lot harder for the driver to aggressively optimize the code. There's relatively little stopping you throwing complete rubbish at the driver via the D3D8/9 API...

My understanding of one motivation behind locking out ASM shaders in D3D10 (you can only author in HLSL now) is to improve this - the driver now has a lot more guarantees about the quality of the incoming byte code. I'd imagine the IHV's will get a lot of info from MS regarding typical compiler output from HLSL inputs.

hth
Jack

<hr align="left" width="25%" />
Jack Hoxley <small>[</small><small> Forum FAQ | Revised FAQ | MVP Profile | Developer Journal ]</small>

Nvidia's decision to move to a scalar architecture is in response to profiling hundreds of in-game shader code only to find that each shader operation was using, on average, roughly 2 vector components. Roughly speaking, modern shaders were only using the standard Vec4 implimentation at 50% efficiency.

The cost of the scalar architecture is that operations must be broken up, and then re-assembled on the far end of the process. In concept, its a bit like a modern CPU design where instructions are broken into simpler micro-ops, and how multiple macro-ops might actually have portions of their micro-ops in-flight on the CPU concurrently, only to be re-assembled before the results of the operation are written back.

I think that eventually we'll see architectures move to a hybrid double/packed float architecture. In other words, an ALU which can operate on a single 64-bit float, or two 32-bit floating point operations. As we trend towards higher-quality rendering and graphics manufacturers begin to cater to scientific computing, having a 64-bit scalar ALU will become more important. Once we have that, it only makes sense to re-use the available resources when operating on 32-bit floats. It won't make for 100% utilization, but it will be the most efficient use of a 64-bit pipeline.

throw table_exception("(? ???)? ? ???");

Quote:Original post by ravyne2001
Nvidia's decision to move to a scalar architecture is in response to profiling hundreds of in-game shader code only to find that each shader operation was using, on average, roughly 2 vector components. Roughly speaking, modern shaders were only using the standard Vec4 implimentation at 50% efficiency.


I have to say this discussion is very interesting...and completely new to me; I had assumed that GPUs would naturally become and more optimized for vector processing. But this argument seems very convincing to me..

I can confirm that drivers do indeed optimize the shaders. It happens when a shader is first used.

One reason for this is that the driver may have to or want to recompile the shader based on renderstates, texture formats, etc.

For instance, if your shader reads from a signed or unsigned texture, this may appear to you as the same shader, but the driver may change or re-order the instructions based on this.

The same might be true if you mask off writing to dest alpha with a render state - the driver can eliminate any instructions that only affect the dest alpha result.

The first versions of the HLSL compiler would optimize for the R300 architecture, which was a mistake, imo. Architecture-specific changes belong in the driver. Other IHVs had to do extra driver work to undo some optimizations. I guess that's one of the benefits of being first...

This topic is closed to new replies.

Advertisement