Jump to content

  • Log In with Google      Sign In   
  • Create Account

Awesome job so far everyone! Please give us your feedback on how our article efforts are going. We still need more finished articles for our May contest theme: Remake the Classics

#ActualHodgman

Posted 09 October 2012 - 06:41 AM

A branching version is quite likely to run even slower than just calling pow directly

True that, however, the OP is only planning on using that function with constant literal arguments, and hoping that the HLSL compiler then goes ahead and evaluates the branches at compile-time.

If you compile with fxc you can output the intermediate directx assembly text file, and just check what pow(x,2) was transformed to.

The OP mentioned that they did that, and it wasn't optimised. They're wondering if the D3D ASM -> GPU ASM compilation process done by your driver will actually perform this optimisation or not.

The general consense on this topic is, that the compiler will be most likely better than your handwritten optimization, maybe not today, but tomorrow.

I actually trust my HLSL compiler more than my C++ compiler for a lot of things, but they're still very dumb sometimes.
I've got a full 2x boost by 'massaging' my HLSL code into something uglier that the compiler was more able to easily digest... Actually, we wouldn't have been able to ship our last game running at 30Hz on the min specs if we didn't hand optimize all the HLSL code to do things that I originally assumed the compiler would be smart enough to do.

That said, the rules for "fast HLSL" change from time to time -- if you're targeting a DX9-era card, you want to hand-vectorize your code to use all 4 components of a float4 wherever possible to reduce instruction counts (FXC does actually do a good job of auto-vectorizing, but not as good as a human), but if you're targeting a DX10-era card, you want to mask off as few elements as possible (e.g. if you only need xyz, make sure to use a float3, or a float4 with .xyz on the end) and not be afraid of scalar math.
So, you can help the compiler to produce much faster code, but you do also need to know which kind of GPU architecture you're targeting while micro-optimizing your HLSL code.

Also, keep in mind that 1080p has ~2.5M pixels, making your pixel shaders the most intensely burdened tight loop in your entire code base, which means a small inefficiency can have a very large impact.

#2Hodgman

Posted 09 October 2012 - 06:39 AM

A branching version is quite likely to run even slower than just calling pow directly

True that, however, the OP is only planning on using that function with constant literal arguments, and hoping that the HLSL compiler then goes ahead and evaluates the branches at compile-time.

The general consense on this topic is, that the compiler will be most likely better than your handwritten optimization, maybe not today, but tomorrow.

I actually trust my HLSL compiler more than my C++ compiler for a lot of things, but they're still very dumb sometimes.
I've got a full 2x boost by 'massaging' my HLSL code into something uglier that the compiler was more able to easily digest... Actually, we wouldn't have been able to ship our last game running at 30Hz on the min specs if we didn't hand optimize all the HLSL code to do things that I originally assumed the compiler would be smart enough to do.

That said, the rules for "fast HLSL" change from time to time -- if you're targeting a DX9-era card, you want to hand-vectorize your code to use all 4 components of a float4 wherever possible to reduce instruction counts (FXC does actually do a good job of auto-vectorizing, but not as good as a human), but if you're targeting a DX10-era card, you want to mask off as few elements as possible (e.g. if you only need xyz, make sure to use a float3, or a float4 with .xyz on the end) and not be afraid of scalar math.
So, you can help the compiler to produce much faster code, but you do also need to know which kind of GPU architecture you're targeting while micro-optimizing your HLSL code.

Also, keep in mind that 1080p has ~2.5M pixels, making your pixel shaders the most intensely burdened tight loop in your entire code base, which means a small inefficiency can have a very large impact.

#1Hodgman

Posted 09 October 2012 - 06:39 AM

A branching version is quite likely to run even slower than just calling pow directly

True that, however, the OP is only planning on using that function with constant literal arguments, and hoping that the HLSL compiler then goes ahead and evaluates the branches at compile-time.

The general consense on this topic is, that the compiler will be most likely better than your handwritten optimization, maybe not today, but tomorrow.

I actually trust my HLSL compiler more than my C++ compiler for a lot of things, but they're still very dumb sometimes.
I've got a full 2x boost by 'massaging' my HLSL code into something uglier that the compiler was more able to easily digest... Actually, we wouldn't have been able to ship our last game running at 30Hz on the min specs if we didn't hand optimize all the HLSL code...

That said, the rules for "fast HLSL" change from time to time -- if you're targeting a DX9-era card, you want to hand-vectorize your code to use all 4 components of a float4 wherever possible to reduce instruction counts (FXC does actually do a good job of auto-vectorizing, but not as good as a human), but if you're targeting a DX10-era card, you want to mask off as few elements as possible (e.g. if you only need xyz, make sure to use a float3, or a float4 with .xyz on the end) and not be afraid of scalar math.
So, you can help the compiler to produce much faster code, but you do also need to know which kind of GPU architecture you're targeting while micro-optimizing your HLSL code.

Also, keep in mind that 1080p has ~2.5M pixels, making your pixel shaders the most intensely burdened tight loop in your entire code base, which means a small inefficiency can have a very large impact.

PARTNERS