Performance comparison: HLSL Texture::GetDimension
If you have a shader for some technique, it happens quite often, that you need the resolution of a certain texture the shader uses (RTV). Now I was wondering, since there are two methods to get this resolution, which one is actually faster. Consider the following:
1) I simply query the dimension of the texture inside the shader (lets make it a pixelshader - executed many times) using texture.GetDimension(w, h);
2) I could also pass the proper dimensions via a cbuffer. This would need an additional float2. Take the case, where the cbuffer is updated every frame (regardless of if the texture-dimension is added or not), and therefore only the additional bandwidth for transferring and the loading instruction inside the shader "hit" the performance.
Now what do you think is the faster way to this? Or is the performance difference negligible (therefore one simply uses the more comfort method)?
Considering you can have 14 buffers at 4096 float4's each, thats nearly a megabyte of data, not counting the internal buffers we don't know about. This number is large enough the hardware probably treats all the buffers as normal memory (as opposed to all the values being loaded into on-chip register like the earlier shader hardware), and merely relies on the cache to keep things fast. With this assumption there are two possibilities that make sense to me: there is a giant table containing texture metadata somewhere which is updated whenever you change texture bindings, or the shader internally has a pointer to the texture and knows how to dereference it to pop out the metadata for the texture. Considering you would need 5 sets of tables (one for each shader stage in the pipeline), I'm going to have to guess its the latter case.
There are up to 6 fields you can access: width, height, depth, numMips, arraySize, multisampleCount.
How many reads it takes to get these values is a mystery, but I would also expect the first four to be grouped together, but who knows.
I would expect the constant buffer version to perform slightly better, especially when a lot of code needing this information immediately turns around to compute 1/dimension, or 0.5/dimension before being able to use the data in an equation, and you can precompute that on the CPU side. The constant buffer approach is more flexible, in that you frequently need access to this information in one of the earlier shader stages which don't directly have the texture bound.
However, the API method handles all the texture types and texture arrays correctly, which I imagine would be a rather painful mess to fully support in a general manner (aside from forcing all texture array slices to be the same dimensions in your engine, or banning texture arrays for the same reason).
My apologize to wake up this old thread, but I ran a quick test to verify Texture.GetDimensions costs. The code used for the micro-benchmark is like this: (as usual, it is a micro-benchmark with potentially some flaws...)
float2 Texture1Size;
float2 Texture2Size;
float2 Texture3Size;
float2 Texture4Size;
float2 Texture5Size;
float2 Texture6Size;
float2 Texture7Size;
float2 Texture8Size;
Texture2D Texture1;
Texture2D Texture2;
Texture2D Texture3;
Texture2D Texture4;
Texture2D Texture5;
Texture2D Texture6;
Texture2D Texture7;
Texture2D Texture8;
SamplerState PointSampler;
float2 dim(Texture2D textureObj)
{
uint width;
uint height;
textureObj.GetDimensions(width, height);
return float2(width, height);
}
const float2 MaxSize = float2(1024, 1024);
float4 PSRaw(float2 tex : TEXCOORD) : SV_TARGET
{
float3 color = float3(tex, 0);
color += Texture1.Sample(PointSampler, float2(0,0));
color += Texture2.Sample(PointSampler, float2(0,0));
color += Texture3.Sample(PointSampler, float2(0,0));
color += Texture4.Sample(PointSampler, float2(0,0));
color += Texture5.Sample(PointSampler, float2(0,0));
color += Texture6.Sample(PointSampler, float2(0,0));
color += Texture7.Sample(PointSampler, float2(0,0));
color += Texture8.Sample(PointSampler, float2(0,0));
return float4(color, 1);
}
float4 PSUsingGetDimension(float2 tex : TEXCOORD) : SV_TARGET
{
float2 size = dim(Texture1);
size += dim(Texture2);
size += dim(Texture3);
size += dim(Texture4);
size += dim(Texture5);
size += dim(Texture6);
size += dim(Texture7);
size += dim(Texture8);
size /= 8;
size /= MaxSize;
float3 color = float3(size * tex, 0);
color += Texture1.Sample(PointSampler, float2(0,0));
color += Texture2.Sample(PointSampler, float2(0,0));
color += Texture3.Sample(PointSampler, float2(0,0));
color += Texture4.Sample(PointSampler, float2(0,0));
color += Texture5.Sample(PointSampler, float2(0,0));
color += Texture6.Sample(PointSampler, float2(0,0));
color += Texture7.Sample(PointSampler, float2(0,0));
color += Texture8.Sample(PointSampler, float2(0,0));
return float4(color, 1);
}
float4 PSUsingCBuffer(float2 tex : TEXCOORD) : SV_TARGET
{
float2 size = Texture1Size;
size += Texture2Size;
size += Texture3Size;
size += Texture4Size;
size += Texture5Size;
size += Texture6Size;
size += Texture7Size;
size += Texture8Size;
size /= 8;
size /= MaxSize;
float3 color = float3(size * tex, 0);
color += Texture1.Sample(PointSampler, float2(0,0));
color += Texture2.Sample(PointSampler, float2(0,0));
color += Texture3.Sample(PointSampler, float2(0,0));
color += Texture4.Sample(PointSampler, float2(0,0));
color += Texture5.Sample(PointSampler, float2(0,0));
color += Texture6.Sample(PointSampler, float2(0,0));
color += Texture7.Sample(PointSampler, float2(0,0));
color += Texture8.Sample(PointSampler, float2(0,0));
return float4(color, 1);
}
All Texture# are different instance textures bound and at each frame they are cycling by one in order to avoid any driver optim.
All TextureSize# are uploaded at each frame.
And the results are:
- PSRaw (sampling directly textures) and PSUsingCBuffer (sampling textures + use of size of each texture) have similar results. Let's take the basis of 1ms (measured from CPU), both having on ShaderAnalyzer an AvgCycle of 3 (I took one card as reference).
- PSUsingGetDimension (sampling textures + texture.GetDimension) is significantly adding an overhead, with 2ms (compare to the previous basis 1ms) confirmed by an AvgCycle of 6 (doubling the results).
It seems quite normal considering the number of instructions that were added when using Texture.GetDimensions (get the size from the texture and convert uint to float).
So when we are seeking to save some little GPU cycle, we should better not using Texture.GetDimension and prefer passing Texture dimensions in cbuffer directly.
Real-world PS usage may well mask the overhead; the float-int conversions are (afaik) done in transcendental ALU units, and if you have "ordinary" (ie simple math) PS instructions that can run in parallel to those, you may not notice the overhead at all. Same goes for parallel data fetch and ALU ops.
I believe that the actual GetDimensions operation is equivalent of reading a structure out of a constant buffer, because it practically is just that.
The compiler can optimize the memory access patterns by rearranging the instructions so that the ops are run as parallel as possible, and the driver can further optimize the intermediate shader bytecode to cater for particular hardware characteristics and internal high-performance operations not exposed to D3D. This is why such simple benchmarks may not actually hold water when tested in real-world scenarios, and why the performance should be profiled by as broad range of system configurations as possible.
Real-world PS usage may well mask the overhead; the float-int conversions are (afaik) done in transcendental ALU units, and if you have "ordinary" (ie simple math) PS instructions that can run in parallel to those, you may not notice the overhead at all. Same goes for parallel data fetch and ALU ops.
I believe that the actual GetDimensions operation is equivalent of reading a structure out of a constant buffer, because it practically is just that.
Indeed, as it is a micro-benchmark, it will probably not make a huge difference in a real world scenario, and yes, GPU instruction scheduling/latency could hide this kind of thing.
FYI, I ran the test without texture sampling and without the float conversion, so basically comparing raw access to GetDimension and to a ConstantBuffer: GetDimension is 80% slower than an access to a cbuffer (on my particular config...etc.) for the same data (width, height). So GetDimension is not equivalent to a constant buffer access, I don't know how it is implemented but there are lots of signatures to handle miplevels...etc., so it is certainlly not as fast as a raw constant buffer for the simple case of accessing width/height.
But yeah, this is of course not something that is going to significantly boost any HLSL code out there... ;)