Jump to content

  • Log In with Google      Sign In   
  • Create Account


Performance comparison: HLSL Texture::GetDimension


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
6 replies to this topic

#1 Julian Mautner   Members   -  Reputation: 157

Like
0Likes
Like

Posted 05 July 2011 - 01:59 AM

Hi folks!

If you have a shader for some technique, it happens quite often, that you need the resolution of a certain texture the shader uses (RTV). Now I was wondering, since there are two methods to get this resolution, which one is actually faster. Consider the following:

1) I simply query the dimension of the texture inside the shader (lets make it a pixelshader - executed many times) using texture.GetDimension(w, h);

2) I could also pass the proper dimensions via a cbuffer. This would need an additional float2. Take the case, where the cbuffer is updated every frame (regardless of if the texture-dimension is added or not), and therefore only the additional bandwidth for transferring and the loading instruction inside the shader "hit" the performance.

Now what do you think is the faster way to this? Or is the performance difference negligible (therefore one simply uses the more comfort method)?


Team leader of stillalive|studios
our current project: Son of Nor (facebook, @sasGames, website)

Sponsor:

#2 Zoner   Members   -  Reputation: 232

Like
0Likes
Like

Posted 05 July 2011 - 03:07 AM

Interesting question but I don't know the answer. But it should be possible to extrapolate:

Considering you can have 14 buffers at 4096 float4's each, thats nearly a megabyte of data, not counting the internal buffers we don't know about. This number is large enough the hardware probably treats all the buffers as normal memory (as opposed to all the values being loaded into on-chip register like the earlier shader hardware), and merely relies on the cache to keep things fast. With this assumption there are two possibilities that make sense to me: there is a giant table containing texture metadata somewhere which is updated whenever you change texture bindings, or the shader internally has a pointer to the texture and knows how to dereference it to pop out the metadata for the texture. Considering you would need 5 sets of tables (one for each shader stage in the pipeline), I'm going to have to guess its the latter case.

There are up to 6 fields you can access: width, height, depth, numMips, arraySize, multisampleCount.

How many reads it takes to get these values is a mystery, but I would also expect the first four to be grouped together, but who knows.

I would expect the constant buffer version to perform slightly better, especially when a lot of code needing this information immediately turns around to compute 1/dimension, or 0.5/dimension before being able to use the data in an equation, and you can precompute that on the CPU side. The constant buffer approach is more flexible, in that you frequently need access to this information in one of the earlier shader stages which don't directly have the texture bound.

However, the API method handles all the texture types and texture arrays correctly, which I imagine would be a rather painful mess to fully support in a general manner (aside from forcing all texture array slices to be the same dimensions in your engine, or banning texture arrays for the same reason).

#3 Julian Mautner   Members   -  Reputation: 157

Like
0Likes
Like

Posted 05 July 2011 - 11:30 AM

Would be interesting to get a response from one of the graphics-hardware companies ;-) No chance there is someone here?
Team leader of stillalive|studios
our current project: Son of Nor (facebook, @sasGames, website)

#4 MJP   Moderators   -  Reputation: 10632

Like
0Likes
Like

Posted 05 July 2011 - 03:52 PM

I don't think anyone from either of the IHV's frequents this forum. There is someone who works a the DirectX team at Microsoft, who may have more information. If I had to guess, I would say that both of them probably implement GetDimensions by sticking the data in whatever their low-level representation is of the immediate constant buffer. Should be pretty easy to set up a performance test, though. :P

#5 xoofx   Members   -  Reputation: 862

Like
2Likes
Like

Posted 17 January 2013 - 09:09 AM

My apologize to wake up this old thread, smile.png but I ran a quick test to verify Texture.GetDimensions costs. The code used for the micro-benchmark is like this: (as usual, it is a micro-benchmark with potentially some flaws...)

float2 Texture1Size;
float2 Texture2Size;
float2 Texture3Size;
float2 Texture4Size;
float2 Texture5Size;
float2 Texture6Size;
float2 Texture7Size;
float2 Texture8Size;

Texture2D Texture1;
Texture2D Texture2;
Texture2D Texture3;
Texture2D Texture4;
Texture2D Texture5;
Texture2D Texture6;
Texture2D Texture7;
Texture2D Texture8;
SamplerState PointSampler;

float2 dim(Texture2D textureObj)
{
	uint width;
	uint height;
	textureObj.GetDimensions(width, height);
	return float2(width, height);
}

const float2 MaxSize = float2(1024, 1024);

float4 PSRaw(float2 tex : TEXCOORD) : SV_TARGET
{
	float3 color = float3(tex, 0);
	color += Texture1.Sample(PointSampler, float2(0,0));
	color += Texture2.Sample(PointSampler, float2(0,0));
	color += Texture3.Sample(PointSampler, float2(0,0));
	color += Texture4.Sample(PointSampler, float2(0,0));
	color += Texture5.Sample(PointSampler, float2(0,0));
	color += Texture6.Sample(PointSampler, float2(0,0));
	color += Texture7.Sample(PointSampler, float2(0,0));
	color += Texture8.Sample(PointSampler, float2(0,0));

	return float4(color, 1);
}

float4 PSUsingGetDimension(float2 tex : TEXCOORD) : SV_TARGET
{
	float2 size = dim(Texture1);
	size += dim(Texture2);
	size += dim(Texture3);
	size += dim(Texture4);
	size += dim(Texture5);
	size += dim(Texture6);
	size += dim(Texture7);
	size += dim(Texture8);
	size /= 8;
	size /= MaxSize;

	float3 color = float3(size * tex, 0);
	color += Texture1.Sample(PointSampler, float2(0,0));
	color += Texture2.Sample(PointSampler, float2(0,0));
	color += Texture3.Sample(PointSampler, float2(0,0));
	color += Texture4.Sample(PointSampler, float2(0,0));
	color += Texture5.Sample(PointSampler, float2(0,0));
	color += Texture6.Sample(PointSampler, float2(0,0));
	color += Texture7.Sample(PointSampler, float2(0,0));
	color += Texture8.Sample(PointSampler, float2(0,0));

	return float4(color, 1);
}

float4 PSUsingCBuffer(float2 tex : TEXCOORD) : SV_TARGET
{
	float2 size = Texture1Size;
	size += Texture2Size;
	size += Texture3Size;
	size += Texture4Size;
	size += Texture5Size;
	size += Texture6Size;
	size += Texture7Size;
	size += Texture8Size;
	size /= 8;
	size /= MaxSize;

	float3 color = float3(size * tex, 0);
	color += Texture1.Sample(PointSampler, float2(0,0));
	color += Texture2.Sample(PointSampler, float2(0,0));
	color += Texture3.Sample(PointSampler, float2(0,0));
	color += Texture4.Sample(PointSampler, float2(0,0));
	color += Texture5.Sample(PointSampler, float2(0,0));
	color += Texture6.Sample(PointSampler, float2(0,0));
	color += Texture7.Sample(PointSampler, float2(0,0));
	color += Texture8.Sample(PointSampler, float2(0,0));

	return float4(color, 1);
}

All Texture# are different instance textures bound and at each frame they are cycling by one in order to avoid any driver optim.

All TextureSize# are uploaded at each frame.

 

And the results are:

  • PSRaw (sampling directly textures) and PSUsingCBuffer (sampling textures + use of size of each texture) have similar results. Let's take the basis of 1ms (measured from CPU), both having on ShaderAnalyzer an AvgCycle of 3 (I took one card as reference).
  • PSUsingGetDimension (sampling textures + texture.GetDimension) is significantly adding an overhead, with 2ms (compare to the previous basis 1ms) confirmed by an AvgCycle of  6 (doubling the results).

It seems quite normal considering the number of instructions that were added when using Texture.GetDimensions (get the size from the texture and convert uint to float).

 

So when we are seeking to save some little GPU cycle, we should better not using Texture.GetDimension and prefer passing Texture dimensions in cbuffer directly.



#6 Nik02   Crossbones+   -  Reputation: 2739

Like
0Likes
Like

Posted 18 January 2013 - 04:22 AM

Real-world PS usage may well mask the overhead; the float-int conversions are (afaik) done in transcendental ALU units, and if you have "ordinary" (ie simple math) PS instructions that can run in parallel to those, you may not notice the overhead at all. Same goes for parallel data fetch and ALU ops.

 

I believe that the actual GetDimensions operation is equivalent of reading a structure out of a constant buffer, because it practically is just that.

 

The compiler can optimize the memory access patterns by rearranging the instructions so that the ops are run as parallel as possible, and the driver can further optimize the intermediate shader bytecode to cater for particular hardware characteristics and internal high-performance operations not exposed to D3D. This is why such simple benchmarks may not actually hold water when tested in real-world scenarios, and why the performance should be profiled by as broad range of system configurations as possible.


Niko Suni


#7 xoofx   Members   -  Reputation: 862

Like
0Likes
Like

Posted 18 January 2013 - 06:26 AM

Real-world PS usage may well mask the overhead; the float-int conversions are (afaik) done in transcendental ALU units, and if you have "ordinary" (ie simple math) PS instructions that can run in parallel to those, you may not notice the overhead at all. Same goes for parallel data fetch and ALU ops.

 

I believe that the actual GetDimensions operation is equivalent of reading a structure out of a constant buffer, because it practically is just that.

Indeed, as it is a micro-benchmark, it will probably not make a huge difference in a real world scenario, and yes, GPU instruction scheduling/latency could hide this kind of thing.

 

FYI, I ran the test without texture sampling and without the float conversion, so basically comparing raw access to GetDimension and to a ConstantBuffer: GetDimension is 80% slower than an access to a cbuffer (on my particular config...etc.) for the same data (width, height). So GetDimension is not equivalent to a constant buffer access, I don't know how it is implemented but there are lots of signatures to handle miplevels...etc., so it is certainlly not as fast as a raw constant buffer for the simple case of accessing width/height.

 

But yeah, this is of course not something that is going to significantly boost any HLSL code out there... ;)






Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS