Jump to content

  • Log In with Google      Sign In   
  • Create Account

HLSL pack two values into one component of a 4x16_UNORM target


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
12 replies to this topic

#1 B_old   Members   -  Reputation: 668

Like
0Likes
Like

Posted 08 May 2009 - 05:23 AM

Hi, I have a DXGI_FORMAT_R16G16B16A16_FLOAT rendertarget (could switch to float if it helps) and want to encode two different values in the last component. One value will be in the range of [0.1] in the pixelshader and the other one typically something like [10-128] or so. Is this possible? I tried this, with limited success:
float packSpecularCoefAndPow(float coef, float pow)
{
	return float((uint(coef * 255.0f)) << 16 | uint(pow));
}

float2 unpackSpecularCoefAndPow(float coefAndPow)
{
	uint tmp = uint(coefAndPow);
	return float2(float(tmp >> 16) / 255.0f, tmp & 0x0000ffff);
}

Is something like this possible?

Sponsor:

#2 B_old   Members   -  Reputation: 668

Like
0Likes
Like

Posted 09 May 2009 - 01:43 AM

OK, so I came up with this solution.

float packSpecularCoefAndPow(float coef, float pow)
{
uint sc = uint(coef * 255.f);
uint sp = uint(pow);

return float(sc << 8 | sp) / 65536.f;
}

float2 unpackSpecularCoefAndPow(float coefAndPow)
{
uint tmp = uint(coefAndPow * 65536.f);

return float2(float(tmp >> 8) / 255.f, float(tmp & 0x000000ff));
}


The result are good. Does anybody see a faster way to do this?

#3 Nik02   Crossbones+   -  Reputation: 2918

Like
0Likes
Like

Posted 09 May 2009 - 01:59 AM

Bit shifts are usually executed on so-called "transcendental ALUs" which are more scarce than basic ALU units. If you can convert/refactor these to floating-point multiplies, the hardware can execute the logic in basic ALUs which will generally result in better overall arithmetic performance, as well as frees the transcendental units to tasks that actually require them, thus reducing potential bottlenecks that can cause latency.

Also, runtime conversions between ints and floats are relatively expensive so it might be worthwhile to try to eliminate them as much as you can.

Your current code has some room for optimization in these areas, but it will provide correct results as-is.

#4 B_old   Members   -  Reputation: 668

Like
0Likes
Like

Posted 10 May 2009 - 02:56 AM

Hi Nik02, thanks for the answer.
The best I can come up with is this:

float packSpecularCoefAndPow(float coef, float pow)
{
return float(uint(coef * 255.f) * 256 | uint(pow));
//return float(uint(coef * 65280.f)| uint(pow)); doesn't work so good...
}


float2 unpackSpecularCoefAndPow(float coefAndPow)
{
return float2(coefAndPow / 65280.f, float(uint(coefAndPow) & 0x000000ff));
}




It is slightly faster than my first method. Another thing I noticed is that pow should be somewhere between 32 and 196 or so for both versions, but that is OK as I don't need a broader range.
I still have one integer-div and some conversions though...

[Edited by - B_old on May 10, 2009 11:56:47 AM]

#5 Nik02   Crossbones+   -  Reputation: 2918

Like
0Likes
Like

Posted 10 May 2009 - 04:13 AM

I don't see an integer division in that code. It is a transcendental operation so it is best to not use them if you don't need them.

As it happens, float division is also transcendental but in your case the division will be converted to multiply (which is cheaper) by the compiler if optimizations are on, because the divisor is a constant value.

With regard to conversions, it is best to keep the data in a same datatype across the entire pipeline, if at all possible. If not, the conversion ops are always available but will reduce performance on both CPU and GPU.

If you want more background info on these optimizations, I recommend reading Radeon programming guide (AMD/ATI) as well as NVidia GPU programming guide, both available for free at their respective developer sites. While the ALU architecture is slightly different between these platforms, general concepts are mostly the same because most of their internal algorithms are same.

#6 B_old   Members   -  Reputation: 668

Like
0Likes
Like

Posted 10 May 2009 - 07:09 AM

You are right, I meant an integer mul...

I now have this versions that relies on only floating point arithmetic. I hope.

float packSpecularCoefAndPow(float coef, float pow)
{
return (pow + clamp(coef, 0.f, 0.999f)) * 10.f; // * 10.f, because I would loose the coef sometimes. pow = 128 and coef = 0.1 were bad for example
}

float2 unpackSpecularCoefAndPow(float coefAndPow)
{
coefAndPow *= 0.1f;

float coef = frac(coefAndPow);
float pow = coefAndPow - coef;

return float2(coef, pow);
}



I can't really say that it is faster though.
At least the code is more readable and I should be able to use lower values for pow without a problem.
Any more comments on this one, Nik02? :) Thanks for the help!

[Edited by - B_old on May 10, 2009 2:09:50 PM]

#7 Nik02   Crossbones+   -  Reputation: 2918

Like
0Likes
Like

Posted 10 May 2009 - 09:37 AM

I don't remember, off the top of my hat, whether or not "frac" was one of the expensive instructions.

However:

Your current packing code has a signal leak between pow and coef. That said, if it works with your input values, go ahead and use it. I would use integer maths so as to precisely combine the bits of the values.

When I hinted about avoiding conversions, I had in mind an implementation that would take in integers during pack, and float2 during unpack or vice versa, depending on your needs.

It is worth considering that sometimes float-int conversions really are the best tool for a problem. Also, ints are easier (and more precise) to handle when you want to cram some raw bits together. BUT, if you can use float arithmetic to arrive at the same conclusion, it will usually suit the current GPUs better.

Performance-wise, I think you have relatively tight code now. To suggest more optimizations, I would need to do a more complete analysis of the actual usage of the functions.

Also, an important fact is that with real games, the CPU->GPU calls are often the bottleneck in modern machines. Thus, if your GPU is not the limiting factor to begin with, it is not worth it to spend your time to micro-optimize the shaders - it is enough to make them work fast enough so as to not limit the whole system performance. Usually that time is more wisely spent on efficient scene management code that minimizes device state changes and draw calls. This is the reason why you should do your profiling and performance-tuning in as close to real-world conditions as possible.

[Edited by - Nik02 on May 10, 2009 3:37:07 PM]

#8 Nik02   Crossbones+   -  Reputation: 2918

Like
0Likes
Like

Posted 10 May 2009 - 10:04 AM

Don't forget that with D3D10 and up, it is possible to interpret data in groovy ways. The shader intrinsics starting with "as" can be used to convert the representation of data from one data type to other. This means that you could carry integer data in your buffer's (texture's) channels that are typed as float. If you can wrap your head around this concept, I think it would be very appropriate for your scenario, and it would avoid unnecessary conversions effectively.

#9 B_old   Members   -  Reputation: 668

Like
0Likes
Like

Posted 10 May 2009 - 08:15 PM

I suppose the signal leak between those values is nothing I can do about if I stick to float arithmetic?

I took a look at the as*-functions and I don't immediately see an obvious benefit. It won't change my bit-pattern, so I can't say all my coef-parts are in the first x bits and all the other parts are in the last x bits, or something like that. Maybe I just haven't thought about this for long enough though.

While I have your attention I'd like to ask another question: Is there a best practice to distribute a 32bit float to two 16bit components? I am having the problem that 16bit depth in my deferred shading is not enough when I also use shadowmaps. I'm not really sure yet, whether I should bother, because in order to free up another 16bits I would have to start packing albedo-colors for example. Somehow it starts getting messy then. Also, I don't know when all the packing/unpacking will start to become as much of a performance penalty as the higher bandwidth of an extra rendertarget would be.


#10 Nik02   Crossbones+   -  Reputation: 2918

Like
0Likes
Like

Posted 10 May 2009 - 09:12 PM

Quote:
Original post by B_old
I suppose the signal leak between those values is nothing I can do about if I stick to float arithmetic?


It is difficult to emulate bit-level operations on floats, so you'd need considerably more complex code. It is not impossible, but may be impractical.

Quote:

I can't say all my coef-parts are in the first x bits and all the other parts are in the last x bits


Oh but you can... ;)

Note that you can both output and input different representations of the values, regardless of where they come from or go to.

Quote:

While I have your attention I'd like to ask another question: Is there a best practice to distribute a 32bit float to two 16bit components? I am having the problem that 16bit depth in my deferred shading is not enough when I also use shadowmaps. I'm not really sure yet, whether I should bother, because in order to free up another 16bits I would have to start packing albedo-colors for example. Somehow it starts getting messy then. Also, I don't know when all the packing/unpacking will start to become as much of a performance penalty as the higher bandwidth of an extra rendertarget would be.


I would do all necessary packing in integer space if at all possible. I don't remember if there was a general best practice for this scenario.

While modern cards have a lot of processing power, memory bandwidth hasn't evolved at the same rate. Therefore, you can write quite complex shader logic before becoming ALU-bound.

The performance will ultimately depend on what else you're doing with the hardware, so it is impossible to say what the best approach is going to be in your particular application.

My ultimate recommendation is not to over-optimize first; just make it work and also write your other scene code so you actually draw all the stuff (or dummies of comparable complexity) that you would draw in the final version. Then, run the stuff on PIX and begin to see where the actual bottlenecks are. Now, you are in the position to make informed decisions as to where to actually optimize your code. Rinse and repeat :)

#11 B_old   Members   -  Reputation: 668

Like
0Likes
Like

Posted 11 May 2009 - 03:23 AM

Quote:
Original post by Nik02
Quote:
Original post by B_old

I can't say all my coef-parts are in the first x bits and all the other parts are in the last x bits



Oh but you can... ;)

Note that you can both output and input different representations of the values, regardless of where they come from or go to.

The way I understood it, as*() won't change the way the bits are laid out. Isn't it problematic when the shader variable is 32bit but is gonna be output to 16bit?
I guess I have to see a practical use of as*() in order to get some inspiration first.


#12 Nik02   Crossbones+   -  Reputation: 2918

Like
0Likes
Like

Posted 11 May 2009 - 07:27 AM

The very point here is that the as* functions won't change how the bits are laid out. Remember, even though floats have more complex representation than ints, they are still constructed entirely from bits with well-defined algorithms.

Ask yourself, what stops you from constructing floats and halfs from bits yourself? And following that, what stops you from writing them out in any format, provided adequate space is available in the destination? When you can answer these, the solution may well become obvious [smile]

#13 B_old   Members   -  Reputation: 668

Like
0Likes
Like

Posted 12 May 2009 - 02:07 AM

I guess you are right. You gave me something to think about.

Regarding my original problem I changed tactics and rearranged my rendertargets, so right now there is no need to pack any values.
Thanks for the help!




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS