HLSL (SM2/3): How do I transfer less-than-32 bit variables?

Started by
5 comments, last by QNAN 11 years, 2 months ago

I wish to transfer an instancing vertex, that beyond the matrix has four very small integers, two of which are indexes into a texturemap shared by several different other objects - a texturemap holding many different kinds of grass and flowers. I call each of them "frames" and the texture a "framemap"l The last two define the dimensions (x/y) of the frame map, that the indexes point into.

From the max dimensions and the indexes, the shader will be able to calculate the offset into the frame map. I found, that this should be cheaper to transfer than simply the float offsets, if I used special types.

The integers are so small, that I can get away with only 4 bits for each, as I allow the frame map to have maximum 16x16 frames. Combined the indexes- and dimensions-variables will occupy 4x4=16 bits.

However, I cannot find a data type for transferring (http://msdn.microsoft.com/en-us/library/windows/desktop/bb172533%28v=vs.85%29.aspx), which takes only 4 bits. Actually nothing that takes less than 32 bits. Is that really so?

I could live with packing variables together and unpacking on the other end, but if I am stuck with a minimum unit of 32 bits, then Im not sure it is worth the price.

Is there a solution to this? Is there any way I can transfer variables of less than 32 bits?

Advertisement

You could go up to 32 bits exactly and transfer them as D3DDECLTYPE_UBYTE4 but this seriously does smell of premature optimization. Try just transferring the full normal texcoords as floats anyway - you'll probably find that you're not really bound at this stage of the pipeline at all and that any attempt to reduce the data size doesn't make a blind bit of difference.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Bandwidth is usually a problem when rendering massive amounts of objects (which foliage can easily be), so I assumed, that it would be here too, although I have not tested yet.

If there is no elegant way to do it, I guess I will bump the variables (indexes/boundaries) to 8 bit and use the UBYTE4-structure. Im just a bit disappointed, that no solution exists for transferring custom-sized data pieces, as it can be a problem if transferring millions of data packets.

Even if this may be premature optimization, I thought I would benefit from knowing the transferring to the card in detail. And I think it is an interesting problem.

I assume the context to be PC games. It might look different when you're running on a console.

On DX9 level hardware, UBYTE4 is indeed the best you can get. You can do some bit swizzling in the shader to combine 2x 4bit into one of these 4 bytes, but you can't specify anything less than 32bit. Thus this compression scheme is only useful if you can make other use of the remaining 3 bytes.

A warning from experience: it won't help you much.

- It can save a lot of GPU memory, but this is only useful if you're handling literally millions of instances. If you're doing just a few tens of thousands of instances, I wouldn't waste my time on it.

- It can save quite some memory bandwith, but I never found this to be the limiting factor. The best I got from compressing my instancing vertex structure from 56 bytes down to 20 bytes was 30% performance gain.

- It can save you transfer bandwith in case you're updating the instancing data every frame. In that case I'd say it's worth the hassle, but I'd wager you have other problems then.

- It won't help you in any other case.

A few months ago I wrote a voxel renderer that splatted millions of textured quads. I first tried to use instancing, but it was slow as hell. 4 million quads resulted in ~15fps on my Geforce GTX 460. When trying to find the bottleneck I noticed that all counters of NVPerfHUD together only accounted for 30% of the frame time, and 70% went to "somewhere". Then I tried Visual NSight, which was buggy as fuck but at least could show me the real cause: the Input Assembler. Then I removed all instancing and stored four unique vertex structures per quad, with a total 80 bytes per quad, and I got to 55fps. For the very same geometry, and four times the GPU memory bandwith. Something is happening on those modern cards that I can't explain. An ATI GPU showed the same behaviour.

----------
Gonna try that "Indie" stuff I keep hearing about. Let's start with Splatter.

It's also the case that for huge numbers of objects you're more likely to bottleneck on fillrate (and potentially overdraw, depending on the type of object) than on vertices. This can be observed with particle systems and would be true of foliage too.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Im just a bit disappointed, that no solution exists for transferring custom-sized data pieces, as it can be a problem if transferring millions of data packets.

Well, graphics cards can't deal too well with data smaller than 32 bits, especially unaligned data, so what you'd save in memory bandwidth, you would lose in processing efficiency. Really, you should get everything working first, and then benchmark (if you still notice a slowdown once everything is in place).

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”

Thanks for the excellent input guys. Knowing that it is impossible to have less than 32 bit types is a big plus - at least I don't have to bang my head against an unbreakable wall :). It was also nice to hear about people's performance stories, as from them it sounds like Im not gonna run into a bandwidth problem as the first thing.

Thanks for the feedback, guys.

This topic is closed to new replies.

Advertisement