Jump to content

  • Log In with Google      Sign In   
  • Create Account


HLSL (SM2/3): How do I transfer less-than-32 bit variables?


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
6 replies to this topic

#1 QNAN   Members   -  Reputation: 223

Like
0Likes
Like

Posted 29 January 2013 - 04:36 PM

I wish to transfer an instancing vertex, that beyond the matrix has four very small integers, two of which are indexes into a texturemap shared by several different other objects - a texturemap holding many different kinds of grass and flowers. I call each of them "frames" and the texture a "framemap"l The last two define the dimensions (x/y) of the frame map, that the indexes point into.

From the max dimensions and the indexes, the shader will be able to calculate the offset into the frame map. I found, that this should be cheaper to transfer than simply the float offsets, if I used special types.

 

The integers are so small, that I can get away with only 4 bits for each, as I allow the frame map to have maximum 16x16 frames. Combined the indexes- and dimensions-variables will occupy 4x4=16 bits.

 

However, I cannot find a data type for transferring (http://msdn.microsoft.com/en-us/library/windows/desktop/bb172533%28v=vs.85%29.aspx), which takes only 4 bits. Actually nothing that takes less than 32 bits. Is that really so?

I could live with packing variables together and unpacking on the other end, but if I am stuck with a minimum unit of 32 bits, then Im not sure it is worth the price.

 

Is there a solution to this? Is there any way I can transfer variables of less than 32 bits?



Sponsor:

#2 mhagain   Crossbones+   -  Reputation: 7413

Like
1Likes
Like

Posted 29 January 2013 - 04:56 PM

You could go up to 32 bits exactly and transfer them as D3DDECLTYPE_UBYTE4 but this seriously does smell of premature optimization.  Try just transferring the full normal texcoords as floats anyway - you'll probably find that you're not really bound at this stage of the pipeline at all and that any attempt to reduce the data size doesn't make a blind bit of difference.


It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#3 QNAN   Members   -  Reputation: 223

Like
0Likes
Like

Posted 30 January 2013 - 04:02 AM

Bandwidth is usually a problem when rendering massive amounts of objects (which foliage can easily be), so I assumed, that it would be here too, although I have not tested yet.

If there is no elegant way to do it, I guess I will bump the variables (indexes/boundaries) to 8 bit and use the UBYTE4-structure. Im just a bit disappointed, that no solution exists for transferring custom-sized data pieces, as it can be a problem if transferring millions of data packets.

 

Even if this may be premature optimization, I thought I would benefit from knowing the transferring to the card in detail. And I think it is an interesting problem.



#4 Schrompf   Prime Members   -  Reputation: 950

Like
1Likes
Like

Posted 30 January 2013 - 06:51 AM

I assume the context to be PC games. It might look different when you're running on a console.

 

On DX9 level hardware, UBYTE4 is indeed the best you can get. You can do some bit swizzling in the shader to combine 2x 4bit into one of these 4 bytes, but you can't specify anything less than 32bit. Thus this compression scheme is only useful if you can make other use of the remaining 3 bytes. 

 

A warning from experience: it won't help you much.

 

- It can save a lot of GPU memory, but this is only useful if you're handling literally millions of instances. If you're doing just a few tens of thousands of instances, I wouldn't waste my time on it. 

- It can save quite some memory bandwith, but I never found this to be the limiting factor. The best I got from compressing my instancing vertex structure from 56 bytes down to 20 bytes was 30% performance gain.

- It can save you transfer bandwith in case you're updating the instancing data every frame. In that case I'd say it's worth the hassle, but I'd wager you have other problems then.

- It won't help you in any other case. 

 

A few months ago I wrote a voxel renderer that splatted millions of textured quads. I first tried to use instancing, but it was slow as hell. 4 million quads resulted in ~15fps on my Geforce GTX 460. When trying to find the bottleneck I noticed that all counters of NVPerfHUD together only accounted for 30% of the frame time, and 70% went to "somewhere". Then I tried Visual NSight, which was buggy as fuck but at least could show me the real cause: the Input Assembler. Then I removed all instancing and stored four unique vertex structures per quad, with a total 80 bytes per quad, and I got to 55fps. For the very same geometry, and four times the GPU memory bandwith. Something is happening on those modern cards that I can't explain. An ATI GPU showed the same behaviour.


----------
Gonna try that "Indie" stuff I keep hearing about. Let's start with Splatter.

#5 mhagain   Crossbones+   -  Reputation: 7413

Like
1Likes
Like

Posted 30 January 2013 - 07:32 AM

It's also the case that for huge numbers of objects you're more likely to bottleneck on fillrate (and potentially overdraw, depending on the type of object) than on vertices.  This can be observed with particle systems and would be true of foliage too.


It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#6 Bacterius   Crossbones+   -  Reputation: 8135

Like
1Likes
Like

Posted 30 January 2013 - 08:17 AM

Im just a bit disappointed, that no solution exists for transferring custom-sized data pieces, as it can be a problem if transferring millions of data packets.

Well, graphics cards can't deal too well with data smaller than 32 bits, especially unaligned data, so what you'd save in memory bandwidth, you would lose in processing efficiency. Really, you should get everything working first, and then benchmark (if you still notice a slowdown once everything is in place).


The slowsort algorithm is a perfect illustration of the multiply and surrender paradigm, which is perhaps the single most important paradigm in the development of reluctant algorithms. The basic multiply and surrender strategy consists in replacing the problem at hand by two or more subproblems, each slightly simpler than the original, and continue multiplying subproblems and subsubproblems recursively in this fashion as long as possible. At some point the subproblems will all become so simple that their solution can no longer be postponed, and we will have to surrender. Experience shows that, in most cases, by the time this point is reached the total work will be substantially higher than what could have been wasted by a more direct approach.

 

- Pessimal Algorithms and Simplexity Analysis


#7 QNAN   Members   -  Reputation: 223

Like
1Likes
Like

Posted 30 January 2013 - 12:06 PM

Thanks for the excellent input guys. Knowing that it is impossible to have less than 32 bit types is a big plus - at least I don't have to bang my head against an unbreakable wall :). It was also nice to hear about people's performance stories, as from them it sounds like Im not gonna run into a bandwidth problem as the first thing.

Thanks for the feedback, guys.

Edited by QNAN, 30 January 2013 - 12:08 PM.





Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS