Depth-sorting or Z buffer ?

Graphics and GPU Programming Programming

Started by clapton April 11, 2007 01:08 PM

24 comments, last by C0D1F1ED 16 years, 12 months ago

14,306

April 20, 2007 02:53 PM

Quote:Original post by Stonemonkey
@ravyne2001 - As far as I know there's no instructions to copy the bits from an int reg to an fp reg (or the other way) so you'd have to go via memory...

And if so its going to be in L1. Any small delay can probably be pipelined away.

Quote:Original post by Stonemonkey
...and if you have to write to the depth without affecting the stencil then you have to read from stencil, mask it, combine them then write which to me looks like quite a lot.

I don't see it as too much of a problem, honestly, but I could be wrong. To write Z without affecting Stencil you've got the basic process correct, but you also have to think of it within the context of what you're doing, much of the work you'd have to do anyhow. Let me pick it apart:

read from stencil: with Z and stencil combined, you'd have to read it to get Z anyhow.

mask it: true enough, but bitwise operation on registers are quick.

combine them: true again, but cast to byte[] and set is quick too.

write: with Z and stencil combined, you'd have to write it anyhow.

Here's how the process would work:

Read combined z/stencil as BLet Z = B & 0xffffff00Let S = (B as Byte[])[0] // 0 on little endian, 3 on big endianLet D = new z &0xffffff00;If D < Z Let Z = D If Stencil_Enabled  S = 0xff;Let (Z as Byte[])[0] = S // 0 on little endian, 3 on big endianWrite combine z/stencil as Z

throw table_exception("(? ???)? ? ???");

C0D1F1ED

453

April 25, 2007 06:05 PM

ravyne2001, indeed, only a few fast instructions are required to store both depth and stencil in a 32-bit format. But you have to realize that you're only saving 20% of bandwidth. To make it worth it you have to do it in even less instructions, because every 1.5 clock cycles the CPU can read a 32-bit value from RAM. Also remember that at 6.4 GB/s you can clear a 800x600x32 buffer over 3000 times per second. So even if you do succeed to reduce bandwith by 20% in zero instructions, you're only winning 0.06 FPS if the rest is running at 30 FPS. Sorry, but these are the hard facts.

C0D1F1ED

453

April 25, 2007 06:45 PM

Quote:Original post by clapton
Using self-modyfing code would mean that I can get rid of various state checking in time-critical loops, reduce branching...

Exactly.

Quote:I am not sure if it can reach that far. At least I could control texel fetching modes this way. ;)

In SwiftShader lots of performance critical processing is done with dynamically generated code. What you need is a cache that links generated functions to states.

Quote:Anyway, do you know any good readings on dynamic code generation? I guess that the only way in case of C/C++ is to go assembly but that's not a problem.

You could use SoftWire, it's the open-source version of the dynamic code generator I use in SwiftShader. The package contains a few documents that could help you get started.

JoeJ

4,185

April 26, 2007 08:50 AM

Stop thinking about too hardware specific stuff - improve th algorithm first! :)
Once I wrote about my software rasterizer in a thread discussing the need for a PVS:

The front to back rendered BSP cell scanlines fill a C-buffer alike datastructure that i call portal spans. The bounding box of the next bsp subspace is checked for partial visibility against the portal spans before rendering or going further down the tree. That's fast, especially for me, because its for a low resolution mobile display. Also, it's flexible, because it works for any front to back rendering method.

A little more detail about it:
Exact front to back order needs to be known (without any intersections), I use BSP to do this.
No PVS and no Z-Buffer required for static geometry (but I fill Z-Buffer to add dynamic objects later, like in quake).
Not only the bounding boxes, also the scanlines from the current poly are clipped against the portal spans, which gives zero overdraw.
No complicated edge sorting, like quake does is required.
Note that full branches of the BSP tree can get rejected by that if they're occluded, not just the leave nodes only.

This is still the best VSD algo I know, nothing I've ever read seems better. Or am I wrong?
Hierarchical Z-Buffers looks like a very inefficient solution to me, acceptable only for hardware rendering.

clapton

234

Author

April 26, 2007 11:21 AM

Quote:Original post by C0D1F1EDYou could use SoftWire, it's the open-source version of the dynamic code generator I use in SwiftShader. The package contains a few documents that could help you get started.

Thanks! You are great, man! That was exactly what I was looking for.

Quote:This is still the best VSD algo I know, nothing I've ever read seems better.

Your solution reminds me of a similiar technique described by Abrash in one of his articles. It seems to be good choice if you always use BSP trees to store your geometry. Anyway, z-buffer is more general approach I guess.

Too bad that the university absorbs all my creativity and energy that I could use to finish my renderer. :P

Cheers

___Quote:Know where basis goes, know where rest goes.

C0D1F1ED

453

April 28, 2007 03:38 AM

Quote:Original post by JoeJ
Stop thinking about too hardware specific stuff - improve th algorithm first! :)

Stop attempting to improve the algorithm - implement the standard graphics pipeline first! ;-)

Seriously, hundreds of researchers and engineers have formed the graphics pipeline as we know it today. The z-buffer has proven to be very valuable. Not only does it perform well, both in theory as in practice, it's accurate for any situation.

Quote:This is still the best VSD algo I know, nothing I've ever read seems better. Or am I wrong?

It totally depends on the situation. If you're rendering a static low-polygon scene and you have access to the BSP, it can be very efficient. But for other conditions this approach either just doesn't work or shows a bad worst-case behavior. Furthermore, if I'm not mistaken this doesn't work with intersecting polygons.

Quote:Hierarchical Z-Buffers looks like a very inefficient solution to me, acceptable only for hardware rendering.

It's not inefficient at all, in its context. It reduces bandwidth significantly, and avoids the rendering of whole tiles of pixels.

So if hierarchical z-buffers are so perfect for hardware rendering, why doesn't it apply to software rendering? Well, first of all, there's a totally different balance between processing power and bandwidth. On a CPU, for every 32-bit of bandwith you only have about 1.5 clock cycles to do your processing. So any attempt at reducing bandwidth is rather futile; you're processing limited, so just use the damn bandwidth. On a GPU, you're sharing the bandwidth with tens of processing units. So if you don't limit bandwidth usage, many of those units won't recieve any data and your expensive hardware can use only a fraction of its processing power. Secondly, on a CPU you do have the ability to pull pixels 'out of the pipeline'. By doing the z-test early you can skip that pixel if it's hidden, and start processing the next pixel right away. In hardware, pixel processing is 'all or nothing'. Once pixels enter the pipeline, you can't pull them out again. You can only make decisions on batches (tiles) of pixels. So it's natural to use at least one layer of hierarchical z-buffer to perform the z-test for a whole tile before it enters the pixel pipeline.

So should we "stop thinking about too hardware specific stuff"? I don't think so. The choice of algorithm heavily depends on the architecture. On the other hand, "premature optimization is the root of all evil". So even if you think you know what's faster for the hardware you're working with, it's safer to implement the straightforward approach and get it fully functional first, then profile for the real bottlenecks, and only then start worrying about how to impove performance (if necessary). It's so situation dependent that any guesses without prior experience are almost always going to be wrong.

So, applying that to this thread's topic, I really advise clapton to first try the standard z-buffer. It's simple, quite efficient in most situations, and produces perfect results. After every other aspect of his application is finished, he can profile performance and determine whether the z-buffer is a bottleneck or not. If not, fantastic. If it is, see you all again in a few months... ;-)

Depth-sorting or Z buffer ?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Depth-sorting or Z buffer ?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines