Depth-sorting or Z buffer ?

Started by
24 comments, last by C0D1F1ED 17 years ago
Hello! I am writing a software renderer (pure pleasure). I am stuck at the moment where I should decide which surface visibility determination method I should go for. A couple of years ago, there was only one choice (depth-sorting by averaged Z coordinates per triangle) but these days CPUs are probably strong enough to handle Z buffering. The truth is that Z-sorting is awful in terms of the renderer design. First of all, it cannot guarantee correct results (mostly noticable when objects are close to the camera). The second thing is that it takes a way a whole bunch of optimizations of the graphics pipeline (i.e. sorting triangles by material/minimizing state changes of the renderer). But Z-sorting would be FAST anyway (the work is per triangle, not per pixel). The Z-buffer is great but it introduces a lot of per-pixel computation. I am afraid that it can be too much even for nowadays CPUs. :/ Sure that perspective texture mapping requires 1/Z interpolation as well but that's only a minor problem. There would be a heavy memory traffic between the CPU and the Z-buffer which surely won't fit the CPU cache (reading, writing ...). OMG. I thought about making a kind of a hybrid solution - use z-buffer to close objects while z-sorting to distant ones. But it doesn't hold together (what happens if there is a huge objects which uses a z-buffer while it still covers distant meshes that doesnt use it?). I should make one choice. What do you think I should choose? Z-buffer seems to be a blessing but won't it kill all my effort to optimize the code? :F Thanks for your help, I'd like to know your opinion
___Quote:Know where basis goes, know where rest goes.
Advertisement
You might want to look into hierarchical Z buffers. They're a useful trick to speed up Z comparisons, and the idea is essentially a quadtree, but applied to the depth comparison.
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
Ditto on hierarchical Z buffers. They're used heavily in graphics hardware, precisely because of the bandwidth problems that you describe.

You could also think about tiling the screen, binning triangles based on intersections with tiles, and rasterizing one tile at a time. This means potentially rasterizing some tris more than once, but this way you can work with small chunks of the Z and colors buffers that will fit nicely in cache.
Joshua Barczak3D Application Research GroupAMD
Hierarchical Z buffers seems very interesting. I will check that out!

Quote:You could also think about tiling the screen, binning triangles based on intersections with tiles, and rasterizing one tile at a time. This means potentially rasterizing some tris more than once, but this way you can work with small chunks of the Z and colors buffers that will fit nicely in cache.

It's an interesting idea. But it would require clipping triangles against tiles which could result in a overhead. I'll think about it, though. Btw, is there any way to check which portions of memory are stored in CPU cache? Hm.

Thanks
___Quote:Know where basis goes, know where rest goes.
Hierarchical z-buffers are yay.

However, your original suspicions are quite correct. With only a minor bit of common sense, you can write a decently performant software rasterizer that uses a depth buffer on modern CPUs. It won't be maximally performant, obviously, but it won't crawl.
I've investigated the idea of the hierarchical z-buffer. :) While that's a totally awesome technique, I am afraid it won't be suitable for my purposes. The reason is that my scene complexity is not going to be big (4k-5k tris, small depth complexity). The algorithm prooves its value when it comes to huge data sets (milions triangles perhaps) but for smaller problems standard scan-convertion appears to be slightly faster (at least that's what authors say).

Implementing the hierarchical z-buffer in pure software could cause an overhead due to scan-converting the bounding boxes. Anyway I am thinking about using Z-pyramid on its own as it can be used separately of spatial occlusion tests.

I know what I'll do. First I will code a classical scan-conversion Z testing and if the result comes out bad, I will worry about some other solution. ;)
___Quote:Know where basis goes, know where rest goes.
Memory access is way more expensive than computations on CPUs these days, you really don't want to access memory too much.

So you could do a beam tree approach, producing exact clipping of all your polys before rasterizing them. You could use portals with convex cells (use face aligned BSP to generate them perhaps?) and do geometry clipping against the frustrums when travsering the portal graph which also gives you zero overdraw. One fairly simple method is to use sorted spans, i.e. just have a list of spans per scanline, scan convert triangles, stick them in the correct list, and clip them against the spans already in there (the list should be sorted to make this easier). There are only a few cases to worry about for the spans so it's quite nice.

Another approach would be to usesome higher level visibility culling and store your geometry so you can draw it in back to front order (probably a BSP), and then just not worry about any further visibility determination. That's going to hurt you though because of overdraw, but probably not as much as doing a read from a z buffer, comparing, and then writing.

I'd recommend sorted spans. It works for everything if you handle all cases (Quake uses it, but draws back to front using the BSP tree and thus doesn't have to worry about intersecting spans which makes the logic a bit easier). If you just handle all possible span intersections you can use this for arbitrary geometry without any special static data structures.
SwiftShader can render Unreal Tournament 2004 and Max Payne in real-time, and uses a floating-point z-buffer.
Quote:I am afraid that it can be too much even for nowadays CPUs.

A modern CPU can do tens of billions of operations per second. Say for example we have a 2.4 GHz CPU and we'd like to render at 800x600 resolution with 25 frames per second. That means you have 200 clock cycles per pixel, per frame. That's plenty. You'll only need a fraction of that for z-buffering.
Quote:Memory access is way more expensive than computations on CPUs these days, you really don't want to access memory too much.

It all depends on the access pattern. Accessing data that resides in the L1 cache is incredibly fast. This is how x86 processors compensate for having very few registers. Accessing big blocks of memory sequentially also isn't a problem, because the processor will automatically prefetch it from RAM into the caches. Accessing complex data structures with lots of pointer indirections can kill performance though. But that certainly doesn't mean memory is slow in general. In the case of a z-buffer it is accessed mostly linearly, so it's really not a problem.
Quote:A modern CPU can do tens of billions of operations per second. Say for example we have a 2.4 GHz CPU and we'd like to render at 800x600 resolution with 25 frames per second. That means you have 200 clock cycles per pixel, per frame. That's plenty. You'll only need a fraction of that for z-buffering.

Nice example! :) Anyway you assumed that there is no depth complexity in the scene.

I am thinking about integer z-buffer since all my per-fragment operations are fixed-point (is it good or bad?). Then I could try to optimize the scan conversion with MMX.

Quote:But that certainly doesn't mean memory is slow in general. In the case of a z-buffer it is accessed mostly linearly, so it's really not a problem.

Well, my doubts appeared after watching a few software-renderers which work totally crappy on my PC (I won't name any though ;). My aim is to write something simple but highly interactive. :)

___Quote:Know where basis goes, know where rest goes.
Quote:Original post by clapton
Anyway you assumed that there is no depth complexity in the scene.

True, high overdraw can really kill performance for a software renderer. If you have control over the application side as well, make sure you render front-to-back as much as possible.
Quote:I am thinking about integer z-buffer since all my per-fragment operations are fixed-point (is it good or bad?).

Normally you can keep z interpolation separate from the rest, so you can use floating-point there and fixed-point for the rest. It really depends on the precision you want. For a z-buffer 16-bit integer is enough, but you need to be quite careful there is no unnecessary precision loss. With floating-point it's a whole lot more straightforward and no slower in practice.
Quote:Well, my doubts appeared after watching a few software-renderers which work totally crappy on my PC (I won't name any though ;). My aim is to write something simple but highly interactive. :)

Performance is always a concern with software rendering, but on todays CPUs with proper optimizations (assembly is a must) you can really achieve a lot. The original Unreal (Tournament) software renderer was meant to run on 100 MHz CPUs with only a fraction of the memory bandwidth of today's systems.

Anyway, my advice is not to worry about performance too early. Do things as straightforward and simple as possible. That's already complicated enough. Follow the example of hardware rendering. Once you have all the functionality you need, and only then, start optimizing. MMX and SSE can really speed things up a lot.

The ultimate approach is dynamic code generation though...

This topic is closed to new replies.

Advertisement