Is software rasterisation processor-heavy?

Started by
9 comments, last by Krypt0n 10 years, 10 months ago
As part of my theory for occlusion, I'm thinking about incorporating a software rasteriser which will just render simple objects (mostly cubes I guess) in plain colours into a simulated bitmap (basically just a 2d array of pixels/numbers).

I kind of get the theory (or can find implementations of it) but I was wondering if it can be a quick process or is it processor intensive? I'm thinking it'll just go into it's own thread.

The bit I'm concerned about is filling the shapes with colour, like in between the bresenham lines.
Advertisement

It depends on the occlusion buffer dimensions.

On a 2GHz Core i7, I take less than 1 millisecond to rasterize 2000 triangles of terrain geometry into a 256 pixel wide occlusion buffer. Increasing the resolution to 1024 pixels also increases the time taken to 2.5 ms. My routine is depth only rasterization, and is not optimized to the max, C++ code only and no SIMD instructions.

I've found it hard to thread the occlusion rendering though, as the way I use it is a sequential operation: find out visible occluders, rasterize them, then use the result to check object AABB's for visibility.

That sounds fine, it would only need to be 256 wide or so.

Thanks
http://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index/

Sixteen blog posts about how to write occulusion culling system and then optimize it. Pure gold.

In the occlusion rasterizer that I'm using, my frame-buffer is only 1 bit per pixel (drawn to yet, or not drawn to). I render all the triangles for occluders and occludees from front-to-back, with occluders filling pixels, and occludees testing if all their pixels are filled or not (if any of their pixels aren't filled, the occludee is visible).

Using SSE, you can fill 128 pixels at a time with this algorithm, so the actual rasterization is not at all a bottleneck. I can run at pretty much any resolution without much difference in speed. The real bottleneck for me is actually transforming all of the vertices from model space into screen space (i.e. the 'vertex shader').

I break the frame-buffer into 'tiles', each of them 128 pixels wide and <height> pixels tall. Each of these tiles can then be independently rasterized by a different thread.

In the occlusion rasterizer that I'm using, my frame-buffer is only 1 bit per pixel (drawn to yet, or not drawn to). I render all the triangles for occluders and occludees from front-to-back, with occluders filling pixels, and occludees testing if all their pixels are filled or not (if any of their pixels aren't filled, the occludee is visible).
Using SSE, you can fill 128 pixels at a time with this algorithm, so the actual rasterization is not at all a bottleneck. I can run at pretty much any resolution without much difference in speed. The real bottleneck for me is actually transforming all of the vertices from model space into screen space (i.e. the 'vertex shader').

I break the frame-buffer into 'tiles', each of them 128 pixels wide and <height> pixels tall. Each of these tiles can then be independently rasterized by a different thread.

How you are sorting triangles? How about intersections? Or do sort by using max z and raster whole triangle with that? Sound great technique but I want to hear more details.

How you are sorting triangles? How about intersections? Or do sort by using max z and raster whole triangle with that? Sound great technique but I want to hear more details.

Yeah you use a single z value for each triangle, which means that large triangles on glancing angles don't act as effective occluders.
To ensure conservative results (no false occlusion), you use the maximum z value for occluder triangles and the minimum z value for occludee triangles, which allows intersecting triangles to work without errors.
First you project all the triangles into screen space, determine their z values as above, bucket them into the "tiles", then sort the triangle lists in each tile according to their z value. And then rasterize the tiles happy.png
When rasterizing a triangle, you iterate through the scanlines generating a bitmask of the pixels covered by the triangle on that line. Occluders then simply OR this mask with the framebuffer. Occludees AND this mask with "NOT framebuffer" and if the result is true, they write a non-zero value into some address indicating this object is visible (when submitting a group of occludee triangles, you also pass an int*, etc, where this value will be written to if the object is visible. The int at this address is initialized to zero beforehand).

P.S. I didn't come up with this, I've shamelessly taken the idea from Vadim Shcherbakov, who got it from another guy, IronPeter. There's a full explanation and a demo with source code on his blog. His code uses SSE intrinsics so it's a bit unreadable in places, and it contains a few bugs, but it's very fast wink.png Eventually, I'd like to release my own open source version of this algorithm, but I've got other things to be working on at the moment.

[edit]
P.P.S. I contacted Vadim about the copyright on his demo, because there is no explicit licensed contained in the ZIP, and got this response:

On 05/10/13 8:11 AM, Vadim Shcherbakov wrote:
--------------------
Hey, you can use the code as you like, there is no license or any limitations.
Regards, Vadim

... and occludees testing if all their pixels are filled or not (if any of their pixels aren't filled, the occludee is visible).

If occluder triangles takes maximum and occludee minimum Z, then i could test just 4 pixels/corners of occludees axis aligned 2D rectangle?

Or even 1 might be enough, i think? biggrin.png I don't see how could this fail.

I am gonna try to implement this myself since Vadim S. demo is crashing for me at certain camera position/angles, also his code is hard to follow and unreadable so i cannot fix it. sad.png

I am gonna try to implement this myself since Vadim S. demo is crashing for me at certain camera position/angles, also his code is hard to follow and unreadable so i cannot fix it

I think the problem is in the clipping algorithm, sometimes, when triangles intersect the near plane (an assert in there is the cause of the crash)... but yeah, his code is very hard to follow.

i could test just 4 pixels/corners of occludees axis aligned 2D rectangle?

I'm not sure that would work -- e.g if the occludee is just beyond a window, but is larger than the window. It's four corners will be occluded by the wall, but it's centre will be visible through the window.

If you're rasterizing a traditional depth buffer instead of one of these 1-bit occlusion buffers, then you can create a hierarchical Z-buffer from it, which does allow you to test any occluder with just 4 samples. This is a very old technique (1993?), but was used very recently in a splinter cell game.

I always found that the memory bandwidth was the biggest issue with software rasterization, which would link the performance to the frame buffer size. However, using one bit per pixel like Hodgman would probably alleviate that issue nicely...

This topic is closed to new replies.

Advertisement