# [Software Rasterizer] Triangle Binning Question

This topic is 2124 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hey guys,

I've been thinking about some optimizations on my rasterizer and I've come across the concept of "binning" the triangles to tiles of the framebuffer. Which then are processed in parallel using several threads.

But there are still some open questions for me on how to do this efficiently...

1. I've seen a sample where they use a tile size that is small enough to fit into the cache e.g. 320x90. What I don't quite get is why make the height / 8 and the width / 4. My pixels are stored contiguously in an std::vector container so doesn't that mean that one scanline is also much faster to access in memory than two half scanlines ? So why not the other way around ?

2. Then these binning containers, are they vectors where I push them back ? Wouldn't that be extremely slow per frame ?

3. What happens if a triangle is overlapping several tiles. Do I just push the triangles into all overlapping containers ? Doesn't that produce a lot of unneeded processing ?

4. In the case of 320x90 tiles at a screen resolution of 1280x720 that would mean I've divided my buffer into 4 * 8  = 32 tiles. That's probably not effective as a 1:1 ratio on CPU threads. So would I run 4 threads and then when they're done the next 4 ?

Edited by lipsryme

##### Share on other sites

1. 320 * 90 * 8bytes (4 for RGB and 4 for depth) -> 225KB. probably targeting at an architecture with 256KB L2 cache. height/4 would exceed L2 and probably destroy most of the optimization intend. width/8 would waste half of the L2 cache but forcing you to twice as many passes. In my rasterizer, I run a simple benchmark on startup, making simple matrix operations (not quite linear, not quite random access patterns), starting with 4KB and doubling that, takes few hundred ms, but it can give you a pretty decent picture of what cache sizes you should use (based on the timings). some CPUs like the Q9450 have 12MB L2, some have like the i7-4770k have just 256KB L2, and with future cpus you cannot assume those gonna grow, it really depends on how they balance (size vs latency*bandwidth), so it's good to just benchmark the tile size.

Regarding cache lines: The question you should ask yourself is, what would lead to more cache hits if you'd draw _random_ primitives. if you look at 'most' of those primitives in real world objects, they will have a rather square than flatten shape. chances are higher that those 512pixel that you access will be somewhear within an 32x32 pixel area than along some 512x1 area. so if you start accessing the first pixel(x,y), you will probably access the pixel(x,y+1) just as likely as pixel(x+1,y), but it's already less likely you'll access pixel(x+2,y). organizing your memory in tiles just continues the idea, accessing pixel(x+16,y+16) in a tile while drawing tons of triangles is more likely than pixel(x+256,y). and in case you access that one, it's again quite likely you draw a big triangle and you access pixel(x+256,y+1) than accessing pixel(x+512,y) in a line.

aligning the processing of pixel to tiles has also benefits when you access your textures, texels that you sample in a line can be quite distributed in a texture, depending on the rotation of the UV set, but texels you access while drawing pixels in a tile, will probably be also in a close area, let alone because you're doing bilinear filtering, you always access 2 lines.

2. usually you have kind of a 'worst case' assumption of how many triangles will probably be drawn and a fallback plan.

-estimation: this can be done by either a static estimation e.g. if you have 320x90 pixel, you could assume it is unlikely someone gonna draw more than 320x90 triangles in a common game into that tile. or you can estimate based on the previous frames, e.g. if you exceeded the previous bucket size, you take the previous triangle count and round it up to the next power of 2 and preallocate that buffer for the tiles.

fallback: that sounds like something bad, but it actually isn't in the global case. if you reach the limit for a tile, there is probably enough work to make it barely more efficient if you'd gather more work, so you could 'flush' it by processing all triangles in the tile, write it out and reset the tile. an alternative is to have some pre-allocated fallback buffer, once you reach a limit, you 'link' on of those fallback buffer, if you run out of those fallback buffer, you can allocate more. again, while both cases sound suboptimal, you probably run out of every estimation of what would be 'fast' anyway and you won't have a fast rendering anyway. fallbacks are to keep it working (and not break), but those aren't the problematic areas, it's probably already the amount of work you shall do that would have been too much even if you wouldn't have fallen to fallback mode.

3.there are two cases to consider.

a) a lot tiny triangles

b) few big triangles

in case of a), 'most' of the triangle should fall within only one tile, so the amount of triangles that are processed more than once are not worth worrying. in case of b) those triangles produce so much work per pixel that the overhead is again not worth bothering. imagin you have to setup a triangle again and again for every tile on the screen, you still setup it just once every 320x90 pixels you process, that's 28800:1. that's why I save every triangle just onces in a global buffer and index to those by IDs. keeps buckets for tiles relatively small, in case you setup small triangles, you setup them once, but you don't push redundant data and in case of b) you don't duplicate tones of data, which again is faster.

keep in mind, in general, tile based rendering would be completely stupid, you create tons of more work, the only reason it's faster is because of 'smarter' memory access. so try to keep memory usage 'smart' by storing and loading as few bytes as possible in every stage and IF you do, then as cache friendly as possible.

4. your answer about 'how inefficient can I handle threading' should be always 'there must be a smarter way'. you don't want to make all the work just to waste time on some suboptimal areas. tiles won't be balanced anyway. you cannot assume very tile will have exactly n-triangles. some will be the skybox with just 2triangles, some will have the player character holding a torch spawning thousands of alpha-blended particles. someone might even have a machine with 64hardware threads on cpu, you really want to waste that? ;)

in my rasterizer, every 'worker' thread consumes tasks, whatever they are, everytime a worker is done with one, it takes the next 'task' from the task queue. I spawn a task that processes all vertices in the scene and bins the triangles, once done, that thread queues up the 'tasks' for every tile and as a last, queues up the vertex processing for the next frame. whatever thread is done first with the tiles, picks up the vertex processing task and creates tasks for all the other threads. (it's actually a bit more complicated, as vertex processing could take longer than the other 'worker' need to get their tiles done, but you get the idea ;) .

##### Share on other sites

Damn that sounds quite complicated...

I'm still not quite sure how the actual binning process would work in detail (and efficiently) ?

Adding items (even if something like unsigned int) to a dynamic array with the size of e.g. framebuffersize / 8 is extremely slow per frame and per vertex in my tests.

Especially if a single tile contains a lot of vertices.

Edited by lipsryme

##### Share on other sites

SW rasterizer was my day job for 4 years.

I wrote only simple rasterizer in c - this is not too much

efficient i think, I would like to improve ...- Could you or

some other person give some hints on the topic? Some

good links or something? What looks like an efficient triangle drawing routine, is this a scanline approach?

How many triangles per second could you rasterize in soft

rasterization?

##### Share on other sites

Could you or some other person give some hints on the topic?

You need very good understanding on graphics pipeline, graphic algorithms and CPU architecture and optimizations. Next step is to define your SW architecture - that really depends on what features you want to implement. Sort-middle architecture (AKA binning) seems like the best choice for high-perf SW rasterizer, for the reasons I stated above. Than you need extremely good low-level CPU skills - the ability to run VTune, read generated assembly, then identify and implement low-level optimizations is crucial if you want good performance.

What looks like an efficient triangle drawing routine, is this a scanline approach?

I actually spent quite some time investigating and optimizing just the rasterizer part. 2 main performance concerns are:

1. Good cache locality for the rest of the pipeline, especially the TXS.
2. Rasterizers tend to have unpredicted branch pattern. x86 branch mis-prediction is costly, so you want to minimize the amount of branches, even at the cost of more ALU ops.

I tried a bunch of stuffs - hierarchical rasterization, normal scanline, assisted scanline. Regular scanline gave the best results in both required aspects, plus it's very straightforward to setup and run, resulting in very short code and very good performance.

How many triangles per second could you rasterize in soft-rasterization?

Depends on the workloads. In general, we were able to run DX10 AAA games at 15-20 FPS on 4 years old Core-i5 CPU. CPU rasterizer is very sensitive to number of TXS operations, even more than GPU. We never really cared about the number of triangles - TXS and PS were all that really mattered, so our entire architecture were targeted at speeding those units.

I've seen some interest in SW rasterizers on this site, perhaps I should write an article about my experience.

Edited by N.I.B.

##### Share on other sites

Could you or some other person give some hints on the topic?

You need very good understanding on graphics pipeline, graphic algorithms and CPU architecture and optimizations. Next step is to define your SW architecture - that really depends on what features you want to implement. Sort-middle architecture (AKA binning) seems like the best choice for high-perf SW rasterizer, for the reasons I stated above. Than you need extremely good low-level CPU skills - the ability to run VTune, read generated assembly, then identify and implement low-level optimizations is crucial if you want good performance.

What looks like an efficient triangle drawing routine, is this a scanline approach?

I actually spent quite some time investigating and optimizing just the rasterizer part. 2 main performance concerns are:

1. Good cache locality for the rest of the pipeline, especially the TXS.
2. Rasterizers tend to have unpredicted branch pattern. x86 branch mis-prediction is costly, so you want to minimize the amount of branches, even at the cost of more ALU ops.

I tried a bunch of stuffs - hierarchical rasterization, normal scanline, assisted scanline. Regular scanline gave the best results in both required aspects, plus it's very straightforward to setup and run, resulting in very short code and very good performance.

How many triangles per second could you rasterize in soft-rasterization?

Depends on the workloads. In general, we were able to run DX10 AAA games at 15-20 FPS on 4 years old Core-i5 CPU. CPU rasterizer is very sensitive to number of TXS operations, even more than GPU. We never really cared about the number of triangles - TXS and PS were all that really mattered, so our entire architecture were targeted at speeding those units.

I've seen some interest in SW rasterizers on this site, perhaps I should write an article about my experience.

For me the most interesting would be to know more how

optymized triangle rasterization routine looks like

I got only somewhat plain c routines code not touching

the assembly level on any conscious cache optymisation:(only flat shading)

In my simple code it looks like that :

first camera  transformation and perspective casting to 2d,

(also i can also count normal and dot this normal with camera_dir to clip rear facing ones )

then i sort (with some ifs) the triangle verticesto know wchich one is top which is left/ middle which is down)

then i check if all 3 vertices are lying inside the screen rectangle (off screen ones goes out, and edge touching ones use slower drawing routine)

I count then dx dy values for edges of the triangle and starting from the top vertice going down the triangle

scanlines and draw it with scanline (setting color and also filling depthbuffer in scanline)

it is not too much eleborate i am afraid :/ and I am searching for ways of improving it - preferebly see some code or tutorial how to improve that

searching for some material about improving it :U

Edited by fir

##### Share on other sites

Let me ask a simple question that might be obvious to you but I guess not so much for me....

Do I rasterize the triangle before I put them into the "bins" ? Which in result means that my bin's contain the pixels that need to be filled ?

If not and I store vertices or a reference to the triangle's vertices, how is there any relationship to pixel cache coherency ? The pixel's that fill up the triangle could be stored in completely arbitrary locations in the vertex buffer.

edit: wait I think I get it...the framebuffer is not just logically partitioned but it is split into 4 actual (or whatever the partitioning count is) buffers.

Edited by lipsryme

##### Share on other sites

Do I rasterize the triangle before I put them into the "bins" ? Which in result means that my bin's contain the pixels that need to be filled ?

No. Depending on triangle sizes and bins-per-triangle-ratio, you might want to do bin level rasterization - instead of binning based on the triangle BBox, check if the triangle is actually inside the tile. This 'optimization' cost us some performance, we ended up not doing it.

If not and I store vertices or a reference to the triangle's vertices, how is there any relationship to pixel cache coherency

It's not related directly. We didn't stored vertices in the bins - we stored the triangle interpolants. Couple that with a smart memory-pool and some cache-centric optimizations, fetching triangle data to the rasterizer was very efficient.

edit: wait I think I get it...the framebuffer is not just logically partitioned but it is split into 4 actual (or whatever the partitioning count is) buffers.

If by partitions you mean tiles, then kind-of. The framebuffer itself is logically partitioned. Like I mentioned above, when processing bins we used temporary buffers which stored the tile pixels in contiguous memory - tiled and SoA layout. That's where the cache coherency is coming from.

##### Share on other sites

But calculating the barycentric weights/coords is part of the rasterizing process and done (partly) per pixel. Even more so calculating interpolated data like tex coordinates also has to be interpolated per pixel. Which part of that process do you actually store ?

This is how I do the rasterization part in my renderer:

// Triangle setup
int A01, A12, A20, B01, B12, B20;
int w0_row, w1_row, w2_row;

A01 = v0.y - v1.y, B01 = v1.x - v0.x;
A12 = v1.y - v2.y, B12 = v2.x - v1.x;
A20 = v2.y - v0.y, B20 = v0.x - v2.x;

w0_row = orient2D(v1, v2, p);
w1_row = orient2D(v2, v0, p);
w2_row = orient2D(v0, v1, p);

for(p.y = tri.boundingBox.minY; p.y <= tri.boundingBox.maxY; ++p.y)
{
// Barycentric coordinates at the start of the row
int w0 = w0_row;
int w1 = w1_row;
int w2 = w2_row;

for(p.x = tri.boundingBox.minX; p.x <= tri.boundingBox.maxX; ++p.x)
{
// If p is on or inside all edges, render pixel
if((w0 | w1 | w2) >= 0)
{
// Degenerate case
if(w0 == 0 && w1 == 0 && w2 == 0)
continue;

unsigned int currentPixel = p.x + p.y * viewportWidth;

// Calc final color
this->OM->RT->texture[currentPixel] =
OutputMerger::ToUnsigned(Vector4(1, 1, 1, 1));

}

// One step to the right
w0 += A12;
w1 += A20;
w2 += A01;
}

// One row step
w0_row += B12;
w1_row += B20;
w2_row += B01;
}

Edited by lipsryme

• ### Game Developer Survey

We are looking for qualified game developers to participate in a 10-minute online survey. Qualified participants will be offered a \$15 incentive for your time and insights. Click here to start!

• 16
• 11
• 9
• 24
• 50