[Software Rasterizer] Triangle Binning Question

Started by
27 comments, last by lipsryme 10 years, 4 months ago

bump

Advertisement

find min/max bin for x,y based in the projected x,y extends

loop through those and at every bin write out the triangle ID, kinda:


Bins[x+y*BinWidth][BinDepth++]=TriangleID;

This part is what's still giving me a headache. I can't really figure out how this works. Especially in a multithreaded environment where you can't do something like:


// Bin Triangle
ScreenTile triangleTile(minX, minY, maxX, maxY);
if(Rectangle2DContains(this->tile[0], triangleTile))
	this->triangle_bin[0].emplace_back(v0, v1, v2);
if(Rectangle2DContains(this->tile[1], triangleTile))
	this->triangle_bin[1].emplace_back(v0, v1, v2);
if(Rectangle2DContains(this->tile[2], triangleTile))
	this->triangle_bin[2].emplace_back(v0, v1, v2);
if(Rectangle2DContains(this->tile[3], triangleTile))
	this->triangle_bin[3].emplace_back(v0, v1, v2);

Maybe I'm also getting the concept wrong but when I think of triangle binning I think of partitioning the screen into let's say 4 parts, and then check the bounding box of the triangle against all 4 parts and if the triangle is inside this part I add it to the Tile's "bin" (?).

you should get it working first, then multithread it. multithreading is an optimization, you do an optimization after you've gathered profiling data to pick the right places to optimize, and you profile after you have a working solution. so I'm afraid technically you cannot have multithreading headache while you're trying to understand the basic concept ;)

and yes, that's how binning works, but usually the screen is more divided into more than 4. 4 would for most resolutions exceed the cache size for framebuffer tiles. that's why I've made it in a loop.

and I suggest again to just add an index to the triangle to your bins, not the full vertices. An ID can be 16,24 or 32 bit, while your 3x vec4/float4 are 48byte, that's wastefull, especially once you've created more bins.

there is no point in spending more memory on bins than you spent on pixels per tile, as the whole binning idea is to save memory bandwidth by being cachefriendly.

Okay so I think I've got the binning process working (single-threaded and slow).

Looks something like:


ScreenTile triangleTile(minX, minY, maxX, maxY);
for(unsigned short i = 0; i != NUM_TILES; ++i)
{
        int edge0 = TriangleEdge_Rectangle_HalfSpaceIntersection(this->screenTiles[i], v0.x, v0.y, v1.x, v1.y);
	int edge1 = TriangleEdge_Rectangle_HalfSpaceIntersection(this->screenTiles[i], v1.x, v1.y, v2.x, v2.y);
	int edge2 = TriangleEdge_Rectangle_HalfSpaceIntersection(this->screenTiles[i], v2.x, v2.y, v0.x, v0.y);

	// Skip tile when outside an edge
	if(edge0 == 0x0 || edge1 == 0x0 || edge2 == 0x0)
		continue;
						
	// Add triangle to bin
	bin[i].push_back(tri.index);	

} // for each bin

Then for rasterization I loop through each tile and rasterize it's triangles.

Now I understand that I need to split each "tile" of the framebuffer & depth buffer array in logical parts (or actual separate array's that I have to copy back together at the end?) so it fits into the lvl2 cache, right? (I'm using 8x4 splits at 1280x720 which makes each tile 320x90).

If I process each Tile at a time and I know that I'm only processing this tile's pixel's does that already do the trick when accessing the buffers as

[x + y * bufferWidth] ?

But what about the pixel shading part ? Doesn't that also create variables and stuff that goes into the lvl2 cache ? How do I make sure that just the framebuffer and depth buffer part is in there..?

Grats, now you've got the most abstract part of it working :)

"Tiles" don't have to be separate memory areas (although they can), you can save the copy operation by just pointing at the right offset into the framebuffer (color and depth) per tile. The cpu cache will cache it no matter where the memory is really placed in DRam.

And yes, if you process just the pixel of that tile, those gonna be all in L2. you can improve the situation by even delaying the 'clear' that you usually do, to the point where you start rasterizing a tile. that way you don't have to even load the pixels from DRam, as you write to those cache lines during the clear of the tile. (the cpu will combine several stores into the same cache line into one write, that way you don't have a cache miss on writing them as long as you fill a cacheline completely).

regarding cache polution by other operation:

1. shading in general doesn't pollute caches, the data you access are mostly interpolators that you re-use on every pixel of the triangle. of course, they occupy the L1, but it should be bu far less than 1kb per triangle.

2. texture fetche "can" pollute the cache, but that's actually even wanted. close by pixel will most likely be accessing close by texel. if you write a regular rasterizer, you write out pixel to random locations on the screen, by random objects and even if it's the same object (e.g. terrain), it can access vastly different parts of the textures in a continous stream of triangles.

But if you rasterize tile by tile (and you properly utilize mipmapping), you'll be as coherent for texels as you are for pixels.

3. usually, most data you move are framebuffer pixel. in average scene (people usually claim) you have an overdraw of 3.5, that means, of all rasterized pixels, you actually just see 1/3.5. so for every pixel in your tile, you'll probably access 4times the zbuffer for reading, if you render front2back, you won't write much, but e.g. for transparent objects, you might write several times per pixel. so you'll end up with a lot of framebuffer data movement. (16-42byte if you 3.5 times read z, write z, write color with no blending).

On the other side, you access textures (with a mipmap that is <= 1:1 texel:pixel ratio), so, at max, you shall have as many texel reads as you have color writes and on top you have at least as many zbuffer reads AND writes.

So, yes, textures can take up cache, but most of the data will be accessed/cached once in L1 and then not accessed again, due to tilling, even different polys will likely access cached texture, and the total amount of bandwidth shall be less than framebuffer bandwidth.

4. that's why I always say "get it running" "profile" "optimize" maybe textures won't be critical as you spent 90% of the time doing shading math? maybe textures will be very critical 'cause you're not doing any math at all, but instead combine 16texture layer per terrain pixel. but, when it comes to this, you can adjust your tile size (why would you want the cache exclusive for the framebuffer if the texture reading is the bottleneck, right?). if textures and framebuffer tiles are fighting for cache, reduce the tile size, you'll have less fighting, more locality for texture lookups.

5. consider doing some compression. no, not that GPU friendly S3TC or something, but some CPU friendly solution e.g. http://www.gamasutra.com/view/feature/3090/image_compression_with_vector_.php

When we put the VQ compression described above in our renderer, we expected to get slightly worse performance but were ready to live with it, because it took about 20MB off our memory footprint. However, profiles showed that rendering has actually gained about 15 to 20 percent performance, probably due to decreased memory traffic.

from my experience, texture filtering is actually taking more time than memory accesses on a hyperthreaded cpu.

My naive approach would be to skip "bin tiles" and just do something like pixel rasterizer threads. If you want 8 threads. Every pixel gets dumped into one. So you essentially have a triangle, you are horizontally rasterizing each line. Since you have one triangle rendered at a time, you have all the attributes stored globally somewhere that each 8 pixel threads can access. You could for each pixel dump into the next bin's queue and continue on. The threads would say "I know the position, I can calculate the delta for texture coords/normals whatever interpolation" and calculate that from the current triangles data.

Each thread you simply pass in the pixel's 2d position or whatever else you need.

NBA2K, Madden, Maneater, Killing Floor, Sims http://www.pawlowskipinball.com/pinballeternal

Now for the multithreaded approach.....

To put this into some perspective here's my test setup:

Rendering of a mesh with ~70k polygons (fairly close to the camera) http://d.pr/i/2lml

CPU: Intel Core i5 2.8ghz (4 physical cores, no hyperthreading / 256kb L2 cache each)

I've not yet converted everything to SIMD / SSE processing (only the most important bits e.g. VS).

And I'm only outputting a color value as the output (so no pixel shading or depth buffer test/write).

VERTEX PROCESSING:

-----------------------------------

What I tried to do was give each thread a split of the index buffer (so for 4 threads -> 4 splits) that each thread works on it's own batch of vertices.

What I'm then doing is filling an IntermediateTriangle buffer local to each thread like so:


this->intermediateTriangles[threadIdx][triangleNr] = IntermediateTriangle(v0, v1, v2,
                                                     v0_out, v1_out, v2_out, triangleNr);
triangleNr++;

The IntermediateTriangle includes the transformed vertices that make up the triangle + it's additional VS outputs and it's ID inside this buffer.

This process takes around 3.4ms

BINNING:

-----------------------------------

Next up in the same thread I'm doing the Binning process.

I loop through each intermediateTriangle that is inside the local buffer for this thread and do binning like seen in my last post:


// Add triangle to bin
bin[ii][binCountPerTile[ii]++] = std::make_pair(tri.index, threadIdx);

I store an std::pair with 2 unsigned integers because I not only need to know the triangle's index into the intermediate buffer but also in which local thread's buffer these vertices are located at.

binCountPerTile is an atomic<unsigned int>.

note: switched everything from push_back to direct writes which gave me a little speedup.

This process takes around 11.4ms (probably the culprit?)

Then I rejoin my threads and prepare a quick queue that holds indices to the bin's:


// Prepare tile queue
for(unsigned short i = 0; i < NUM_TILES; ++i)
{
	if(this->binCountPerTile[i] > 0)
		this->tileQueue.push(i);
}

m

m

m

PIXEL PROCESSING:

-----------------------------------

Now I launch another threadgroup for rasterization / pixel processing.

In here I run a while loop that checks if the queue is empty.

I take the first tile from the queue like so:


// As long as the tile queue isn't empty
while(!this->tileQueue.empty())
{
	m.lock();
	unsigned int tile = this->tileQueue.front();
	this->tileQueue.pop();
	m.unlock();

        ...
}

And then process each triangle inside the tile.

This process takes around 4.7ms

So as you can see my frametime adds up to around 19-20ms which definitely needs improvement.

Tests without binning run the whole process in around 9-10ms.

I'd love to get your inputs smile.png

edit: um...what? why is this getting voted down ?

edit: um...what? why is this getting voted down ?

I'm afraid it's because of my fat fingers:

-was on the way home

-watched this interesting thread on my phone

-wanted to vote up

-but somehow it was detected as down vote

-I cannot correct that anymore. I'm sorry.

But your results are very interesting, it's rare to see here some actually profiling numbers. yet, your test case is not very tile based rendering friendly. It's just one object at a specific place, if you slightly zoom out, it could actually fit into a tile. And that's actually what your results report, you are about 10ms slower than without the optimization and your binning takes 11.4ms.

A better test case would be if you'd render 32x32x32 boxes, covering most of the screen, then you'd actually become framebuffer bound and your tile based framebuffer optimization might kick in.

My 2cent.

No harm done smile.png

Back on topic:

It seems like I had a little oversight on the binning process since I also needed to make sure that the tiles I'm testing using the half space test where inside a tile that contains or intersects the triangle's bounding box. So in my test before it filled every one of the 32 tiles with triangles that were actually empty!

Here's what I'm doing now:


// only loop through the tiles that our bounding box overlaps
ScreenTile triangleTile(minX, minY, maxX, maxY);
for(unsigned short ii = 0; ii != NUM_TILES; ++ii)
{
	// Skip tiles that don't contain our triangle's bounding box
	if(!Rectangle2DIntersects(this->screenTiles[ii], triangleTile))
		continue;

	int edge0 = TriangleEdge_Rectangle_HalfSpaceIntersection(this->screenTiles[ii],                                                                 v0.x, v0.y, v1.x, v1.y);
	int edge1 = TriangleEdge_Rectangle_HalfSpaceIntersection(this->screenTiles[ii],                                                                 v1.x, v1.y, v2.x, v2.y);
	int edge2 = TriangleEdge_Rectangle_HalfSpaceIntersection(this->screenTiles[ii],                                                                 v2.x, v2.y, v0.x, v0.y);

	// Skip tile when outside an edge
	if(edge0 == 0x0 || edge1 == 0x0 || edge2 == 0x0)
		continue;
						
	// Add triangle to bin
	bin[ii][binCountPerTile[ii]++] = std::make_pair(tri.index, threadIdx);

} // for each bin

bool Rectangle2DIntersects(const ScreenTile &tile, const ScreenTile &triangleBB)
{
	if(tile.minX <= triangleBB.maxX &&
	   triangleBB.minX <= tile.maxX &&
	   tile.minY <= triangleBB.maxY &&
	   triangleBB.minY <= tile.maxY)
	{
	    return true;
	}

	return false;
}

I feel like the bounding box test could be done somehow faster than the regular rectangle / rectangle intersection test.

What I need to do is loop through every tile that my triangle's bounding box intersects. Does anyone know an efficient way ?

My frame time of the close up now went to around 10ms (4ms vertex / 3ms binning / 3ms rasterization and filling pixels) !

This topic is closed to new replies.

Advertisement