I've been thinking about some optimizations on my rasterizer and I've come across the concept of "binning" the triangles to tiles of the framebuffer. Which then are processed in parallel using several threads.
But there are still some open questions for me on how to do this efficiently...
1. I've seen a sample where they use a tile size that is small enough to fit into the cache e.g. 320x90. What I don't quite get is why make the height / 8 and the width / 4. My pixels are stored contiguously in an std::vector container so doesn't that mean that one scanline is also much faster to access in memory than two half scanlines ? So why not the other way around ?
2. Then these binning containers, are they vectors where I push them back ? Wouldn't that be extremely slow per frame ?
3. What happens if a triangle is overlapping several tiles. Do I just push the triangles into all overlapping containers ? Doesn't that produce a lot of unneeded processing ?
4. In the case of 320x90 tiles at a screen resolution of 1280x720 that would mean I've divided my buffer into 4 * 8 = 32 tiles. That's probably not effective as a 1:1 ratio on CPU threads. So would I run 4 threads and then when they're done the next 4 ?