multithreading software renderer
currently i have a software renderer that is single threaded. I want to make it multithreaded so whipped up some test code thats draws a single triangle that takes up half the screen.
I divide the triangle in to the number of threads i have and each thread renders a block of the triangle.
this is giving me a more than 50% improvement. How ever im not sure how this would work in real world especially if there are lots of small triangles. I guess i could dynamically adjust number of divisions of the triangle based on its size, but then i have wasted threads doing nothing.
I've read up on larabee that they put triangles in to bins based on screen location so they can draw multiple triangles at once. Not sure if i should try this or some other methods?
Ideas, thoughts?
Ive implemented dual cpu software rendering before.
I did it by splitting the screen down the middle and having 2 clipped scenes.
Then I rendered half the screen on the first cpu and half on the other.
I did get framerate improvements but it still wasnt as fast as doing rendering with the gpu so I disbanded from the project.
It was the first time I ever handled threads, the way youve proposed I could probably attempt now but back then I probably couldnt so the way I did it was probably easier thread programming wise.
But I think what youve said should work too.
[EDIT] it gives you a gnarly clipping line down the centre of your screen :) [/EDIT]
I did it by splitting the screen down the middle and having 2 clipped scenes.
Then I rendered half the screen on the first cpu and half on the other.
I did get framerate improvements but it still wasnt as fast as doing rendering with the gpu so I disbanded from the project.
It was the first time I ever handled threads, the way youve proposed I could probably attempt now but back then I probably couldnt so the way I did it was probably easier thread programming wise.
But I think what youve said should work too.
[EDIT] it gives you a gnarly clipping line down the centre of your screen :) [/EDIT]
(i am in no way a software rendering pro, just someone who had the same idea during a school project)
there can be multiple problems:
if you split the screen in half:
- lots of small triangles give a lot of useless clipping tests. (if the triangle isnt in the frustum, gives you double tests.)
- lots of small triangles makes for not so threadable code (a lot of memory reads for small areas)
- lots of big triangles give you clipping problems: you have to clip them twice possibly doubling the ammount of output-vertices.
you could consider having multiple threads with each having their own render buffer, later on merging those two. you could divide the triangles over the two threads, render them to their own buffer, merge buffer (could also be done multithreaded). would be almost-lockless (just two syncs: both threads have to be done rendering, then both threads have to be done merging). i dont know what the performance gain would be in that case, it was just something i came up with but never got the time to implement.
--edit--
also worth a read:
http://www.devmaster.net/forums/showthread.php?t=1884
could be used for implementing it with threads.
there can be multiple problems:
if you split the screen in half:
- lots of small triangles give a lot of useless clipping tests. (if the triangle isnt in the frustum, gives you double tests.)
- lots of small triangles makes for not so threadable code (a lot of memory reads for small areas)
- lots of big triangles give you clipping problems: you have to clip them twice possibly doubling the ammount of output-vertices.
you could consider having multiple threads with each having their own render buffer, later on merging those two. you could divide the triangles over the two threads, render them to their own buffer, merge buffer (could also be done multithreaded). would be almost-lockless (just two syncs: both threads have to be done rendering, then both threads have to be done merging). i dont know what the performance gain would be in that case, it was just something i came up with but never got the time to implement.
--edit--
also worth a read:
http://www.devmaster.net/forums/showthread.php?t=1884
could be used for implementing it with threads.
i've sort of thought of doing a multi buffer thing where they get merged, but im not sure if this would cause issues with sorting/blending
i suspect it would cause problems
i suspect it would cause problems
you would have to do the transparency single threaded at the end. for the rest it wouldn't matter much, i think, unless you are going to do anti-aliasing.
Have you considered using threads for different purposes instead of multiple similar threads? For example, one thread could do the transormation, one could rasterize, one do texture lookup, etc... You could also dynamically choose each thread's current job too.
I tried something similar, but started to run into memory bandwidth bottlenecks and so suspended the project. Even so, if you could balance things correctly it could pay off.
I tried something similar, but started to run into memory bandwidth bottlenecks and so suspended the project. Even so, if you could balance things correctly it could pay off.
im thinking of combining my method of spliting the poly in to multiple divisions with the multi buffering.
The problem is how to do this in a lock-less fashion if possible.
say i have 2 threads.
so i have 2 render buffers, so i can render up to 2 triangles at once.
but each triangle gets chopped in to multiple.
so the thread gets to render part of the triangle, to a certain buffer.
need to work out how to work out what buffer to render too. IE. some way to say triangle index of 3 needs to go to buffer 1, triangle index of 4 goes to buffer 0.
this needs to be done when i start rendering the first block of a new poly as to automatically balance rendering between the 2 buffers.
The problem is how to do this in a lock-less fashion if possible.
say i have 2 threads.
so i have 2 render buffers, so i can render up to 2 triangles at once.
but each triangle gets chopped in to multiple.
so the thread gets to render part of the triangle, to a certain buffer.
need to work out how to work out what buffer to render too. IE. some way to say triangle index of 3 needs to go to buffer 1, triangle index of 4 goes to buffer 0.
this needs to be done when i start rendering the first block of a new poly as to automatically balance rendering between the 2 buffers.
hrmm seems with the multi buffering it might be more hassle than its worth seems like any savings you might make could be offset by having to combine the buffers each frame.
multi-threading in general is a complex beast, multi-threading a software renderer is even more complex.
one thing i'm noticing in this thread is that people are only talking about 2 threads... most PC's these days have 4-8 HW threads, consoles similarly have many threads.
lets assume a conservative 4 HW thread model... there are many processes that need to go on for rendering a single mesh.
1 Transform verts into screen space
2 Clip verts to "Tile" extents
3 interpolate vertex across the triangle and generate raster "quads" 1 per "Tile" touched
4 run the <shader> portion per pixel
If i were writing a software renderer I would be doing the following
1 Setup thread(s) to do <1 + 2 + 3> in one swoop pushing into a circular thread safe buffer (per Tile) a number of "quads" each quad fully describing itself in terms of what to rasterize.
2. Setup a second set of threads to READ from the circular buffers and perform the rasterisation. Each Thread here would represent only 1 "Tile" in the final buffer thus there would be ZERO contention on the final buffer itself, Transparency would be handled by normal render order methods.
This method would allow expansion to many threads depending entirely on how many tiles you wanted to split rendering into.
Note - this is off the cuff, I haven't implemented a software renderer in almost 15 years.
one thing i'm noticing in this thread is that people are only talking about 2 threads... most PC's these days have 4-8 HW threads, consoles similarly have many threads.
lets assume a conservative 4 HW thread model... there are many processes that need to go on for rendering a single mesh.
1 Transform verts into screen space
2 Clip verts to "Tile" extents
3 interpolate vertex across the triangle and generate raster "quads" 1 per "Tile" touched
4 run the <shader> portion per pixel
If i were writing a software renderer I would be doing the following
1 Setup thread(s) to do <1 + 2 + 3> in one swoop pushing into a circular thread safe buffer (per Tile) a number of "quads" each quad fully describing itself in terms of what to rasterize.
2. Setup a second set of threads to READ from the circular buffers and perform the rasterisation. Each Thread here would represent only 1 "Tile" in the final buffer thus there would be ZERO contention on the final buffer itself, Transparency would be handled by normal render order methods.
This method would allow expansion to many threads depending entirely on how many tiles you wanted to split rendering into.
Note - this is off the cuff, I haven't implemented a software renderer in almost 15 years.
fill me in with more info on the quad/tile idea? are you talking about dividing up the screen?
the problem i see with this idea is most polys will end up in the same tiles on the screen when you render a model anyways. If you deferred any rasterisation till the end once all meshes had been processed than it might be a different story.
the problem i see with this idea is most polys will end up in the same tiles on the screen when you render a model anyways. If you deferred any rasterisation till the end once all meshes had been processed than it might be a different story.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement