• Advertisement
Sign in to follow this  

multithreading software renderer

This topic is 3226 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

currently i have a software renderer that is single threaded. I want to make it multithreaded so whipped up some test code thats draws a single triangle that takes up half the screen. I divide the triangle in to the number of threads i have and each thread renders a block of the triangle. this is giving me a more than 50% improvement. How ever im not sure how this would work in real world especially if there are lots of small triangles. I guess i could dynamically adjust number of divisions of the triangle based on its size, but then i have wasted threads doing nothing. I've read up on larabee that they put triangles in to bins based on screen location so they can draw multiple triangles at once. Not sure if i should try this or some other methods? Ideas, thoughts?

Share this post


Link to post
Share on other sites
Advertisement
Ive implemented dual cpu software rendering before.

I did it by splitting the screen down the middle and having 2 clipped scenes.
Then I rendered half the screen on the first cpu and half on the other.

I did get framerate improvements but it still wasnt as fast as doing rendering with the gpu so I disbanded from the project.

It was the first time I ever handled threads, the way youve proposed I could probably attempt now but back then I probably couldnt so the way I did it was probably easier thread programming wise.

But I think what youve said should work too.

[EDIT] it gives you a gnarly clipping line down the centre of your screen :) [/EDIT]

Share this post


Link to post
Share on other sites
(i am in no way a software rendering pro, just someone who had the same idea during a school project)

there can be multiple problems:
if you split the screen in half:
- lots of small triangles give a lot of useless clipping tests. (if the triangle isnt in the frustum, gives you double tests.)
- lots of small triangles makes for not so threadable code (a lot of memory reads for small areas)
- lots of big triangles give you clipping problems: you have to clip them twice possibly doubling the ammount of output-vertices.

you could consider having multiple threads with each having their own render buffer, later on merging those two. you could divide the triangles over the two threads, render them to their own buffer, merge buffer (could also be done multithreaded). would be almost-lockless (just two syncs: both threads have to be done rendering, then both threads have to be done merging). i dont know what the performance gain would be in that case, it was just something i came up with but never got the time to implement.

--edit--

also worth a read:
http://www.devmaster.net/forums/showthread.php?t=1884
could be used for implementing it with threads.

Share this post


Link to post
Share on other sites
i've sort of thought of doing a multi buffer thing where they get merged, but im not sure if this would cause issues with sorting/blending
i suspect it would cause problems

Share this post


Link to post
Share on other sites
you would have to do the transparency single threaded at the end. for the rest it wouldn't matter much, i think, unless you are going to do anti-aliasing.

Share this post


Link to post
Share on other sites
Have you considered using threads for different purposes instead of multiple similar threads? For example, one thread could do the transormation, one could rasterize, one do texture lookup, etc... You could also dynamically choose each thread's current job too.

I tried something similar, but started to run into memory bandwidth bottlenecks and so suspended the project. Even so, if you could balance things correctly it could pay off.

Share this post


Link to post
Share on other sites
im thinking of combining my method of spliting the poly in to multiple divisions with the multi buffering.

The problem is how to do this in a lock-less fashion if possible.

say i have 2 threads.

so i have 2 render buffers, so i can render up to 2 triangles at once.
but each triangle gets chopped in to multiple.

so the thread gets to render part of the triangle, to a certain buffer.

need to work out how to work out what buffer to render too. IE. some way to say triangle index of 3 needs to go to buffer 1, triangle index of 4 goes to buffer 0.

this needs to be done when i start rendering the first block of a new poly as to automatically balance rendering between the 2 buffers.

Share this post


Link to post
Share on other sites
hrmm seems with the multi buffering it might be more hassle than its worth seems like any savings you might make could be offset by having to combine the buffers each frame.

Share this post


Link to post
Share on other sites
multi-threading in general is a complex beast, multi-threading a software renderer is even more complex.

one thing i'm noticing in this thread is that people are only talking about 2 threads... most PC's these days have 4-8 HW threads, consoles similarly have many threads.

lets assume a conservative 4 HW thread model... there are many processes that need to go on for rendering a single mesh.

1 Transform verts into screen space
2 Clip verts to "Tile" extents
3 interpolate vertex across the triangle and generate raster "quads" 1 per "Tile" touched
4 run the <shader> portion per pixel

If i were writing a software renderer I would be doing the following

1 Setup thread(s) to do <1 + 2 + 3> in one swoop pushing into a circular thread safe buffer (per Tile) a number of "quads" each quad fully describing itself in terms of what to rasterize.
2. Setup a second set of threads to READ from the circular buffers and perform the rasterisation. Each Thread here would represent only 1 "Tile" in the final buffer thus there would be ZERO contention on the final buffer itself, Transparency would be handled by normal render order methods.

This method would allow expansion to many threads depending entirely on how many tiles you wanted to split rendering into.

Note - this is off the cuff, I haven't implemented a software renderer in almost 15 years.

Share this post


Link to post
Share on other sites
fill me in with more info on the quad/tile idea? are you talking about dividing up the screen?

the problem i see with this idea is most polys will end up in the same tiles on the screen when you render a model anyways. If you deferred any rasterisation till the end once all meshes had been processed than it might be a different story.

Share this post


Link to post
Share on other sites
a "quad" would simply be an area that needed rasterising, 2x2/4x4 area of pixels on screen with specific details regarding what and where to render within itself; most quads will be a full 16 pixel draw.

a "tile" would simply be a split up area of the screen. It is important that you do NOT clip to this split but simply let the quads handle that.

The deferred nature of this pipeline is obvious; you don't render anything until the triangles are all processed into quads.

so at the end of drawing a single model you have a list of quads per tile, you define the tiles in terms of quads (so they align) and then fire of each processor to process its quads.

while those processes are going, generate the quads for the next draw.

Share this post


Link to post
Share on other sites
seems like only doing 4x4 blocks would result in a lot of wasted time generating this data, moving it around and storing it?

im doing old school rasterisation where i do a whole line of a triangle at a time, how ever i did read up on some other method that used sse or something to do blocks at a time, but the article i read on that didnt explain it well enough for me to understand. Is this perhaps the method your talking about?

Share this post


Link to post
Share on other sites
I would advise you to, in advance, profile in real detail what your main load is, wether it's transform, or triangle setup or fillrate or..., this should be the base of your decision of your multithreading strategy.

As an example, I was rendering shadowsbuffers in a relative low resolution, no need for lot of pixel work, nor expensive triangle setup with interpolators. So my way to parallize it was to just setup n-threads rasterizing into n-buffers, every of them was catching a drawcall from a simple commandbuffer. At the end of drawing, I merged all buffers into one (also with n-threads), it was a simple vectorized Min(..).
With 4cores it was 3.9 times faster, quite a good scaling, quite few work, very scaleable for future hardware.

It's not the way to do it in all situations, but that's why you shall figure out first, what bottlenecks you really have. Writing a complex deferred, tiled based rasterizer without estimating the result (based on profiling) could be a big disappointment.

Just a friendly advice ;)

Share this post


Link to post
Share on other sites
Quote:
Original post by Jason Z
Have you considered using threads for different purposes instead of multiple similar threads? For example, one thread could do the transormation, one could rasterize, one do texture lookup, etc... You could also dynamically choose each thread's current job too.

I tried something similar, but started to run into memory bandwidth bottlenecks and so suspended the project. Even so, if you could balance things correctly it could pay off.


I think it's very likely that it wasn't really memory bandwith there, but a false sharing issue (ie, one thread is writing to a certain area a lot, while another thread is reading from that same area at the same time, which would mean that a lot of cache flushes will need to be done).


Share this post


Link to post
Share on other sites
Quote:
Original post by Krypt0n
I would advise you to, in advance, profile in real detail what your main load is, wether it's transform, or triangle setup or fillrate or..., this should be the base of your decision of your multithreading strategy.


sound advice in any situation.

Quote:
Original post by Krypt0n
Writing a complex deferred, tiled based rasterizer without estimating the result (based on profiling) could be a big disappointment.
Just a friendly advice ;)


you are correct... the only issue i have is... the system i detailed is not complex at all by normal rendering standards... its incredibly simple with very few moving parts... ie robust as hell and massively open to parallelism.

Share this post


Link to post
Share on other sites
Quote:
Original post by AndyFirth
Quote:
Original post by Krypt0n
I would advise you to, in advance, profile in real detail what your main load is, wether it's transform, or triangle setup or fillrate or..., this should be the base of your decision of your multithreading strategy.


sound advice in any situation.
sadly a lot of ppl ignore it.

Quote:
Original post by AndyFirth
Quote:
Original post by Krypt0n
Writing a complex deferred, tiled based rasterizer without estimating the result (based on profiling) could be a big disappointment.
Just a friendly advice ;)


you are correct... the only issue i have is... the system i detailed is not complex at all by normal rendering standards
dont take this as offense, but maybe that's cause you didn't implement it yet.
Quote:
its incredibly simple with very few moving parts... ie robust as hell and massively open to parallelism.
if it was that way, we had a lot of those in hardware, as GPUs are made for parallelism.
It might be a big win in some cases and it might be a big slowdown in other cases and having one thread to prepare all geometry and utilizing the other threads/cores for fragments, claims that the bottleneck can't be the vertex transform, trigangle setup, this is not as robust and open to parallelism as you claim. Implementing a multi-threaded VT,TriangleSetup and sorting into tiles is far less simple if you want it to be a win. e.g. handling the memory for the tiles:
- how? fixed allocated? dynamic? some memory-pool?
- how to handle multithreaded allocations? pool per thread? pool per tile-group?
- how to handle 'out of mem'? rendering all coming geometry out simple forward way? flushing existing tiles?

And even if you overcome such a beast, in the end you've an architecture designed for massive fragmentshader work, that scales well with computation load that is beyond what and 8core can handle, probably not beeing a win for current realtime needs on cpu, as you can maybe get away far better with a dedicated solution.

ppl shouldn't see the larrabee software architecture as a decision how to do it best, it's more likely designed to avoid common problems that AMD and NVidia handle with dedicated hardware, like load balacing between different kind of shader work (VS,GS,FS) on the same alu-units, scheduling of threads in respect to geometry-setup-unit, texture-unit and rop load, etc.

and that's why I suggest to profile first. without knowing the bottlenecks and usage, you won't get satisfying results. Intel did tons of profiling of various (software)architectures till they decided for the current solution. They know the fight will be decided in insane resolutions like 2560*1920 8AA as well as 800*600, that's why they made dedicated hardware, with tons of registers and hyperthreading and did not just takes and 8core or 16core cpu, you can expect that software architecture wont scale as well on cpus.

again, i'm not bashing you, I just suggest to be careful, it's not as easy as it seems and might not give as much as someone might expect, especially without detailed profiling numers.

Share this post


Link to post
Share on other sites
Quote:
Original post by Promit
Here's the latest presentation on the Larrabee architecture. It has some info on the binning and subdivision they use.


Let me add some Dr. Dobbs articles (first on LRB, second on CUDA):

Share this post


Link to post
Share on other sites
Quote:
Original post by Krypt0ngood stuff.


lol - i don't get offended with tech discussions at all so feel free to say whatever, more inventive methods come out of healthy discussion than introverted genius.

all your points are perfectly valid and until i implement it (which i will likely never get time to, too busy writing works console engine/putting out another game) i won't know the solutions.

as i explained earlier tho... i have ZERO experience with PC's in this day and age. The last time i did any serious graphics stuff on pc was DirectX5, last time i did software rendering was 1998... I've been working on console for a VERY long time and were i to be writing this on console (say Ps3) i would most definitely be able to solve all the issues you mention relatively easily and the renderer I detailed would work perfectly fine; I know this because its been done already and my input was used in that design.

re: Larrabee, I've not talked to the guys over there in a while now but the project seems interesting, I should drop them a line sometime and get more info... too busy now tho, we just hit alpha.

anyways good luck - brings back memories of writing perspective correct texture mapping on Pentium 1 with 6 cycles per pixel, fun times.

Share this post


Link to post
Share on other sites
One really simple and functional suggestion: chop up your image in horizontal stripes instead of vertical. Video memory reads line by line, which means that depending on the placement of your vertical line you will have up to 1200 cache lines shared between the halves, versus 1 max with a horizontal split. Also draw with horizontal lines from left to right (for write-combining and prefetching) and top-to-bottom (again, cache line order).

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement