• ### Popular Now

• 12
• 12
• 9
• 10
• 13

This topic is 3717 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

okay i've written a software renderer. I have a single draw function which calls 'rasterize' with the data to be rasterized on the cpu. So, I thought it would be cool to make this support multithreading, so, my draw function tells 2 threads (as i have a dual core cpu) to resume and draw half the data each. Now, i was getting 30fps on 1 core not multithreaded. Now i've made it multithreaded, i get 20fps running both cores nearly 100% useage according to windows task manager. Any ideas how I can work out why i'm getting worse performance?

##### Share on other sites
Are the data properly split up, so the threads will not have to synchronize/lock the data (or at least not too often)?
I am mostly thinking of data, which relies on shared resources (which therefore has to be locked in order to the used).

##### Share on other sites
Are they drawing to the same buffer? In that case the synchronization might make the program completely serial, since only one thread can work at a time, and you get the synchronization overhead on top. But without some code or more detailed description of the system, we can only guess.

##### Share on other sites
Adding to the issue of synchronization. Be aware there might be implicit hardware synchronizations when writing (and reading) to the same cache line from different threads. I.e. any non-shared cache level will have to re-fetch to see the other thread's writes.

##### Share on other sites
In general you cannot add synchronisation to an algorithm that was designed to run on a single thread and expect it to run faster on multiple threads. More often you will need to design a new algorithm.

Look to see if what you're doing can be broken down into task-parallel or data-parallel chunks and design your algorithm around that. And remember even if each thread does more than half the work of the original thread, you can still expect a speed up -- even if it's not 2x your initial speed, it has a better chance of scaling as you throw more cores at it.

Another tip is not to design your algorithms for a specific number of threads. Instead, try to design for an arbitrary number. Then you will be able to reap the benefits of a machine with more cores (just today I see that Apple Power Macs now come with 8 cores as standard -- and 12mb L2 cache per core!!).

By designing for two cores you're probably thinking along the lines of "this thread does this, then this thread has to wait for this to happen, then it can go for a while, then the other one can start again...". Inherent in this is a lot of special-case locking (and hence waiting), which is very hard to make fast, let alone correct.

##### Share on other sites
here is a very simplified algorithm:

for each poly{  for each vert  {    run vertex shader  }  calculate top of poly to begin rendering from (call this t)  calculate bottom of poly to end rendering (call this b)  for t to b  {    calculate left edge to begin rendering from (call this l)    calculate right edge to end rendering from (call this r)    for l to r    {      run pixel shader      write resulting pixel to texture (memory)    }  }}

what i'm doing is, if im given say 1000 polys to render, is to split this in to 2 batches of 500, and call the above algorithm.

##### Share on other sites
The steps:
'write resulting pixel to texture (memory)'

require a lock on the texture data don't they?

##### Share on other sites
One issue you could be having is that you haven't split the work up very well. For example if most of the triangles for one thread are back facing or small then it will finish much earlier than the other one.

You'll also have the problem that both threads are writing to the same frame buffer, and you'll get cache synchronization slowdown because of that.

You might also get a slowdown from rendering order changes, due to the z-buffer eliminating less of the work.

I'd also split the vertex and pixel processing into separate loops, as verts are likely to be shared between triangles.

A design something like this should go quicker, and work well with any reasonable number of cores.

Step1:

Split vertex list into N pieces, and let one thread process each section. You could split into more sections if the vertex shader complexity varies significantly, and assign work dynamically.

This will process vertices which aren't used by triangles, but unless there's a lot of that it should go significantly quicker.

Step2:

Split the screen up into rectangles (either horizontal, vertical or both). Make sure the splits are done on cache line boundaries. Give each thread one rectangle to work on to start with. When a thread runs out of work give it the next rectangle that needs rendering, until you run out.

##### Share on other sites
Quote:
 Any ideas how I can work out why i'm getting worse performance?

I've noticed that multithreading can make performance worse if you compile in Debug mode.

##### Share on other sites
Quote:
 Original post by NitageThe steps:'run pixel shader''write resulting pixel to texture (memory)'require a lock on the texture data don't they?

nope, my *texture* is just a chunk of memory.
I've also tried this in release build and problem s the same as debug.

my code actually supports an arbitary number of cores, would be interesting to test it out on a quad core, and see if there is any improvement.

Adam, your step 2 sounds hrmm difficult to implement, i assume for each rectangle you need to iterate over all polys and see if it needs to be rendered?