Performance updates

posted in Chronicles of the Hieroglyph

Published May 02, 2007

PerformanceWell, I was starting to get a little worried about my software rendering project. In a simple test scene, I was getting about 27 fps using a single threaded version of my pipeline. This is on a 320x240 frame buffer, so 27 fps is pretty slow even to start out with.

Then I switched over to a multithreaded architecture instead. Since my machine only has one processor, I expected a small performance hit for all of the context switching. Basically, each stage of the pipeline is continually running on its own thread. Each pipeline stage has an input buffer and an output buffer, with the output buffer being the input buffer on the next pipeline stage. To ensure that one stage didn't muck with the next stage while simultaneously writing/reading from the same buffer, I implemented a mutex scheme, where the buffers could be locked.

This completely crapped out all performance - I was down to 5 fps! This was clearly not going to be useable at all. I still had not optimized the individual stages, but 5 fps was a little silly. So I started reading up on multithreading techniques, and found out that mutex's are typically overkill for normal threading within a single process. Instead, critical sections are used for thread synchronization within a single application. Since a mutex is a kernel object and a critical section is a user mode object, the critical section is 10-20 (or more) times faster than a mutex.

In my current design, there is a whole lot of locking going on, so the performance delta between the two was quite astonishing. By directly replacing the mutex with a critical section (and the appropriate function calls for acquiring/releasing the critical section) I was right back up to 20 fps. That's more like it.

Next up is to evaluate how my processors are currently accessing the buffers. More or less, they follow this procedure:

1. Check input buffer for sufficient data to process
2. Lock input buffer
3. Read out data elements to process
4. Unlock input buffer
5. Process data elements
6. Lock output buffer
7. Write out processed data elements
8. Unlock output buffer
9. Repeat forever...

The current pipeline has six stages, with seven corresponding buffers (one to start with, one in between each stage, and one output buffer). The problem with this is that the six threads are constantly synchronizing between eachother. I need to evaluate how to reduce the number of locks between processing elements - this is likely going to be the most difficult design aspect to tackle since I want to make sure that the pipeline scales linearly (or as close to linearly as possible) with the number of processors/cores installed on a machine.

This is likely going to require quite a bit of thought. I am considering having a smaller secondary buffer for each processor to hold more than one processing element in. Basically allowing it to read more than one element out of the buffer when it does lock it. This should reduce the number of locks substantially without really increasing the memory usage too much.

Does anyone else have a suggestion for this? It seems like this must be a common scenario when working with multiple threads - what have you guys/gals done in the past??? More to come...

Previous Entry Back in action

Next Entry Sub-Pixel Accuracy

0 likes 5 comments

Comments

sirob

Sounds like the perfect place to use a lockless queue. [smile]

May 03, 2007 03:35 AM

Jason Z

I actually did check them out after your last post. The problem is that with CPU and compiler read re-ordering it makes it much more complicated to get working properly. I will eventually go this route, at the very least for a comparison. But for now I am looking for something that won't require me to spend a month researching how processor caches are used and when a CPU would reorder instructions.

Thanks for the link though!

May 03, 2007 06:52 AM

Deyja

I've written a software rasterizer myself, and I can tell you why your textures are coming out oddly jagged. You aren't accounting for sub-pixel accuracy.

May 04, 2007 08:20 PM

Jason Z

What exactly do you mean by sub-pixel accuracy? If you are willing to discuss it, I'd love to hear how to improve the quality of the renderings.

May 04, 2007 08:24 PM

Deyja

When you're tracing the side of the triangle, you aren't stepping exactly on pixel centers. There is an error that must be accounted for. All of your U, V and Z values must be shifted slightly to account for the difference between where the edge of the triangle is, and where the center of the pixel is.

U += U_X_GRADIENT * ( X - floor(X) );

May 05, 2007 06:14 AM

You must log in to join the conversation.

Don't have a GameDev.net account? Sign up!

Jason Z

Author

Performance updates

Comments

Jason Z

Latest Entries

We finally see some Hololens development details...

Wrapping up 2015, planning for 2016

Visual Studio 2013 and Graphics Development

Using STL Algorithms

Microsoft's Hololens

Data Design for Scene Graphs Pt.II

Tune in to the Connect(); conference

Data Design for Scene Graphs

Using NuGet to Manage Dependencies

Build & Project Configurations for Hieroglyph 3

Performance updates

Comments

Jason Z

Latest Entries

We finally see some Hololens development details...

Wrapping up 2015, planning for 2016

Visual Studio 2013 and Graphics Development

Using STL Algorithms

Microsoft&#39;s Hololens

Data Design for Scene Graphs Pt.II

Tune in to the Connect(); conference

Data Design for Scene Graphs

Using NuGet to Manage Dependencies

Build &#38; Project Configurations for Hieroglyph 3

Reticulating splines

Microsoft's Hololens

Build & Project Configurations for Hieroglyph 3