PerformanceWell, I was starting to get a little worried about my software rendering project. In a simple test scene, I was getting about 27 fps using a single threaded version of my pipeline. This is on a 320x240 frame buffer, so 27 fps is pretty slow even to start out with.
Then I switched over to a multithreaded architecture instead. Since my machine only has one processor, I expected a small performance hit for all of the context switching. Basically, each stage of the pipeline is continually running on its own thread. Each pipeline stage has an input buffer and an output buffer, with the output buffer being the input buffer on the next pipeline stage. To ensure that one stage didn't muck with the next stage while simultaneously writing/reading from the same buffer, I implemented a mutex scheme, where the buffers could be locked.
This completely crapped out all performance - I was down to 5 fps! This was clearly not going to be useable at all. I still had not optimized the individual stages, but 5 fps was a little silly. So I started reading up on multithreading techniques, and found out that mutex's are typically overkill for normal threading within a single process. Instead, critical sections are used for thread synchronization within a single application. Since a mutex is a kernel object and a critical section is a user mode object, the critical section is 10-20 (or more) times faster than a mutex.
In my current design, there is a whole lot of locking going on, so the performance delta between the two was quite astonishing. By directly replacing the mutex with a critical section (and the appropriate function calls for acquiring/releasing the critical section) I was right back up to 20 fps. That's more like it.
Next up is to evaluate how my processors are currently accessing the buffers. More or less, they follow this procedure:
1. Check input buffer for sufficient data to process
2. Lock input buffer
3. Read out data elements to process
4. Unlock input buffer
5. Process data elements
6. Lock output buffer
7. Write out processed data elements
8. Unlock output buffer
9. Repeat forever...
The current pipeline has six stages, with seven corresponding buffers (one to start with, one in between each stage, and one output buffer). The problem with this is that the six threads are constantly synchronizing between eachother. I need to evaluate how to reduce the number of locks between processing elements - this is likely going to be the most difficult design aspect to tackle since I want to make sure that the pipeline scales linearly (or as close to linearly as possible) with the number of processors/cores installed on a machine.
This is likely going to require quite a bit of thought. I am considering having a smaller secondary buffer for each processor to hold more than one processing element in. Basically allowing it to read more than one element out of the buffer when it does lock it. This should reduce the number of locks substantially without really increasing the memory usage too much.
Does anyone else have a suggestion for this? It seems like this must be a common scenario when working with multiple threads - what have you guys/gals done in the past??? More to come...