Software Rendering with SSE

Started by
1 comment, last by NickW 16 years, 1 month ago
So I'm working on this software renderer and I've got it all SSE optimized the most part, but there are a few problems. First of all, it's not very fast. With my Pentium Core2 Duo, I'm getting about 90-100 fps (just rendering a simple teapot, which is about 6300 polys). I was expecting the SSE optimizations to give a lot better performance than that. Second, there are some weird artifacts in the output. Thin one-pixel lines that are either darker or lighter than the surrounding pixels. They seem to be near edges of the polys, but I can't figure out why they're so dark. To rasterize the polys, I'm basically splitting the poly in half and drawing the top half and botton half independently (it makes it easier to intersect the scanline with the triangle if you only have to worry about 2 of the 3 lines). It seems like a decent algorithm, but if there is a more efficient way to do it, I would be very interested. If anyone's interested, the code can be downloaded here: http://nick.weihs.googlepages.com/rasterizer.zip
Advertisement
Having had only a quick glance, it seems like you have a lot of copies/temporaries which should be avoided.

One example is your Multiply function uses a total 15 XMM registers (2 matrices in, 1 matrix out, and 3 vectors). While most modern CPUs indeed have that many registers, those are only visible if the processor is running in AMD64 mode.
So, unless the compiler knows for sure that it is producing code which will only run on a 64-Bit operating system, it will have to assume that there are only 8 XMM registers and will therefore have to keep juggling registers and memory for the temporaries.

Another example (which is not used by your code, but still) are the Identity/TranslationMatrix/ScaleMatrix functions. You generate an aligned array of floats on the stack, fill it one by one, and then do a copy using SSE. If SSERasterVector was an union, this extra copy could be avoided by directly writing the values where they belong.
Thanks for the feedback samoth, those are some very good points. Looking at the code that might be one thing that's slowing it down, though not specifically the things you said (Matrix x Matrix is only done at the very beginning a few times, as is the matrix setup)

It's possible that there are a lot more such instances in the code where it has to swap out registers unnecessarily, although I might have to resort to inline assembly if I want to fix that.

This topic is closed to new replies.

Advertisement