Sign in to follow this  
MarkS

Optimizing a software rasterizer?

Recommended Posts

I know that this question is extremely general, but that is actually what I'm after. I've written a software rasterizer. It was strangely easy, but, of course, the frame rate sucks. Drawing two large perspective correct texture mapped (no filtering) triangles in a 1024 x 768 window gives me a frame rate of about 30. If I do bilinear filtering, the frame rate drops to around 3. I'm not doing much right now to optimize the code. Obvious things like loop unrolling and assembly language aside, what is typically done to optimize a software rasterizer? Also, is there a way to blit more than one 32-bit pixels at a time? One of the biggest performance hits is texture access. How is this typically optimized? I've tried interpolating the texture coordinates after perspective division, but that leaves me with linear texture mapping. Sorry, I'm at work and the code is on my home computer. I'll post some code for more direct critiques when I get home.

Share this post


Link to post
Share on other sites
Quote:
Original post by maspeir
One of the biggest performance hits is texture access. How is this typically optimized?


By building specialized fast hardware for exactly this purpose. Really. Larrabee, which is essentially a complete software renderer couldn't get by without hardware texture samplers.

Share this post


Link to post
Share on other sites
Quote:
Original post by maspeir
I'm not doing much right now to optimize the code. Obvious things like loop unrolling and assembly language aside, what is typically done to optimize a software rasterizer?


If you want to optimize, you'll need to do profiling to figure out where the bottlenecks are in your code, coupled with having a deep understanding of the hardware you're running your code on will help you play nice with the hardware so you're not stalling waiting for memory or otherwise wasting computing resources doing unnecessary things.

Share this post


Link to post
Share on other sites
[wink] OK, now let's assume that this is done just for fun and my own purposes and wont be released. I want it fast enough to be able to display simple scenes, maybe on the order of the original Doom at 30 FPS or so in a 1024x768 window.

My original purpose in making this was educational (I've never made one before). However, now that it works, I want to see how fast I can make it.

Share this post


Link to post
Share on other sites
ASM is really the way to go. Not only will using the SSE instruction set give you access to more ops/cycle, but allows you to really fine-tune memory access. Pre-loading data, cache hints, ect... all can go a long way to speeding things up.

Also, this might be of interest to you:
http://www.radgametools.com/pixomain.htm

Michael Abrash has been in the industry a long time and has written alot of articles (just google him, you'll get all sorts of stuff). Also they have demos on that site, so you can get an idea of what sort of speed you can get if you're really good ;)

Share this post


Link to post
Share on other sites
Quote:
Original post by maspeir
I apologize. That came across as rather dismissive. You threw me off with the specialized hardware comment.


I only meant that the way people have optimized software renderers in the past is by inventing the GPU

Share this post


Link to post
Share on other sites
The wait for memory is usually the performance killer when it comes to texturing with a software rasteriser. You can check by giving it very small textures.
If this is the case, there are two common ways to optimise. One is to swizzle the texture addressing, the other is mipmapping.

Share this post


Link to post
Share on other sites
Quote:
Original post by maspeir
I want it fast enough to be able to display simple scenes, maybe on the order of the original Doom at 30 FPS or so in a 1024x768 window.


This shouldn't be too hard to do. It would help if you posted to the code for your triangle filler, as it's hard to say without knowing what techniques you're currently using. Do you use only simple texture mapping? Lightmapping? Perspective correction every pixel? Floating point or fixed point? Function calls in the scanline loop? Barycentric coordinates or simpler interpolation?

Share this post


Link to post
Share on other sites
I'll post code shortly. I want to clean it up first. I did a lot of things to get it working, but they would make me look like an idiot.

To answer some of Erik's questions: floating point, perspective correction per pixel, simple interpolation, no lighting as of yet and one function call in the scanline loop (calculates the bilinear filtering; easily inlined).

Share this post


Link to post
Share on other sites
Float to integer conversions can be pretty slow, so not using floating point in the inner loop could speed it up a lot, if acceptable. As the bilinear filtering seems to be taking up pretty much your entire frame time, I would investigate exactly what part of it is the bottle neck. Texture accesses shouldn't be cutting your framerate from 30 to 3, so there has to be something else too. If you haven't turned on SSE2 optimizations in the compiler, try doing so, as float->int conversions get faster, but eliminating them is preferable.

Share this post


Link to post
Share on other sites
Quote:
Original post by maspeir
To answer some of Erik's questions: floating point, perspective correction per pixel, simple interpolation, no lighting as of yet and one function call in the scanline loop (calculates the bilinear filtering; easily inlined).


I assume that you are not using SSE functions. Previously, fixed point math was used to boost up innerloops instead of floating point. It is hard to say is it faster or slower nowadays. The problem may come at the point where you need to convert your floating point values to integers (for memory access etc). Check this link about the matter.

Just my 5 cents.

Cheers!

Share this post


Link to post
Share on other sites
That link is awesome! I knew that the conversion was causing a performance hit, but other than switching to fixed point, I did not know how to resolve this. I also didn't realize what the compiler was doing to convert floating point to integers.

Share this post


Link to post
Share on other sites
using fix-point, you need to be careful with the range. rasterizer hardware needs 48bit ints for that, you won't get away with just 32bit in all cases and using 64bit int is slower and will cause you more headache when using SSE.
My none-vectorized rasterizer was like 5% faster using 32bit-fixpoint than using float (making it switchable by a typedef is quite a nice way for benchmarking), but I had to switch to 64bit and it got way slower.
so I decided to use float in my vectorized version, although it's not as accurate anymore as it would be using int.

In my rasterizer also the texture access is the slowest part, rendering a scene at 1280x720 takes 22ms (including clear and blit to screen).
when I assign the U+V coordinate as color instead of reading from the texture array, I get away with about 16ms.

the problem is, for memory access you cannot stay vectorized, reads can be quite random (with larrabee you could read from random positions with a vector instruction, but it won't be that much faster).
One optimization I found was to always store 4 pixel for bilinear interpolation as one SIMD-sized texel (yeah, that makes the texture 4x the size). with some more shuffles you get quite some speedup.
anoter is to defere texture fetching and use prefetch to hide latency, I loop over a bunch of pixels, evaluate the texture fetch index, make a prefetch and store that texture-coord in a queue, then I read at the other end of the queue some old position and fetch those textures. sometimes that gives some % boost, but in other cases it slows down (cases where the texture-mipmap is small enough to stay in cache anyway).

Share this post


Link to post
Share on other sites
I'm not overly familiar with the concepts behind software rendering but the game source-code I am currently converting to use Direct3D9 so the game is playable on most modern PCs has a software renderer that was written back in 1997 (IIRC)
back then there was a game known as Jurassic Park: Trespasser made by Dreamworks Interactive, the game was programmed to run on PCs that did not have 3D hardware capability (as it was in the infant stages in home computers back then) the engine
is capable of rendering hundreds of triangles with close to hundreds of textures per-scene in an outdoor environemnt with perspective scene correction and other amazing things.

If I could get my hands on the source-code to that game, I'd share it out for learning experiences but alas EAGames are greedy bastards.

Anyway, this other game I have I do have the source-code to, and some of the things I've noticed is that the game uses a buffer with the special data type void* and this can take in unsigned short (565 color mode) or unsigned int (888) colors. There is a RenderASM.cpp file that handles all the copying of data to this buffer and a final call at the end of a Scene loop that Blt's the buffer to the screen via DirectDraw which IIRC is considerably faster than GDI.

From what I understand of the engine thus far, to increase speed and make the game playable, the dev's swizzled textures and made use of mipmapping, since most textures were of the maximum size 256x256 all of which were 16-bit color, the mipmap stages went down as low as about... a third stage (making 256x256 textures have a minimum mip of 32x32 AFAIK) Other things were they didn't screw around with data conversion if they could help it, most of the game resources were stored in the raw format required by the software renderer which made memory copy's easier and faster.

And finally the very last thing I've noticed is that in this game, they programmed the order that things rendered in, in a way that made it so a depth-buffer was not necessary to render a proper 3D scene, so they weren't performing depth-testing per each frame which would have helped considerably.

Because of these, on today's hardware (if the game boots up due to it's old use of deprecated features) we expect to get approximate 40-70 fps in the game which is more than enough to be playable, in comparison the hardware renders run up to 120 fps or more depending on how new your machine is (we've tested it on machines that are around 5-6 years old now)

I hope that some of this, which has come from simply observing the code and sifting through ASM commands, might be of use to someone.

Share this post


Link to post
Share on other sites
@ RexHunter99
thx for sharing those interesting bits :)

I have some questions, tho.

does it support texture filtering? if yes, also trilinear between mips?
at what resolutions does it reach 40-70 fps? did you test on DX9 with filtering? (funnily some older hardware gets slower if you disable filtering).

what cpu did you use for testing.



The reason old-school rasterizer got away with fixpoint rasterization is that they clipped triangles against the viewport. Nowadays with those vectorized versions you usually evaluate not clipped triangles, because clipping would put even more pressure on the triangle setup which is in most cases limited to 1 tri/cycle on modern hardware.

Share this post


Link to post
Share on other sites
I haven't lost interest in this topic, but I have decided to leave the code alone for now. It was for education and I really do not have the time right now to dedicate to something that I will not use. However, I do wish for this topic to remain active for as long as possible and I'm not marking it solved.

I find the replies so far to be very interesting.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this