Jump to content
  • Advertisement

Archived

This topic is now archived and is closed to further replies.

foolish_mortal

Why is my software renderer slow?

This topic is 5095 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I know software renderers are pretty redundant nowadays, but I''ve written one for my game anyway as a way of learning about stuff and understanding the concepts and so on. When I compare my renderer to Half-Life, playing in software mode, my renderer is very much slower. On my machine, Half-Life can handle 1280x1024 still running at acceptable framerates, whereas with my renderer you pretty much have to go down to 320x200. So I''m wondering what things the professionals do (or did, before hardware acceleration took over) in order to make their renderers so much faster. My software renderer is written using C++, there''s no assembler in there as of yet, and my experience with assembler is very limitted. I''ve timed the various stages of rendering, and the conclusion is that the biggest time user is the rasterization stage, by a factor of at least 10, so that seems like the place to start. I figured if I wrote the inner loops of the rasteriser in assembler, that might speed it up. But I tried a simple test (without texturing and shading) a while back, writing the inner loops in inline assembler, and it didn''t seem to make much difference. If assembler is the answer, what is it that assembler would do exactly that a bit of compiled c code wouldn''t? Here''s an example of my C code for a scanline of a flat colour triangle. pDepth is a pointer to the Z buffer, and pBmp is a pointer to the screen buffer, they are initialised to the start of the scan line. x0 is the x co-ordinate of the start of the scan line and x1 is the end. c0z is the z co-ordinate of the point at the start of the scan-line, and c1z is the amount z changes per pixel, so adding c1z to c0z interpolates z across the scanline. In more complex rendering modes, other attributes like colour and texture co-ordinates are interpolated in the same way as the z value is.

for ( x = x0; x < x1; x++ )
{
	if ( c0z > *pDepth )
	{
		*(pDepth++) = c0z;
		
		*(pBmp++) = ccol[0];
		*(pBmp++) = ccol[1];
		*(pBmp++) = ccol[2];
	}
	else
	{
		++pDepth;
		pBmp += 3;
	}

	c0z += c1z; // Interpolate z

}

How could this possibly be more efficient?

Share this post


Link to post
Share on other sites
Advertisement
find a profiling tool & run it on your code. this will tell you where the bottle necks are. at that point you will know specifically which parts of your renderer are eating up all your CPU cycles. Start by optimizing those areas that are the worst. if at all possible attack the problem by trying to come up with a better _algorithm_. going to assembler is really only necessary as a last resort when you get to a point at which you are unable/unwilling to find a different algorithm and now have to optimize very small & specific portions of your code.

-me

Share this post


Link to post
Share on other sites
Software rendering is quite slow, but it is a great excersize to go through if you really want to understand how DX/OGL do stuff and to appreciate how much 3D cards really do for you these days. Another area you may want to look at is clearing the screen (if you do so) it can REALLY eat cycles if its not done right, at the very least use memset or something similar, there are faster ways using MMX/SSE assembler. Google for fast screen clear. A good place to look for stuff on software rasterizers are old computer programming books, DOS-era stuff. They may be old but theory still applies. For a newer book, you could look at LaMothes newer 3D rendering book, though I personally hate the coding style, it has alot of good info. Finally, read anything you can get your hands on by Michael Abrash. He''s pretty myuch the god of software rendering, he co-wrote UT2k3 and 2k4''s software renderer (probably HL''s as well) if I remember correctly (its actually middleware, but was liscensed for use in unreal).

Theres a good pipeline article here, though not so much to do with rasterization: http://www.cbloom.com/3d/techdocs/pipeline.txt

and there is some good info here: http://www.theteahouse.com.au/gba/index.html

Sorry I can''t help you with specifics, I can''t help much without knowing more. Do you have optimizations on? Building in Release mode?

Ravyne, NYN Interactive Entertainment
[My Site][My School][My Group]

Share this post


Link to post
Share on other sites
Is this scanline code by itself in a function?
(i.e. is there code like

for (y=y0; y <= y1; ++y) RenderScanLine (x0[y], x1[y]);

if so, you might have all the overhead from calling the function.

Other items:
* change your for loop to :
--- "for (x = x1-x0; x; --x)"

* make sure c0z is of the same type as *pDepth

* same with ccol and *pBmp

* you could use a temp pointer to ccol:
--- "tmpColor = ccol;"
--- "*(pBmp++) = *tmpColor++;"
--- "*(pBmp++) = *tmpColor++;"
--- "*(pBmp++) = *tmpColor;"

* or you could break ccol into ccolR, ccolG, ccolB

* changing the order of your if statement might speed it up, if the second case happens more often than the first.

* make sure compiler optimization is turned on, and maybe declare some of your variables (like x) as register.

well, that''s all I''ve got. good luck.


lonesock

Piranha are people too.

Share this post


Link to post
Share on other sites
quote:
Original post by lonesock
* change your for loop to :
--- "for (x = x1-x0; x; --x)"


This shouldn''t matter. It might make some difference with non-trivial iterators, but with basic types there should be no measurable difference at all.

Anyway, ensuring that the function is getting inlined might help. You might also want to look into unrolling the loop - it''s going to be run an awful lot of times. Maybe Duff''s Device would be useful?

Of course, the best way is always to attempt to improve algorithms (do you rasterise more than you really need? Fail to cull occluded surfaces?); micro-optimisations may be necessary for something like a software rasteriser, but should still be the last phase.

Share this post


Link to post
Share on other sites
Are you sure that the draw triangle function is what's really slowing you down? Perhaps you're doing something else, like not blitting to the screen smartly, or some problem with your geometry transformation code. I wrote a software renderer in Java, and it runs pretty comfortably even on 1280x1024. You can try commenting out different parts of the code to see what's wrong. If you don't find it, In general, stay away from assembler, or even from worrying about putting your statements in a different order, unless you're sure that that's the only thing you can do and you realize that it'll only help by 10-20 percent at most. Compilers are pretty smart these days. If you're getting bad performance with a screen resolution that people were using 5-10 years ago, it's likely to be an algorithm problem .

Just from the example you gave, it looks like you're storing the 3 different color components in different indices in the image. I can't tell what pBmp is (maybe it's a byte pointer?), but if you want fast blitting, you should make sure your surface is of type 32-bit RGB.

[edited by - Matei on June 9, 2004 9:20:41 PM]

Share this post


Link to post
Share on other sites
quote:
Original post by Palidine
find a profiling tool & run it on your code. this will tell you where the bottle necks are. at that point you will know specifically which parts of your renderer are eating up all your CPU cycles. Start by optimizing those areas that are the worst. if at all possible attack the problem by trying to come up with a better _algorithm_. going to assembler is really only necessary as a last resort when you get to a point at which you are unable/unwilling to find a different algorithm and now have to optimize very small & specific portions of your code.


Listen to this person, for they speak the wisest truths.

Share this post


Link to post
Share on other sites
It is possible to make a very impressive software renderer. Check out swShader on SourceForge (it''s author is frequently on flipcode). That''s impressive for software. But yeah, do the profiller.

tj963

Share this post


Link to post
Share on other sites
Hi foolish_mortal,

I'm the developer of swShader. The key to high-performance software rendering is assembly. There is no way around it. C++ gives horribly little control over the advanced instructions offered by a CPU. But don't let assembly scare you. It takes a while to learn, but it pays off in everything you do, even C++ programming (like for debugging)!

Since you asked about Half-Life: it uses the same software renderer as Quake I. How it works, also at the assembly level, is explained in great detail in part 4 and 5 of Chris Hecker's Perspective Texture Mapping articles. Start with parts 1, 2 and 3 first, they are invaluable. Beware, it's quite math heavy, but I higly advice to read them again and again until you understand all of it. Take pencil and paper to work out the formulas yourself and you will know how to write an accurate and fast software renderer.

Good luck, and if you have further questions, don't hesitate to ask!

[edited by - c0d1f1ed on June 10, 2004 7:07:24 PM]

Share this post


Link to post
Share on other sites
Thanks for all the replies to this post :-)

quote:
find a profiling tool & run it on your code.

I haven't used one of these before. How do they work? Are there any free, easy to use ones that you'd recommend. I am using MS Visual C++ .NET standard edition to compile my code.

quote:
Another area you may want to look at is clearing the screen (if you do so) it can REALLY eat cycles if its not done right, at the very least use memset or something similar

I've got timers in my code and they say the biggest chunk is being used up by the rasterising. Clearing the screen and depth buffer does take a significant amount however, but it's not the major thing I'm worried about. I use memset. I wasn't aware of any faster way, I will look up MMX/SSE screen clearing methods.

quote:
Is this scanline code by itself in a function?

No

quote:
make sure c0z is of the same type as *pDepth

It is, as are ccol and pBmp

quote:
make sure compiler optimization is turned on

Well it is, I'm in Release Mode, not Debug mode, but this is VC++ Standard Edition, which means it is a non-optimising compiler. Is that likely to make a big difference in this kind of situation?

quote:
maybe declare some of your variables (like x) as register

How do I do that and how does it work?

quote:
change your for loop to :
--- "for (x = x1-x0; x; --x)"

quote:
This shouldn't matter. It might make some difference with non-trivial iterators, but with basic types there should be no measurable difference at all.

This is the kind of thing I'm not sure about. Does it make a difference or not? I guess it couldn't hurt to do it.

quote:
the best way is always to attempt to improve algorithms (do you rasterise more than you really need? Fail to cull occluded surfaces?

I do use a Bsp tree and visibility culling, but of course there is still some overdraw.

quote:
Perhaps you're doing something else, like not blitting to the screen smartly

How would you Blit to the screen smartly? I use the function SetDIBitsToDevice from the Win32 API. The FillScanLines function that I showed you fills a screen buffer (pBmp) which is an array of floats, three floats per pixel (for the RGB values respectively). Before I blit it, I have to go through all these floats and convert them to chars and copy them to the bitmap buffer which is blitted to the screen. This step might be open to optimisation. Also, my Bitmap isn't actually 4-byte aligned, it's 3-byte aligned, maybe that would help, I will give that a go.

But as I said, when I time it, the most time is spent in the triangle rasterising step.

I have to admit, my FillScanLines function can get more complicated that what I've shown here. i.e. When there's texture mapping and so on. I use a different function for each set of options, so it doesn't have to keep testing which options are enabled when its in the inner loops. But with the textures you can have 0 to 8 layered textures, so there's another for loop to loop through all the textures. Also there could be caching issues? Would a profiler help identify these?

quote:
The key to high-performance software rendering is assembly. There is no way around it. C++ gives horribly little control over the advanced instructions offered by a CPU. But don't let assembly scare you. It takes a while to learn, but it pays off in everything you do, even C++ programming (like for debugging)!

I would quite like to learn assembler better, even though it does scare me a bit, just to prove I can.

Is it bad to use inline assembler? I've heard that it can be 'slow', although I have no idea why it would be. If I can't use inline assembler, how else would I integrate it with my c++ code? (see I told you I'm a newbie with assembler...)

Thanks also to everyone for the links you posted.

[edited by - foolish_mortal on June 10, 2004 8:23:12 PM]

Share this post


Link to post
Share on other sites

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!