Sign in to follow this  
Ravyne

Average pixels per triangle?

Recommended Posts

Ravyne    14300
Hi, I am trying to find out what the average nimber of pixels per triangle is for benchmarking purposes. I realize that this is a rather arbitrary question as it depends on the scene, resolution, etc. I guess what I am seeking is what has been considered and average pixel-count when others have benchmarked their performance so I have something concrete to compare to. I've googled for this information for about an hour both on gamedev and web-wide. Can anyone give me an answer or reasonable estimate? My software renderer currently draws 88k flat shaded triangles per second in 16bit color with vertices (10, 10)-(128, 12)-(12, 128). I will be implimenting a new edge-walking algorith which should push it over 100k, possibly as high as 120k.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster   
Guest Anonymous Poster
Depending on what you want to measure (setup or fill rate) there are different things that you can do. Back in the early CAD days (SGI Indigo anyone?) 25 pixel triangles were pretty much the norm (10x5 triangle). To be really confident about your rate, you need to rotate the triangle 360 degrees to be sure you aren't only taking the fast cases such as long scanlines.

I do remember back in the day my perspective correct texture mapping routine got about 5MP/s on a Pentium-90. Heh, that was fast for the time (I believe the first 3dfx Voodoo card was ~80MP/s and didn't have much CPU overhead, where my 5MP/s used all the CPU :)....

Share this post


Link to post
Share on other sites
Guest Anonymous Poster   
Guest Anonymous Poster
The problem is pretty simple. Just store the area in pixels of each rendered triangle (see the well known Graphics FAQ for info on that), store how many triangles are rendered and then find the ratio between these two

Share this post


Link to post
Share on other sites
d000hg    1199
Well for games it's a normal scene which is probably quite small polys. You could tile the screen with 5x5 triangles in a grid of quads and while it's running gradually increase the poly size, storing poly size and fps data each frame for later. To take overdraw into account maybe draw the grid 2/3 times with each poly randomly chosen from 3 z locations?

Share this post


Link to post
Share on other sites
Charles B    863
Hi Ravyne, long time.

For a soft renderer, surely 25-50 pixels is a decent norma. Certainly you should launch 3 tests :
small triangles (10 pix ?)
=> hilights transforms + edge walking.
medium triangles (50 pix ?)
=> edge walking
large triangles (1000 pix ?)
=> pixel rate

The quickest edge walking exploits the fact that there are only two different kind of increments when you follow a Bresenham down (y++).

For instance if dx/dy = 3.26
Then the sequence is something ike this :
x+=3; y++;
x+=3; y++;
x+=4; y++;
x+=3; y++;

3 or 4 (==3+1) depends on a carry bit between the fractionnal part and the integer part of your x coordinate.

This becomes very powerful to quickly update the values of the linear gradients (for instance the screen pointer or the rgb or the homogenous texture coords).

This ends up with something like this :
carry = (xnew^xold)>>31L;
ptr += ptr_step[carry];
r += r_step[carry];
g += g_step[carry];
b += b_step[carry];

This saves the multiplications you might have in :
r = rx*x + ry*y + r00;

Share this post


Link to post
Share on other sites
Ravyne    14300
Yes Charles, it has been awhile, hasn't it? I've seen you make a few posts around gamedev and you're always on target. I've also been keeping an eye on the SBGFRC (that was the acronymn, wasn't it?) group. I'm very impressed by what I hear of your math lib, as well as the other memeber's contributions.

You are on target once again Charles, about using Bresenham's to walk the edges (currently I am using float math, just to get something up quick and dirty ;) ) Because my span-line filler takes three parameters: a pointer to the beginning of the line segment in the framebuffer, a color value, and a pixel count, I intend to calculate the pointer and count while walking the left edge, and only the count while walking the right edge. This, of course, leaves me with exactly the information I need, while eliminating floats and making the algo purely incremental.

I should note that this triangle filler is part of a 2D software rendering system, perhaps rasterization would have been a better choice in words. I do hope to apply a modified version in my 3D software graphics lib, of course I will then have to track texels, normals, and lighting among other things. If you have any additional advice I would be glad to hear it.

Thanks everyone for you're help, I appreciate it all very much.

Share this post


Link to post
Share on other sites
Charles B    863
...I intend to calculate the pointer and count while walking the left edge, and only the count while walking the right edge.
Right it's the way to do. Just beware of exact precision losses for long spans. Or some edges between two triangles will reveal cracks/distortions in texture continuity or lighting.

If you have any additional advice I would be glad to hear it.

Maybe this. I remember the implementations details to make it more efficient in asm. But since you already know the math trick you probably also know it. Anyway this was something like that :

; use memory for constants (read only)
; => more read/write registers free
add eax, Dx_lo
; or this for better scheduling
;mov edx, Dx_lo
;add eax, edx
sbb ecx, ecx ; -carry
mov edi, prev_ptr
add edi, inc_ptr[4+4*ecx]
; etc...

Else I knew a multitude of tricks combining floating points, integers or MMX. But that's kind of outdated ... just like software rendering is :)

For instance a true bilinear filtering + magnification in less than 10 cycles per pixel with floating points only. It worked on a Pentium 90 (non MMX). But that was very tricky and a real headache to code. These kind of 100 lines long asm code that take many days of chess playing to produce. Here my sincere advice is don't try to do that again. That was only useful coz the really first 3D cards were such a pity ;)

Now there could be more interesting challenges for fun. With SIMD caps and high frequencies, I think a more than decent 3D renderer could be done. I suppose one could make surprinsingly good things in pure software compared to a code based on 3D hardwares and written with average quality.

Share this post


Link to post
Share on other sites
Ravyne    14300
Quote:
Original post by Charles B
For a soft renderer, surely 25-50 pixels is a decent norma. Certainly you should launch 3 tests :
small triangles (10 pix ?)
=> hilights transforms + edge walking.
medium triangles (50 pix ?)
=> edge walking
large triangles (1000 pix ?)
=> pixel rate


I wanted to post again for comparison purposes before I impliment the new algo (Note these measurements are as close as I can easily manage, but are not dead-accurate.)

Small Triangle (10 pix) = 2.35 million triangles per second.
Medium Triangle (50 pix) = 1.36 million
Large Triangle (1000 pix) = 386 thousand

Share this post


Link to post
Share on other sites
Charles B    863
What would be the most interesting to you know is see where you can improve things :
- transfos + projection
- clipping
- rasterizing (edge walking)
- filling

1000 pix triangles give you mostly the fill rate :
386 millions pixels per second, if your machine is 2GHz, then it's 2000/386 = 5 cycles per pixel. Rather correct, though you can certainly improve (write 8 aligned bytes at once with MMX for instance).

Now getting the numbers for smaller triangles lets appear the contribiutions of the earlier pipeline stages of rendering.

50 pix : 68 M pix/sec (29 cycles/pix)
10 pix : 23 M pix/sec (86 cycles/pix)

860 cycles per 10pix triangle looks quite big. This means that filling a 1 Mega pixels offscreen with such triangles would get you 23FPS. I think it can be improved so that your software renderer can propose an alternative to 3D cards. I know it's probably a pedagogic study case, still 100FPS could let you see more opportunities for this soft.

- The three projections should cost around 50-75. Less with SIMD. I don't mention strips or indexed arrays that let only one vertex be transformed per triangle on the average.

- Then maybe vectorial clipping is a bottleneck. It could be disabled by the user when the bounding volume of a mesh is known to be fully visible.

- Getting the projected edge slopes, the gradients (color texture, etc...) is where I had to focus most of my energy for small triangles.

- Edge walking can also be time consumming. I remeber that I focused on that point nearly as much as on the inner loop. (*)

- The inner loop becomes a bit more complicated when you unroll the loops or treat 4 pixels at once.

(*) Very small triangles could be drawn very fast with some SIMD features. Basically imagine you draw a 4X or 8X width rectangle and you apply a triangle mask on it. Compute 4 aligned columns at once and use write masks to take the edge borders into account. Doing this instead of classical edge walking remove the time consumming overhead of the inner loops. It also helps packing the pixels for 4X or 8X fill rate, even on the edges.

My intuition is that the current machines should allow a peak of 50-100 Mega 10pix non textured triangles per second. But that would certainly not be a piece of cake to code. Your results are already very decent.

Share this post


Link to post
Share on other sites
Ravyne    14300
UPDATE:

I have implimented the new algorithm, and while there are a few cases not working currently, my initial results have been very promising. In tests of a static triangle rendered many times. I achieved the following performance in 32bit color, 640x480:

9 pixel - 5.40m (320, 32) - (318, 35) - (323, 35)
25 pixel - 4.40m (320, 32) - (315, 37) - (325, 37)
100 pixel - 2.50m (320, 32) - (310, 42) - (330, 42)
400 pixel - 1.25m (320, 32) - (300, 52) - (340, 52)
900 pixel - 740k (320, 32) - (290, 62) - (350, 62)

Again, these are 2D solid-shaded triangles, so theres nothing fancy going on. The performance gain was much greater than expected. I'm sure performance will drop when the algo works for every case, but not much if my understanding of the failure is accurate. I would also like to note that I currently only use asm in the span-filler, and while I am aiming for a minimum of MMX support I haven't yet implimented any MMX techniques.

For those interested my relevant specs are listed below:
Intel 875P mobo (Dell 400sc)
Pentium4 800fsb 3.0ghz
Dual-Channel DDR400 1gig

Share this post


Link to post
Share on other sites
Charles B    863
Yes a 2X speedup with just one element from the 8 maybe that can significantly improve perfs. ;) You see in the end 2^8 = 64. There is surely still a good margin. 6 mega triangles already looks promissing. You are in the aera of a Dreamcast, not bad at all. I suppose that if your edge walking was in assembly too, the span rendering inlined, (a trapazoid routine replces the span filler), you would gain another X2 for small triangles. A function call overhead costs much on small loops.

Can you explain concisely what is the buggy case ? Is this related to clipping ? Or is it related to the way you compute the starting x ? It's a typical source of graphical bug (lacking pixels on edges etc...). You must be really precise on how you take the fractional part into account, how you jump to the exact next line under the vertex, and how you deal with fractionnal precision. How you change the xstart, when a new left edge is treated, etc ...

But be sure it's worth the pain for you keeping on improving your code. Because it's quasi an abyssal job to reach the 'perfect code' for triangle rendering.

Share this post


Link to post
Share on other sites
Ravyne    14300
Yes, I'm sure asm would benefit the edge-walking, although I don't know if I want to go quite that far. I'll be able to recycle this code for next years CG class, but they discourage using lots of asm, I suppose I'll look into it for my own benefit though. The Span-filler is small and inlined, so there shouldn't be a function call. I suppose I should examine the compiler's output just to be sure.

The failing cases seems to do with the order of the vertices. It should be a matter of ensuring order at the top of the function or reworking the algo to accept unordered vertices, I have a feeling that the second option will be faster and eliminate some conditional jumps. In either case, the inner loop would remain the same.

I have also confirmed my suspicions(spelling?) about why 16 bit color is slower than 32bit in triangle rendering. It was indeed an alignment issue. Other functions did not exhibit this behavior as they were always aligned. Aligning to 32 or 64bit boundaries should increase that performance by around 2-3 times.
That would put 16bit triangles in the neighborhood of 12-18m. In any case, I plan to be switching to an MMX span-filler in the near future.

As for whether this continued improvement is worth it, I believe it is for my own education. Unfortunately Software rendering is not a very viable skill these days, With all the new engines coming out with (sometimes requiring) pixel-shader support, software becomes unusable even as a fall-back renderer. UT2k4 contains a software mode writen by Michael Abrash, which actually pieces together hand coded ASM fragments as features are turned on or off, and even he states that PS support could not be done to satisfactory quality and speed. Maybe this could change when multi-core CPUs/SMP systems become common. Using one core (or CPU) to simply do rasterization, leaving the other to perform geometry calculations (building a display list for the other core/CPU,) gameplay, etc... Anyhow I'm wandering off topic.


I believe it will be worthwhile when I one day get an interview and the man behind a desk says "We were very impressed with your use of Direct3D in your demos." to which I can reply "Actually, thats all my own software rendering routines." Hopefully sealing the deal :D I suppose they'll know when they look through the source, but I'm allowed to have occasional delusions of grandure, right?

Thanks again for your input Charles.

Share this post


Link to post
Share on other sites
Charles B    863
Quote:
Original post by Ravyne
The failing cases seems to do with the order of the vertices. It should be a matter of ensuring order at the top of the function or reworking the algo to accept unordered vertices, I have a feeling that the second option will be faster and eliminate some conditional jumps. In either case, the inner loop would remain the same.

I'd start by handling :
- backculling : always
- ccw : always
Then you can start making your API compatible with OpenGL philosophy, let user choose each option.

So now this leaves you a crossproduct (if verts are already projected) and sign evaluation. If the sign is positive your rasterizer won't have to cope with inversely ordered triangles. Even though, spans of negative length should render nothing. for(x=a; x<b; x++) renders nothing when a>=b. I don't remeber any difficulty with this problem. Unless you want to accept any kind of winding order. Then you probably better have to swap the vertices 2 and 3 at the top of the function after the backfacing test.


Quote:

It was indeed an alignment issue.

Sure, I forgot to answer to this question. But misalignement and even 16 bit accesses are reknown perfs killers. I remember it was often faster to write 2 bytes on the first pentiums. The best is to group 16 bits pixels two by two (or, shift) and write 32 bits aligned data at once. Now same thing with 64 bits data if you use the MMX. This means you have to deal with some nasty loop preambles and postambles. Then comes the idea of the masks back again. The loop is complete ;)


Quote:

Anyhow I'm wandering off topic.

Software rendering is fundamental to understand the 3D hardware far more in depth. Having tackled the perfs issues in software rendering enables an intuitive understanding on how the undocummented parts of the hardwares may work, and how to help them speed up. Now I also consider that after the dull years of the first hardwares, the developments of shaders tend to let the coders reappropriate the lower levels of rendering code. So it's not as obsolete concern as it seems. Maybe on longer term, the GPUs will enable even more layers of coding, possibly letting a coder redefine a completely customized rendering pipeline, for instance rendering complex shapes without tesselation. Example direct nurb rasterization. The architecture of the PS2 for instance was in this direction.


Quote:

I suppose they'll know when they look through the source, but I'm allowed to have occasional delusions of grandure, right?

Sure it's one way to go. The other is hard work and produce things. All a question of balance with some clear objectives in mind.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this