• entries
422
1540
• views
492080

# Graphs. Lots of graphs.

225 views

This is a continuation of this previous journal entry

I've made a few changes to the source code, so you might want to download it again. I've added some better user-interface features and Eyal 'ET3D' Teler has contributed a 4th D3DX-based rendering method...

If you're running on ATI hardware I'd be interested to know what text is reported in the 'Cache Size:' label. It seems that some (or possibly all) ATI GPU's dont support the query used to retrieve that information. Without that information the cache optimization uses a default value and will almost certainly underperform [sad]

One of the subtle, but interesting, changes I made was to the timing code. In particular I set it up to force (as well as time) a pipeline flush when generating the throughput statistics. I realised that it would be quite possible for the application to be counting triangles in the wrong time period depending on when the Draw**() call was submitted. Accurate timing and a flush should mean that the triangles/second statistic is more reliable.

So, a few graphs showing the results...

(All results are from my Pentium 4 @ 3.06ghz, 1GB RAM, Nvidia GeForce 6800 AGP (512mb VRAM) on Windows XP Pro machine)

The advantages of the different methods are quite clear in this diagram. Each invocation of the VS requires 91 instructions - thus any saving due to caching is going to be quite noticeable.

Key points:
• Below 40,000 triangles/batch theres not much difference; almost certainly due to the overhead of submitting the work.
• There is no particular benefit to the cache-optimized techniques until you get over 175,000 triangles/batch.
• Its not until you get over 400,000 triangles/batch that the custom optimization method starts to shine.

This graph shows the frame rates for each of the tests in the previous graph. Whilst the triangle throughput tends to converge at a maximum, the frame rate still drops off as the batch size gets bigger. Nothing hugely surprising about that [smile]

Theres a strange little 'bump' at the 403,202 mark. Not sure why that happens, but its not a one-off on my system - completely reproducible...

The next two graphs are from a new mode that I implemented using only an NoL lighting model. This weighs in at a mere 15 instructions - a sixth of the shader previously used.

Interestingly, all of the indexed approaches are nearly identical performance wise. I'm not too sure, but my working theory is that its showing a bottleneck elsewhere in the GPU. It was for this reason that I originally used a heavy-weight shader.

Its also interesting to note that the non-indexed method is actually faster at the 125,000 to 320,000 sized batches. This suggests that some sort of triangle setup or memory latency/bandwidth issue is the aforementioned bottleneck - data need only be pulled from a single, sequentially accessed, buffer.

This shows the frame rates for the previous tests and also backs up the observation about the non-indexed lists.

Based on the results shown by this sample it would lean towards the indication that indexed and cache-optimized methods are generally better - but only substantially at large batch counts with complex shaders.

With small batch counts and simple shaders the chances are that other areas of the system are the limiting factors. This would follow with the general consensus that most GPU-intensive applications are fill-rate limited rather than transform/vertex limited.

I wonder if this will change with the unified shaders under ATI's R6xx hardware and Direct3D10? In theory the system should dedicate more of its resources to the vertex shader in this sample...

So, I ran it on my system, no vcache query supported and I'm now running the Cat6.4 drivers (newest ones out at the time of writing).

I only tested at 400x400 as I wanted a direct compare with the OGL version, which needs to be adjusted slightly I think.

Anyways, first numbers are with the complex shader, 2nd with the simple.
Simple   | 13.9MTri @ 46fps     |  47.95Mtri @ 153fps
Index    | 42.0Mtri @ 142fps    |  107.0MTri @ 333fps
Opti     | 39.0MTri @ 134fps    |  105.0MTri @ 333fps
D3DX     | 68.7MTri @ 226fps    |  103.0MTri @ 333fps


The Opti with complex is the intresting one, as before it was coming it at around the same speed as the D3DX version [oh]

Quote:
 So, I ran it on my system, no vcache query supported and I'm now running the Cat6.4 drivers (newest ones out at the time of writing).
Okay, so next trick is to try and find out how you get the vertex cache size from an ATI card [oh]

Quote:
 The Opti with complex is the intresting one, as before it was coming it at around the same speed as the D3DX version
I changed the default value from '10' to '16' in the case that it cant auto-detect the cache size. Reason being that 10 struck me as a bit small and 16 is the same as the older GeForce cards... Other than that it should be the same.

Cheers,
Jack

ah, from my own OGL tests above 14 performance went south, so that could well be the problem..

Jack, regarding cache size, change it to 14 by default. That's the cache size on ATI cards.

Also, I don't like your use of "batch size". My code that used small (D3DX optimised) batches performed about the same as the D3DX optimised code. It took considerably more CPU due to batch overhead, but the throughput was very good. I prefer a measure of "triangles per frame" rather than "batch size", which as see as "triangles per call".

Quote:
 Jack, regarding cache size, change it to 14 by default. That's the cache size on ATI cards.

That explains my finding then with OGL (10, 12 and 14 it worked fine, 16 and up MTri/sec dropped off)

Quote:
 regarding cache size, change it to 14 by default. That's the cache size on ATI cards.
Okay, may I ask where you found that out? I've been looking through all the public information for but couldn't see any reference to it at all [headshake]

Quote:
 I don't like your use of "batch size". <-- snip --> I prefer a measure of "triangles per frame" rather than "batch size", which as see as "triangles per call".
Yup, I agree. Not entirely sure why I called it 'batch size' to be honest. Just threw it in at the start and never bothered to change it [grin]

I'll correct both of these and re-upload it.

Cheers,
Jack

Quote:
 I'll correct both of these and re-upload it.
Done [smile]

Quote:
 I wonder if this will change with the unified shaders under ATI's R6xx hardware and Direct3D10? In theory the system should dedicate more of its resources to the vertex shader in this sample...

In a while (late today?) we might have some indication. In their docs, ATi has implied hat pure R2VB usage in a D3D9 apps gets close to having a unified shader arch. One thing I've wanted to try out for awhile but haven't gotten around to it is applying R2VB to this application. That is, precalculate all of the (post-transform) positions and colours of the terrain using R2VB, and then when doing the actual rendering of the verts, the data is merely copied through to the PS.

It would be *very* interesting to see how R2VB stacks up with this. Be sure to let me know if/when you get anything from that [smile]

Cheers,
Jack

You could fake it with generating the data by using a render target to generate the colours of the terrain from the lighting, and then feed that back to the VS through VTF.

## Create an account

Register a new account