I've made a few changes to the source code, so you might want to download it again. I've added some better user-interface features and Eyal 'ET3D' Teler has contributed a 4th D3DX-based rendering method...
If you're running on ATI hardware I'd be interested to know what text is reported in the 'Cache Size:' label. It seems that some (or possibly all) ATI GPU's dont support the query used to retrieve that information. Without that information the cache optimization uses a default value and will almost certainly underperform [sad]
One of the subtle, but interesting, changes I made was to the timing code. In particular I set it up to force (as well as time) a pipeline flush when generating the throughput statistics. I realised that it would be quite possible for the application to be counting triangles in the wrong time period depending on when the Draw**() call was submitted. Accurate timing and a flush should mean that the triangles/second statistic is more reliable.
So, a few graphs showing the results...
(All results are from my Pentium 4 @ 3.06ghz, 1GB RAM, Nvidia GeForce 6800 AGP (512mb VRAM) on Windows XP Pro machine)
The advantages of the different methods are quite clear in this diagram. Each invocation of the VS requires 91 instructions - thus any saving due to caching is going to be quite noticeable.
This graph shows the frame rates for each of the tests in the previous graph. Whilst the triangle throughput tends to converge at a maximum, the frame rate still drops off as the batch size gets bigger. Nothing hugely surprising about that [smile]
Theres a strange little 'bump' at the 403,202 mark. Not sure why that happens, but its not a one-off on my system - completely reproducible...
The next two graphs are from a new mode that I implemented using only an NoL lighting model. This weighs in at a mere 15 instructions - a sixth of the shader previously used.
Interestingly, all of the indexed approaches are nearly identical performance wise. I'm not too sure, but my working theory is that its showing a bottleneck elsewhere in the GPU. It was for this reason that I originally used a heavy-weight shader.
Its also interesting to note that the non-indexed method is actually faster at the 125,000 to 320,000 sized batches. This suggests that some sort of triangle setup or memory latency/bandwidth issue is the aforementioned bottleneck - data need only be pulled from a single, sequentially accessed, buffer.
This shows the frame rates for the previous tests and also backs up the observation about the non-indexed lists.
Based on the results shown by this sample it would lean towards the indication that indexed and cache-optimized methods are generally better - but only substantially at large batch counts with complex shaders.
With small batch counts and simple shaders the chances are that other areas of the system are the limiting factors. This would follow with the general consensus that most GPU-intensive applications are fill-rate limited rather than transform/vertex limited.
I wonder if this will change with the unified shaders under ATI's R6xx hardware and Direct3D10? In theory the system should dedicate more of its resources to the vertex shader in this sample...