Just got a book in the mail "Real-Time Rendering Second Edition" and skimmed through a ton of stuff until I reached pipeline optimizations. I read through this thoroughly and figured since I was overhauling the way the cpu feeds the gpu geometry I would use analysis, optimization and performance tips I learned from the book.
Here I have over 180,000 polygons rendering at 80+ FPS in the IDE using the D3DUSAGE_WRITEONLY flag and triangle lists. There is clearly a bottleneck in the geometry stage and I expect to boost performance with index buffers. I'm hoping that using multiple textures (thus multiple batches) and index buffers together will keep the FPS around what it is now.
I might be able to get additional performance from normal approximation and hardware manufacturer specific optimizations should I research them later. There's massive headroom in the application stage.
I wrote a benchmark function that benchmarks the speed of other functions. This made me realize that as my face pool got larger, searching the pool for a free index took exponentially longer. An optimization was made here where instead of searching through the pool for a free index, the pool is expanded and a new face is added to the end every time. This made certain functions increase in speed by over 200 seconds for the tradeoff of some wasted memory. An even better approach would be to make a new list for every empty spot in the face pool (empty spots would be recorded when a face is deleted). Adding a new face would simply take the first element from the new list and the new list would be downsized by 1 element.
Everything is being tested in the map editor of my engine on System A. "Real Time Rendering Second Edition" outdoes every high rating book in my collection two-fold.
I worked around an automation error I've had for a while when calling CreateVertexBuffer. When specifying the byte size, the general method is to use the size of your FVF (in this case I'm using the legacy 32 byte standard) multiplied by the vertex count. It turns out that 3145728 works but 24576 * 32 (same thing) does not. I've seen things like this in VB6 before when working with Long's. Furthermore I changed it to hardware vertex processing instead of software (software was faster when rendering face by face, but now with batches hardware is faster.)
My next test consisted of using a 64x64 quadpatch (single 32768 primitive batch), testing regions and native textures of 8x8, 512x512 and 2048x2048. A quad size of 8 and 16 were tested (doubling the amount of pixels to be drawn on screen.)
On System A: 174 FPS in all tests. On System C: 127 FPS in all tests.
I'm doing a comparison test for: 1) drawing a native 512x512 texture to each quad 2) drawing a 512x512 region of a 2048x2048 texture to each quad 3) drawing a 2048x2048 texture to each quad.
I've written a function to set up a quad patch of specified columns, rows and quad size to test this. A quad patch of 10 columns by 10 rows, or 100 quads, or one batch of 200 primitives is made. The project was compiled to an exe. Average frames per second over 20 seconds.
Test System A 1st run: 1) 1042 2) 1042 2nd run: 1) 1052 2) 1050 3rd run: 1) 1047 2) 1055 4th run: 3) 1040
Test System C 1st run: 1) 560 2) 560 3) 560
No final conclusion. There is nearly no difference in comparison. I will run a few more tests with much larger/multiple quad patches and various textures.