A few optimizations I've made:
- Use GBA specific faster memcpy functions
- Do two pixels at a time when possible to take advantage of the 32bits buswidth of the GBA (1 pixel is 16bit)
- Moved code into IWRAM (which is the fastest location possible on the GBA, but it is fairly limited)
I've now got the same image as in the previous post running at twice the speed (38fps). A simple cube of the same size is now running at 70-80fps.
Unfortunately, while testing, I found out my renderer isn't quite correct. Some pixels were being plotted twice. Since I broke the renderer in windows when switching from floating point to fixed point. So this weekend I spend some time cleaning up the fixed point code to the point that I can now run it in windows again. To make sure I was getting the results right, I compare my output with OpenGL output. This gave me some initial problems because I couldn't get OpenGL to disable anti-aliasing. After some help from Sages I changed settings in my Display Settings. Although this didn't seem to help at the time, it seemed to work the next time I worked on this (after a restart). So if anyone has problems like that: a restart might help you out.
The next step will obviously be to fix my renderer (sigh), which is what I'm doing at the moment.
A number of optimizations I'm still planning to do are:
- Matrix functions. These seem awfully slow at the moment. Unfortunately, I'll need to dig into ASM for this, something I'm not looking forward to.
- Switch to indexed colors. The copy from backbuffer to framebuffer will go twice as fast (or maybe won't be necessary, since the 256color screens natively support double buffering). I'm currently planning a game that should look fine with 256 colors. Unfortunately, doing something like this will probably break the general usabilty of my renderer, so I'll have to do some thinking on how to implement this within the current code.
- After I've made the switch to 256color I'll need to change the renderer to be able to plot 4 pixels at once (to use the full 32bits in the bus). This should give some increase in speed in my testcases, although I'm a bit worried that my game will not really benefit from this as the triangles I'll be rasterizing will only be a few pixels each.
- T&L cache. The current testcases are flatshaded, and will therefor benefit not one iota from this, as each vertex needs to be different anyway to accomodate the separate normal. However, if I'm willing to calculate the normal for each triangle inside the renderer, I can drop this, and plot a cube using only 8 points, instead of 24 (using quads). This should be a nice boost, but again I'm holding off on this one because it will break the generality of the renderer.
As the situation is right now, I think it's almost time to start the game code as well, so I can get a good feel for how fast is will run in actual conditions as opposed to testcases.