It's mainly GPU-based. E.g. I kept moving color/shadow calculations up the chain, and ended up pre-calculating a color-lookup texture, so that the vertex shader does simple normal calculations and feeds the resulting brightness value to the fragment shader to use as color texture coordinate + simple step-based shadow. I'm maybe sacrificing physical realism, but as realistic rendering is not my goal, this isn't an issue.
For the discard optimization, I keep two vertex buffers, one for triangles that has fragments that are to be discarded, and one for all-visible fragments. Here it's the CPU that decides for each triangle which buffer it goes into. The buffers are static, so this is done during level load time.
As for the edge-filter-pass, it's too much for the original iPad to handle, so it's only used on newer if your running on a newer iPad.