I would go for the pre-calculated solution. That would let you do as complicated of algorithm as you could possibly want to generate the normal vectors from your noise function, and then store the noise value and the normal vector together in your noise structure. Then when you do the perlin lookup, just lookup the 4-component value and normalize the normal vector portion of the values.
I think that would simplify the process, keep the number of noise lookups down, and let you work in a single pass - which should handle all of your requirements! The only work is to pre-calculate the normal vectors and ensure that you won't run into any situations where you end up with a <0,0,0> value from your routine.