For example, it's my understanding than multiplication is slightly faster than division, and saves a few CPU cycles here and there

Is this correct/true, and should I be doing it this way? And what other optimizations might I use in general to make my math code blazing fast and efficient?

Assuming that the SSE instruction set instead of the old FP87 stack is used, then a single-precision float scalar division (DIVSS instruction) has a latency of 14-32 cycles and a processing time of 14-32 cycles, depending on the architecture. Double-precision float scalar division (DIVSD) has a latency of 22-39 cycles and a processing time of 20-39 cycles.

Compare to multiplication: a single-precision float scalar multiplication (MULSS) has a latency of 4-7 cycles, and a processing delay of 1-2 cycles, and double-precision scalar multiplication (MULSD) has a latency of 5-7 cycles and a processing time of 1-2 cycles.

The figures were taken from Intel Intrinsic Guide.

So, multiplication is about 20 times faster (assuming perfectly pipelined instructions).

I'm ignoring here the fact that you're not using C/C++ and direct SSE asm/intrinsics, and instead use C#, but the point is that 'yes, division is considerably slower *for the CPU* to execute even on modern CPUs than multiplication'. Whether that can be seen in C# execution environment, is then a matter of profiling.

MathGeoLib uses this 'multiplication by inverse' form, as do most of the game math libraries I've seen as well. Note that x / s versus x * (1/s) are not arithmetically identical, since first computing the inverse as a float and multiplying by it does lose some precision.

And what other optimizations might I use in general to make my math code blazing fast and efficient?

It should be noted that in C/C++ both a single function call, or an 'if' statement are far slower than performing a single division. However, again, in the context of C#, I recommend profiling in your real application hotspot to see what kind of effects these are, since that's quite a different context than low-level C code on the assembly/intrinsic level.