march and mtune looks similar; and they're quite related. But not the same.
march specifies the minimum compatibility level. This means that the compiler won't generate an instruction that is incompatible with that architecture. i.e. if you specify march=pentium2 then SSE can't be used.
With march=pentium3; SSE2 won't be used.
(the exception happens if you i.e. explicitly use sse2 intrinsics, SSE2 code will be generated despite your march option).
mtune optimizes for the given architecture. You need a very low level understanding of how CPUs work. It's better to use an example:
In Yorkfield architecture, xorps, xorpd and pxor are all three SSE instructions that perform the bitwise "OR" on xmm registers. They all do the same, they're executed by the same execution unit (which afaik lives in the integer unit). The only difference is that xorps takes one less byte to encode. If you tune for Yorkfield, the compiler should be always using xorps and never (or almost never) generate xorpd or pxor.
In Nehalem architecture, xorps and pxor are executed by different execution units (I don't know about xorpd). When working with floating point instructions (ie. movaps, addps, etc) the compiler should use xorps. If it uses pxor; the CPU internally has to move the register data from the floating point unit to the integer unit, and then back (if another floating point sse instruction is used afterwards) there is around ~1 cycle penalty for moving between units; so using pxor here could end up adding 2-3 cycles of latency.
But when working with integer instructions (i.e. movdqa, paddd, etc); the compiler should use pxor (despite needing more bytes to encode). If it uses xorps, the data will be moved between execution units and add cycles of latency as with the floating point case.
So, in summary, tuning for Yorkfield should always use xorps because it takes less bytes to encode (the penalty from moving to and from the integer unit is always there, can't be avoided) and tuning for Nehalem should select between xorps, xorpd and pxor depending on the type of instructions being used on the registers before and after the OR.
Both architectures support these instructions so march doesn't have a big effect in this case. But one architecture prefers one way of doing things, the other prefers the opposite way.
The same snippet tuned for Nehalem performs slower in Yorkfield cpus, and likewise code tuned for Yorkfield performs slower for Nehalem. But both of them can run the two versions.
Nehalem supports SSE 4.2; Yorkfield supports up to SSE 4.1; march for yorkfield will guarantee no SSE 4.2 instructions are made. march=nehalem might generate code that can't be executed by Yorkfield.
Another example: AMD K10 cpus execute shifps faster than a pair of movhlps/movlhps; but the opposite is true for pre-K10 cpus.
Of course if your march is too far apart from mtune, many tuning-optimization opportunities will be missed. i.e. march=pentium3 removes SSE2; and thus selecting between mtune=yorkfield or mtune=nehalem is quite pointless (not completely though, there could be some minor differences in usage patterns regarding general purpose registers, etc).
Is it clear now?
Edited by Matias Goldberg, 22 June 2014 - 08:39 AM.