-march=pentium3 -mtune=generic -mfpmath=both ?

Started by
19 comments, last by Tribad 9 years, 9 months ago

Im still confused what flags to use in mingw (gcc 4.7.1) to optymize binary as far as its possible -

especially that this gcc dosumentation is partially weakly written

for example i do understand (hopefully correct)

1) that -march says what instruction set i should restrict compiler to use (for example setting -march=pentium3 makes my binary onlu with instructions awaliable on pentium3)

2) also i understand that -mtune says to what target the previous instructions do optymize, for example i can get p3 instructions and optymize it for core2

confusingly the docs say

"-march=cpu-type Generate instructions for the machine type cpu-type. The choices for cpu-type are the same as for -mtune. Moreover, specifying -march=cpu-type implies -mtune=cpu-type. "

I doubt if this is true - does this mean that when choicing -march=pentium3 -mtune=generic the mtune setting is discarded and this is equiwalent of

-march=pentium3 -mtune=pentium3 ? dont think so (this is confusing)

other questions

1.

i would like to chose resonable codeset that would be working on older

machines but also working ok on more modern ones I chose -march=pentium3 as i doubt if someone uses something older than p3 and I didnt noticed noticable change when putting something newer here (like -march=core2 - i dint notice any speedup)

2.

what in general i can yet add to this commandline to speed things up ?

(or throw away some runtime or exception stuff bytes or something like that)

c:\mingw\bin\g++ -O2 -Ofast -w -c transform_triangle_3d.c -funsafe-math-optimizations -mrecip -ffast-math -fno-rtti -fno-exceptions -march=pentium3 -mtune=generic -mfpmath=both
(some flags may be redundant here but I added them to be sure, got no time to carefulle check up what is redundant )
im using here -O2 as i not noticed difference with -O3
i noticed that "-mfpmath=both " speeded things up (though docs say something that its dangerous didnt understand why) also (-ffast-math /
-funsafe-math-optimizations also speeded things)
Advertisement

1) that -march says what instruction set i should restrict compiler to use (for example setting -march=pentium3 makes my binary onlu with instructions awaliable on pentium3)

-march sets the minimum compatibility level... in this case it means Pentium III or later.


2) also i understand that -mtune says to what target the previous instructions do optymize, for example i can get p3 instructions and optymize it for core2

confusingly the docs say

"-march=cpu-type Generate instructions for the machine type cpu-type. The choices for cpu-type are the same as for -mtune. Moreover, specifying -march=cpu-type implies -mtune=cpu-type. "

I doubt if this is true - does this mean that when choicing -march=pentium3 -mtune=generic the mtune setting is discarded and this is equiwalent of
-march=pentium3 -mtune=pentium3 ? dont think so (this is confusing)

Why do you doubt it? It makes perfect sense: -march has priority. If you choose to set the minimum compatibility level, the optimizer will use that as when making choices.

1. i would like to chose resonable codeset that would be working on older
machines but also working ok on more modern ones I chose -march=pentium3 as i doubt if someone uses something older than p3 and I didnt noticed noticable change when putting something newer here (like -march=core2 - i dint notice any speedup)

While there are millions of pre-PIII machines still going into production, it's unlikely that your game will be running on them (they're things like disk controllers, routers, refrigerators, toasters, and so on). PIII is probabyl good enough, since it has PAE by default and other improvements like fast DIV, better interlocking, and extended prefetch.

It's also likely that newer architectures don't introduce new abilities that your picooptimization can take advantage of when it comes to something not CPU-bound, like a game.

2. what in general i can yet add to this commandline to speed things up ?
(or throw away some runtime or exception stuff bytes or something like that)

In general, such picooptimization is not going to make one whit of difference in a typical game. What you really need to do is hand-tune some very specific targeted benchmark programs so they show significant difference between the settings (by not really running the same code), like the magazines and websites do when they're trying to sell you something.

im using here -O2 as i not noticed difference with -O3

Hardly surprising, since most picooptimizations don't provide much noticeable difference in non-CPU-bound code. -O2 is likely good enough (and definitely better than -O1 or -O0), but -O3 has been known to introduce bad code from time to time, I always stay away from it.

i noticed that "-mfpmath=both " speeded things up (though docs say something that its dangerous didnt understand why) also (-ffast-math /
-funsafe-math-optimizations also speeded things)

Those switches end up altering the floating-point results. You may lose accuracy, and some results may vary from IEEE standards in their higher-order significant digits. If you're doing a lot of repeated floating-point calculations in which such error can propagate quickly, you will not want to choose those options. For the purposes of most games, they're probably OK. Don't enable them when calculating missile trajectories for real-life nuclear warheads. Don't forget GCC has other uses with much stricter requirements than casual game development.

I'd say that while it's fun to play with the GCC command-line options and it's a good idea to understand them, they're not really going to give you a lot of optimization oomph. You will get far more bang for your buck playing with algorithms and structuring your code and data to take advantage of on-core caching.

Also, if you haven't already, you might want to read about the GCC internals to understand more of what's going on under the hood.

Stephen M. Webb
Professional Free Software Developer

tnx for the answer, dont know what you call pikooptymizations

when testing in my prog (rasterizer) removing all the other opt switches except -O3 slowed the prog execution from 20 ms to about 31ms

then curiously changing -O3 into -O1 slowed only to about 32-33 ms

- so it shows some optymization flags can have significant effect

(I will answer more a bit later)

1) that -march says what instruction set i should restrict compiler to use (for example setting -march=pentium3 makes my binary onlu with instructions awaliable on pentium3)

-march sets the minimum compatibility level... in this case it means Pentium III or later.


2) also i understand that -mtune says to what target the previous instructions do optymize, for example i can get p3 instructions and optymize it for core2

confusingly the docs say

"-march=cpu-type Generate instructions for the machine type cpu-type. The choices for cpu-type are the same as for -mtune. Moreover, specifying -march=cpu-type implies -mtune=cpu-type. "

I doubt if this is true - does this mean that when choicing -march=pentium3 -mtune=generic the mtune setting is discarded and this is equiwalent of
-march=pentium3 -mtune=pentium3 ? dont think so (this is confusing)

Why do you doubt it? It makes perfect sense: -march has priority. If you choose to set the minimum compatibility level, the optimizer will use that as when making choices.

the question is if I set march to pentium3 then mtune to core2 if it will optymize for pentium3 or for core2 - those sentence in docs can be understood as it will optymize for pentium3 - and i doubt it becouse if so

mtune would be unusable there

as to rest of the things i noticed speedup when hand crafting with procedures but some of this switches also gave a visible speedup (like i was saying from 30 -> 20 ms)

march and mtune looks similar; and they're quite related. But not the same.

march specifies the minimum compatibility level. This means that the compiler won't generate an instruction that is incompatible with that architecture. i.e. if you specify march=pentium2 then SSE can't be used.

With march=pentium3; SSE2 won't be used.

(the exception happens if you i.e. explicitly use sse2 intrinsics, SSE2 code will be generated despite your march option).

mtune optimizes for the given architecture. You need a very low level understanding of how CPUs work. It's better to use an example:

In Yorkfield architecture, xorps, xorpd and pxor are all three SSE instructions that perform the bitwise "OR" on xmm registers. They all do the same, they're executed by the same execution unit (which afaik lives in the integer unit). The only difference is that xorps takes one less byte to encode. If you tune for Yorkfield, the compiler should be always using xorps and never (or almost never) generate xorpd or pxor.

In Nehalem architecture, xorps and pxor are executed by different execution units (I don't know about xorpd). When working with floating point instructions (ie. movaps, addps, etc) the compiler should use xorps. If it uses pxor; the CPU internally has to move the register data from the floating point unit to the integer unit, and then back (if another floating point sse instruction is used afterwards) there is around ~1 cycle penalty for moving between units; so using pxor here could end up adding 2-3 cycles of latency.

But when working with integer instructions (i.e. movdqa, paddd, etc); the compiler should use pxor (despite needing more bytes to encode). If it uses xorps, the data will be moved between execution units and add cycles of latency as with the floating point case.

So, in summary, tuning for Yorkfield should always use xorps because it takes less bytes to encode (the penalty from moving to and from the integer unit is always there, can't be avoided) and tuning for Nehalem should select between xorps, xorpd and pxor depending on the type of instructions being used on the registers before and after the OR.

Both architectures support these instructions so march doesn't have a big effect in this case. But one architecture prefers one way of doing things, the other prefers the opposite way.

The same snippet tuned for Nehalem performs slower in Yorkfield cpus, and likewise code tuned for Yorkfield performs slower for Nehalem. But both of them can run the two versions.

Nehalem supports SSE 4.2; Yorkfield supports up to SSE 4.1; march for yorkfield will guarantee no SSE 4.2 instructions are made. march=nehalem might generate code that can't be executed by Yorkfield.

Another example: AMD K10 cpus execute shifps faster than a pair of movhlps/movlhps; but the opposite is true for pre-K10 cpus.

Of course if your march is too far apart from mtune, many tuning-optimization opportunities will be missed. i.e. march=pentium3 removes SSE2; and thus selecting between mtune=yorkfield or mtune=nehalem is quite pointless (not completely though, there could be some minor differences in usage patterns regarding general purpose registers, etc).

Is it clear now?

march and mtune looks similar; and they're quite related. But not the same.

march specifies the minimum compatibility level. This means that the compiler won't generate an instruction that is incompatible with that architecture. i.e. if you specify march=pentium2 then SSE can't be used.

With march=pentium3; SSE2 won't be used.

(the exception happens if you i.e. explicitly use sse2 intrinsics, SSE2 code will be generated despite your march option).

mtune optimizes for the given architecture. You need a very low level understanding of how CPUs work. It's better to use an example:

In Yorkfield architecture, xorps, xorpd and pxor are all three SSE instructions that perform the bitwise "OR" on xmm registers. They all do the same, they're executed by the same execution unit (which afaik lives in the integer unit). The only difference is that xorps takes one less byte to encode. If you tune for Yorkfield, the compiler should be always using xorps and never (or almost never) generate xorpd or pxor.

In Nehalem architecture, xorps and pxor are executed by different execution units (I don't know about xorpd). When working with floating point instructions (ie. movaps, addps, etc) the compiler should use xorps. If it uses pxor; the CPU internally has to move the register data from the floating point unit to the integer unit, and then back (if another floating point sse instruction is used afterwards) there is around ~1 cycle penalty for moving between units; so using pxor here could end up adding 2-3 cycles of latency.

But when working with integer instructions (i.e. movdqa, paddd, etc); the compiler should use pxor (despite needing more bytes to encode). If it uses xorps, the data will be moved between execution units and add cycles of latency as with the floating point case.

So, in summary, tuning for Yorkfield should always use xorps because it takes less bytes to encode (the penalty from moving to and from the integer unit is always there, can't be avoided) and tuning for Nehalem should select between xorps, xorpd and pxor depending on the type of instructions being used on the registers before and after the OR.

Both architectures support these instructions so march doesn't have a big effect in this case. But one architecture prefers one way of doing things, the other prefers the opposite way.

The same snippet tuned for Nehalem performs slower in Yorkfield cpus, and likewise code tuned for Yorkfield performs slower for Nehalem. But both of them can run the two versions.

Nehalem supports SSE 4.2; Yorkfield supports up to SSE 4.1; march for yorkfield will guarantee no SSE 4.2 instructions are made. march=nehalem might generate code that can't be executed by Yorkfield.

Another example: AMD K10 cpus execute shifps faster than a pair of movhlps/movlhps; but the opposite is true for pre-K10 cpus.

Of course if your march is too far apart from mtune, many tuning-optimization opportunities will be missed. i.e. march=pentium3 removes SSE2; and thus selecting between mtune=yorkfield or mtune=nehalem is quite pointless (not completely though, there could be some minor differences in usage patterns regarding general purpose registers, etc).

Is it clear now?

it is clear except this confusing sentence in documentation

"-march=cpu-type Generate instructions for the machine type cpu-type. The choices for cpu-type are the same as for -mtune. Moreover, specifying -march=cpu-type implies -mtune=cpu-type. "

this could be understood as setting march to pentium3 implies setting (overwritting) mtune for pentium3

this probably means that it implies setting to mtune when you do not state mtune otherwise, when I state mtune to core2 then this implicity do not occurs and it will not overwrite the setting - but this sentence is confuding a bit

PS there is yet a bit of confusion about setting mfpmath to both

- i noticed speedup against setting to sse, i noticed

float < sse <both, both seem to be the best but there was stated

`sse,387' `sse+387' `both' Attempt to utilize both instruction sets at once. This effectively double the amount of available registers and on chips with separate execution units for 387 and SSE the execution resources too. Use this option with care, as it is still experimental, because the GCC register allocator does not model separate functional units well resulting in instable performance.

what does it can mean, is it unsafe in some way?

"-march=cpu-type Generate instructions for the machine type cpu-type. The choices for cpu-type are the same as for -mtune. Moreover, specifying -march=cpu-type implies -mtune=cpu-type. "

I doubt if this is true - does this mean that when choicing -march=pentium3 -mtune=generic the mtune setting is discarded and this is equiwalent of
-march=pentium3 -mtune=pentium3 ? dont think so (this is confusing)

-mtune=core2 -march=pentium3 will generate code for Pentium III, and tune for Pentium III (march overrode the mtune)
-march=pentium3 -mtune=core2 will generate code for Pentium III, and tune for Core 2 (mtune was set after march)

Though note that, like I said, march=pentium3 and mtune=core2 are so far apart that there probably won't make much difference; because many tuning chances will be missed due to missing key instructions.

"-march=cpu-type Generate instructions for the machine type cpu-type. The choices for cpu-type are the same as for -mtune. Moreover, specifying -march=cpu-type implies -mtune=cpu-type. "

I doubt if this is true - does this mean that when choicing -march=pentium3 -mtune=generic the mtune setting is discarded and this is equiwalent of
-march=pentium3 -mtune=pentium3 ? dont think so (this is confusing)

-mtune=core2 -march=pentium3 will generate code for Pentium III, and tune for Pentium III (march overrode the mtune)
-march=pentium3 -mtune=core2 will generate code for Pentium III, and tune for Core 2 (mtune was set after march)

Though note that, like I said, march=pentium3 and mtune=core2 are so far apart that there probably won't make much difference; because many tuning chances will be missed due to missing key instructions.

I was testing with core2 & core2 (on core2) but i not noticed a speedup

I was testing with core2 & core2 (on core2) but i not noticed a speedup

They aren't magic switches. These fall in the micro-optimization category. They're most noticed when you're extremely ALU bound.
If you're bandwidth bound, it won't make a difference. Profile. Find your bottleneck and hotspots and optimize that.
Furthermore algorithmic optimizations are much more important.

Edit: Plus, if you're compiling for x64; march=pentium3 will be ignored as the minimum x64 cpu is much newer than the P3.

I was testing with core2 & core2 (on core2) but i not noticed a speedup

They aren't magic switches. These fall in the micro-optimization category. They're most noticed when you're extremely ALU bound.
If you're bandwidth bound, it won't make a difference. Profile. Find your bottleneck and hotspots and optimize that.
Furthermore algorithmic optimizations are much more important.

I faced a limit of my optymizing skills, for example some transformation in

my software rasterization "pipeline" goes like (this is weakly written (im afraid it looks like middle-ages) but works)


 
inline void TransformPointByModelMatrix(float* px, float* py, float* pz)
{
  float wx = *px -  modelPos.x;
  float wy = *py -  modelPos.y;
  float wz = *pz -  modelPos.z;
 
 *px   = ((wx*modelRight.x + wy*modelRight.y + wz*modelRight.z));
 *py   = ((wx*modelUp.x    + wy*modelUp.y    + wz*modelUp.z   ));
 *pz   = ((wx*modelDir.x   + wy*modelDir.y   + wz*modelDir.z  ));
 
  *px  +=  modelPos.x;
  *py  +=  modelPos.y;
  *pz  +=  modelPos.z;
 
 
}
 
inline void TransformPointToEyeSpace(float* px, float* py, float* pz)
{
 
 float wx = *px - cameraPos.x;
 float wy = *py - cameraPos.y;
 float wz = *pz - cameraPos.z;
 
 *px   = ((wx*cameraRight.x + wy*cameraRight.y + wz*cameraRight.z));
 *py   = ((wx*cameraUp.x    + wy*cameraUp.y    + wz*cameraUp.z   ));
 *pz   = ((wx*cameraDir.x   + wy*cameraDir.y   + wz*cameraDir.z  ));
 
 *pz   += camera_depth;
 
}
 
 
 int TransformTriangle3d( Triangle* triangle, unsigned color)
{
  /////
 ////// space TRANSFoRmATIONs bOTH World And EYe
///////
 
     float x1m, y1m, z1m, x1, y1, z1;
     float x2m, y2m, z2m, x2, y2, z2;
     float x3m, y3m, z3m, x3, y3, z3;
 
    /////////////////////////////////
     x1m =     (*triangle).a.x;
     y1m =     (*triangle).a.y;
     z1m =     (*triangle).a.z;
    TransformPointByModelMatrix(&x1m,&y1m,&z1m);
     x1 =  x1m;
     y1 =  y1m;
     z1 =  z1m;
    TransformPointToEyeSpace(&x1,&y1,&z1);
    if( z1<=camera_clip_distance)  return 0;
    ////////////////////////////////////
     x2m =     (*triangle).b.x;
     y2m =     (*triangle).b.y;
     z2m =     (*triangle).b.z;
    TransformPointByModelMatrix(&x2m,&y2m,&z2m);
     x2 =  x2m;
     y2 =  y2m;
     z2 =  z2m;
    TransformPointToEyeSpace(&x2,&y2,&z2);
    if( z2<=camera_clip_distance)  return 0;
    /////////////////////////////////////
     x3m =     (*triangle).c.x;
     y3m =     (*triangle).c.y;
     z3m =     (*triangle).c.z;
    TransformPointByModelMatrix(&x3m,&y3m,&z3m);
     x3 =  x3m;
     y3 =  y3m;
     z3 =  z3m;
    TransformPointToEyeSpace(&x3,&y3,&z3);
    if( z3<=camera_clip_distance)  return 0;
    ///////////////////////////////////
 
  /////////////////////////
 // PrOJecTION to 2D
 //////////////////////////////////
 
   int p1x, p2x, p3x, p1y, p2y, p3y;
 
 
    p1x = frame_size_x_DIV_camera_size_x_MUL_camera_depth*x1/z1 + frame_size_x_DIV_2;
    p2x = frame_size_x_DIV_camera_size_x_MUL_camera_depth*x2/z2 + frame_size_x_DIV_2;
    p3x = frame_size_x_DIV_camera_size_x_MUL_camera_depth*x3/z3 + frame_size_x_DIV_2;
 
    p1y = frame_size_y_DIV_camera_size_y_MUL_camera_depth*y1/z1 + frame_size_y_DIV_2;
    p2y = frame_size_y_DIV_camera_size_y_MUL_camera_depth*y2/z2 + frame_size_y_DIV_2;
    p3y = frame_size_y_DIV_camera_size_y_MUL_camera_depth*y3/z3 + frame_size_y_DIV_2;
   }
 
///////////////////
////////////// 2D CLIPPING
///////////////////////
 
    static int min_x, min_y, max_x, max_y;
 
     min_x = min_int((int)p1x, (int)p2x, (int)p3x);
     min_y = min_int((int)p1y, (int)p2y, (int)p3y);
     max_x = max_int((int)p1x, (int)p2x, (int)p3x);
     max_y = max_int((int)p1y, (int)p2y, (int)p3y);
 
    if(! RectangleOverlapsFrame(min_x, min_y, max_x, max_y) )
       return 0;
 
  (....)
      // later is shading the triangles then rasterization routines
 
}
 

you think it can be further optymized ? I got no idea

This topic is closed to new replies.

Advertisement