Jump to content

  • Log In with Google      Sign In   
  • Create Account

We're offering banner ads on our site from just $5!

1. Details HERE. 2. GDNet+ Subscriptions HERE. 3. Ad upload HERE.


Don't forget to read Tuesday's email newsletter for your chance to win a free copy of Construct 2!


-march=pentium3 -mtune=generic -mfpmath=both ?


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
20 replies to this topic

#1 fir   Members   -  Reputation: -460

Like
0Likes
Like

Posted 22 June 2014 - 04:07 AM

Im still confused what flags to use in mingw (gcc 4.7.1) to optymize binary as far as its possible -

 

especially that this gcc dosumentation is partially weakly written

 

for example i do understand (hopefully correct)

 

1) that -march says what instruction set i should restrict compiler to use (for example setting -march=pentium3 makes my binary onlu with instructions awaliable on pentium3)

 

2) also i understand that -mtune says to what target the previous instructions do optymize, for example i can get p3 instructions and optymize it for core2

 

confusingly the docs say

 

"-march=cpu-type Generate instructions for the machine type cpu-type. The choices for cpu-type are the same as for -mtune. Moreover, specifying -march=cpu-type implies -mtune=cpu-type. "

 

I doubt if this is true - does this mean that when choicing -march=pentium3 -mtune=generic the mtune setting is discarded and this is equiwalent of 

-march=pentium3 -mtune=pentium3 ? dont think so (this is confusing)

 

other questions

 

1.

 

i would like to chose resonable codeset that would be working on older

machines but also working ok on more modern ones  I chose -march=pentium3 as i  doubt if someone uses something older than p3 and I didnt noticed noticable change when putting something newer here (like -march=core2 - i dint notice any speedup)

 

2.

 

what in general i can yet add to this commandline to speed things up ?

(or throw away some runtime or exception stuff bytes or something like that)

 

c:\mingw\bin\g++ -O2 -Ofast -w -c transform_triangle_3d.c   -funsafe-math-optimizations -mrecip -ffast-math -fno-rtti -fno-exceptions -march=pentium3 -mtune=generic -mfpmath=both 
 
(some flags may be redundant here but I added them to be sure, got no time to carefulle check up what is redundant )
 
im using here -O2 as i not noticed difference with -O3
 
i noticed that "-mfpmath=both " speeded things up (though docs say something that its dangerous didnt understand why) also (-ffast-math /
-funsafe-math-optimizations also speeded things)

Edited by fir, 22 June 2014 - 04:08 AM.


Sponsor:

#2 Bregma   Crossbones+   -  Reputation: 5248

Like
4Likes
Like

Posted 22 June 2014 - 07:37 AM

1) that -march says what instruction set i should restrict compiler to use (for example setting -march=pentium3 makes my binary onlu with instructions awaliable on pentium3)

-march sets the minimum compatibility level... in this case it means Pentium III or later.


2) also i understand that -mtune says to what target the previous instructions do optymize, for example i can get p3 instructions and optymize it for core2
 
confusingly the docs say
 
"-march=cpu-type Generate instructions for the machine type cpu-type. The choices for cpu-type are the same as for -mtune. Moreover, specifying -march=cpu-type implies -mtune=cpu-type. "
 
I doubt if this is true - does this mean that when choicing -march=pentium3 -mtune=generic the mtune setting is discarded and this is equiwalent of 
-march=pentium3 -mtune=pentium3 ? dont think so (this is confusing)

Why do you doubt it? It makes perfect sense: -march has priority. If you choose to set the minimum compatibility level, the optimizer will use that as when making choices.

1. i would like to chose resonable codeset that would be working on older
machines but also working ok on more modern ones  I chose -march=pentium3 as i  doubt if someone uses something older than p3 and I didnt noticed noticable change when putting something newer here (like -march=core2 - i dint notice any speedup)

While there are millions of pre-PIII machines still going into production, it's unlikely that your game will be running on them (they're things like disk controllers, routers, refrigerators, toasters, and so on). PIII is probabyl good enough, since it has PAE by default and other improvements like fast DIV, better interlocking, and extended prefetch.

It's also likely that newer architectures don't introduce new abilities that your picooptimization can take advantage of when it comes to something not CPU-bound, like a game.

2. what in general i can yet add to this commandline to speed things up ?
(or throw away some runtime or exception stuff bytes or something like that)

In general, such picooptimization is not going to make one whit of difference in a typical game. What you really need to do is hand-tune some very specific targeted benchmark programs so they show significant difference between the settings (by not really running the same code), like the magazines and websites do when they're trying to sell you something.

im using here -O2 as i not noticed difference with -O3

Hardly surprising, since most picooptimizations don't provide much noticeable difference in non-CPU-bound code. -O2 is likely good enough (and definitely better than -O1 or -O0), but -O3 has been known to introduce bad code from time to time, I always stay away from it.

i noticed that "-mfpmath=both " speeded things up (though docs say something that its dangerous didnt understand why) also (-ffast-math /
-funsafe-math-optimizations also speeded things)

Those switches end up altering the floating-point results. You may lose accuracy, and some results may vary from IEEE standards in their higher-order significant digits. If you're doing a lot of repeated floating-point calculations in which such error can propagate quickly, you will not want to choose those options. For the purposes of most games, they're probably OK. Don't enable them when calculating missile trajectories for real-life nuclear warheads. Don't forget GCC has other uses with much stricter requirements than casual game development.

I'd say that while it's fun to play with the GCC command-line options and it's a good idea to understand them, they're not really going to give you a lot of optimization oomph. You will get far more bang for your buck playing with algorithms and structuring your code and data to take advantage of on-core caching.

Also, if you haven't already, you might want to read about the GCC internals to understand more of what's going on under the hood.


Stephen M. Webb
Professional Free Software Developer

#3 fir   Members   -  Reputation: -460

Like
0Likes
Like

Posted 22 June 2014 - 08:06 AM

tnx for the answer, dont know what you call pikooptymizations 

when testing in my prog (rasterizer) removing all the other opt switches except -O3 slowed the prog execution from 20 ms to about 31ms

then curiously changing -O3 into -O1 slowed only to about 32-33 ms

 

- so it shows some optymization flags can have significant effect

 

(I will answer more a bit later)


Edited by fir, 22 June 2014 - 08:26 AM.


#4 fir   Members   -  Reputation: -460

Like
0Likes
Like

Posted 22 June 2014 - 08:33 AM

 

1) that -march says what instruction set i should restrict compiler to use (for example setting -march=pentium3 makes my binary onlu with instructions awaliable on pentium3)

-march sets the minimum compatibility level... in this case it means Pentium III or later.


2) also i understand that -mtune says to what target the previous instructions do optymize, for example i can get p3 instructions and optymize it for core2
 
confusingly the docs say
 
"-march=cpu-type Generate instructions for the machine type cpu-type. The choices for cpu-type are the same as for -mtune. Moreover, specifying -march=cpu-type implies -mtune=cpu-type. "
 
I doubt if this is true - does this mean that when choicing -march=pentium3 -mtune=generic the mtune setting is discarded and this is equiwalent of 
-march=pentium3 -mtune=pentium3 ? dont think so (this is confusing)

Why do you doubt it? It makes perfect sense: -march has priority. If you choose to set the minimum compatibility level, the optimizer will use that as when making choices.

 

the question is if I set march to pentium3  then mtune to core2 if it will optymize for pentium3 or for core2 - those sentence in docs can be understood as it will optymize for pentium3 - and i doubt it becouse if so 

mtune would be unusable there

 

as to rest of the things i noticed speedup when hand crafting with procedures but some of this switches also gave a visible speedup (like i was saying from 30 -> 20 ms)


Edited by fir, 22 June 2014 - 08:35 AM.


#5 Matias Goldberg   Crossbones+   -  Reputation: 3570

Like
4Likes
Like

Posted 22 June 2014 - 08:37 AM

march and mtune looks similar; and they're quite related. But not the same.

 

march specifies the minimum compatibility level. This means that the compiler won't generate an instruction that is incompatible with that architecture. i.e. if you specify march=pentium2 then SSE can't be used.

With march=pentium3; SSE2 won't be used.

(the exception happens if you i.e. explicitly use sse2 intrinsics, SSE2 code will be generated despite your march option).

 

mtune optimizes for the given architecture. You need a very low level understanding of how CPUs work. It's better to use an example:

 

In Yorkfield architecture, xorps, xorpd and pxor are all three SSE instructions that perform the bitwise "OR" on xmm registers. They all do the same, they're executed by the same execution unit (which afaik lives in the integer unit). The only difference is that xorps takes one less byte to encode. If you tune for Yorkfield, the compiler should be always using xorps and never (or almost never) generate xorpd or pxor.

 

In Nehalem architecture, xorps and pxor are executed by different execution units (I don't know about xorpd). When working with floating point instructions (ie. movaps, addps, etc) the compiler should use xorps. If it uses pxor; the CPU internally has to move the register data from the floating point unit to the integer unit, and then back (if another floating point sse instruction is used afterwards) there is around ~1 cycle penalty for moving between units; so using pxor here could end up adding 2-3 cycles of latency.

But when working with integer instructions (i.e. movdqa, paddd, etc); the compiler should use pxor (despite needing more bytes to encode). If it uses xorps, the data will be moved between execution units and add cycles of latency as with the floating point case.

 

So, in summary, tuning for Yorkfield should always use xorps because it takes less bytes to encode (the penalty from moving to and from the integer unit is always there, can't be avoided) and tuning for Nehalem should select between xorps, xorpd and pxor depending on the type of instructions being used on the registers before and after the OR.

 

Both architectures support these instructions so march doesn't have a big effect in this case. But one architecture prefers one way of doing things, the other prefers the opposite way.

The same snippet tuned for Nehalem performs slower in Yorkfield cpus, and likewise code tuned for Yorkfield performs slower for Nehalem. But both of them can run the two versions.

 

Nehalem supports SSE 4.2; Yorkfield supports up to SSE 4.1; march for yorkfield will guarantee no SSE 4.2 instructions are made. march=nehalem might generate code that can't be executed by Yorkfield.

 

Another example: AMD K10 cpus execute shifps faster than a pair of movhlps/movlhps; but the opposite is true for pre-K10 cpus.

 

Of course if your march is too far apart from mtune, many tuning-optimization opportunities will be missed. i.e. march=pentium3 removes SSE2; and thus selecting between mtune=yorkfield or mtune=nehalem is quite pointless (not completely though, there could be some minor differences in usage patterns regarding general purpose registers, etc).

 

Is it clear now?


Edited by Matias Goldberg, 22 June 2014 - 08:39 AM.


#6 fir   Members   -  Reputation: -460

Like
0Likes
Like

Posted 22 June 2014 - 08:46 AM

march and mtune looks similar; and they're quite related. But not the same.

 

march specifies the minimum compatibility level. This means that the compiler won't generate an instruction that is incompatible with that architecture. i.e. if you specify march=pentium2 then SSE can't be used.

With march=pentium3; SSE2 won't be used.

(the exception happens if you i.e. explicitly use sse2 intrinsics, SSE2 code will be generated despite your march option).

 

mtune optimizes for the given architecture. You need a very low level understanding of how CPUs work. It's better to use an example:

 

In Yorkfield architecture, xorps, xorpd and pxor are all three SSE instructions that perform the bitwise "OR" on xmm registers. They all do the same, they're executed by the same execution unit (which afaik lives in the integer unit). The only difference is that xorps takes one less byte to encode. If you tune for Yorkfield, the compiler should be always using xorps and never (or almost never) generate xorpd or pxor.

 

In Nehalem architecture, xorps and pxor are executed by different execution units (I don't know about xorpd). When working with floating point instructions (ie. movaps, addps, etc) the compiler should use xorps. If it uses pxor; the CPU internally has to move the register data from the floating point unit to the integer unit, and then back (if another floating point sse instruction is used afterwards) there is around ~1 cycle penalty for moving between units; so using pxor here could end up adding 2-3 cycles of latency.

But when working with integer instructions (i.e. movdqa, paddd, etc); the compiler should use pxor (despite needing more bytes to encode). If it uses xorps, the data will be moved between execution units and add cycles of latency as with the floating point case.

 

So, in summary, tuning for Yorkfield should always use xorps because it takes less bytes to encode (the penalty from moving to and from the integer unit is always there, can't be avoided) and tuning for Nehalem should select between xorps, xorpd and pxor depending on the type of instructions being used on the registers before and after the OR.

 

Both architectures support these instructions so march doesn't have a big effect in this case. But one architecture prefers one way of doing things, the other prefers the opposite way.

The same snippet tuned for Nehalem performs slower in Yorkfield cpus, and likewise code tuned for Yorkfield performs slower for Nehalem. But both of them can run the two versions.

 

Nehalem supports SSE 4.2; Yorkfield supports up to SSE 4.1; march for yorkfield will guarantee no SSE 4.2 instructions are made. march=nehalem might generate code that can't be executed by Yorkfield.

 

Another example: AMD K10 cpus execute shifps faster than a pair of movhlps/movlhps; but the opposite is true for pre-K10 cpus.

 

Of course if your march is too far apart from mtune, many tuning-optimization opportunities will be missed. i.e. march=pentium3 removes SSE2; and thus selecting between mtune=yorkfield or mtune=nehalem is quite pointless (not completely though, there could be some minor differences in usage patterns regarding general purpose registers, etc).

 

Is it clear now?

 

it is clear except this confusing sentence in documentation 

 

"-march=cpu-type Generate instructions for the machine type cpu-type. The choices for cpu-type are the same as for -mtune. Moreover, specifying -march=cpu-type implies -mtune=cpu-type. "

 

this could be understood as setting march to pentium3 implies setting (overwritting) mtune for pentium3 

 

this probably means that it implies setting to mtune when you do not state mtune otherwise, when I state mtune to core2 then this implicity do not occurs and it will not overwrite the setting - but this sentence is confuding a bit

 

PS  there is yet a bit of confusion about setting mfpmath to both

- i noticed speedup against setting to sse, i noticed

float < sse <both, both seem to be the best but there was stated 

 

`sse,387' `sse+387' `both' Attempt to utilize both instruction sets at once. This effectively double the amount of available registers and on chips with separate execution units for 387 and SSE the execution resources too. Use this option with care, as it is still experimental, because the GCC register allocator does not model separate functional units well resulting in instable performance.

 

what does it can mean, is it unsafe in some way?


Edited by fir, 22 June 2014 - 08:54 AM.


#7 Matias Goldberg   Crossbones+   -  Reputation: 3570

Like
2Likes
Like

Posted 22 June 2014 - 08:50 AM

"-march=cpu-type Generate instructions for the machine type cpu-type. The choices for cpu-type are the same as for -mtune. Moreover, specifying -march=cpu-type implies -mtune=cpu-type. "
 
I doubt if this is true - does this mean that when choicing -march=pentium3 -mtune=generic the mtune setting is discarded and this is equiwalent of 
-march=pentium3 -mtune=pentium3 ? dont think so (this is confusing)

-mtune=core2 -march=pentium3 will generate code for Pentium III, and tune for Pentium III (march overrode the mtune)
-march=pentium3 -mtune=core2 will generate code for Pentium III, and tune for Core 2 (mtune was set after march)

Though note that, like I said, march=pentium3 and mtune=core2 are so far apart that there probably won't make much difference; because many tuning chances will be missed due to missing key instructions.

#8 fir   Members   -  Reputation: -460

Like
0Likes
Like

Posted 22 June 2014 - 08:58 AM

 

"-march=cpu-type Generate instructions for the machine type cpu-type. The choices for cpu-type are the same as for -mtune. Moreover, specifying -march=cpu-type implies -mtune=cpu-type. "
 
I doubt if this is true - does this mean that when choicing -march=pentium3 -mtune=generic the mtune setting is discarded and this is equiwalent of 
-march=pentium3 -mtune=pentium3 ? dont think so (this is confusing)

-mtune=core2 -march=pentium3 will generate code for Pentium III, and tune for Pentium III (march overrode the mtune)
-march=pentium3 -mtune=core2 will generate code for Pentium III, and tune for Core 2 (mtune was set after march)

Though note that, like I said, march=pentium3 and mtune=core2 are so far apart that there probably won't make much difference; because many tuning chances will be missed due to missing key instructions.

 

I was testing with core2 & core2 (on core2) but i not noticed a speedup 



#9 Matias Goldberg   Crossbones+   -  Reputation: 3570

Like
2Likes
Like

Posted 22 June 2014 - 09:22 AM

I was testing with core2 & core2 (on core2) but i not noticed a speedup

They aren't magic switches. These fall in the micro-optimization category. They're most noticed when you're extremely ALU bound.
If you're bandwidth bound, it won't make a difference. Profile. Find your bottleneck and hotspots and optimize that.
Furthermore algorithmic optimizations are much more important.

Edit: Plus, if you're compiling for x64; march=pentium3 will be ignored as the minimum x64 cpu is much newer than the P3.

Edited by Matias Goldberg, 22 June 2014 - 09:35 AM.


#10 fir   Members   -  Reputation: -460

Like
0Likes
Like

Posted 22 June 2014 - 09:41 AM

 

I was testing with core2 & core2 (on core2) but i not noticed a speedup

They aren't magic switches. These fall in the micro-optimization category. They're most noticed when you're extremely ALU bound.
If you're bandwidth bound, it won't make a difference. Profile. Find your bottleneck and hotspots and optimize that.
Furthermore algorithmic optimizations are much more important.

 

 

I faced a limit of my optymizing skills, for example some transformation in 

my software rasterization "pipeline" goes like (this is weakly written (im afraid it looks like middle-ages) but works)

 

 
inline void TransformPointByModelMatrix(float* px, float* py, float* pz)
{
  float wx = *px -  modelPos.x;
  float wy = *py -  modelPos.y;
  float wz = *pz -  modelPos.z;
 
 *px   = ((wx*modelRight.x + wy*modelRight.y + wz*modelRight.z));
 *py   = ((wx*modelUp.x    + wy*modelUp.y    + wz*modelUp.z   ));
 *pz   = ((wx*modelDir.x   + wy*modelDir.y   + wz*modelDir.z  ));
 
  *px  +=  modelPos.x;
  *py  +=  modelPos.y;
  *pz  +=  modelPos.z;
 
 
}
 
inline void TransformPointToEyeSpace(float* px, float* py, float* pz)
{
 
 float wx = *px - cameraPos.x;
 float wy = *py - cameraPos.y;
 float wz = *pz - cameraPos.z;
 
 *px   = ((wx*cameraRight.x + wy*cameraRight.y + wz*cameraRight.z));
 *py   = ((wx*cameraUp.x    + wy*cameraUp.y    + wz*cameraUp.z   ));
 *pz   = ((wx*cameraDir.x   + wy*cameraDir.y   + wz*cameraDir.z  ));
 
 *pz   += camera_depth;
 
}
 
 
 int TransformTriangle3d( Triangle* triangle, unsigned color)
{
  /////
 ////// space TRANSFoRmATIONs bOTH World And EYe
///////
 
     float x1m, y1m, z1m, x1, y1, z1;
     float x2m, y2m, z2m, x2, y2, z2;
     float x3m, y3m, z3m, x3, y3, z3;
 
    /////////////////////////////////
     x1m =     (*triangle).a.x;
     y1m =     (*triangle).a.y;
     z1m =     (*triangle).a.z;
    TransformPointByModelMatrix(&x1m,&y1m,&z1m);
     x1 =  x1m;
     y1 =  y1m;
     z1 =  z1m;
    TransformPointToEyeSpace(&x1,&y1,&z1);
    if( z1<=camera_clip_distance)  return 0;
    ////////////////////////////////////
     x2m =     (*triangle).b.x;
     y2m =     (*triangle).b.y;
     z2m =     (*triangle).b.z;
    TransformPointByModelMatrix(&x2m,&y2m,&z2m);
     x2 =  x2m;
     y2 =  y2m;
     z2 =  z2m;
    TransformPointToEyeSpace(&x2,&y2,&z2);
    if( z2<=camera_clip_distance)  return 0;
    /////////////////////////////////////
     x3m =     (*triangle).c.x;
     y3m =     (*triangle).c.y;
     z3m =     (*triangle).c.z;
    TransformPointByModelMatrix(&x3m,&y3m,&z3m);
     x3 =  x3m;
     y3 =  y3m;
     z3 =  z3m;
    TransformPointToEyeSpace(&x3,&y3,&z3);
    if( z3<=camera_clip_distance)  return 0;
    ///////////////////////////////////
 
  /////////////////////////
 // PrOJecTION to 2D
 //////////////////////////////////
 
   int p1x, p2x, p3x, p1y, p2y, p3y;
 
 
    p1x = frame_size_x_DIV_camera_size_x_MUL_camera_depth*x1/z1 + frame_size_x_DIV_2;
    p2x = frame_size_x_DIV_camera_size_x_MUL_camera_depth*x2/z2 + frame_size_x_DIV_2;
    p3x = frame_size_x_DIV_camera_size_x_MUL_camera_depth*x3/z3 + frame_size_x_DIV_2;
 
    p1y = frame_size_y_DIV_camera_size_y_MUL_camera_depth*y1/z1 + frame_size_y_DIV_2;
    p2y = frame_size_y_DIV_camera_size_y_MUL_camera_depth*y2/z2 + frame_size_y_DIV_2;
    p3y = frame_size_y_DIV_camera_size_y_MUL_camera_depth*y3/z3 + frame_size_y_DIV_2;
   }
 
///////////////////
////////////// 2D CLIPPING
///////////////////////
 
    static int min_x, min_y, max_x, max_y;
 
     min_x = min_int((int)p1x, (int)p2x, (int)p3x);
     min_y = min_int((int)p1y, (int)p2y, (int)p3y);
     max_x = max_int((int)p1x, (int)p2x, (int)p3x);
     max_y = max_int((int)p1y, (int)p2y, (int)p3y);
 
    if(! RectangleOverlapsFrame(min_x, min_y, max_x, max_y) )
       return 0;
 
  (....)
      // later is shading the triangles then rasterization routines
 
}
 

 
you think it can be further optymized ? I got no idea


#11 phantom   Moderators   -  Reputation: 7399

Like
1Likes
Like

Posted 22 June 2014 - 12:56 PM

Did you profile to find out that those two functions are the bottleneck?

#12 fir   Members   -  Reputation: -460

Like
0Likes
Like

Posted 22 June 2014 - 01:48 PM

Did you profile to find out that those two functions are the bottleneck?

 

they consume exactly half of the frame time at least for the smaller resolutions like 500p, at larger when rasterisations span more pixels rasterisation bagan to take over 50%

 

more exact profiling

 

time before transformations - 2 ms

transformations - 6 ms

projection - 1 ms

minmax block - 4 ms

vertex depth test - 1 ms

vertex shading - 8 ms

following rasterisation - -2 ms

 

it seems like ifs are expensive as the stages with them are

heavy

 

this vertex shading is shitty, but i dont know how to improve it

 
    float s1 = dot(normal, lightDir1);
    float s2 = dot(normal, lightDir2);
    float s3 = dot(normal, lightDir3);
    float s4 = dot(normal, lightDir4);

     if(s1<0) s1*=0;
     if(s2<0) s2*=0;
     if(s3<0) s3*=0;
     if(s4<0) s4*=0;
 
    float b = (color&0x000000ff);
    float g = (color&0x0000ff00)>>8;
    float r = (color&0x00ff0000)>>16;
 
     r*= .1 + (s1*lightColor1.x + s2*lightColor2.x + s3*lightColor3.x+ s4*lightColor4.x);
     g*= .1 +(s1*lightColor1.y + s2*lightColor2.y + s3*lightColor3.y+ s4*lightColor4.y);
     b*= .1 + (s1*lightColor1.z + s2*lightColor2.z + s3*lightColor3.z+ s4*lightColor4.z);
 
    int rr = r;
    int gg = g;
    int bb = b;
    if(rr<0) rr=0;
    if(gg<0) gg=0;
    if(bb<0) bb=0;
    if(rr>255) rr=255;
    if(gg>255) gg=255;
    if(bb>255) bb=255;
    color = bb+(gg<<8)+(rr<<16);

does maybe someone know how to awoit those ifs?
 
yet curiously when i put return after the shading but before the rasterization it takes LONGER, so this is first time in my life when i see the code which has minus execution time 9and there is a lot of it even couple of divisions etc)

Edited by fir, 22 June 2014 - 04:13 PM.


#13 fir   Members   -  Reputation: -460

Like
0Likes
Like

Posted 23 June 2014 - 02:58 AM

ps 

 

i managed to improve this 1 ms (5%) by revriting this min max bounding rectangle clipping by just vertex testin (that previous was silly of me)

    int v1vis = PixelIsInFrame(p1x, p1y);
    int v2vis = PixelIsInFrame(p2x, p2y);
    int v3vis = PixelIsInFrame(p3x, p3y);
 
    int none_in_frame = !v1vis && !v2vis && !v3vis;
    if(none_in_frame) return 0;
 
    int all_in_frame = v1vis && v2vis && v3vis;
 
    if(all_in_frame)
    {
     //....
    }

not much but always nice, profiling shows this damn lazy shading eats
half of this all (10 ms) and rasterization itself takes 0 ms - im not quite understand this

Edited by fir, 23 June 2014 - 02:58 AM.


#14 phantom   Moderators   -  Reputation: 7399

Like
1Likes
Like

Posted 23 June 2014 - 03:45 AM

How are you profiling things?

#15 fir   Members   -  Reputation: -460

Like
0Likes
Like

Posted 23 June 2014 - 06:12 AM

How are you profiling things?

 

by query performance counter ...

 

yet i did one thing, 

 

there was some big dilemma if i should store world-space transformed

triangle (which is immediatelly transformed yet by eye space) , as im doing above, to use this later in vertex shading or not to store this (in 36 bytes of local variables) but recount it second time there 

 

i tested it and this showed be the same speed (storing versus recalculating) though it is hard to say as it is hard for me to observe 

changes smaller than about 1 ms and also changes in more cases

- but it looks like be about the same execution time 

 

but this way of not storing the immediate world- space position 

give opportunity to multiply those two transformations and 

throwing the coefficients before the loop

 

i was doubting if it will help as those operation were not complex 

couple muls and adds  also it was a big amount of work by

editing thic coefficients with hand - it took me probably about 3 hours of 

hard work of moving those variables in text editor :C 

void f()
{
  npx_px =  +  (modelRight.x*cameraRight.x
     +   modelUp.x*cameraRight.y
       +  modelDir.x*cameraRight.z);
 
  npx_py =   (modelRight.y*cameraRight.x
     + modelUp.y*cameraRight.y
      + modelDir.y*cameraRight.z);
 
  npx_pz =   (+  modelRight.z*cameraRight.x
        +  modelUp.z*cameraRight.y
        +  modelDir.z*cameraRight.z);
  npx_tail =
  -  modelPos.x*modelRight.x*cameraRight.x
    -  modelPos.y*modelRight.y*cameraRight.x
      -  modelPos.z*modelRight.z*cameraRight.x
       + modelPos.x*cameraRight.x
        - cameraPos.x*cameraRight.x
           -  modelPos.x*modelUp.x*cameraRight.y
           -  modelPos.y*modelUp.y*cameraRight.y
               -  modelPos.z*modelUp.z*cameraRight.y
               + modelPos.y*cameraRight.y
               - cameraPos.y*cameraRight.y
                  -  modelPos.x*modelDir.x*cameraRight.z
                      -  modelPos.y*modelDir.y*cameraRight.z
                        -  modelPos.z*modelDir.z*cameraRight.z
                       + modelPos.z*cameraRight.z
                       - cameraPos.z*cameraRight.z;
 
 
  npy_px =   modelRight.x*cameraUp.x  + modelUp.x*cameraUp.y  + modelDir.x*cameraUp.z;
  npy_py =  modelRight.y*cameraUp.x + modelUp.y*cameraUp.y + modelDir.y*cameraUp.z;
  npy_pz =   modelRight.z*cameraUp.x +  modelUp.z*cameraUp.y  + modelDir.z*cameraUp.z;
 
   npy_tail =
 
 -  modelPos.x*modelRight.x*cameraUp.x
  -  modelPos.y*modelRight.y*cameraUp.x
   -  modelPos.z*modelRight.z*cameraUp.x
    + modelPos.x*cameraUp.x
     - cameraPos.x*cameraUp.x
        -  modelPos.x*modelUp.x*cameraUp.y
          -  modelPos.y*modelUp.y*cameraUp.y
            -  modelPos.z*modelUp.z*cameraUp.y
              + modelPos.y*cameraUp.y
               - cameraPos.y*cameraUp.y
                -  modelPos.x*modelDir.x*cameraUp.z
                  -  modelPos.y*modelDir.y*cameraUp.z
                   -  modelPos.z*modelDir.z*cameraUp.z
                     + modelPos.z*cameraUp.z
                      - cameraPos.z*cameraUp.z   ;
 
 
  npz_px =  modelRight.x*cameraDir.x
        +  modelUp.x*cameraDir.y
                 +  modelDir.x*cameraDir.z;
 
  npz_py =    modelRight.y*cameraDir.x
         + modelUp.y*cameraDir.y
                 + modelDir.y*cameraDir.z;
 
  npz_pz =       modelRight.z*cameraDir.x
          +  modelUp.z*cameraDir.y
                 + modelDir.z*cameraDir.z;
 
  npz_tail =   -  modelPos.x*modelRight.x*cameraDir.x
    -  modelPos.y*modelRight.y*cameraDir.x
      -  modelPos.z*modelRight.z*cameraDir.x
       + modelPos.x*cameraDir.x
        - cameraPos.x*cameraDir.x
          -  modelPos.x*modelUp.x*cameraDir.y
         -  modelPos.y*modelUp.y*cameraDir.y
           -  modelPos.z*modelUp.z*cameraDir.y
            + modelPos.y*cameraDir.y
            - cameraPos.y*cameraDir.y
                   -  modelPos.x*modelDir.x*cameraDir.z
                  -  modelPos.y*modelDir.y*cameraDir.z
                 -  modelPos.z*modelDir.z*cameraDir.z
                 + modelPos.z*cameraDir.z
                  - cameraPos.z*cameraDir.z  ;
 
}

but finaly it showed that i gained 3 ms (it is 15% at low res at higher res this is only 7%) so with this two changes it dropped from 20 ms to 16 ms

not much noticable thing with naked eye but anyway nice (besides all my code is slow but this is for education - how to improve it with cache methods is still mystery for me )

 

 

 

Edited by fir, 23 June 2014 - 06:13 AM.


#16 Bacterius   Crossbones+   -  Reputation: 9066

Like
2Likes
Like

Posted 23 June 2014 - 06:27 AM


by query performance counter ...

 

Uhh, you're going to have tons of latent cache effects if you just time random portions of code. Just use a sampling profiler, it will tell you exactly which functions your program spends the most time in by directly sampling the instruction pointer, minus the voodoo and uncertainty. You cannot guess performance by eyeballing how many multiplications you're doing or how many variables you're using in your code, hardware doesn't work that way anymore (though perhaps it still might for you, I don't know what you're running on...)

 

 

 


i was doubting if it will help as those operation were not complex

 

Then stop doubting - profile. With a real profiler, not a microbenchmark. Main thing is a profiler does a better job isolating actual realistic function runtimes, timing alone is very dependent on context (recently executed instructions and so on) so any minuscule gain you observe is usually illusory (or a cognitive bias) and will likely disappear the next time you refactor some code, or even reboot.

 

 

 


it took me probably about 3 hours of hard work of moving those variables in text editor :C

 

You spent three hours renaming and moving variables around? blink.png I hope that is just the forum acting up because, no offense, but the indentation is completely incoherent.


The slowsort algorithm is a perfect illustration of the multiply and surrender paradigm, which is perhaps the single most important paradigm in the development of reluctant algorithms. The basic multiply and surrender strategy consists in replacing the problem at hand by two or more subproblems, each slightly simpler than the original, and continue multiplying subproblems and subsubproblems recursively in this fashion as long as possible. At some point the subproblems will all become so simple that their solution can no longer be postponed, and we will have to surrender. Experience shows that, in most cases, by the time this point is reached the total work will be substantially higher than what could have been wasted by a more direct approach.

 

- Pessimal Algorithms and Simplexity Analysis


#17 fir   Members   -  Reputation: -460

Like
0Likes
Like

Posted 23 June 2014 - 06:59 AM

 


by query performance counter ...

 

Uhh, you're going to have tons of latent cache effects if you just time random portions of code. Just use a sampling profiler, it will tell you exactly which functions your program spends the most time in by directly sampling the instruction pointer, minus the voodoo and uncertainty. You cannot guess performance by eyeballing how many multiplications you're doing or how many variables you're using in your code, hardware doesn't work that way anymore (though perhaps it still might for you, I don't know what you're running on...)

 

 

 


i was doubting if it will help as those operation were not complex

 

Then stop doubting - profile. With a real profiler, not a microbenchmark. Main thing is a profiler does a better job isolating actual realistic function runtimes, timing alone is very dependent on context (recently executed instructions and so on) so any minuscule gain you observe is usually illusory (or a cognitive bias) and will likely disappear the next time you refactor some code, or even reboot.

 

 

 


it took me probably about 3 hours of hard work of moving those variables in text editor :C

 

You spent three hours renaming and moving variables around? blink.png I hope that is just the forum acting up because, no offense, but the indentation is completely incoherent.

 

 

1)to profile you must revrite and sometimes it is 3 hours ;/

2)(about 3 hours for composting two "Rotation Translation" transformations into one with precomputing factors );

3)im not profiling out of context im looking on frame times 

 

- i like to optymize though im not good at it some hints or ideas 

what could help here yet would be much welcome, i need yet to do this

test with tiled rasterization - some more info on this method or in general how to use cache here would be welcome


Edited by fir, 23 June 2014 - 07:01 AM.


#18 Madhed   Crossbones+   -  Reputation: 3077

Like
2Likes
Like

Posted 23 June 2014 - 07:51 AM

inline void TransformPointByModelMatrix(float* px, float* py, float* pz)

 

Um... you supply the vector as three separate pointers?

Something is telling me that this has to be extremely slow.

 

try changing the signature to

void TransformPointByModelMatrix(float* p)

and use the array element operator [] to access the separate elements and see if it makes any difference.



#19 Ohforf sake   Members   -  Reputation: 1832

Like
1Likes
Like

Posted 23 June 2014 - 09:30 AM

2)(about 3 hours for composting two "Rotation Translation" transformations into one with precomputing factors );


Let me get this straight, it took you 3 hours to change
glm::mat4 projection = ...;
glm::mat4 view = ...;
glm::mat4 model = ...;

//....
for every vertex:
    glm::vec4 clipSpaceVertex = projection * view * model * modelSpaceVertex;
into
glm::mat4 projection = ...;
glm::mat4 view = ...;
glm::mat4 model = ...;

glm::mat4 MVP = projection * view * model;


//....

for every vertex:
    glm::vec4 clipSpaceVertex = MVP * modelSpaceVertex;
seriously?

#20 fir   Members   -  Reputation: -460

Like
0Likes
Like

Posted 23 June 2014 - 10:42 AM



 


inline void TransformPointByModelMatrix(float* px, float* py, float* pz)

 

Um... you supply the vector as three separate pointers?

Something is telling me that this has to be extremely slow.

 

try changing the signature to

void TransformPointByModelMatrix(float* p)

and use the array element operator [] to access the separate elements and see if it makes any difference.

 

 

those functions are inlined, though shade() and rasterize() are not and i pass like 9 floats to them - but im not expecting much difference if inlining

(based one some previous test i was doing with such things)

 

the thing is that if i pass triangle by value the acces to it is hardcoded 

like [0x12345678] [0x1234567c] , when i pass the pointer to trangle or array index the acces is then like [triangle+0], [triangle+4], or  [triangle*i+0]  in assembly this adressing all probably looks

like the same as those first numbers i gave are in reality also adressed

by bsp than hardcoded and such adressings are probably at the 

same speed (maybe a differency od cycle or two or something im not sure)

- so finally i assume (maybe wery slightli wrong that the acces time to the 

 

(*p).x

p[i].x

x //where x is on the stack

 

is nearly the same (maybe i could suspect that array and structure could be a bit slower as i or p adresses must be stored in some register asses

for x is also stored in stack pointer but it would be stored anyway

(I may be slightly wrong here , someoe could correct me)

 

there is yet a question of copying data by push  - this probably could be much more slowdown than those previous adressing slowdowns 9as as i think as this is called hundred thousands times on a frame adds to general ram troughtput limit in the number of 100k*36 = 3.6 MB of throughtput by frame pushed on stack - i am not sure as this kind of writings can be counted same way as writing top regular array but probably*) but in most cases i need such variable storage clonings anyway

 

* very interesting questions could someone correct that, say i got

two cases 

 

for(i=0;i<100*1000; i++)

      a36byteStruct[i]=some35byteStruct;

  //copying 36 bytes * 100k = 3.6MB

 

for(i=0;i<100*1000; i++)

     f(a,b,c,d,e,f,g,h,k);      // pushing 36 bytes on the stack * 100k = 3.6MB
 
both thic casec counds in mamory troughtput limitation in the same way?
I suspect that - yes and calling functions can be counted ac copying structures, but not 100% sure





Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS