-march=pentium3 -mtune=generic -mfpmath=both ?

Started by
19 comments, last by Tribad 9 years, 9 months ago
Did you profile to find out that those two functions are the bottleneck?
Advertisement

Did you profile to find out that those two functions are the bottleneck?

they consume exactly half of the frame time at least for the smaller resolutions like 500p, at larger when rasterisations span more pixels rasterisation bagan to take over 50%

more exact profiling

time before transformations - 2 ms

transformations - 6 ms

projection - 1 ms

minmax block - 4 ms

vertex depth test - 1 ms

vertex shading - 8 ms

following rasterisation - -2 ms

it seems like ifs are expensive as the stages with them are

heavy

this vertex shading is shitty, but i dont know how to improve it


 
    float s1 = dot(normal, lightDir1);
    float s2 = dot(normal, lightDir2);
    float s3 = dot(normal, lightDir3);
    float s4 = dot(normal, lightDir4);

     if(s1<0) s1*=0;
     if(s2<0) s2*=0;
     if(s3<0) s3*=0;
     if(s4<0) s4*=0;
 
    float b = (color&0x000000ff);
    float g = (color&0x0000ff00)>>8;
    float r = (color&0x00ff0000)>>16;
 
     r*= .1 + (s1*lightColor1.x + s2*lightColor2.x + s3*lightColor3.x+ s4*lightColor4.x);
     g*= .1 +(s1*lightColor1.y + s2*lightColor2.y + s3*lightColor3.y+ s4*lightColor4.y);
     b*= .1 + (s1*lightColor1.z + s2*lightColor2.z + s3*lightColor3.z+ s4*lightColor4.z);
 
    int rr = r;
    int gg = g;
    int bb = b;
    if(rr<0) rr=0;
    if(gg<0) gg=0;
    if(bb<0) bb=0;
    if(rr>255) rr=255;
    if(gg>255) gg=255;
    if(bb>255) bb=255;
    color = bb+(gg<<8)+(rr<<16);

does maybe someone know how to awoit those ifs?
yet curiously when i put return after the shading but before the rasterization it takes LONGER, so this is first time in my life when i see the code which has minus execution time 9and there is a lot of it even couple of divisions etc)

ps

i managed to improve this 1 ms (5%) by revriting this min max bounding rectangle clipping by just vertex testin (that previous was silly of me)


    int v1vis = PixelIsInFrame(p1x, p1y);
    int v2vis = PixelIsInFrame(p2x, p2y);
    int v3vis = PixelIsInFrame(p3x, p3y);
 
    int none_in_frame = !v1vis && !v2vis && !v3vis;
    if(none_in_frame) return 0;
 
    int all_in_frame = v1vis && v2vis && v3vis;
 
    if(all_in_frame)
    {
     //....
    }

not much but always nice, profiling shows this damn lazy shading eats
half of this all (10 ms) and rasterization itself takes 0 ms - im not quite understand this
How are you profiling things?

How are you profiling things?

by query performance counter ...

yet i did one thing,

there was some big dilemma if i should store world-space transformed

triangle (which is immediatelly transformed yet by eye space) , as im doing above, to use this later in vertex shading or not to store this (in 36 bytes of local variables) but recount it second time there

i tested it and this showed be the same speed (storing versus recalculating) though it is hard to say as it is hard for me to observe

changes smaller than about 1 ms and also changes in more cases

- but it looks like be about the same execution time

but this way of not storing the immediate world- space position

give opportunity to multiply those two transformations and

throwing the coefficients before the loop

i was doubting if it will help as those operation were not complex

couple muls and adds also it was a big amount of work by

editing thic coefficients with hand - it took me probably about 3 hours of

hard work of moving those variables in text editor :C


void f()
{
  npx_px =  +  (modelRight.x*cameraRight.x
     +   modelUp.x*cameraRight.y
       +  modelDir.x*cameraRight.z);
 
  npx_py =   (modelRight.y*cameraRight.x
     + modelUp.y*cameraRight.y
      + modelDir.y*cameraRight.z);
 
  npx_pz =   (+  modelRight.z*cameraRight.x
        +  modelUp.z*cameraRight.y
        +  modelDir.z*cameraRight.z);
  npx_tail =
  -  modelPos.x*modelRight.x*cameraRight.x
    -  modelPos.y*modelRight.y*cameraRight.x
      -  modelPos.z*modelRight.z*cameraRight.x
       + modelPos.x*cameraRight.x
        - cameraPos.x*cameraRight.x
           -  modelPos.x*modelUp.x*cameraRight.y
           -  modelPos.y*modelUp.y*cameraRight.y
               -  modelPos.z*modelUp.z*cameraRight.y
               + modelPos.y*cameraRight.y
               - cameraPos.y*cameraRight.y
                  -  modelPos.x*modelDir.x*cameraRight.z
                      -  modelPos.y*modelDir.y*cameraRight.z
                        -  modelPos.z*modelDir.z*cameraRight.z
                       + modelPos.z*cameraRight.z
                       - cameraPos.z*cameraRight.z;
 
 
  npy_px =   modelRight.x*cameraUp.x  + modelUp.x*cameraUp.y  + modelDir.x*cameraUp.z;
  npy_py =  modelRight.y*cameraUp.x + modelUp.y*cameraUp.y + modelDir.y*cameraUp.z;
  npy_pz =   modelRight.z*cameraUp.x +  modelUp.z*cameraUp.y  + modelDir.z*cameraUp.z;
 
   npy_tail =
 
 -  modelPos.x*modelRight.x*cameraUp.x
  -  modelPos.y*modelRight.y*cameraUp.x
   -  modelPos.z*modelRight.z*cameraUp.x
    + modelPos.x*cameraUp.x
     - cameraPos.x*cameraUp.x
        -  modelPos.x*modelUp.x*cameraUp.y
          -  modelPos.y*modelUp.y*cameraUp.y
            -  modelPos.z*modelUp.z*cameraUp.y
              + modelPos.y*cameraUp.y
               - cameraPos.y*cameraUp.y
                -  modelPos.x*modelDir.x*cameraUp.z
                  -  modelPos.y*modelDir.y*cameraUp.z
                   -  modelPos.z*modelDir.z*cameraUp.z
                     + modelPos.z*cameraUp.z
                      - cameraPos.z*cameraUp.z   ;
 
 
  npz_px =  modelRight.x*cameraDir.x
        +  modelUp.x*cameraDir.y
                 +  modelDir.x*cameraDir.z;
 
  npz_py =    modelRight.y*cameraDir.x
         + modelUp.y*cameraDir.y
                 + modelDir.y*cameraDir.z;
 
  npz_pz =       modelRight.z*cameraDir.x
          +  modelUp.z*cameraDir.y
                 + modelDir.z*cameraDir.z;
 
  npz_tail =   -  modelPos.x*modelRight.x*cameraDir.x
    -  modelPos.y*modelRight.y*cameraDir.x
      -  modelPos.z*modelRight.z*cameraDir.x
       + modelPos.x*cameraDir.x
        - cameraPos.x*cameraDir.x
          -  modelPos.x*modelUp.x*cameraDir.y
         -  modelPos.y*modelUp.y*cameraDir.y
           -  modelPos.z*modelUp.z*cameraDir.y
            + modelPos.y*cameraDir.y
            - cameraPos.y*cameraDir.y
                   -  modelPos.x*modelDir.x*cameraDir.z
                  -  modelPos.y*modelDir.y*cameraDir.z
                 -  modelPos.z*modelDir.z*cameraDir.z
                 + modelPos.z*cameraDir.z
                  - cameraPos.z*cameraDir.z  ;
 
}

but finaly it showed that i gained 3 ms (it is 15% at low res at higher res this is only 7%) so with this two changes it dropped from 20 ms to 16 ms

not much noticable thing with naked eye but anyway nice (besides all my code is slow but this is for education - how to improve it with cache methods is still mystery for me )


by query performance counter ...

Uhh, you're going to have tons of latent cache effects if you just time random portions of code. Just use a sampling profiler, it will tell you exactly which functions your program spends the most time in by directly sampling the instruction pointer, minus the voodoo and uncertainty. You cannot guess performance by eyeballing how many multiplications you're doing or how many variables you're using in your code, hardware doesn't work that way anymore (though perhaps it still might for you, I don't know what you're running on...)


i was doubting if it will help as those operation were not complex

Then stop doubting - profile. With a real profiler, not a microbenchmark. Main thing is a profiler does a better job isolating actual realistic function runtimes, timing alone is very dependent on context (recently executed instructions and so on) so any minuscule gain you observe is usually illusory (or a cognitive bias) and will likely disappear the next time you refactor some code, or even reboot.


it took me probably about 3 hours of hard work of moving those variables in text editor :C

You spent three hours renaming and moving variables around? blink.png I hope that is just the forum acting up because, no offense, but the indentation is completely incoherent.

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”


by query performance counter ...

Uhh, you're going to have tons of latent cache effects if you just time random portions of code. Just use a sampling profiler, it will tell you exactly which functions your program spends the most time in by directly sampling the instruction pointer, minus the voodoo and uncertainty. You cannot guess performance by eyeballing how many multiplications you're doing or how many variables you're using in your code, hardware doesn't work that way anymore (though perhaps it still might for you, I don't know what you're running on...)


i was doubting if it will help as those operation were not complex

Then stop doubting - profile. With a real profiler, not a microbenchmark. Main thing is a profiler does a better job isolating actual realistic function runtimes, timing alone is very dependent on context (recently executed instructions and so on) so any minuscule gain you observe is usually illusory (or a cognitive bias) and will likely disappear the next time you refactor some code, or even reboot.


it took me probably about 3 hours of hard work of moving those variables in text editor :C

You spent three hours renaming and moving variables around? blink.png I hope that is just the forum acting up because, no offense, but the indentation is completely incoherent.

1)to profile you must revrite and sometimes it is 3 hours ;/

2)(about 3 hours for composting two "Rotation Translation" transformations into one with precomputing factors );

3)im not profiling out of context im looking on frame times

- i like to optymize though im not good at it some hints or ideas

what could help here yet would be much welcome, i need yet to do this

test with tiled rasterization - some more info on this method or in general how to use cache here would be welcome


inline void TransformPointByModelMatrix(float* px, float* py, float* pz)

Um... you supply the vector as three separate pointers?

Something is telling me that this has to be extremely slow.

try changing the signature to


void TransformPointByModelMatrix(float* p)

and use the array element operator [] to access the separate elements and see if it makes any difference.

2)(about 3 hours for composting two "Rotation Translation" transformations into one with precomputing factors );


Let me get this straight, it took you 3 hours to change
glm::mat4 projection = ...;
glm::mat4 view = ...;
glm::mat4 model = ...;

//....
for every vertex:
    glm::vec4 clipSpaceVertex = projection * view * model * modelSpaceVertex;
into
glm::mat4 projection = ...;
glm::mat4 view = ...;
glm::mat4 model = ...;

glm::mat4 MVP = projection * view * model;


//....

for every vertex:
    glm::vec4 clipSpaceVertex = MVP * modelSpaceVertex;
seriously?


inline void TransformPointByModelMatrix(float* px, float* py, float* pz)

Um... you supply the vector as three separate pointers?

Something is telling me that this has to be extremely slow.

try changing the signature to


void TransformPointByModelMatrix(float* p)

and use the array element operator [] to access the separate elements and see if it makes any difference.

those functions are inlined, though shade() and rasterize() are not and i pass like 9 floats to them - but im not expecting much difference if inlining

(based one some previous test i was doing with such things)

the thing is that if i pass triangle by value the acces to it is hardcoded

like [0x12345678] [0x1234567c] , when i pass the pointer to trangle or array index the acces is then like [triangle+0], [triangle+4], or [array+triangle*i+0] in assembly this adressing all probably looks

like the same as those first numbers i gave are in reality also adressed

by bsp than hardcoded and such adressings are probably at the

same speed (maybe a differency od cycle or two or something im not sure)

- so finally i assume (maybe wery slightli wrong that the acces time to the

(*p).x

p.x

x //where x is on the stack

is nearly the same (maybe i could suspect that array and structure could be a bit slower as i or p adresses must be stored in some register asses

for x is also stored in stack pointer but it would be stored anyway

(I may be slightly wrong here , someoe could correct me)

there is yet a question of copying data by push - this probably could be much more slowdown than those previous adressing slowdowns 9as as i think as this is called hundred thousands times on a frame adds to general ram troughtput limit in the number of 100k*36 = 3.6 MB of throughtput by frame pushed on stack - i am not sure as this kind of writings can be counted same way as writing top regular array but probably*) but in most cases i need such variable storage clonings anyway

* very interesting questions could someone correct that, say i got

two cases

for(i=0;i<100*1000; i++)

a36byteStruct=some35byteStruct;

//copying 36 bytes * 100k = 3.6MB

for(i=0;i<100*1000; i++)

f(a,b,c,d,e,f,g,h,k); // pushing 36 bytes on the stack * 100k = 3.6MB
both thic casec counds in mamory troughtput limitation in the same way?
I suspect that - yes and calling functions can be counted ac copying structures, but not 100% sure

This topic is closed to new replies.

Advertisement