speeding this with sse or sse intrinsics

Started by
18 comments, last by fir 9 years, 9 months ago
I got some weakly looking code of flat shading a triangle on cpu (with 4 lights)
it consumes a noticable amount of frame time (seem to be over 50% though it is hard to say as some cache effects come into play and some parts of code show
zero execution time - profiling says that this code consumes most execution time in the whole pipeline)
could it be revritten using SEE or SSE intrinsics (im using mingw and gcc 4.7)
witha rules of SSE art? how to do this? (all small functions called here like cross dot normalize anr rgb (for adding sperate rgb int one unsigned int) are my own and i could revrite this dependant function bodies too here)
How to do that - if someone would help with that i could test if it has an effects on frame time; prefereably gcc sse intrinsics way would be most invited
if no maybe someone could say where to go with such sse-related questions as i know this forum is not to much focused on assembly, maybe there are better places over the net to talk this matter?

 
// input:     float x1,  y1,  z1, x2,  y2,  z2, x3,  y3,  z3;
 

   static float3 lightDir1=  {0.2, -1.6,  -1.7 }; 
   static float3 lightDir2 = {0.5, -0.7,  20.3 };
   static float3 lightDir3 = {-0.5,-0.3, -0.6 }; 
   static float3 lightDir4 = {-0.5, 1.3,  0.6 };
 
    static  float3 lightColor1=  {.4,    .414,  .515 };
    static   float3 lightColor2 = {.4145, .451,   .543 };
    static   float3 lightColor3 = {.584,  .51414,  .43 };
    static    float3 lightColor4 = {.41,   .44,    .3414 };
 

   float3 u = {x2-x1, y2-y1, z2-z1 };
   float3 v = {x3-x2, y3-y2, z3-z2 };
 
   float3 normal = cross_(u,v);
 
//  normal.x = (y2-y1)*(z3-z2) - (z2-z1)*(y3-y2);
//  normal.y = (z2-z1)*(x3-x2) - (x2-x1)*(z3-z2);
//  normal.z = (x2-x1)*(y3-y2) - (y2-y1)*(x3-x2);
 
  normalize(&normal);
 
    float s1 = dot(normal, lightDir1);
    float s2 = dot(normal, lightDir2);
    float s3 = dot(normal, lightDir3);
    float s4 = dot(normal, lightDir4);
 
 
     if(s1<0) s1=0;
     if(s2<0) s2=0;
     if(s3<0) s3=0;
     if(s4<0) s4=0;
 
 
   int b = (color&0x000000ff);
   int g = (color&0x0000ff00)>>8;
   int r = (color&0x00ff0000)>>16;
 
  float   lr= .1 + (s1*lightColor1.x + s2*lightColor2.x + s3*lightColor3.x+ s4*lightColor4.x);
  float   lg= .1 +(s1*lightColor1.y + s2*lightColor2.y + s3*lightColor3.y+ s4*lightColor4.y);
  float   lb= .1 + (s1*lightColor1.z + s2*lightColor2.z + s3*lightColor3.z+ s4*lightColor4.z);
 
 
   r*=lr;
   g*=lg;
   b*=lb;
 
   if(r>255) r=255;
   if(g>255) g=255;
   if(b>255) b=255;
 
   return rgb(b,g,r);
Advertisement

Hi.

First of all, have you tried looking at the disassembly? With the right flags (a high -march, -mfpmath=sse, optimizations), gcc is able to produce decent vectorized code, so there's a possibility that it already has most SSE optimizations in place. While looking at the disassembly you could also find some ideas how to maybe reorder some C code to let the compiler do better optimizations.

That being said, if you want, you could manually write the SSE intrinsics, especially the dots, madds and clamps. The Bullet's LinearMath library has some nice SSE code, you could get ideas from there.

The GCC documentation has some very useful examples.

Stephen M. Webb
Professional Free Software Developer

I doubt if the performance cause is that specific code and that using intrinsics will not eliminate the (whole) issue.

Can you try to render more then the single quad and see how performance then is? (profile again)

Crealysm game & engine development: http://www.crealysm.com

Looking for a passionate, disciplined and structured producer? PM me

I doubt if the performance cause is that specific code and that using intrinsics will not eliminate the (whole) issue.

Can you try to render more then the single quad and see how performance then is? (profile again)

I was profiling this in the context of my prog (you meen ifing this off nad wathing how much framerate gets up) - i could provide results yet after a while

edit - example results - for some scene with this shading frame time 27 ms without it, 20-21 ms - so this seems this is only 20-25 % (a bit more in low res - this is 30% ) of cpu consumption but if i could manage to improve it 30% or so i would be happy

also would just like to test this sse intrinsics

> The GCC documentation has some very useful examples.

good idea though its a bit hard, maybe someona could help a a bit more specyfic if had nothing to do and want to improve sse skills or talk about this?

> Hi.
First of all, have you tried looking at the disassembly? With the right flags (a high -march, -mfpmath=sse, optimizations), gcc is able to produce decent vectorized code, so there's a possibility that it already has most SSE optimizations in place. While looking at the disassembly you could also find some ideas how to maybe reorder some C code to let the compiler do better optimizations.
That being said, if you want, you could manually write the SSE intrinsics, especially the dots, madds and clamps. The Bullet's LinearMath library has some nice SSE code, you could get ideas from there.

(this selsctive quote makes troubles)

Good idea i can provide assembly output after a while (Im new in gcc so I am not yet accustomed to it how to do that and how to read that - got only some basic assembly skills, but would like to train it a bit

edit

okay, exactly this code


 
   static float3 lightDir1=  {0.2, -1.6,  -1.7 }; 
   static float3 lightDir2 = {0.5, -0.7,  20.3 }; 
   static float3 lightDir3 = {-0.5,-0.3, -0.6 }; 
   static float3 lightDir4 = {-0.5, 1.3,  0.6 };
 
  static  float3 lightColor1_=  {.4,    .414,  .515 };
  static  float3 lightColor2_ = {.4145, .451,   .543 };
  static  float3 lightColor3_ = {.584,  .51414,  .43 };
  static  float3 lightColor4_ = {.41,   .44,    .3414 };
 
 
 unsigned ShadeTriangle3d(  Triangle* triangle,
                           unsigned color)
 {
 
    static int initialized = 0;
    if(!initialized)
    {
     normalize(&lightDir1);
     normalize(&lightDir2);
     normalize(&lightDir3);
     normalize(&lightDir4);
     initialized = 1;
    }
  ///////////
 
     float x1,  y1,  z1, x2,  y2,  z2, x3,  y3,  z3;
 
     x1   = (   ((*triangle).a.x -  modelPos.x)*modelRight.x + ((*triangle).a.y -  modelPos.y)*modelRight.y + ((*triangle).a.z -  modelPos.z)*modelRight.z) +   modelPos.x;
     y1   = (   ((*triangle).a.x -  modelPos.x)*modelUp.x    + ((*triangle).a.y -  modelPos.y)*modelUp.y    + ((*triangle).a.z -  modelPos.z)*modelUp.z   ) +   modelPos.y;
     z1   = (   ((*triangle).a.x -  modelPos.x)*modelDir.x   + ((*triangle).a.y -  modelPos.y)*modelDir.y   + ((*triangle).a.z -  modelPos.z)*modelDir.z  ) +   modelPos.z;
 
     x2   = (   ((*triangle).b.x -  modelPos.x)*modelRight.x + ((*triangle).b.y -  modelPos.y)*modelRight.y + ((*triangle).b.z -  modelPos.z)*modelRight.z) +   modelPos.x;
     y2   = (   ((*triangle).b.x -  modelPos.x)*modelUp.x    + ((*triangle).b.y -  modelPos.y)*modelUp.y    + ((*triangle).b.z -  modelPos.z)*modelUp.z   ) +   modelPos.y;
     z2   = (   ((*triangle).b.x -  modelPos.x)*modelDir.x   + ((*triangle).b.y -  modelPos.y)*modelDir.y   + ((*triangle).b.z -  modelPos.z)*modelDir.z  ) +   modelPos.z;
 
     x3   = (   ((*triangle).c.x -  modelPos.x)*modelRight.x + ((*triangle).c.y -  modelPos.y)*modelRight.y + ((*triangle).c.z -  modelPos.z)*modelRight.z) +   modelPos.x;
     y3   = (   ((*triangle).c.x -  modelPos.x)*modelUp.x    + ((*triangle).c.y -  modelPos.y)*modelUp.y    + ((*triangle).c.z -  modelPos.z)*modelUp.z   ) +   modelPos.y;
     z3   = (   ((*triangle).c.x -  modelPos.x)*modelDir.x   + ((*triangle).c.y -  modelPos.y)*modelDir.y   + ((*triangle).c.z -  modelPos.z)*modelDir.z  ) +   modelPos.z;
 
 
  float3 normal;
 
  normal.x = (y2-y1)*(z3-z2) - (z2-z1)*(y3-y2);
  normal.y = (z2-z1)*(x3-x2) - (x2-x1)*(z3-z2);
  normal.z = (x2-x1)*(y3-y2) - (y2-y1)*(x3-x2);
 
  normalize_length_silent(&normal);
 
 
    float s1 = dot(normal, lightDir1);
    float s2 = dot(normal, lightDir2);
    float s3 = dot(normal, lightDir3);
    float s4 = dot(normal, lightDir4);
 
 
     if(s1<0) s1=0;
     if(s2<0) s2=0;
     if(s3<0) s3=0;
     if(s4<0) s4=0;
 
 
   int b = (color&0x000000ff);
   int g = (color&0x0000ff00)>>8;
   int r = (color&0x00ff0000)>>16;
 
 
 
  float   lr= .1 + (s1*lightColor1_.x + s2*lightColor2.x + s3*lightColor3_.x+ s4*lightColor4_.x);
  float   lg= .1 +(s1*lightColor1_.y + s2*lightColor2.y + s3*lightColor3_.y+ s4*lightColor4_.y);
  float   lb= .1 + (s1*lightColor1_.z + s2*lightColor2.z + s3*lightColor3_.z+ s4*lightColor4_.z);
 
   r*=lr;
   g*=lg;
   b*=lb;
 
   if(r>255) r=255;
   if(g>255) g=255;
   if(b>255) b=255;
 
   return rgb(b,g,r);
 
 }
 
 
 
 


produces such output


.file "shade_triangle_3d.c"
.intel_syntax noprefix
 # GNU C++ (tdm-1) version 4.7.1 (mingw32)
 # compiled by GNU C version 4.7.1, GMP version 4.3.2, MPFR version 2.4.2, MPC version 0.8.2
 # GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
 # options passed:  -I ..\..\..\
 # -iprefix c:\mingw\bin\../lib/gcc/mingw32/4.7.1/ shade_triangle_3d.c
 # -mrecip -march=pentium3 -mtune=generic -mfpmath=both -masm=intel -O3
 # -Ofast -w -funsafe-math-optimizations -ffast-math -fno-rtti
 # -fno-exceptions -fverbose-asm
 # options enabled:  -fassociative-math -fasynchronous-unwind-tables
 # -fauto-inc-dec -fbranch-count-reg -fcaller-saves
 # -fcombine-stack-adjustments -fcommon -fcompare-elim -fcprop-registers
 # -fcrossjumping -fcse-follow-jumps -fcx-limited-range
 # -fdebug-types-section -fdefer-pop -fdelete-null-pointer-checks
 # -fdevirtualize -fdwarf2-cfi-asm -fearly-inlining
 # -feliminate-unused-debug-types -fexpensive-optimizations
 # -ffinite-math-only -fforward-propagate -ffunction-cse -fgcse
 # -fgcse-after-reload -fgcse-lm -fgnu-runtime -fguess-branch-probability
 # -fident -fif-conversion -fif-conversion2 -findirect-inlining -finline
 # -finline-atomics -finline-functions -finline-functions-called-once
 # -finline-small-functions -fipa-cp -fipa-cp-clone -fipa-profile
 # -fipa-pure-const -fipa-reference -fipa-sra -fira-share-save-slots
 # -fira-share-spill-slots -fivopts -fkeep-inline-dllexport
 # -fkeep-static-consts -fleading-underscore -fmerge-constants
 # -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer
 # -foptimize-register-move -foptimize-sibling-calls -foptimize-strlen
 # -fpartial-inlining -fpeephole -fpeephole2 -fpredictive-commoning
 # -fprefetch-loop-arrays -freciprocal-math -free -freg-struct-return
 # -fregmove -freorder-blocks -freorder-functions -frerun-cse-after-loop
 # -fsched-critical-path-heuristic -fsched-dep-count-heuristic
 # -fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic
 # -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic
 # -fsched-stalled-insns-dep -fschedule-insns2 -fset-stack-executable
 # -fshow-column -fshrink-wrap -fsplit-ivs-in-unroller -fsplit-wide-types
 # -fstrict-aliasing -fstrict-overflow -fstrict-volatile-bitfields
 # -fthread-jumps -ftoplevel-reorder -ftree-bit-ccp -ftree-builtin-call-dce
 # -ftree-ccp -ftree-ch -ftree-copy-prop -ftree-copyrename -ftree-cselim
 # -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre
 # -ftree-loop-distribute-patterns -ftree-loop-if-convert -ftree-loop-im
 # -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops=
 # -ftree-phiprop -ftree-pre -ftree-pta -ftree-reassoc -ftree-scev-cprop
 # -ftree-sink -ftree-slp-vectorize -ftree-sra -ftree-switch-conversion
 # -ftree-tail-merge -ftree-ter -ftree-vect-loop-version -ftree-vectorize
 # -ftree-vrp -funit-at-a-time -funsafe-math-optimizations -funswitch-loops
 # -funwind-tables -fvect-cost-model -fverbose-asm
 # -fzero-initialized-in-bss -m32 -m80387 -m96bit-long-double
 # -maccumulate-outgoing-args -malign-double -malign-stringops
 # -mfancy-math-387 -mfp-ret-in-387 -mmmx -mms-bitfields -mno-red-zone
 # -mno-sse4 -mpush-args -mrecip -msahf -msse -mstack-arg-probe
 
.section .rdata,"dr"
.align 4
LC1:
.ascii "division by zero in normalize vector\0"
.text
.p2align 4,,15
.globl __Z15ShadeTriangle3dP8Trianglej
.def __Z15ShadeTriangle3dP8Trianglej; .scl 2; .type 32; .endef
__Z15ShadeTriangle3dP8Trianglej:
push ebx #
sub esp, 104 #,
mov eax, DWORD PTR __ZZ15ShadeTriangle3dP8TrianglejE11initialized #, initialized
mov ebx, DWORD PTR [esp+112] # triangle, triangle
mov edx, DWORD PTR [esp+116] # color, color
test eax, eax #
je L2 #,
movss xmm5, DWORD PTR __ZL9lightDir4 # prephitmp.90, lightDir4.x
xorps xmm3, xmm3 # tmp1394
movss xmm4, DWORD PTR __ZL9lightDir4+4 # prephitmp.90, lightDir4.y
movss xmm7, DWORD PTR __ZL9lightDir4+8 # prephitmp.90, lightDir4.z
L3:
fld DWORD PTR _modelPos+4 # modelPos.y
movss xmm6, DWORD PTR [ebx] #, triangle_9(D)->a.x
movss xmm0, DWORD PTR [ebx+4] #, triangle_9(D)->a.y
fst DWORD PTR [esp+28] #
fld DWORD PTR _modelPos+8 # modelPos.z
movss xmm1, DWORD PTR [esp+28] #,
movss xmm2, DWORD PTR [ebx+8] #, triangle_9(D)->a.z
fst DWORD PTR [esp+28] #
fld DWORD PTR _modelUp+4 # modelUp.y
subss xmm0, xmm1 #,
fld DWORD PTR _modelPos # modelPos.x
fsubr DWORD PTR [ebx+12] # triangle_9(D)->b.x
subss xmm6, DWORD PTR _modelPos #, modelPos.x
movss DWORD PTR [esp+80], xmm0 # %sfp,
movss DWORD PTR [esp+76], xmm6 # %sfp,
fld DWORD PTR [ebx+16] # triangle_9(D)->b.y
fsub st, st(4) #,
movss xmm6, DWORD PTR [esp+28] #,
fld DWORD PTR [ebx+20] # triangle_9(D)->b.z
fsub st, st(4) #,
fxch st(2) #
subss xmm2, xmm6 #,
fst DWORD PTR [esp+28] #
fld DWORD PTR _modelRight+4 # modelRight.y
fmul st, st(2) #,
movss xmm0, DWORD PTR [esp+28] #,
movss DWORD PTR [esp+84], xmm2 # %sfp,
mulss xmm0, DWORD PTR _modelRight #, modelRight.x
fstp DWORD PTR [esp+28] #
fld DWORD PTR _modelRight+8 # modelRight.z
fmul st, st(3) #,
movss xmm1, DWORD PTR [esp+28] #,
addss xmm1, xmm0 #,
fstp DWORD PTR [esp+28] #
fld DWORD PTR _modelUp # modelUp.x
fmul st, st(1) #,
addss xmm1, DWORD PTR _modelPos #, modelPos.x
movss xmm2, DWORD PTR [esp+28] #,
fld st(2) #
addss xmm2, xmm1 #,
fmul st, st(5) #,
movss DWORD PTR [esp+64], xmm2 # %sfp,
faddp st(1), st #,
fadd st, st(6) #,
fld DWORD PTR _modelUp+8 # modelUp.z
fmul st, st(4) #,
faddp st(1), st #,
fld DWORD PTR _modelDir # modelDir.x
fmulp st(2), st #,
fxch st(2) #
fmul DWORD PTR _modelDir+4 # modelDir.y
faddp st(1), st #,
fadd st, st(4) #,
fxch st(2) #
fmul DWORD PTR _modelDir+8 # modelDir.z
faddp st(2), st #,
fld DWORD PTR _modelPos # modelPos.x
fsubr DWORD PTR [ebx+24] # triangle_9(D)->c.x
fld DWORD PTR [ebx+28] # triangle_9(D)->c.y
fsub st, st(6) #,
fld DWORD PTR [ebx+32] # triangle_9(D)->c.z
fsub st, st(6) #,
fxch st(2) #
fst DWORD PTR [esp+28] #
fxch st(1) #
movss xmm6, DWORD PTR [esp+28] #,
fst DWORD PTR [esp+28] #
fxch st(2) #
movss xmm0, DWORD PTR [esp+28] #,
mulss xmm6, DWORD PTR _modelRight #, modelRight.x
fst DWORD PTR [esp+28] #
fxch st(1) #
mulss xmm0, DWORD PTR _modelRight+4 #, modelRight.y
movss xmm1, DWORD PTR [esp+28] #,
fst DWORD PTR [esp+28] #
fxch st(2) #
movss xmm2, DWORD PTR [esp+28] #,
addss xmm6, xmm0 #,
fst DWORD PTR [esp+28] #
fxch st(5) #
mulss xmm1, DWORD PTR _modelRight+8 #, modelRight.z
mulss xmm2, DWORD PTR _modelUp #, modelUp.x
addss xmm6, DWORD PTR _modelPos #, modelPos.x
addss xmm6, xmm1 #,
movss DWORD PTR [esp+72], xmm6 # %sfp,
movss xmm6, DWORD PTR [esp+28] #,
fst DWORD PTR [esp+28] #
fxch st(7) #
movss xmm0, DWORD PTR [esp+28] #,
fst DWORD PTR [esp+28] #
fxch st(1) #
movss xmm1, DWORD PTR [esp+28] #,
fst DWORD PTR [esp+28] #
fxch st(2) #
fmul DWORD PTR _modelDir # modelDir.x
fxch st(5) #
mulss xmm6, xmm0 #,
fmul DWORD PTR _modelDir+4 # modelDir.y
addss xmm2, xmm6 #,
addss xmm1, xmm2 #,
movss xmm2, DWORD PTR [esp+28] #,
faddp st(5), st #,
fxch st(4) #
mulss xmm2, DWORD PTR _modelUp+8 #, modelUp.z
fadd st, st(5) #,
fxch st(1) #
addss xmm1, xmm2 #,
fmul DWORD PTR _modelDir+8 # modelDir.z
movss DWORD PTR [esp+68], xmm1 # %sfp,
faddp st(1), st #,
fxch st(5) #
fmul DWORD PTR [esp+80] # %sfp
fld DWORD PTR [esp+84] # %sfp
fmul DWORD PTR _modelUp+8 # modelUp.z
faddp st(1), st #,
faddp st(3), st #,
fld DWORD PTR [esp+76] # %sfp
fmul DWORD PTR _modelUp # modelUp.x
faddp st(3), st #,
fsubr st(2), st #,
fld DWORD PTR [esp+84] # %sfp
fmul DWORD PTR _modelDir+8 # modelDir.z
fld DWORD PTR [esp+76] # %sfp
fmul DWORD PTR _modelDir # modelDir.x
faddp st(1), st #,
faddp st(4), st #,
fld DWORD PTR [esp+80] # %sfp
fmul DWORD PTR _modelDir+4 # modelDir.y
faddp st(4), st #,
fxch st(3) #
fsubr st, st(1) #,
fld DWORD PTR [esp+68] # %sfp
fsubr st, st(4) #,
fmul st, st(1) #,
fld st(5) #
fsub st, st(3) #,
fmul st, st(4) #,
faddp st(1), st #,
fld DWORD PTR [esp+76] # %sfp
fmul DWORD PTR _modelRight # modelRight.x
fld DWORD PTR [esp+80] # %sfp
fmul DWORD PTR _modelRight+4 # modelRight.y
faddp st(1), st #,
fadd DWORD PTR _modelPos # modelPos.x
fld DWORD PTR [esp+84] # %sfp
fmul DWORD PTR _modelRight+8 # modelRight.z
faddp st(1), st #,
fsubr DWORD PTR [esp+64] # %sfp
fxch st(6) #
fsubp st(3), st #,
fxch st(2) #
fmul st, st(5) #,
fld DWORD PTR [esp+72] # %sfp
fsub DWORD PTR [esp+64] # %sfp
fmulp st(2), st #,
faddp st(1), st #,
fld DWORD PTR [esp+64] # %sfp
fsub DWORD PTR [esp+72] # %sfp
fmulp st(3), st #,
fld DWORD PTR [esp+68] # %sfp
fsubrp st(4), st #,
fxch st(3) #
movss DWORD PTR [esp+28], xmm3 #, tmp1394
fmulp st(4), st #,
fxch st(1) #
faddp st(3), st #,
fld st(1) #
fmul st, st(2) #,
fld st(1) #
fmul st, st(2) #,
faddp st(1), st #,
fld st(3) #
fmul st, st(4) #,
faddp st(1), st #,
fsqrt
fld DWORD PTR [esp+28] #
fcomip st, st(1) #,
jae L31 #,
fstp DWORD PTR [esp+28] #
movss xmm1, DWORD PTR [esp+28] #,
rcpss xmm0, xmm1 # tmp1298,
mulss xmm1, xmm0 # tmp1299, tmp1298
mulss xmm1, xmm0 # tmp1299, tmp1298
addss xmm0, xmm0 # tmp1301, tmp1298
subss xmm0, xmm1 # tmp1301, tmp1299
movss DWORD PTR [esp+28], xmm0 #, tmp1301
fld DWORD PTR [esp+28] #
fmul st(1), st #,
fmul st(2), st #,
fmulp st(3), st #,
fxch st(1) #
jmp L16 #
.p2align 4,,7
L31:
fstp st(0) #
fxch st(1) #
.p2align 4,,7
L16:
fst DWORD PTR [esp+28] #
fxch st(1) #
mov eax, edx # tmp1347, color
movss xmm1, DWORD PTR [esp+28] # s1,
and eax, 16711680 # tmp1347,
fst DWORD PTR [esp+28] #
fxch st(2) #
movzx ecx, dh # tmp1365, color
movss xmm0, DWORD PTR [esp+28] # tmp1305,
and edx, 255 # tmp1382,
mulss xmm1, DWORD PTR __ZL9lightDir1+4 # s1, lightDir1.y
fst DWORD PTR [esp+28] #
fxch st(1) #
mulss xmm0, DWORD PTR __ZL9lightDir1 # tmp1305, lightDir1.x
shr eax, 16 # tmp1347,
addss xmm1, xmm0 # s1, tmp1305
movss xmm0, DWORD PTR [esp+28] # tmp1308,
fst DWORD PTR [esp+28] #
fxch st(2) #
movss xmm2, DWORD PTR [esp+28] # s2,
mulss xmm0, DWORD PTR __ZL9lightDir1+8 # tmp1308, lightDir1.z
fst DWORD PTR [esp+28] #
fxch st(1) #
mulss xmm2, DWORD PTR __ZL9lightDir2+4 # s2, lightDir2.y
addss xmm1, xmm0 # s1, tmp1308
movss xmm0, DWORD PTR [esp+28] # tmp1312,
fst DWORD PTR [esp+28] #
fxch st(2) #
maxss xmm1, xmm3 # s1, tmp1394
mulss xmm0, DWORD PTR __ZL9lightDir2 # tmp1312, lightDir2.x
addss xmm2, xmm0 # s2, tmp1312
movss xmm0, DWORD PTR [esp+28] # tmp1315,
fst DWORD PTR [esp+28] #
fxch st(1) #
mulss xmm0, DWORD PTR __ZL9lightDir2+8 # tmp1315, lightDir2.z
addss xmm2, xmm0 # s2, tmp1315
movss xmm0, DWORD PTR [esp+28] # s3,
fst DWORD PTR [esp+28] #
fxch st(2) #
movss xmm6, DWORD PTR [esp+28] # tmp1319,
maxss xmm2, xmm3 # s2, tmp1394
mulss xmm0, DWORD PTR __ZL9lightDir3+4 # s3, lightDir3.y
fst DWORD PTR [esp+28] #
fxch st(1) #
mulss xmm6, DWORD PTR __ZL9lightDir3 # tmp1319, lightDir3.x
addss xmm0, xmm6 # s3, tmp1319
movss xmm6, DWORD PTR [esp+28] # tmp1322,
fstp DWORD PTR [esp+28] #
fxch st(1) #
mulss xmm6, DWORD PTR __ZL9lightDir3+8 # tmp1322, lightDir3.z
addss xmm0, xmm6 # s3, tmp1322
movss xmm6, DWORD PTR [esp+28] #,
fstp DWORD PTR [esp+28] #
maxss xmm0, xmm3 # s3, tmp1394
mulss xmm4, xmm6 # s4,
movss xmm6, DWORD PTR [esp+28] #,
fstp DWORD PTR [esp+28] #
mulss xmm5, xmm6 # tmp1324,
addss xmm4, xmm5 # s4, tmp1324
movss xmm5, DWORD PTR [esp+28] #,
movss DWORD PTR [esp+28], xmm2 #, s2
fld DWORD PTR [esp+28] #
movss xmm6, DWORD PTR LC3 #,
mulss xmm7, xmm5 # tmp1326,
fmul DWORD PTR _lightColor2 # lightColor2.x
movss xmm5, DWORD PTR LC5 #,
mulss xmm6, xmm1 #, s1
addss xmm4, xmm7 # s4, tmp1326
maxss xmm4, xmm3 # s4, tmp1394
movss xmm3, DWORD PTR LC4 #,
movss DWORD PTR [esp+64], xmm6 # %sfp,
mulss xmm5, xmm4 #, s4
mulss xmm3, xmm0 #, s3
fadd DWORD PTR [esp+64] # %sfp
movss xmm6, DWORD PTR LC7 #,
movss DWORD PTR [esp+64], xmm3 # %sfp,
mulss xmm6, xmm1 #, s1
mulss xmm1, DWORD PTR LC10 # s1,
fadd DWORD PTR [esp+64] # %sfp
movss DWORD PTR [esp+64], xmm5 # %sfp,
cvtsi2ss xmm5, eax # tmp1348, tmp1347
fadd DWORD PTR [esp+64] # %sfp
movss DWORD PTR [esp+64], xmm6 # %sfp,
fld QWORD PTR LC6 #
fadd st(1), st #,
fxch st(1) #
fstp DWORD PTR [esp+28] #
movss xmm3, DWORD PTR [esp+28] # D.17036,
movss DWORD PTR [esp+28], xmm2 #, s2
mulss xmm3, xmm5 # D.17036, tmp1348
fld DWORD PTR [esp+28] #
fmul DWORD PTR _lightColor2+4 # lightColor2.y
movss xmm5, DWORD PTR LC9 #,
cvttss2si eax, xmm3 # r, D.17036
movss xmm3, DWORD PTR LC8 #,
mulss xmm5, xmm4 #, s4
fadd DWORD PTR [esp+64] # %sfp
mulss xmm3, xmm0 #, s3
mulss xmm0, DWORD PTR LC11 # s3,
movss DWORD PTR [esp+64], xmm3 # %sfp,
fadd DWORD PTR [esp+64] # %sfp
movss DWORD PTR [esp+64], xmm5 # %sfp,
cvtsi2ss xmm5, ecx # tmp1366, tmp1365
fadd DWORD PTR [esp+64] # %sfp
movss DWORD PTR [esp+64], xmm1 # %sfp, s1
cvtsi2ss xmm1, edx # tmp1383, tmp1382
mov edx, 255 # tmp1386,
fadd st, st(1) #,
fstp DWORD PTR [esp+28] #
movss xmm3, DWORD PTR [esp+28] # D.17038,
movss DWORD PTR [esp+28], xmm2 #, s2
mulss xmm3, xmm5 # D.17038, tmp1366
fld DWORD PTR [esp+28] #
fmul DWORD PTR _lightColor2+8 # lightColor2.z
cvttss2si ecx, xmm3 # g, D.17038
fadd DWORD PTR [esp+64] # %sfp
cmp ecx, 255 # g,
movss DWORD PTR [esp+64], xmm0 # %sfp, s3
cmovg ecx, edx # g,, tmp1387, tmp1386
mulss xmm4, DWORD PTR LC12 # s4,
sal ecx, 8 # tmp1387,
fadd DWORD PTR [esp+64] # %sfp
movss DWORD PTR [esp+64], xmm4 # %sfp, s4
fadd DWORD PTR [esp+64] # %sfp
faddp st(1), st #,
fstp DWORD PTR [esp+28] #
movss xmm0, DWORD PTR [esp+28] # D.17040,
mulss xmm0, xmm1 # D.17040, tmp1383
cvttss2si ebx, xmm0 # b, D.17040
cmp ebx, 255 # b,
cmovg ebx, edx # b,, tmp1388, tmp1386
add ecx, ebx # tmp1390, tmp1388
cmp eax, 255 # r,
cmovg eax, edx # r,, tmp1393, tmp1386
add esp, 104 #,
sal eax, 16 # tmp1393,
add eax, ecx # tmp1384, tmp1390
pop ebx #
ret
.p2align 4,,7
L2:
fld DWORD PTR __ZL9lightDir1+4 # lightDir1.y
xorps xmm3, xmm3 # tmp1394
movss xmm4, DWORD PTR __ZL9lightDir1 # D.17224, lightDir1.x
fld st(0) #
fmul st, st(1) #,
movss DWORD PTR [esp+28], xmm4 #, D.17224
movss xmm2, DWORD PTR __ZL9lightDir1+8 # D.17219, lightDir1.z
fld DWORD PTR [esp+28] #
fmul st, st(0) #,
movss DWORD PTR [esp+28], xmm2 #, D.17219
faddp st(1), st #,
fld DWORD PTR [esp+28] #
fmul st, st(0) #,
movss DWORD PTR [esp+28], xmm3 #, tmp1394
faddp st(1), st #,
fsqrt
fld DWORD PTR [esp+28] #
fcomip st, st(1) #,
jae L27 #,
fstp DWORD PTR [esp+28] #
movss xmm0, DWORD PTR [esp+28] #,
rcpss xmm1, xmm0 # tmp1128,
movaps xmm5, xmm0 # tmp1129,
mulss xmm5, xmm1 # tmp1129, tmp1128
movaps xmm0, xmm1 # tmp1131, tmp1128
addss xmm0, xmm1 # tmp1131, tmp1128
mulss xmm5, xmm1 # tmp1129, tmp1128
subss xmm0, xmm5 # tmp1131, tmp1129
movss DWORD PTR [esp+28], xmm0 #, tmp1131
mulss xmm4, xmm0 # tmp1133, tmp1131
mulss xmm0, xmm2 # tmp1137, D.17219
fld DWORD PTR [esp+28] #
fmulp st(1), st #,
movss DWORD PTR __ZL9lightDir1, xmm4 # lightDir1.x, tmp1133
movss DWORD PTR __ZL9lightDir1+8, xmm0 # lightDir1.z, tmp1137
fstp DWORD PTR __ZL9lightDir1+4 # lightDir1.y
L6:
fld DWORD PTR __ZL9lightDir2+4 # lightDir2.y
movss xmm4, DWORD PTR __ZL9lightDir2 # D.17246, lightDir2.x
fld st(0) #
fmul st, st(1) #,
movss DWORD PTR [esp+28], xmm4 #, D.17246
movss xmm2, DWORD PTR __ZL9lightDir2+8 # D.17241, lightDir2.z
fld DWORD PTR [esp+28] #
fmul st, st(0) #,
movss DWORD PTR [esp+28], xmm2 #, D.17241
faddp st(1), st #,
fld DWORD PTR [esp+28] #
fmul st, st(0) #,
movss DWORD PTR [esp+28], xmm3 #, tmp1394
faddp st(1), st #,
fsqrt
fld DWORD PTR [esp+28] #
fcomip st, st(1) #,
jae L28 #,
fstp DWORD PTR [esp+28] #
movss xmm0, DWORD PTR [esp+28] #,
rcpss xmm1, xmm0 # tmp1151,
movaps xmm5, xmm0 # tmp1152,
mulss xmm5, xmm1 # tmp1152, tmp1151
movaps xmm0, xmm1 # tmp1154, tmp1151
addss xmm0, xmm1 # tmp1154, tmp1151
mulss xmm5, xmm1 # tmp1152, tmp1151
subss xmm0, xmm5 # tmp1154, tmp1152
movss DWORD PTR [esp+28], xmm0 #, tmp1154
mulss xmm4, xmm0 # tmp1156, tmp1154
mulss xmm0, xmm2 # tmp1160, D.17241
fld DWORD PTR [esp+28] #
fmulp st(1), st #,
movss DWORD PTR __ZL9lightDir2, xmm4 # lightDir2.x, tmp1156
movss DWORD PTR __ZL9lightDir2+8, xmm0 # lightDir2.z, tmp1160
fstp DWORD PTR __ZL9lightDir2+4 # lightDir2.y
L9:
fld DWORD PTR __ZL9lightDir3+4 # lightDir3.y
movss xmm4, DWORD PTR __ZL9lightDir3 # D.17268, lightDir3.x
fld st(0) #
fmul st, st(1) #,
movss DWORD PTR [esp+28], xmm4 #, D.17268
movss xmm2, DWORD PTR __ZL9lightDir3+8 # D.17263, lightDir3.z
fld DWORD PTR [esp+28] #
fmul st, st(0) #,
movss DWORD PTR [esp+28], xmm2 #, D.17263
faddp st(1), st #,
fld DWORD PTR [esp+28] #
fmul st, st(0) #,
movss DWORD PTR [esp+28], xmm3 #, tmp1394
faddp st(1), st #,
fsqrt
fld DWORD PTR [esp+28] #
fcomip st, st(1) #,
jae L29 #,
fstp DWORD PTR [esp+28] #
movss xmm0, DWORD PTR [esp+28] #,
rcpss xmm1, xmm0 # tmp1174,
movaps xmm5, xmm0 # tmp1175,
mulss xmm5, xmm1 # tmp1175, tmp1174
movaps xmm0, xmm1 # tmp1177, tmp1174
addss xmm0, xmm1 # tmp1177, tmp1174
mulss xmm5, xmm1 # tmp1175, tmp1174
subss xmm0, xmm5 # tmp1177, tmp1175
movss DWORD PTR [esp+28], xmm0 #, tmp1177
mulss xmm4, xmm0 # tmp1179, tmp1177
mulss xmm0, xmm2 # tmp1183, D.17263
fld DWORD PTR [esp+28] #
fmulp st(1), st #,
movss DWORD PTR __ZL9lightDir3, xmm4 # lightDir3.x, tmp1179
movss DWORD PTR __ZL9lightDir3+8, xmm0 # lightDir3.z, tmp1183
fstp DWORD PTR __ZL9lightDir3+4 # lightDir3.y
L12:
fld DWORD PTR __ZL9lightDir4+4 # lightDir4.y
movss xmm5, DWORD PTR __ZL9lightDir4 # D.17290, lightDir4.x
fld st(0) #
fmul st, st(1) #,
movss DWORD PTR [esp+28], xmm5 #, D.17290
movss xmm1, DWORD PTR __ZL9lightDir4+8 # D.17285, lightDir4.z
fld DWORD PTR [esp+28] #
fmul st, st(0) #,
movss DWORD PTR [esp+28], xmm1 #, D.17285
faddp st(1), st #,
fld DWORD PTR [esp+28] #
fmul st, st(0) #,
movss DWORD PTR [esp+28], xmm3 #, tmp1394
faddp st(1), st #,
fsqrt
fld DWORD PTR [esp+28] #
fcomip st, st(1) #,
jae L30 #,
fstp DWORD PTR [esp+28] #
movss xmm2, DWORD PTR [esp+28] #,
fstp DWORD PTR [esp+28] #
movss xmm4, DWORD PTR [esp+28] # prephitmp.90,
rcpss xmm0, xmm2 # tmp1200,
mulss xmm2, xmm0 # tmp1201, tmp1200
movaps xmm7, xmm0 # tmp1203, tmp1200
addss xmm7, xmm0 # tmp1203, tmp1200
mulss xmm2, xmm0 # tmp1201, tmp1200
subss xmm7, xmm2 # tmp1203, tmp1201
mulss xmm5, xmm7 # prephitmp.90, tmp1203
mulss xmm4, xmm7 # prephitmp.90, tmp1203
mulss xmm7, xmm1 # prephitmp.90, D.17285
movss DWORD PTR __ZL9lightDir4, xmm5 # lightDir4.x, prephitmp.90
movss DWORD PTR __ZL9lightDir4+4, xmm4 # lightDir4.y, prephitmp.90
movss DWORD PTR __ZL9lightDir4+8, xmm7 # lightDir4.z, prephitmp.90
L15:
mov DWORD PTR __ZZ15ShadeTriangle3dP8TrianglejE11initialized, 1 # initialized,
jmp L3 #
.p2align 4,,7
L27:
fstp st(0) #
fstp st(0) #
mov DWORD PTR [esp], OFFSET FLAT:LC1 #,
mov DWORD PTR [esp+60], edx #,
movss DWORD PTR [esp+32], xmm3 #,
call __Z6ERROR_Pc #
mov edx, DWORD PTR [esp+60] #,
movss xmm3, DWORD PTR [esp+32] #,
jmp L6 #
.p2align 4,,7
L28:
fstp st(0) #
fstp st(0) #
mov DWORD PTR [esp], OFFSET FLAT:LC1 #,
mov DWORD PTR [esp+60], edx #,
movss DWORD PTR [esp+32], xmm3 #,
call __Z6ERROR_Pc #
mov edx, DWORD PTR [esp+60] #,
movss xmm3, DWORD PTR [esp+32] #,
jmp L9 #
.p2align 4,,7
L29:
fstp st(0) #
fstp st(0) #
mov DWORD PTR [esp], OFFSET FLAT:LC1 #,
mov DWORD PTR [esp+60], edx #,
movss DWORD PTR [esp+32], xmm3 #,
call __Z6ERROR_Pc #
mov edx, DWORD PTR [esp+60] #,
movss xmm3, DWORD PTR [esp+32] #,
jmp L12 #
.p2align 4,,7
L30:
fstp st(0) #
fstp st(0) #
mov DWORD PTR [esp], OFFSET FLAT:LC1 #,
mov DWORD PTR [esp+60], edx #,
movss DWORD PTR [esp+32], xmm3 #,
call __Z6ERROR_Pc #
mov edx, DWORD PTR [esp+60] #,
movss xmm5, DWORD PTR __ZL9lightDir4 # prephitmp.90, lightDir4.x
movss xmm4, DWORD PTR __ZL9lightDir4+4 # prephitmp.90, lightDir4.y
movss xmm7, DWORD PTR __ZL9lightDir4+8 # prephitmp.90, lightDir4.z
movss xmm3, DWORD PTR [esp+32] #,
jmp L15 #
.lcomm __ZZ15ShadeTriangle3dP8TrianglejE11initialized,4,16
.data
.align 16
__ZL9lightDir1:
 # x:
.long 1045220557
 # y:
.long -1077097267
 # z:
.long -1076258406
.align 16
__ZL9lightDir2:
 # x:
.long 1056964608
 # y:
.long -1087163597
 # z:
.long 1101162086
.align 16
__ZL9lightDir3:
 # x:
.long -1090519040
 # y:
.long -1097229926
 # z:
.long -1088841318
.align 16
__ZL9lightDir4:
 # x:
.long -1090519040
 # y:
.long 1067869798
 # z:
.long 1058642330
.section .rdata,"dr"
.align 4
LC3:
.long 1053609165
.align 4
LC4:
.long 1058373894
.align 4
LC5:
.long 1053944709
.align 8
LC6:
.long -1717986918
.long 1069128089
.align 4
LC7:
.long 1054078927
.align 4
LC8:
.long 1057201838
.align 4
LC9:
.long 1054951342
.align 4
LC10:
.long 1057216266
.align 4
LC11:
.long 1054615798
.align 4
LC12:
.long 1051642875
.def __Z6ERROR_Pc; .scl 2; .type 32; .endef
 

a bit mess.. imo - and i should it could be probably at least 'computionaly' improved with sse

especially this is a mess probably gcc inlined all the cose for normalize and dots and those are repeated 8 times

<propagada>

Keep in mind that your code is not hardware accelerated so you wont get much better perfs than that.

Also, your dump show that your compiler IS using sse optimizations (no y there...)

</propaganda>

<propagada>

Keep in mind that your code is not hardware accelerated so you wont get much better perfs than that.

Also, your dump show that your compiler IS using sse optimizations (no y there...)

</propaganda>

it is using sse but only sclar mnemonics, need some advice how to hand optymize it with intrinsics

I know i will not take much speedup (probably or something about that 30% is all i can count on but anyway i would like to try it if possible

<propagada>
Keep in mind that your code is not hardware accelerated so you wont get much better perfs than that.
Also, your dump show that your compiler IS using sse optimizations (no y there...)
</propaganda>

Those are the scalar operations (ending in ss), not the vectorized ones (ending in ps). You can actually get quite a lot better than that, even on the CPU, but it involves high level optimizations (evil propaganda that never got anyone anywhere), profiling and debugging (unhandy), reading up on stuff (too much an inconveniance) and googleing (boring, can't be engaged in a discussion).

Why do i even bother...

<propagada>
Keep in mind that your code is not hardware accelerated so you wont get much better perfs than that.
Also, your dump show that your compiler IS using sse optimizations (no y there...)
</propaganda>

Those are the scalar operations (ending in ss), not the vectorized ones (ending in ps). You can actually get quite a lot better than that, even on the CPU, but it involves high level optimizations (evil propaganda that never got anyone anywhere), profiling and debugging (unhandy), reading up on stuff (too much an inconveniance) and googleing (boring, can't be engaged in a discussion).

some reasonable opinion at least.. (profiling is incorrect as i do a lot of measurments/profiling) - this is at all interesting topic so why not discuss it?

found something in the archive

http://www.gamedev.net/topic/648061-performance-optimization-sse-vector-dotnormalize/

though it do not compile as this is sse 4.1 and i am forced to use only 1 & 2

This topic is closed to new replies.

Advertisement