Sign in to follow this  
fir

speeding this with sse or sse intrinsics

Recommended Posts

I got some weakly looking code of flat shading a triangle on cpu (with 4 lights)
it consumes a noticable amount of frame time (seem to be over 50% though it is hard to say as some cache effects come into play and some parts of code show
zero execution time - profiling says that this code consumes most execution time in the whole pipeline)
 
could it be revritten using SEE or SSE intrinsics (im using mingw and gcc 4.7)
witha  rules of SSE art? how to do this? (all small functions called here like cross dot normalize anr rgb (for adding sperate rgb int one unsigned int) are my own and i could revrite this dependant function bodies too here)
 
How to do that - if someone would help with that i could test if it has an effects on frame time; prefereably gcc sse intrinsics way would be most invited
 
if no maybe someone could say where to go with such sse-related questions as i know this forum is not to much focused on assembly, maybe there are better places over the net to talk this matter? 
 
 
// input:     float x1,  y1,  z1, x2,  y2,  z2, x3,  y3,  z3;
 

   static float3 lightDir1=  {0.2, -1.6,  -1.7 }; 
   static float3 lightDir2 = {0.5, -0.7,  20.3 };
   static float3 lightDir3 = {-0.5,-0.3, -0.6 }; 
   static float3 lightDir4 = {-0.5, 1.3,  0.6 };
 
    static  float3 lightColor1=  {.4,    .414,  .515 };
    static   float3 lightColor2 = {.4145, .451,   .543 };
    static   float3 lightColor3 = {.584,  .51414,  .43 };
    static    float3 lightColor4 = {.41,   .44,    .3414 };
 

   float3 u = {x2-x1, y2-y1, z2-z1 };
   float3 v = {x3-x2, y3-y2, z3-z2 };
 
   float3 normal = cross_(u,v);
 
//  normal.x = (y2-y1)*(z3-z2) - (z2-z1)*(y3-y2);
//  normal.y = (z2-z1)*(x3-x2) - (x2-x1)*(z3-z2);
//  normal.z = (x2-x1)*(y3-y2) - (y2-y1)*(x3-x2);
 
  normalize(&normal);
 
    float s1 = dot(normal, lightDir1);
    float s2 = dot(normal, lightDir2);
    float s3 = dot(normal, lightDir3);
    float s4 = dot(normal, lightDir4);
 
 
     if(s1<0) s1=0;
     if(s2<0) s2=0;
     if(s3<0) s3=0;
     if(s4<0) s4=0;
 
 
   int b = (color&0x000000ff);
   int g = (color&0x0000ff00)>>8;
   int r = (color&0x00ff0000)>>16;
 
  float   lr= .1 + (s1*lightColor1.x + s2*lightColor2.x + s3*lightColor3.x+ s4*lightColor4.x);
  float   lg= .1 +(s1*lightColor1.y + s2*lightColor2.y + s3*lightColor3.y+ s4*lightColor4.y);
  float   lb= .1 + (s1*lightColor1.z + s2*lightColor2.z + s3*lightColor3.z+ s4*lightColor4.z);
 
 
   r*=lr;
   g*=lg;
   b*=lb;
 
   if(r>255) r=255;
   if(g>255) g=255;
   if(b>255) b=255;
 
   return rgb(b,g,r);

Share this post


Link to post
Share on other sites

Hi.

First of all, have you tried looking at the disassembly? With the right flags (a high -march, -mfpmath=sse, optimizations), gcc is able to produce decent vectorized code, so there's a possibility that it already has most SSE optimizations in place. While looking at the disassembly you could also find some ideas how to maybe reorder some C code to let the compiler do better optimizations.

That being said, if you want, you could manually write the SSE intrinsics, especially the dots, madds and clamps. The Bullet's LinearMath library has some nice SSE code, you could get ideas from there.

Share this post


Link to post
Share on other sites

I doubt if the performance cause is that specific code and that using intrinsics will not eliminate the (whole) issue.

Can you try to render more then the single quad and see how performance then is? (profile again)

Edited by cozzie

Share this post


Link to post
Share on other sites

I doubt if the performance cause is that specific code and that using intrinsics will not eliminate the (whole) issue.

Can you try to render more then the single quad and see how performance then is? (profile again)

 

I was profiling this in the context of my prog (you meen ifing this off nad wathing how much framerate gets up) - i could provide results yet after a while

 

 

edit - example results - for some scene with this shading frame time 27 ms without it,  20-21 ms - so this seems this is only 20-25 % (a bit more in low res - this is 30% )  of cpu consumption  but if i could manage to improve it 30% or so i would be happy 

 

also would just like to test this sse intrinsics

 

 

> The GCC documentation has some very useful examples.

 

 

good idea though its a bit hard, maybe someona could help a a bit more specyfic if had nothing to do and want to improve sse skills or talk about this?

 

 

 

> Hi.
First of all, have you tried looking at the disassembly? With the right flags (a high -march, -mfpmath=sse, optimizations), gcc is able to produce decent vectorized code, so there's a possibility that it already has most SSE optimizations in place. While looking at the disassembly you could also find some ideas how to maybe reorder some C code to let the compiler do better optimizations.
That being said, if you want, you could manually write the SSE intrinsics, especially the dots, madds and clamps. The Bullet's LinearMath library has some nice SSE code, you could get ideas from there.

 
(this selsctive quote makes troubles)
 

Good idea i can provide assembly output after a while (Im new in gcc so I am not yet accustomed to it how to do that and how to read that - got only some basic assembly skills, but would like to train it a bit

 

 

edit 

 

okay, exactly this code

 
   static float3 lightDir1=  {0.2, -1.6,  -1.7 }; 
   static float3 lightDir2 = {0.5, -0.7,  20.3 }; 
   static float3 lightDir3 = {-0.5,-0.3, -0.6 }; 
   static float3 lightDir4 = {-0.5, 1.3,  0.6 };
 
  static  float3 lightColor1_=  {.4,    .414,  .515 };
  static  float3 lightColor2_ = {.4145, .451,   .543 };
  static  float3 lightColor3_ = {.584,  .51414,  .43 };
  static  float3 lightColor4_ = {.41,   .44,    .3414 };
 
 
 unsigned ShadeTriangle3d(  Triangle* triangle,
                           unsigned color)
 {
 
    static int initialized = 0;
    if(!initialized)
    {
     normalize(&lightDir1);
     normalize(&lightDir2);
     normalize(&lightDir3);
     normalize(&lightDir4);
     initialized = 1;
    }
  ///////////
 
     float x1,  y1,  z1, x2,  y2,  z2, x3,  y3,  z3;
 
     x1   = (   ((*triangle).a.x -  modelPos.x)*modelRight.x + ((*triangle).a.y -  modelPos.y)*modelRight.y + ((*triangle).a.z -  modelPos.z)*modelRight.z) +   modelPos.x;
     y1   = (   ((*triangle).a.x -  modelPos.x)*modelUp.x    + ((*triangle).a.y -  modelPos.y)*modelUp.y    + ((*triangle).a.z -  modelPos.z)*modelUp.z   ) +   modelPos.y;
     z1   = (   ((*triangle).a.x -  modelPos.x)*modelDir.x   + ((*triangle).a.y -  modelPos.y)*modelDir.y   + ((*triangle).a.z -  modelPos.z)*modelDir.z  ) +   modelPos.z;
 
     x2   = (   ((*triangle).b.x -  modelPos.x)*modelRight.x + ((*triangle).b.y -  modelPos.y)*modelRight.y + ((*triangle).b.z -  modelPos.z)*modelRight.z) +   modelPos.x;
     y2   = (   ((*triangle).b.x -  modelPos.x)*modelUp.x    + ((*triangle).b.y -  modelPos.y)*modelUp.y    + ((*triangle).b.z -  modelPos.z)*modelUp.z   ) +   modelPos.y;
     z2   = (   ((*triangle).b.x -  modelPos.x)*modelDir.x   + ((*triangle).b.y -  modelPos.y)*modelDir.y   + ((*triangle).b.z -  modelPos.z)*modelDir.z  ) +   modelPos.z;
 
     x3   = (   ((*triangle).c.x -  modelPos.x)*modelRight.x + ((*triangle).c.y -  modelPos.y)*modelRight.y + ((*triangle).c.z -  modelPos.z)*modelRight.z) +   modelPos.x;
     y3   = (   ((*triangle).c.x -  modelPos.x)*modelUp.x    + ((*triangle).c.y -  modelPos.y)*modelUp.y    + ((*triangle).c.z -  modelPos.z)*modelUp.z   ) +   modelPos.y;
     z3   = (   ((*triangle).c.x -  modelPos.x)*modelDir.x   + ((*triangle).c.y -  modelPos.y)*modelDir.y   + ((*triangle).c.z -  modelPos.z)*modelDir.z  ) +   modelPos.z;
 
 
  float3 normal;
 
  normal.x = (y2-y1)*(z3-z2) - (z2-z1)*(y3-y2);
  normal.y = (z2-z1)*(x3-x2) - (x2-x1)*(z3-z2);
  normal.z = (x2-x1)*(y3-y2) - (y2-y1)*(x3-x2);
 
  normalize_length_silent(&normal);
 
 
    float s1 = dot(normal, lightDir1);
    float s2 = dot(normal, lightDir2);
    float s3 = dot(normal, lightDir3);
    float s4 = dot(normal, lightDir4);
 
 
     if(s1<0) s1=0;
     if(s2<0) s2=0;
     if(s3<0) s3=0;
     if(s4<0) s4=0;
 
 
   int b = (color&0x000000ff);
   int g = (color&0x0000ff00)>>8;
   int r = (color&0x00ff0000)>>16;
 
 
 
  float   lr= .1 + (s1*lightColor1_.x + s2*lightColor2.x + s3*lightColor3_.x+ s4*lightColor4_.x);
  float   lg= .1 +(s1*lightColor1_.y + s2*lightColor2.y + s3*lightColor3_.y+ s4*lightColor4_.y);
  float   lb= .1 + (s1*lightColor1_.z + s2*lightColor2.z + s3*lightColor3_.z+ s4*lightColor4_.z);
 
   r*=lr;
   g*=lg;
   b*=lb;
 
   if(r>255) r=255;
   if(g>255) g=255;
   if(b>255) b=255;
 
   return rgb(b,g,r);
 
 }
 
 
 
 


 

produces such output
 
 

.file "shade_triangle_3d.c"
.intel_syntax noprefix
 # GNU C++ (tdm-1) version 4.7.1 (mingw32)
 # compiled by GNU C version 4.7.1, GMP version 4.3.2, MPFR version 2.4.2, MPC version 0.8.2
 # GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
 # options passed:  -I ..\..\..\
 # -iprefix c:\mingw\bin\../lib/gcc/mingw32/4.7.1/ shade_triangle_3d.c
 # -mrecip -march=pentium3 -mtune=generic -mfpmath=both -masm=intel -O3
 # -Ofast -w -funsafe-math-optimizations -ffast-math -fno-rtti
 # -fno-exceptions -fverbose-asm
 # options enabled:  -fassociative-math -fasynchronous-unwind-tables
 # -fauto-inc-dec -fbranch-count-reg -fcaller-saves
 # -fcombine-stack-adjustments -fcommon -fcompare-elim -fcprop-registers
 # -fcrossjumping -fcse-follow-jumps -fcx-limited-range
 # -fdebug-types-section -fdefer-pop -fdelete-null-pointer-checks
 # -fdevirtualize -fdwarf2-cfi-asm -fearly-inlining
 # -feliminate-unused-debug-types -fexpensive-optimizations
 # -ffinite-math-only -fforward-propagate -ffunction-cse -fgcse
 # -fgcse-after-reload -fgcse-lm -fgnu-runtime -fguess-branch-probability
 # -fident -fif-conversion -fif-conversion2 -findirect-inlining -finline
 # -finline-atomics -finline-functions -finline-functions-called-once
 # -finline-small-functions -fipa-cp -fipa-cp-clone -fipa-profile
 # -fipa-pure-const -fipa-reference -fipa-sra -fira-share-save-slots
 # -fira-share-spill-slots -fivopts -fkeep-inline-dllexport
 # -fkeep-static-consts -fleading-underscore -fmerge-constants
 # -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer
 # -foptimize-register-move -foptimize-sibling-calls -foptimize-strlen
 # -fpartial-inlining -fpeephole -fpeephole2 -fpredictive-commoning
 # -fprefetch-loop-arrays -freciprocal-math -free -freg-struct-return
 # -fregmove -freorder-blocks -freorder-functions -frerun-cse-after-loop
 # -fsched-critical-path-heuristic -fsched-dep-count-heuristic
 # -fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic
 # -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic
 # -fsched-stalled-insns-dep -fschedule-insns2 -fset-stack-executable
 # -fshow-column -fshrink-wrap -fsplit-ivs-in-unroller -fsplit-wide-types
 # -fstrict-aliasing -fstrict-overflow -fstrict-volatile-bitfields
 # -fthread-jumps -ftoplevel-reorder -ftree-bit-ccp -ftree-builtin-call-dce
 # -ftree-ccp -ftree-ch -ftree-copy-prop -ftree-copyrename -ftree-cselim
 # -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre
 # -ftree-loop-distribute-patterns -ftree-loop-if-convert -ftree-loop-im
 # -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops=
 # -ftree-phiprop -ftree-pre -ftree-pta -ftree-reassoc -ftree-scev-cprop
 # -ftree-sink -ftree-slp-vectorize -ftree-sra -ftree-switch-conversion
 # -ftree-tail-merge -ftree-ter -ftree-vect-loop-version -ftree-vectorize
 # -ftree-vrp -funit-at-a-time -funsafe-math-optimizations -funswitch-loops
 # -funwind-tables -fvect-cost-model -fverbose-asm
 # -fzero-initialized-in-bss -m32 -m80387 -m96bit-long-double
 # -maccumulate-outgoing-args -malign-double -malign-stringops
 # -mfancy-math-387 -mfp-ret-in-387 -mmmx -mms-bitfields -mno-red-zone
 # -mno-sse4 -mpush-args -mrecip -msahf -msse -mstack-arg-probe
 
.section .rdata,"dr"
.align 4
LC1:
.ascii "division by zero in normalize vector\0"
.text
.p2align 4,,15
.globl __Z15ShadeTriangle3dP8Trianglej
.def __Z15ShadeTriangle3dP8Trianglej; .scl 2; .type 32; .endef
__Z15ShadeTriangle3dP8Trianglej:
push ebx #
sub esp, 104 #,
mov eax, DWORD PTR __ZZ15ShadeTriangle3dP8TrianglejE11initialized #, initialized
mov ebx, DWORD PTR [esp+112] # triangle, triangle
mov edx, DWORD PTR [esp+116] # color, color
test eax, eax #
je L2 #,
movss xmm5, DWORD PTR __ZL9lightDir4 # prephitmp.90, lightDir4.x
xorps xmm3, xmm3 # tmp1394
movss xmm4, DWORD PTR __ZL9lightDir4+4 # prephitmp.90, lightDir4.y
movss xmm7, DWORD PTR __ZL9lightDir4+8 # prephitmp.90, lightDir4.z
L3:
fld DWORD PTR _modelPos+4 # modelPos.y
movss xmm6, DWORD PTR [ebx] #, triangle_9(D)->a.x
movss xmm0, DWORD PTR [ebx+4] #, triangle_9(D)->a.y
fst DWORD PTR [esp+28] #
fld DWORD PTR _modelPos+8 # modelPos.z
movss xmm1, DWORD PTR [esp+28] #,
movss xmm2, DWORD PTR [ebx+8] #, triangle_9(D)->a.z
fst DWORD PTR [esp+28] #
fld DWORD PTR _modelUp+4 # modelUp.y
subss xmm0, xmm1 #,
fld DWORD PTR _modelPos # modelPos.x
fsubr DWORD PTR [ebx+12] # triangle_9(D)->b.x
subss xmm6, DWORD PTR _modelPos #, modelPos.x
movss DWORD PTR [esp+80], xmm0 # %sfp,
movss DWORD PTR [esp+76], xmm6 # %sfp,
fld DWORD PTR [ebx+16] # triangle_9(D)->b.y
fsub st, st(4) #,
movss xmm6, DWORD PTR [esp+28] #,
fld DWORD PTR [ebx+20] # triangle_9(D)->b.z
fsub st, st(4) #,
fxch st(2) #
subss xmm2, xmm6 #,
fst DWORD PTR [esp+28] #
fld DWORD PTR _modelRight+4 # modelRight.y
fmul st, st(2) #,
movss xmm0, DWORD PTR [esp+28] #,
movss DWORD PTR [esp+84], xmm2 # %sfp,
mulss xmm0, DWORD PTR _modelRight #, modelRight.x
fstp DWORD PTR [esp+28] #
fld DWORD PTR _modelRight+8 # modelRight.z
fmul st, st(3) #,
movss xmm1, DWORD PTR [esp+28] #,
addss xmm1, xmm0 #,
fstp DWORD PTR [esp+28] #
fld DWORD PTR _modelUp # modelUp.x
fmul st, st(1) #,
addss xmm1, DWORD PTR _modelPos #, modelPos.x
movss xmm2, DWORD PTR [esp+28] #,
fld st(2) #
addss xmm2, xmm1 #,
fmul st, st(5) #,
movss DWORD PTR [esp+64], xmm2 # %sfp,
faddp st(1), st #,
fadd st, st(6) #,
fld DWORD PTR _modelUp+8 # modelUp.z
fmul st, st(4) #,
faddp st(1), st #,
fld DWORD PTR _modelDir # modelDir.x
fmulp st(2), st #,
fxch st(2) #
fmul DWORD PTR _modelDir+4 # modelDir.y
faddp st(1), st #,
fadd st, st(4) #,
fxch st(2) #
fmul DWORD PTR _modelDir+8 # modelDir.z
faddp st(2), st #,
fld DWORD PTR _modelPos # modelPos.x
fsubr DWORD PTR [ebx+24] # triangle_9(D)->c.x
fld DWORD PTR [ebx+28] # triangle_9(D)->c.y
fsub st, st(6) #,
fld DWORD PTR [ebx+32] # triangle_9(D)->c.z
fsub st, st(6) #,
fxch st(2) #
fst DWORD PTR [esp+28] #
fxch st(1) #
movss xmm6, DWORD PTR [esp+28] #,
fst DWORD PTR [esp+28] #
fxch st(2) #
movss xmm0, DWORD PTR [esp+28] #,
mulss xmm6, DWORD PTR _modelRight #, modelRight.x
fst DWORD PTR [esp+28] #
fxch st(1) #
mulss xmm0, DWORD PTR _modelRight+4 #, modelRight.y
movss xmm1, DWORD PTR [esp+28] #,
fst DWORD PTR [esp+28] #
fxch st(2) #
movss xmm2, DWORD PTR [esp+28] #,
addss xmm6, xmm0 #,
fst DWORD PTR [esp+28] #
fxch st(5) #
mulss xmm1, DWORD PTR _modelRight+8 #, modelRight.z
mulss xmm2, DWORD PTR _modelUp #, modelUp.x
addss xmm6, DWORD PTR _modelPos #, modelPos.x
addss xmm6, xmm1 #,
movss DWORD PTR [esp+72], xmm6 # %sfp,
movss xmm6, DWORD PTR [esp+28] #,
fst DWORD PTR [esp+28] #
fxch st(7) #
movss xmm0, DWORD PTR [esp+28] #,
fst DWORD PTR [esp+28] #
fxch st(1) #
movss xmm1, DWORD PTR [esp+28] #,
fst DWORD PTR [esp+28] #
fxch st(2) #
fmul DWORD PTR _modelDir # modelDir.x
fxch st(5) #
mulss xmm6, xmm0 #,
fmul DWORD PTR _modelDir+4 # modelDir.y
addss xmm2, xmm6 #,
addss xmm1, xmm2 #,
movss xmm2, DWORD PTR [esp+28] #,
faddp st(5), st #,
fxch st(4) #
mulss xmm2, DWORD PTR _modelUp+8 #, modelUp.z
fadd st, st(5) #,
fxch st(1) #
addss xmm1, xmm2 #,
fmul DWORD PTR _modelDir+8 # modelDir.z
movss DWORD PTR [esp+68], xmm1 # %sfp,
faddp st(1), st #,
fxch st(5) #
fmul DWORD PTR [esp+80] # %sfp
fld DWORD PTR [esp+84] # %sfp
fmul DWORD PTR _modelUp+8 # modelUp.z
faddp st(1), st #,
faddp st(3), st #,
fld DWORD PTR [esp+76] # %sfp
fmul DWORD PTR _modelUp # modelUp.x
faddp st(3), st #,
fsubr st(2), st #,
fld DWORD PTR [esp+84] # %sfp
fmul DWORD PTR _modelDir+8 # modelDir.z
fld DWORD PTR [esp+76] # %sfp
fmul DWORD PTR _modelDir # modelDir.x
faddp st(1), st #,
faddp st(4), st #,
fld DWORD PTR [esp+80] # %sfp
fmul DWORD PTR _modelDir+4 # modelDir.y
faddp st(4), st #,
fxch st(3) #
fsubr st, st(1) #,
fld DWORD PTR [esp+68] # %sfp
fsubr st, st(4) #,
fmul st, st(1) #,
fld st(5) #
fsub st, st(3) #,
fmul st, st(4) #,
faddp st(1), st #,
fld DWORD PTR [esp+76] # %sfp
fmul DWORD PTR _modelRight # modelRight.x
fld DWORD PTR [esp+80] # %sfp
fmul DWORD PTR _modelRight+4 # modelRight.y
faddp st(1), st #,
fadd DWORD PTR _modelPos # modelPos.x
fld DWORD PTR [esp+84] # %sfp
fmul DWORD PTR _modelRight+8 # modelRight.z
faddp st(1), st #,
fsubr DWORD PTR [esp+64] # %sfp
fxch st(6) #
fsubp st(3), st #,
fxch st(2) #
fmul st, st(5) #,
fld DWORD PTR [esp+72] # %sfp
fsub DWORD PTR [esp+64] # %sfp
fmulp st(2), st #,
faddp st(1), st #,
fld DWORD PTR [esp+64] # %sfp
fsub DWORD PTR [esp+72] # %sfp
fmulp st(3), st #,
fld DWORD PTR [esp+68] # %sfp
fsubrp st(4), st #,
fxch st(3) #
movss DWORD PTR [esp+28], xmm3 #, tmp1394
fmulp st(4), st #,
fxch st(1) #
faddp st(3), st #,
fld st(1) #
fmul st, st(2) #,
fld st(1) #
fmul st, st(2) #,
faddp st(1), st #,
fld st(3) #
fmul st, st(4) #,
faddp st(1), st #,
fsqrt
fld DWORD PTR [esp+28] #
fcomip st, st(1) #,
jae L31 #,
fstp DWORD PTR [esp+28] #
movss xmm1, DWORD PTR [esp+28] #,
rcpss xmm0, xmm1 # tmp1298,
mulss xmm1, xmm0 # tmp1299, tmp1298
mulss xmm1, xmm0 # tmp1299, tmp1298
addss xmm0, xmm0 # tmp1301, tmp1298
subss xmm0, xmm1 # tmp1301, tmp1299
movss DWORD PTR [esp+28], xmm0 #, tmp1301
fld DWORD PTR [esp+28] #
fmul st(1), st #,
fmul st(2), st #,
fmulp st(3), st #,
fxch st(1) #
jmp L16 #
.p2align 4,,7
L31:
fstp st(0) #
fxch st(1) #
.p2align 4,,7
L16:
fst DWORD PTR [esp+28] #
fxch st(1) #
mov eax, edx # tmp1347, color
movss xmm1, DWORD PTR [esp+28] # s1,
and eax, 16711680 # tmp1347,
fst DWORD PTR [esp+28] #
fxch st(2) #
movzx ecx, dh # tmp1365, color
movss xmm0, DWORD PTR [esp+28] # tmp1305,
and edx, 255 # tmp1382,
mulss xmm1, DWORD PTR __ZL9lightDir1+4 # s1, lightDir1.y
fst DWORD PTR [esp+28] #
fxch st(1) #
mulss xmm0, DWORD PTR __ZL9lightDir1 # tmp1305, lightDir1.x
shr eax, 16 # tmp1347,
addss xmm1, xmm0 # s1, tmp1305
movss xmm0, DWORD PTR [esp+28] # tmp1308,
fst DWORD PTR [esp+28] #
fxch st(2) #
movss xmm2, DWORD PTR [esp+28] # s2,
mulss xmm0, DWORD PTR __ZL9lightDir1+8 # tmp1308, lightDir1.z
fst DWORD PTR [esp+28] #
fxch st(1) #
mulss xmm2, DWORD PTR __ZL9lightDir2+4 # s2, lightDir2.y
addss xmm1, xmm0 # s1, tmp1308
movss xmm0, DWORD PTR [esp+28] # tmp1312,
fst DWORD PTR [esp+28] #
fxch st(2) #
maxss xmm1, xmm3 # s1, tmp1394
mulss xmm0, DWORD PTR __ZL9lightDir2 # tmp1312, lightDir2.x
addss xmm2, xmm0 # s2, tmp1312
movss xmm0, DWORD PTR [esp+28] # tmp1315,
fst DWORD PTR [esp+28] #
fxch st(1) #
mulss xmm0, DWORD PTR __ZL9lightDir2+8 # tmp1315, lightDir2.z
addss xmm2, xmm0 # s2, tmp1315
movss xmm0, DWORD PTR [esp+28] # s3,
fst DWORD PTR [esp+28] #
fxch st(2) #
movss xmm6, DWORD PTR [esp+28] # tmp1319,
maxss xmm2, xmm3 # s2, tmp1394
mulss xmm0, DWORD PTR __ZL9lightDir3+4 # s3, lightDir3.y
fst DWORD PTR [esp+28] #
fxch st(1) #
mulss xmm6, DWORD PTR __ZL9lightDir3 # tmp1319, lightDir3.x
addss xmm0, xmm6 # s3, tmp1319
movss xmm6, DWORD PTR [esp+28] # tmp1322,
fstp DWORD PTR [esp+28] #
fxch st(1) #
mulss xmm6, DWORD PTR __ZL9lightDir3+8 # tmp1322, lightDir3.z
addss xmm0, xmm6 # s3, tmp1322
movss xmm6, DWORD PTR [esp+28] #,
fstp DWORD PTR [esp+28] #
maxss xmm0, xmm3 # s3, tmp1394
mulss xmm4, xmm6 # s4,
movss xmm6, DWORD PTR [esp+28] #,
fstp DWORD PTR [esp+28] #
mulss xmm5, xmm6 # tmp1324,
addss xmm4, xmm5 # s4, tmp1324
movss xmm5, DWORD PTR [esp+28] #,
movss DWORD PTR [esp+28], xmm2 #, s2
fld DWORD PTR [esp+28] #
movss xmm6, DWORD PTR LC3 #,
mulss xmm7, xmm5 # tmp1326,
fmul DWORD PTR _lightColor2 # lightColor2.x
movss xmm5, DWORD PTR LC5 #,
mulss xmm6, xmm1 #, s1
addss xmm4, xmm7 # s4, tmp1326
maxss xmm4, xmm3 # s4, tmp1394
movss xmm3, DWORD PTR LC4 #,
movss DWORD PTR [esp+64], xmm6 # %sfp,
mulss xmm5, xmm4 #, s4
mulss xmm3, xmm0 #, s3
fadd DWORD PTR [esp+64] # %sfp
movss xmm6, DWORD PTR LC7 #,
movss DWORD PTR [esp+64], xmm3 # %sfp,
mulss xmm6, xmm1 #, s1
mulss xmm1, DWORD PTR LC10 # s1,
fadd DWORD PTR [esp+64] # %sfp
movss DWORD PTR [esp+64], xmm5 # %sfp,
cvtsi2ss xmm5, eax # tmp1348, tmp1347
fadd DWORD PTR [esp+64] # %sfp
movss DWORD PTR [esp+64], xmm6 # %sfp,
fld QWORD PTR LC6 #
fadd st(1), st #,
fxch st(1) #
fstp DWORD PTR [esp+28] #
movss xmm3, DWORD PTR [esp+28] # D.17036,
movss DWORD PTR [esp+28], xmm2 #, s2
mulss xmm3, xmm5 # D.17036, tmp1348
fld DWORD PTR [esp+28] #
fmul DWORD PTR _lightColor2+4 # lightColor2.y
movss xmm5, DWORD PTR LC9 #,
cvttss2si eax, xmm3 # r, D.17036
movss xmm3, DWORD PTR LC8 #,
mulss xmm5, xmm4 #, s4
fadd DWORD PTR [esp+64] # %sfp
mulss xmm3, xmm0 #, s3
mulss xmm0, DWORD PTR LC11 # s3,
movss DWORD PTR [esp+64], xmm3 # %sfp,
fadd DWORD PTR [esp+64] # %sfp
movss DWORD PTR [esp+64], xmm5 # %sfp,
cvtsi2ss xmm5, ecx # tmp1366, tmp1365
fadd DWORD PTR [esp+64] # %sfp
movss DWORD PTR [esp+64], xmm1 # %sfp, s1
cvtsi2ss xmm1, edx # tmp1383, tmp1382
mov edx, 255 # tmp1386,
fadd st, st(1) #,
fstp DWORD PTR [esp+28] #
movss xmm3, DWORD PTR [esp+28] # D.17038,
movss DWORD PTR [esp+28], xmm2 #, s2
mulss xmm3, xmm5 # D.17038, tmp1366
fld DWORD PTR [esp+28] #
fmul DWORD PTR _lightColor2+8 # lightColor2.z
cvttss2si ecx, xmm3 # g, D.17038
fadd DWORD PTR [esp+64] # %sfp
cmp ecx, 255 # g,
movss DWORD PTR [esp+64], xmm0 # %sfp, s3
cmovg ecx, edx # g,, tmp1387, tmp1386
mulss xmm4, DWORD PTR LC12 # s4,
sal ecx, 8 # tmp1387,
fadd DWORD PTR [esp+64] # %sfp
movss DWORD PTR [esp+64], xmm4 # %sfp, s4
fadd DWORD PTR [esp+64] # %sfp
faddp st(1), st #,
fstp DWORD PTR [esp+28] #
movss xmm0, DWORD PTR [esp+28] # D.17040,
mulss xmm0, xmm1 # D.17040, tmp1383
cvttss2si ebx, xmm0 # b, D.17040
cmp ebx, 255 # b,
cmovg ebx, edx # b,, tmp1388, tmp1386
add ecx, ebx # tmp1390, tmp1388
cmp eax, 255 # r,
cmovg eax, edx # r,, tmp1393, tmp1386
add esp, 104 #,
sal eax, 16 # tmp1393,
add eax, ecx # tmp1384, tmp1390
pop ebx #
ret
.p2align 4,,7
L2:
fld DWORD PTR __ZL9lightDir1+4 # lightDir1.y
xorps xmm3, xmm3 # tmp1394
movss xmm4, DWORD PTR __ZL9lightDir1 # D.17224, lightDir1.x
fld st(0) #
fmul st, st(1) #,
movss DWORD PTR [esp+28], xmm4 #, D.17224
movss xmm2, DWORD PTR __ZL9lightDir1+8 # D.17219, lightDir1.z
fld DWORD PTR [esp+28] #
fmul st, st(0) #,
movss DWORD PTR [esp+28], xmm2 #, D.17219
faddp st(1), st #,
fld DWORD PTR [esp+28] #
fmul st, st(0) #,
movss DWORD PTR [esp+28], xmm3 #, tmp1394
faddp st(1), st #,
fsqrt
fld DWORD PTR [esp+28] #
fcomip st, st(1) #,
jae L27 #,
fstp DWORD PTR [esp+28] #
movss xmm0, DWORD PTR [esp+28] #,
rcpss xmm1, xmm0 # tmp1128,
movaps xmm5, xmm0 # tmp1129,
mulss xmm5, xmm1 # tmp1129, tmp1128
movaps xmm0, xmm1 # tmp1131, tmp1128
addss xmm0, xmm1 # tmp1131, tmp1128
mulss xmm5, xmm1 # tmp1129, tmp1128
subss xmm0, xmm5 # tmp1131, tmp1129
movss DWORD PTR [esp+28], xmm0 #, tmp1131
mulss xmm4, xmm0 # tmp1133, tmp1131
mulss xmm0, xmm2 # tmp1137, D.17219
fld DWORD PTR [esp+28] #
fmulp st(1), st #,
movss DWORD PTR __ZL9lightDir1, xmm4 # lightDir1.x, tmp1133
movss DWORD PTR __ZL9lightDir1+8, xmm0 # lightDir1.z, tmp1137
fstp DWORD PTR __ZL9lightDir1+4 # lightDir1.y
L6:
fld DWORD PTR __ZL9lightDir2+4 # lightDir2.y
movss xmm4, DWORD PTR __ZL9lightDir2 # D.17246, lightDir2.x
fld st(0) #
fmul st, st(1) #,
movss DWORD PTR [esp+28], xmm4 #, D.17246
movss xmm2, DWORD PTR __ZL9lightDir2+8 # D.17241, lightDir2.z
fld DWORD PTR [esp+28] #
fmul st, st(0) #,
movss DWORD PTR [esp+28], xmm2 #, D.17241
faddp st(1), st #,
fld DWORD PTR [esp+28] #
fmul st, st(0) #,
movss DWORD PTR [esp+28], xmm3 #, tmp1394
faddp st(1), st #,
fsqrt
fld DWORD PTR [esp+28] #
fcomip st, st(1) #,
jae L28 #,
fstp DWORD PTR [esp+28] #
movss xmm0, DWORD PTR [esp+28] #,
rcpss xmm1, xmm0 # tmp1151,
movaps xmm5, xmm0 # tmp1152,
mulss xmm5, xmm1 # tmp1152, tmp1151
movaps xmm0, xmm1 # tmp1154, tmp1151
addss xmm0, xmm1 # tmp1154, tmp1151
mulss xmm5, xmm1 # tmp1152, tmp1151
subss xmm0, xmm5 # tmp1154, tmp1152
movss DWORD PTR [esp+28], xmm0 #, tmp1154
mulss xmm4, xmm0 # tmp1156, tmp1154
mulss xmm0, xmm2 # tmp1160, D.17241
fld DWORD PTR [esp+28] #
fmulp st(1), st #,
movss DWORD PTR __ZL9lightDir2, xmm4 # lightDir2.x, tmp1156
movss DWORD PTR __ZL9lightDir2+8, xmm0 # lightDir2.z, tmp1160
fstp DWORD PTR __ZL9lightDir2+4 # lightDir2.y
L9:
fld DWORD PTR __ZL9lightDir3+4 # lightDir3.y
movss xmm4, DWORD PTR __ZL9lightDir3 # D.17268, lightDir3.x
fld st(0) #
fmul st, st(1) #,
movss DWORD PTR [esp+28], xmm4 #, D.17268
movss xmm2, DWORD PTR __ZL9lightDir3+8 # D.17263, lightDir3.z
fld DWORD PTR [esp+28] #
fmul st, st(0) #,
movss DWORD PTR [esp+28], xmm2 #, D.17263
faddp st(1), st #,
fld DWORD PTR [esp+28] #
fmul st, st(0) #,
movss DWORD PTR [esp+28], xmm3 #, tmp1394
faddp st(1), st #,
fsqrt
fld DWORD PTR [esp+28] #
fcomip st, st(1) #,
jae L29 #,
fstp DWORD PTR [esp+28] #
movss xmm0, DWORD PTR [esp+28] #,
rcpss xmm1, xmm0 # tmp1174,
movaps xmm5, xmm0 # tmp1175,
mulss xmm5, xmm1 # tmp1175, tmp1174
movaps xmm0, xmm1 # tmp1177, tmp1174
addss xmm0, xmm1 # tmp1177, tmp1174
mulss xmm5, xmm1 # tmp1175, tmp1174
subss xmm0, xmm5 # tmp1177, tmp1175
movss DWORD PTR [esp+28], xmm0 #, tmp1177
mulss xmm4, xmm0 # tmp1179, tmp1177
mulss xmm0, xmm2 # tmp1183, D.17263
fld DWORD PTR [esp+28] #
fmulp st(1), st #,
movss DWORD PTR __ZL9lightDir3, xmm4 # lightDir3.x, tmp1179
movss DWORD PTR __ZL9lightDir3+8, xmm0 # lightDir3.z, tmp1183
fstp DWORD PTR __ZL9lightDir3+4 # lightDir3.y
L12:
fld DWORD PTR __ZL9lightDir4+4 # lightDir4.y
movss xmm5, DWORD PTR __ZL9lightDir4 # D.17290, lightDir4.x
fld st(0) #
fmul st, st(1) #,
movss DWORD PTR [esp+28], xmm5 #, D.17290
movss xmm1, DWORD PTR __ZL9lightDir4+8 # D.17285, lightDir4.z
fld DWORD PTR [esp+28] #
fmul st, st(0) #,
movss DWORD PTR [esp+28], xmm1 #, D.17285
faddp st(1), st #,
fld DWORD PTR [esp+28] #
fmul st, st(0) #,
movss DWORD PTR [esp+28], xmm3 #, tmp1394
faddp st(1), st #,
fsqrt
fld DWORD PTR [esp+28] #
fcomip st, st(1) #,
jae L30 #,
fstp DWORD PTR [esp+28] #
movss xmm2, DWORD PTR [esp+28] #,
fstp DWORD PTR [esp+28] #
movss xmm4, DWORD PTR [esp+28] # prephitmp.90,
rcpss xmm0, xmm2 # tmp1200,
mulss xmm2, xmm0 # tmp1201, tmp1200
movaps xmm7, xmm0 # tmp1203, tmp1200
addss xmm7, xmm0 # tmp1203, tmp1200
mulss xmm2, xmm0 # tmp1201, tmp1200
subss xmm7, xmm2 # tmp1203, tmp1201
mulss xmm5, xmm7 # prephitmp.90, tmp1203
mulss xmm4, xmm7 # prephitmp.90, tmp1203
mulss xmm7, xmm1 # prephitmp.90, D.17285
movss DWORD PTR __ZL9lightDir4, xmm5 # lightDir4.x, prephitmp.90
movss DWORD PTR __ZL9lightDir4+4, xmm4 # lightDir4.y, prephitmp.90
movss DWORD PTR __ZL9lightDir4+8, xmm7 # lightDir4.z, prephitmp.90
L15:
mov DWORD PTR __ZZ15ShadeTriangle3dP8TrianglejE11initialized, 1 # initialized,
jmp L3 #
.p2align 4,,7
L27:
fstp st(0) #
fstp st(0) #
mov DWORD PTR [esp], OFFSET FLAT:LC1 #,
mov DWORD PTR [esp+60], edx #,
movss DWORD PTR [esp+32], xmm3 #,
call __Z6ERROR_Pc #
mov edx, DWORD PTR [esp+60] #,
movss xmm3, DWORD PTR [esp+32] #,
jmp L6 #
.p2align 4,,7
L28:
fstp st(0) #
fstp st(0) #
mov DWORD PTR [esp], OFFSET FLAT:LC1 #,
mov DWORD PTR [esp+60], edx #,
movss DWORD PTR [esp+32], xmm3 #,
call __Z6ERROR_Pc #
mov edx, DWORD PTR [esp+60] #,
movss xmm3, DWORD PTR [esp+32] #,
jmp L9 #
.p2align 4,,7
L29:
fstp st(0) #
fstp st(0) #
mov DWORD PTR [esp], OFFSET FLAT:LC1 #,
mov DWORD PTR [esp+60], edx #,
movss DWORD PTR [esp+32], xmm3 #,
call __Z6ERROR_Pc #
mov edx, DWORD PTR [esp+60] #,
movss xmm3, DWORD PTR [esp+32] #,
jmp L12 #
.p2align 4,,7
L30:
fstp st(0) #
fstp st(0) #
mov DWORD PTR [esp], OFFSET FLAT:LC1 #,
mov DWORD PTR [esp+60], edx #,
movss DWORD PTR [esp+32], xmm3 #,
call __Z6ERROR_Pc #
mov edx, DWORD PTR [esp+60] #,
movss xmm5, DWORD PTR __ZL9lightDir4 # prephitmp.90, lightDir4.x
movss xmm4, DWORD PTR __ZL9lightDir4+4 # prephitmp.90, lightDir4.y
movss xmm7, DWORD PTR __ZL9lightDir4+8 # prephitmp.90, lightDir4.z
movss xmm3, DWORD PTR [esp+32] #,
jmp L15 #
.lcomm __ZZ15ShadeTriangle3dP8TrianglejE11initialized,4,16
.data
.align 16
__ZL9lightDir1:
 # x:
.long 1045220557
 # y:
.long -1077097267
 # z:
.long -1076258406
.align 16
__ZL9lightDir2:
 # x:
.long 1056964608
 # y:
.long -1087163597
 # z:
.long 1101162086
.align 16
__ZL9lightDir3:
 # x:
.long -1090519040
 # y:
.long -1097229926
 # z:
.long -1088841318
.align 16
__ZL9lightDir4:
 # x:
.long -1090519040
 # y:
.long 1067869798
 # z:
.long 1058642330
.section .rdata,"dr"
.align 4
LC3:
.long 1053609165
.align 4
LC4:
.long 1058373894
.align 4
LC5:
.long 1053944709
.align 8
LC6:
.long -1717986918
.long 1069128089
.align 4
LC7:
.long 1054078927
.align 4
LC8:
.long 1057201838
.align 4
LC9:
.long 1054951342
.align 4
LC10:
.long 1057216266
.align 4
LC11:
.long 1054615798
.align 4
LC12:
.long 1051642875
.def __Z6ERROR_Pc; .scl 2; .type 32; .endef
 

a bit mess.. imo - and i should it could be probably at least 'computionaly' improved with sse

 

especially this is a mess probably gcc inlined all the cose for normalize and dots and those are repeated 8 times 

Edited by fir

Share this post


Link to post
Share on other sites

<propagada>

Keep in mind that your code is not hardware accelerated so you wont get much better perfs than that.

Also, your dump show that your compiler IS using sse optimizations (no y there...)

</propaganda>

Edited by Vortez

Share this post


Link to post
Share on other sites

<propagada>

Keep in mind that your code is not hardware accelerated so you wont get much better perfs than that.

Also, your dump show that your compiler IS using sse optimizations (no y there...)

</propaganda>

 

it is using sse but only sclar mnemonics, need some advice how to hand optymize it with intrinsics

 

I know i will not take much speedup (probably or something about that 30% is all i can count on but anyway i would like to try it if possible

Share this post


Link to post
Share on other sites

<propagada>
Keep in mind that your code is not hardware accelerated so you wont get much better perfs than that.
Also, your dump show that your compiler IS using sse optimizations (no y there...)
</propaganda>

Those are the scalar operations (ending in ss), not the vectorized ones (ending in ps). You can actually get quite a lot better than that, even on the CPU, but it involves high level optimizations (evil propaganda that never got anyone anywhere), profiling and debugging (unhandy), reading up on stuff (too much an inconveniance) and googleing (boring, can't be engaged in a discussion).

Share this post


Link to post
Share on other sites

 

<propagada>
Keep in mind that your code is not hardware accelerated so you wont get much better perfs than that.
Also, your dump show that your compiler IS using sse optimizations (no y there...)
</propaganda>

Those are the scalar operations (ending in ss), not the vectorized ones (ending in ps). You can actually get quite a lot better than that, even on the CPU, but it involves high level optimizations (evil propaganda that never got anyone anywhere), profiling and debugging (unhandy), reading up on stuff (too much an inconveniance) and googleing (boring, can't be engaged in a discussion).

 

 

some reasonable opinion at least.. (profiling is incorrect as i do a lot of measurments/profiling) - this is at all interesting topic so why not discuss it?

 

found something in the archive

 

http://www.gamedev.net/topic/648061-performance-optimization-sse-vector-dotnormalize/

 

though it do not compile as this is sse 4.1 and i am forced to use only 1 & 2

Edited by fir

Share this post


Link to post
Share on other sites

well im only transforming by model becouse this lightning is only world space lighting (not dependant on camera at all) - so not camera and projection needed

 

as to transforming the normal maybe i could do that, but forgot a bit about this... - is normal just transformed the same way as the corresponding vertices?

 

what do you mean by "instead of transforming per triangle, why not transform per vertex?" i do not understood that.. later i could test if some of it will speed up - but anyway some sse would not harm and also is not so hard i think (if someone has some ground up in working with it, )

 

edit

 

well i tested the code to transform only the normal 9by simply multipling it by model matrix) but the result is different - need to check up how to transform normals

 

edit

 

this is probably i had something spoiled with the left-rand handnesses 

becouse it is the same but mirrored, probably i can omit this

Edited by fir

Share this post


Link to post
Share on other sites

Ps. i looked at it and those normal transformiation instead of triangle transformation is ok - maybe i should even precompute some of this but not sure as reading 100k of normals also takes its time

 

but what other code is changing? you only changed ifs to min , maybe this is a bit of improvement but besides this everything seem to be the same

 

to improve this i would need to use some intrinsics i think

Share this post


Link to post
Share on other sites

<propagada>
Keep in mind that your code is not hardware accelerated so you wont get much better perfs than that.
Also, your dump show that your compiler IS using sse optimizations (no y there...)
</propaganda>


My apologies, I hit the wrong button and down-voted this by mistake - I think it's an excellent contribution to the topic.

Share this post


Link to post
Share on other sites

I began to trying some untrinsics but as it all has some rules (of alignment, loading storing etc) im not sure when i will get some results - but probably toomorow i will spend whole day on this

Share this post


Link to post
Share on other sites

 

Here's a pretty direct translation of your code to SSE.  The big change is that in order to get good speed without huge changes to the algorithm, this code requires the data for 4 triangles to be packed in a transposed {xxxx, yyyy, zzzz} structure-of-arrays format so that 4 triangles can be processed in a single function call.  All of the functions work in SOA form so that they can resemble scalar math while actually working on 4 values at once.  This function takes 4 packed triangles and returns 4 8-bit colors packed as {rrrr,gggg,bbbb,0000}.

 

I have not checked this code for correctness.  There are probably bugs.  I will not be available to answer further questions about this code.  Sorry.

#include <pmmintrin.h>

struct FourVec3s { __m128 x; __m128 y; __m128 z; };
struct FourTris  { FourVec3s a; FourVec3s b; FourVec3s c; __m128i colors; };

// transposed
static FourVec3s lightDirs = {{0.2, 0.5, -0.5, -0.5},
                              {-1.6,-0.7,-0.3, 1.3},
                              {-1.7, 20.3,-0.6, 0.6}};

// transposed
static FourVec3s lightColors = {{.4,   .4145, .584,  .41 },
                                {.414,  .451, .51414,.44},
                                {.515,  .543, .43,   .3414}};

static __m128 modelRight = {1.0, 0.0, 0.0, 0.0};
static __m128 modelUp    = {0.0, 1.0, 0.0, 0.0};
static __m128 modelDir   = {0.0, 0.0, 1.0, 0.0};
static __m128 modelPos   = {0.0, 0.0, 0.0, 1.0};


static inline __m128 splatX(__m128 v) { return _mm_shuffle_ps(v,v,_MM_SHUFFLE(0,0,0,0)); }
static inline __m128 splatY(__m128 v) { return _mm_shuffle_ps(v,v,_MM_SHUFFLE(1,1,1,1)); }
static inline __m128 splatZ(__m128 v) { return _mm_shuffle_ps(v,v,_MM_SHUFFLE(2,2,2,2)); }
static inline __m128 add(__m128 l, __m128 r) { return _mm_add_ps(l, r); }
static inline __m128 sub(__m128 l, __m128 r) { return _mm_sub_ps(l, r); }
static inline __m128 mul(__m128 l, __m128 r) { return _mm_mul_ps(l, r); }
static inline __m128 and(__m128 l, __m128 r) { return _mm_and_ps(l, r); }
static inline __m128 less(__m128 l, __m128 r) { return _mm_cmplt_ps(l, r); }
static inline __m128 dot(const FourVec3s &l, const FourVec3s &r) { return add(add(mul(l.x,r.x), mul(l.y,r.y)), mul(l.z,r.z)); }

// unpack 8 bit RgbaRgbaRgbaRgba into 32-bit RRRR gggg or bbbb
static inline __m128i unpackR(__m128i iv) { return _mm_unpacklo_epi16(_mm_unpacklo_epi8(iv,_mm_setzero_si128()),_mm_setzero_si128()); }
static inline __m128i unpackG(__m128i iv) { return _mm_unpackhi_epi16(_mm_unpacklo_epi8(iv,_mm_setzero_si128()),_mm_setzero_si128()); }
static inline __m128i unpackB(__m128i iv) { return _mm_unpacklo_epi16(_mm_unpackhi_epi8(iv,_mm_setzero_si128()),_mm_setzero_si128()); }
static inline __m128 intsToFloats(__m128i iv) { return _mm_cvtepi32_ps(iv); }
static inline __m128i floatToInts(__m128 fv) { return _mm_cvttps_epi32(fv); }
static inline __m128i packAndSaturate32To8(__m128i r ,__m128i g, __m128i b, __m128i a) { return _mm_packs_epi16(_mm_packs_epi32(r,g),_mm_packs_epi32(b,a)); }
 

static inline FourVec3s normalizeFourVec3s(const FourVec3s &v) {
    __m128 length = _mm_sqrt_ps(add(add( mul(v.x,v.x), mul(v.y,v.y)), mul(v.z,v.z) )); 
    FourVec3s result = { _mm_div_ps(v.x,length), _mm_div_ps(v.y,length), _mm_div_ps(v.z,length) };
    return result;
}

__m128i Shade4Triangles(const FourTris &tris) {
    __m128 x1 = add(add(add( mul(sub(tris.a.x, splatX(modelPos)), splatX(modelRight)),  //  (*triangle).a.x - modelPos.x)*modelRight.x +
                             mul(sub(tris.a.y, splatY(modelPos)), splatY(modelRight))), // ((*triangle).a.y - modelPos.y)*modelRight.y +
                             mul(sub(tris.a.z, splatZ(modelPos)), splatZ(modelRight))), // ((*triangle).a.z - modelPos.z)*modelRight.z) +
                             splatX(modelPos));                                         // modelPos.x
    __m128 y1 = add(add(add( mul(sub(tris.a.x, splatX(modelPos)), splatX(modelUp)),
                             mul(sub(tris.a.y, splatY(modelPos)), splatY(modelUp))),
                             mul(sub(tris.a.z, splatZ(modelPos)), splatZ(modelUp))),
                             splatY(modelPos));
    __m128 z1 = add(add(add( mul(sub(tris.a.x, splatX(modelPos)), splatX(modelDir)),
                             mul(sub(tris.a.y, splatY(modelPos)), splatY(modelDir))),
                             mul(sub(tris.a.z, splatZ(modelPos)), splatZ(modelDir))),
                             splatZ(modelPos));
    __m128 x2 = add(add(add( mul(sub(tris.b.x, splatX(modelPos)), splatX(modelRight)),
                             mul(sub(tris.b.y, splatY(modelPos)), splatY(modelRight))),
                             mul(sub(tris.b.z, splatZ(modelPos)), splatZ(modelRight))),
                             splatX(modelPos));
    __m128 y2 = add(add(add( mul(sub(tris.b.x, splatX(modelPos)), splatX(modelUp)),
                             mul(sub(tris.b.y, splatY(modelPos)), splatY(modelUp))),
                             mul(sub(tris.b.z, splatZ(modelPos)), splatZ(modelUp))),
                             splatY(modelPos));
    __m128 z2 = add(add(add( mul(sub(tris.b.x, splatX(modelPos)), splatX(modelDir)),
                             mul(sub(tris.b.y, splatY(modelPos)), splatY(modelDir))),
                             mul(sub(tris.b.z, splatZ(modelPos)), splatZ(modelDir))),
                             splatZ(modelPos));
    __m128 x3 = add(add(add( mul(sub(tris.c.x, splatX(modelPos)), splatX(modelRight)),
                             mul(sub(tris.c.y, splatY(modelPos)), splatY(modelRight))),
                             mul(sub(tris.c.z, splatZ(modelPos)), splatZ(modelRight))),
                             splatX(modelPos));
    __m128 y3 = add(add(add( mul(sub(tris.c.x, splatX(modelPos)), splatX(modelUp)),
                             mul(sub(tris.c.y, splatY(modelPos)), splatY(modelUp))),
                             mul(sub(tris.c.z, splatZ(modelPos)), splatZ(modelUp))),
                             splatY(modelPos));
    __m128 z3 = add(add(add( mul(sub(tris.c.x, splatX(modelPos)), splatX(modelDir)),
                             mul(sub(tris.c.y, splatY(modelPos)), splatY(modelDir))),
                             mul(sub(tris.c.z, splatZ(modelPos)), splatZ(modelDir))),
                             splatZ(modelPos));

    FourVec3s normal;
    normal.x = sub( mul(sub(y1,y1),sub(z3,z2)), mul(sub(z2,z1),sub(y3,y2)) );
    normal.y = sub( mul(sub(z2,z1),sub(x3,x2)), mul(sub(x2,x1),sub(z3,z2)) );
    normal.z = sub( mul(sub(x2,x1),sub(y3,y2)), mul(sub(y2,y1),sub(x3,x2)) );
    normal = normalizeFourVec3s(normal);

    __m128 s1234 = dot(normal, lightDirs);
    s1234 = and(s1234, less(_mm_setzero_ps(), s1234));

    __m128 l = add(_mm_set_ps1(0.1f), add(add( mul(s1234,lightColors.x), mul(s1234,lightColors.y)), mul(s1234,lightColors.z)));

    __m128i r = floatToInts(mul(l,intsToFloats(unpackR(tris.colors))));
    __m128i g = floatToInts(mul(l,intsToFloats(unpackG(tris.colors))));
    __m128i b = floatToInts(mul(l,intsToFloats(unpackB(tris.colors))));
    
    return packAndSaturate32To8(r,g,b,_mm_setzero_si128());
}

 

GREAT, much tnx, will check it - i will probably have to change the begining to transform only the normal, and debug this if bugy but probably i will manage to do it - tnxxx much

Share this post


Link to post
Share on other sites

ps by saying  {xxxx, yyyy, zzzz}  you mean aaaabbbbcccc where a is a vertex or xxxxyyyyzzzz where x is a x of the vertex - this is confusion becaouse in the second case it would span to another triangle, so i dont know if do it like xxx0yyy0zzz0 or what - this is some confusion

 

edit

alright, i understand this is like

 

A: xxxx yyyy zzzz B:xxxx yyyy zzzz C: xxxx yyyy zzzz

 

against my previous A: xyz B: xyz C: xyz    A: xyz B: xyz C: xyz     A: xyz B: xyz C: xyz     A: xyz B: xyz C: xyz 

 

well dont seem to much a problem, i will only need change slightly my geometry  loader code

 

though, to say all, i should revrite then my whole rasterization pipeline for such "4 T at once" layout - this is yet a bit of revriting

 

edit

 

yet as Im observing this i see an additional bit of complication 

 

this routine above is luckily strightforward with no "dispersed" branches of execution - so this can just munch a 4 times chunk of data in one step

and thats cool (except im not to much accustomed to syntax and some semantics of that)

 

- but in previoust stages of  the rasterizer (I mean between the geometry input array and shading here i got clipping code and some random triangles in the stream are just clipped out

 

so i probably would need to collect the triangles than passed through clipping - that involves copying - though probably is worth this

Edited by fir

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this