Okay, I'm really a sucker for this kind of stuff, so I decided to write the 4x4 swizzled version. So good news and bad news... it ends up processing 16 points in 120 cycles (not the 80 I'd hoped for), so it's still a 100% speed-up (which is very hard to do over already highly-optimized SSE!) but the difference lies mostly in the fact that I changed the interface.
Before, I'd assumed that the Vector2s processed were a continuous chunk of memory, but Martoon clarified that they weren't... so in this version, I opted to pass in pointers to these 4x2 structures. There's a lot of very technical stuff to explain why this makes a 20 cycle stall, but it does :)
But also note, at the cost of 20 cycles we get some really cool flexibility. You can build the entire array of SwizVec2's to represent your screen (and outputs), and only need to do that once. Then you can pass in just the screen center and an array of 4 pointers to these swizzled chunks depending on where you want to calculate, and they're all independent.
Also, note that we're down to 8 cycles per point now, and that includes calculating an inverse 4th root for each point! Any performance gains beyond this will be to determine what area you're going to compute then prefetch that to the L1 cache before issuing these calls. There are more dangerous demons than stall cycles, and they live on "The
BUS" :)
code:
__declspec(align(16))struct SwizVec2{ float x0, x1, x2, x3; float y0, y1, y2, y3; friend std::ostream& operator << (std::ostream& out, const SwizVec2& vec);};std::ostream& operator << (std::ostream& out, const SwizVec2& vec){ out << "[ ( " << vec.x0 << ",\t" << vec.y0 << " ) \t" << " ( " << vec.x1 << ",\t" << vec.y1 << " ) \t" << " ( " << vec.x2 << ",\t" << vec.y2 << " ) \t" << " ( " << vec.x3 << ",\t" << vec.y3 << " ) ]" << std::endl; return out;}inline void transform16PointsSwizzled(SwizVec2** out, const SwizVec2** in, const SwizVec2* screenCenter){ SwizVec2 t0; SwizVec2 t1; SwizVec2 t2; SwizVec2 t3; __asm { mov eax, in mov ecx, [eax + 04h] mov edi, [eax + 08h] mov esi, [eax + 0ch] mov eax, [eax + 00h] mov edx, screenCenter // Good Idea #1: Don't load screenCenter into a register. As long as it is aligned on a // 16-byte boundary, we can use it directly as the 2nd argument to most *ps instructions. // This is good news! Register economy is key. // x's first movaps xmm0, [eax] movaps xmm1, [ecx] movaps xmm2, [edi] movaps xmm3, [esi] subps xmm0, [edx] // subtract screen center. The same x,y values are replicated // into all 4 screen center coordinates. // start moving in y's also. movaps xmm4, [eax + 10h] subps xmm1, [edx] movaps xmm5, [ecx + 10h] subps xmm2, [edx] movaps xmm6, [edi + 10h] subps xmm3, [edx] movaps xmm7, [esi + 10h] movaps [t0], xmm0 movaps [t1], xmm1 movaps [t2], xmm2 movaps [t3], xmm3 mulps xmm0, xmm0 subps xmm4, [edx + 10h] mulps xmm1, xmm1 subps xmm5, [edx + 10h] mulps xmm2, xmm2 subps xmm6, [edx + 10h] mulps xmm3, xmm3 subps xmm7, [edx + 10h] mulps xmm4, xmm4 mulps xmm5, xmm5 mulps xmm6, xmm6 mulps xmm7, xmm7 // now we have x^2 in xmm0-3 and y^2 in xmm4-7, so add. addps xmm0, xmm4 addps xmm1, xmm5 addps xmm2, xmm6 addps xmm3, xmm7 // start reloading y's. In the previous implementation, we saved off // the entire (p0 - ScreenCenter) calculation. That requires moving // a lot of memory around just to save a simple subtraction! Here, // it turns out to be a good idea to save just the x-components and // recalculate the y components. The 'why' is somewhat complicated, // but involves filling instruction latency and the fact that we start // off using all 8 xmm registers, then shrink down to using only 4 // during the rsqrt/rcp, then expand back to 8 to calculate p1 finally. // There's *always* a good way to use 4 vacant registers... movaps xmm4, [eax + 10h] movaps xmm5, [ecx + 10h] movaps xmm6, [edi + 10h] movaps xmm7, [esi + 10h] // start inverse 4th root. (x = x^(-1/2)) rsqrtps xmm0, xmm0 rsqrtps xmm1, xmm1 rsqrtps xmm2, xmm2 rsqrtps xmm3, xmm3 // recomputing (p0 - screenCenter).x here, also filling pipeline stalls. subps xmm4, [edx + 10h] subps xmm5, [edx + 10h] subps xmm6, [edx + 10h] subps xmm7, [edx + 10h] // AGAIN! (x = x^(1/4)) rsqrtps xmm0, xmm0 rsqrtps xmm1, xmm1 rsqrtps xmm2, xmm2 rsqrtps xmm3, xmm3 // and reciprocal to negate the exponent: (x = x^(-1/4)) rcpps xmm0, xmm0 rcpps xmm1, xmm1 rcpps xmm2, xmm2 rcpps xmm3, xmm3 // stalls are building up, good time to reassign our addressing regs. // note: this doesn't help much, these all dispatch and execute // in just 2 cycles! mov eax, out mov ecx, [eax + 04h] mov edi, [eax + 08h] mov esi, [eax + 0ch] mov eax, [eax + 00h] mulps xmm4, xmm0 mulps xmm5, xmm1 mulps xmm6, xmm2 mulps xmm7, xmm3 mulps xmm0, [t0] mulps xmm1, [t1] mulps xmm2, [t2] mulps xmm3, [t3] addps xmm0, [edx] addps xmm1, [edx] addps xmm2, [edx] addps xmm3, [edx] addps xmm4, [edx + 10h] addps xmm5, [edx + 10h] addps xmm6, [edx + 10h] addps xmm7, [edx + 10h] movaps [eax], xmm0 movaps [ecx], xmm1 movaps [edi], xmm2 movaps [esi], xmm3 movaps [eax + 10h], xmm4 movaps [ecx + 10h], xmm5 movaps [edi + 10h], xmm6 movaps [esi + 10h], xmm7 };}
called like:
SwizVec2 someVecs[8] = { { 1, 2, 3, 4, 5, 6, 7, 8 }, { 11, 12, 13, 14, 15, 16, 17, 18 }, { 21, 22, 23, 24, 25, 26, 27, 28 }, { 31, 32, 33, 34, 35, 36, 37, 38 }, { 41, 42, 43, 44, 45, 46, 47, 48 }, { 51, 52, 53, 54, 55, 56, 57, 58 }, { 61, 62, 63, 64, 65, 66, 67, 68 }, { 71, 72, 73, 74, 75, 76, 77, 78 }, }; SwizVec2 someOuts[8]; SwizVec2 screen = { 10, 10, 10, 10, 40, 40, 40, 40 }; const SwizVec2* inputs[4] = { &someVecs[0], &someVecs[3], &someVecs[1], &someVecs[7] }; SwizVec2* outputs[4]= { &someOuts[0], &someOuts[3], &someOuts[1], &someOuts[7] }; transform16PointsSwizzled(outputs, inputs, &screen);
So the SSE turns out very lovely. I mean, Picasso couldn't paint a pipeline this beautiful (well, he could but it would have some naked fat chicks lavishing about...)
I decided to crop off the nasty pointer-to-pointer stalls in order to make this image fit on the screen. If you refer to the assembly, you should be able to see where it picks up and leaves off...
Martoon, this will only work on processors equipped with SSE. This includes all Pentium 3+ and Athlon brands (basically 95% of processors in homes today). I never use SSE2+ because it eliminates many processors that folks have, and there are almost no benefits. Additionally, while the underlying assembly will work with x86 GCC, it would have to be reformatted to adhere to the GCC asm syntax, which is both frustrating and powerful (in this case, mostly frustrating). If you have any questions, ask away.
[Edited by - ajas95 on August 21, 2005 10:56:47 AM]