quote:philscott
Statistics are bollocks and you can always find sets which say the opposite of other sets. I''d have a guess that you are biased towards assembler because you''ve obviously invested a lot of effort in it. Intel are biased against assembler because they want to sell their compiler. But if you want to compete with them, check out their website and look up optimisation for SSE2.
Yeah, I guess
I''m not too interested in SSE2 (don''t need doubles, don''t have a P4), but I find their statement "we will beat all asm coders" a bit much.
quote:philscott
When it comes to optimisation for each processor, aren''t there a massive number of variables to take into consideration. To be an effective assembly programmer, surely you''ve got to have a natural knack for being able to work with that many variables and constructs and stuff. Or is it easier than it seems? I can''t see how the average person on this forum would be up to it.
Dunno - I''ve been doing it for a while
My advice is to ignore the optimization part until you have a good grasp of assembly language, and then go crazy (learn the microarchitecture, read dox, ...).
quote:*urgh*.. if you would use asm on p4 you would throw every athlon into your garbage collector..
and why oh why is it still the fastest processor?
I''m sure you are aware that that depends on the benchmark/app. You are correct, though, in that newer P4s sometimes outperform Athlons. The main reason I bought my Athlon was performance / price.
quote:does your app not need to run on pc''s? actually i know.. uhm.. 3 persons owning an amd here. i know tens of persons running p3 or p4.
In my case, it''s the exact opposite, but that''s irrelevant. As I said, my primary interest is getting stuff to run fast on /my/ machine.
quote:and beat my vc7 with equivalent (x86) asm. try it. i don''t believe you''ve written optimal c/c++ code at all..
VC7 does not make any significant difference. I''m also sick of hearing ''my C code must be horribly broken''. Here it is:
static inline void tstrip_append(const uint i, const uint p, const VP& v){ if(i == l1 || i == l2) return; if(parity == p) { *idx = idx[-2]; idx++; } else { parity = p; l2 = l1; } l1 = i; *idx++ = last_idx++; vb[0] = (float)v.x; vb[1] = (float)v.y; vb[2] = v.z; vb += 4; /* V3F, stride = 16 */}static void refineR(uint i, uint j, VP vl, VP va, VP vr, uint32 l_vis){ { /* avoid VC++ "local var. init. skipped" error by making locals go out of scope after "goto skip" */start: if(((--l_vis)&0xffff) == 0) // if(--l == 0) goto skip; const float e = (float)v[j].e, r = (float)v[j].r, z = (float)v[j].dz; VP vm; vm.x = (vl.x+vr.x)/2; vm.y = (vl.y+vr.y)/2; vm.z = (vl.z+vr.z)/2; /* MORPH / ACTIVE */ float dx = vm.x-E[0], dy = vm.y-E[1], dz = vm.z-E[2]; float d = dx*dx + dy*dy + dz*dz; float dmax = e*nu[0] + r; dmax *= dmax; dmax -= d; if(*(int32*)&dmax <= 0) /* IEEE-754 specific: if(dmax <= 0) ... */ goto skip; /* faster than fcomp, fnstsw */ float dmin = e*nu[1] + r; dmin *= dmin; d -= dmin; vm.z += (*(uint32*)&d <= 0x80000000)? z * dmax / (dmax+d) : z; /* (d >= 0)? ... : z */ /* CULL */ if(l_vis < 0x001f0000) /* not completely inside all planes; NOTE: far plane is ignored */ for(uint n = 0, bit = (1 << 16); n < 5; n++, bit += bit) if(!(l_vis & bit)) { d = vm.x*frustum[n][0] + vm.y*frustum[n][1] + vm.z*frustum[n][2] + frustum[n][3]; if(d > r) /* completely outside */ goto skip; if(-d > r) /* completely inside */ l_vis |= bit; } /* VISIT CHILDREN (IN ORDER) */ uint a = (i+i)+j, b = (i << 2)-11; refineR(j, ((a-10)&3)+b, va, vm, vr, l_vis); tstrip_append(i, l_vis&1, va); /* index, parity, vertex coords */ /* TRE: refineR(j, ((a-9)&3)+b, vl, vm, va, l, visible); */ i = j; j = ((a-9)&3)+b; vr = va; va = vm; goto start; }skip: tstrip_append(i, l_vis&1, va); /* index, parity, vertex coords */}
I''d love to hear why this is _2.5x slower_ than my asm code.
Then again, VC7 does not emit 3DNow! Pro code. You lose. Case closed.
quote:knowing asm is good, but using it 99% of the time unneeded with any newer compiler.
If you never write anything time critical, asm optimization won''t be necessary; otherwise, it better be as fast as possible.
quote:and your statement crappy hw p4 simply drops you out of the reliable programmers list. bye
Oh well. I will attempt to enlighten you as to why I call the P4 crappy hardware:
Design decision: insanely long pipes (20 stages) to allow higher clock rates (which is why Intel is hyping GHz). Mispredicted branches will really cost you.
Results of a last minute reduction in die size to save costs:
1) 8 kb L1 cache. Apps with a tiny working set benefit from faster L1 access (2 clocks vs. 3), but I don''t know of any such apps. Baad.
2) Only 1 decoder. I-cache misses are painful; also, you better pray your code is in the trace cache.
3) no barrel shifter. This is just insane.
4) on top of that, partial register stalls are ''avoided'' by having 8 bit register accesses go through the shift unit (5-6 clocks!!). Also, no more AGU.
5) insufficient trace cache throughput - 3 µops / clock
6) finally, there''s only 1 FP and 1 MMX unit.
But you knew that, right?
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3