Should I be learning assembly language...

Started by
38 comments, last by philscott 21 years, 5 months ago
quote:"I said, ''John, we can''t possibly simulate a human programmer with a language - this language - that would produce machine code that would even approach the efficiency of a human programmer like me, for example,''"

hmm. I understood efficiency to mean ''programming time''. If not, I guess you are correct in pointing out that compilers, even after 50 years, are not yet good enough.

quote:philscott
Surely you can''t produce a game which is only optimised for the AMD processor, which sounds like it is all you know. This is bollocks.

Fortunately, at the moment, I do not have to worry about my programs'' performance on crappy hardware (including P4s). Still, I know SSE(2), and optimization guidelines for Pentium 1/2/3/4, K6-2/3, and Athlon (XP) CPUs; I have an Athlon XP (the microarchitecture of which I like, and know inside out), so that is where I concentrate my efforts.

quote:philscott
My comment about the Intel compiler is just based on what I''ve read. From their statistics, which I consider to be no better than yours, they say that their Vectoriser can produce code for the SSE2 extensions better than any assembly programmer.

You are comparing my actual performance data with marketroid ''statistics''? Note: I only care about making my junk run fast, not whether or not I had to resort to asm to do so, so I can afford to be honest. I suggest you think about how this is supposed to be possible - I know SSE(2) and find it hard to believe (do I smell a contest? ).

quote:philscott
I can''t see how you can be telling prospective game developers to start using assembler when your knowledge is restricted to a single processor. I don''t know what the shite is going on here. This is getting more and more confusing.

It is obviously not. Does that help your confusion?

quote:philscott
But the bottom line is that I think I''d rather become proficient in C++ and keep my options open, than learn the full assembly extenstions and optimisation options just in case it might be able to speed tiny segments of code up.

Being proficient in C++ is a prerequisite.
BTW, it''s pretty neat to speed tiny segments of code (especially if these are the bottlenecks) up by 150%.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Advertisement
quote:Original post by Jan Wassenberg
hmm. I understood efficiency to mean ''programming time''. If not, I guess you are correct in pointing out that compilers, even after 50 years, are not yet good enough.

The fact that FORTRAN was considered a huge success in no way makes the point that FORTRAN was not good enough. Obviously something went wrong with your deductive reasoning in expanding the anecdote.
Oh please. What is it with you?
We are talking about whether or not compilers are "good enough" to make hand coding asm unnecessary, not the success of FORTRAN.
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Okay Wassenburg, the comments are fair enough. From what you were saying it sounded a lot like you totally ignored Intel and only knew AMD.

Statistics are bollocks and you can always find sets which say the opposite of other sets. I''d have a guess that you are biased towards assembler because you''ve obviously invested a lot of effort in it. Intel are biased against assembler because they want to sell their compiler. But if you want to compete with them, check out their website and look up optimisation for SSE2.

When it comes to optimisation for each processor, aren''t there a massive number of variables to take into consideration. To be an effective assembly programmer, surely you''ve got to have a natural knack for being able to work with that many variables and constructs and stuff. Or is it easier than it seems? I can''t see how the average person on this forum would be up to it.

quote:Original post by Jan Wassenberg Fortunately, at the moment, I do not have to worry about my programs'' performance on crappy hardware (including P4s).


*urgh*.. if you would use asm on p4 you would throw every athlon into your garbage collector..

and why oh why is it still the fastest processor?

does your app not need to run on pc''s? actually i know.. uhm.. 3 persons owning an amd here. i know tens of persons running p3 or p4.


and beat my vc7 with equivalent (x86) asm. try it. i don''t believe you''ve written optimal c/c++ code at all..

you''re babbling a lot in here, but i see not much point. knowing asm is good, but using it 99% of the time unneeded with any newer compiler.

and your statement crappy hw p4 simply drops you out of the reliable programmers list. bye

"take a look around" - limp bizkit
www.google.com
If that's not the help you're after then you're going to have to explain the problem better than what you have. - joanusdmentia

My Page davepermen.net | My Music on Bandcamp and on Soundcloud

quote:Original post by Jan Wassenberg
We are talking about whether or not compilers are "good enough" to make hand coding asm unnecessary, not the success of FORTRAN.

Yes, that''s correct. I''m not talking about the success of FORTRAN either.
quote:philscott
Statistics are bollocks and you can always find sets which say the opposite of other sets. I''d have a guess that you are biased towards assembler because you''ve obviously invested a lot of effort in it. Intel are biased against assembler because they want to sell their compiler. But if you want to compete with them, check out their website and look up optimisation for SSE2.

Yeah, I guess I''m not too interested in SSE2 (don''t need doubles, don''t have a P4), but I find their statement "we will beat all asm coders" a bit much.

quote:philscott
When it comes to optimisation for each processor, aren''t there a massive number of variables to take into consideration. To be an effective assembly programmer, surely you''ve got to have a natural knack for being able to work with that many variables and constructs and stuff. Or is it easier than it seems? I can''t see how the average person on this forum would be up to it.

Dunno - I''ve been doing it for a while
My advice is to ignore the optimization part until you have a good grasp of assembly language, and then go crazy (learn the microarchitecture, read dox, ...).


quote:*urgh*.. if you would use asm on p4 you would throw every athlon into your garbage collector..
and why oh why is it still the fastest processor?

I''m sure you are aware that that depends on the benchmark/app. You are correct, though, in that newer P4s sometimes outperform Athlons. The main reason I bought my Athlon was performance / price.

quote:does your app not need to run on pc''s? actually i know.. uhm.. 3 persons owning an amd here. i know tens of persons running p3 or p4.

In my case, it''s the exact opposite, but that''s irrelevant. As I said, my primary interest is getting stuff to run fast on /my/ machine.

quote:and beat my vc7 with equivalent (x86) asm. try it. i don''t believe you''ve written optimal c/c++ code at all..

VC7 does not make any significant difference. I''m also sick of hearing ''my C code must be horribly broken''. Here it is:

  static inline void tstrip_append(const uint i, const uint p, const VP& v){	if(i == l1 || i == l2)		return;	if(parity == p)	{		*idx = idx[-2];		idx++;	}	else	{		parity = p;		l2 = l1;	}	l1 = i;	*idx++ = last_idx++;	vb[0] = (float)v.x; vb[1] = (float)v.y;	vb[2] = v.z;	vb += 4;	/* V3F, stride = 16 */}static void refineR(uint i, uint j, VP vl, VP va, VP vr, uint32 l_vis){	{	/* avoid VC++ "local var. init. skipped" error by		   making locals go out of scope after "goto skip" */start:	if(((--l_vis)&0xffff) == 0)		// if(--l == 0)		goto skip;	const float e = (float)v[j].e, r = (float)v[j].r, z = (float)v[j].dz;	VP vm; vm.x = (vl.x+vr.x)/2; vm.y = (vl.y+vr.y)/2; vm.z = (vl.z+vr.z)/2;	/* MORPH / ACTIVE */	float dx = vm.x-E[0], dy = vm.y-E[1], dz = vm.z-E[2];	float d = dx*dx + dy*dy + dz*dz;	float dmax = e*nu[0] + r; dmax *= dmax;	dmax -= d;	if(*(int32*)&dmax <= 0)	/* IEEE-754 specific: if(dmax <= 0) ... */		goto skip;			/*   faster than fcomp, fnstsw */	float dmin = e*nu[1] + r; dmin *= dmin;	d -= dmin;	vm.z += (*(uint32*)&d <= 0x80000000)? z * dmax / (dmax+d) : z; /* (d >= 0)? ... : z */	/* CULL */	if(l_vis < 0x001f0000)		/* not completely inside all planes; NOTE: far plane is ignored */		for(uint n = 0, bit = (1 << 16); n < 5; n++, bit += bit)			if(!(l_vis & bit))			{				d = vm.x*frustum[n][0] + vm.y*frustum[n][1] + vm.z*frustum[n][2] + frustum[n][3];				if(d > r)		/* completely outside */					goto skip;				if(-d > r)		/* completely inside */					l_vis |= bit;			}	/* VISIT CHILDREN (IN ORDER) */	uint a = (i+i)+j, b = (i << 2)-11;	refineR(j, ((a-10)&3)+b, va, vm, vr, l_vis);	tstrip_append(i, l_vis&1, va);	/* index, parity, vertex coords */	/* TRE: refineR(j, ((a-9)&3)+b, vl, vm, va, l, visible); */	i = j;	j = ((a-9)&3)+b;	vr = va;	va = vm;	goto start;	}skip:	tstrip_append(i, l_vis&1, va);	/* index, parity, vertex coords */}  

I''d love to hear why this is _2.5x slower_ than my asm code.
Then again, VC7 does not emit 3DNow! Pro code. You lose. Case closed.

quote:knowing asm is good, but using it 99% of the time unneeded with any newer compiler.

If you never write anything time critical, asm optimization won''t be necessary; otherwise, it better be as fast as possible.

quote:and your statement crappy hw p4 simply drops you out of the reliable programmers list. bye

Oh well. I will attempt to enlighten you as to why I call the P4 crappy hardware:

Design decision: insanely long pipes (20 stages) to allow higher clock rates (which is why Intel is hyping GHz). Mispredicted branches will really cost you.

Results of a last minute reduction in die size to save costs:
1) 8 kb L1 cache. Apps with a tiny working set benefit from faster L1 access (2 clocks vs. 3), but I don''t know of any such apps. Baad.
2) Only 1 decoder. I-cache misses are painful; also, you better pray your code is in the trace cache.
3) no barrel shifter. This is just insane.
4) on top of that, partial register stalls are ''avoided'' by having 8 bit register accesses go through the shift unit (5-6 clocks!!). Also, no more AGU.
5) insufficient trace cache throughput - 3 µops / clock
6) finally, there''s only 1 FP and 1 MMX unit.

But you knew that, right?
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
I''ll say it again, I thought this forum was trying to attract prospective and current game developers. If you are trying to get into the industry, you don''t just want to be focussing on one chipset, on the basis that that happens to be the chip in *your* computer.

You say that you want to produce code that runs fast on *your* machine. That''s not really a commercial stand point is it? So surely some of this stuff is going to be bad advice for people actually trying to prepare for the industry?

VC7 doesn''t emit 3dNow! code, but I bet you can get hold of libraries that do.

From what you''re saying, you''ve been doing assembler for a while and your main expertise lies with AMD. So you''d have to become as good with the Intel processors to produce commercial code. And then when Intel and AMD release the next generation of processors, you''ve got to become proficient in both of them all over again. At the same time, the next version of DirectX is being released and games are starting to incorporate ever more advanced 3D graphics which need to be learned. How do you keep up, and which do you give priority to?
quote:Original post by Jan Wassenberg
I''d love to hear why this is _2.5x slower_ than my asm code.
Then again, VC7 does not emit 3DNow! Pro code. You lose. Case closed.

well.. it just means you compare wrong stuff. try to get vc7 code _FASTER_ with x86 asm, then you see that you don''t get much out of it most the time.
and with the intel compilers, and vc7.1 as well, its the same for sse.
why no one of the biggies supports 3dnow? because its _DEAD_. every new amd can do sse as well.

i don''t think you can outperform a new compiler by much if you use the same options he has. of _COURSE_ you are bether than him if you use stuff he can''t do (3dnow,sse depending on compiler). but thats no fair comparison.


yes i know how the p4 is designed, and i don''t think its a flaw. at least its fucking fast if used right, do i need more?. and i know that my code runs as well on every other processor.


what i think with your code btw is a) unreadable, but that doesn''t mather, you know what you code, blahblah.. and b) yes, possibly that thing is not much optimizeable anymore. but possibly your algo is simply slow. i don''t know exactly what kind of lodding you do, but for me it sounds overcomplex if it uses that much resources. i''ve seen fast lodding shemes, that looked nice and did _NOT_ needed _ANY_ lowlevel optimizations.


for the best proof that asm is not _needed_ anymore most the time: www.realstorm.com. why is realtime raytracing possible with that much speed? because they made an excelent algorithm in their engine. its plain c code, no asm in there, still it is very fast. it _IS_ much faster on amd cpu''s, showing the big flaw of p4: original floatingpoint unit. i''m quite interested how much faster it could be when compiled with intel c++ settings for sse-only, meaning replacing the fpu by the sse unit.


well, i''ve rewritten my raytracer, its quite some time back full with fpu asm instead of using the compiler. result was.. uhm.. _no_ difference. but once i started using sse, woosh it was quite a bit faster. then again, vc6 had no sse inline (meaning he can''t compile sse you haven''t written, eighter in asm or with the intristic functions).


asm helps there where you have no other choise (like in your situation, optimizing for amd), and asm helps much in understanding how a pc works, to know while coding highlevel even, what would be slow, what could be done faster, etc.

but you can''t get much speedboost if you fight against the compiler with the same weapons. else i agree. if you have bether weapons than he has, you''ll win. logically. but else, possibly yes, possibly not. depends much on situation, but all in all you''ll get about equal.

"take a look around" - limp bizkit
www.google.com
If that's not the help you're after then you're going to have to explain the problem better than what you have. - joanusdmentia

My Page davepermen.net | My Music on Bandcamp and on Soundcloud

quote:philscott
I''ll say it again, I thought this forum was trying to attract prospective and current game developers. If you are trying to get into the industry, you don''t just want to be focussing on one chipset, on the basis that that happens to be the chip in *your* computer.
You say that you want to produce code that runs fast on *your* machine. That''s not really a commercial stand point is it? So surely some of this stuff is going to be bad advice for people actually trying to prepare for the industry?

''general'' development
And yes, if I were programming commercially, I would do things differently (make sure the game/app runs acceptably on as many computers as possible).
As to ''getting into the industry'', I wouldn''t know, but I can only imagine being able to produce efficient code to be a plus.

quote:philscott
VC7 doesn''t emit 3dNow! code, but I bet you can get hold of libraries that do.

Yes. They will usually accelerate stuff like matrix multiplications. The problem with small 3DNow! / MMX functions, though, is FPU state switching. Also, the only thing in that code that a library might implement is a dot product, which I might as well inline.

quote:philscott
From what you''re saying, you''ve been doing assembler for a while and your main expertise lies with AMD. So you''d have to become as good with the Intel processors to produce commercial code. And then when Intel and AMD release the next generation of processors, you''ve got to become proficient in both of them all over again.

Yep. Knowledge is there, I''d just hate to have to optimize around the P4''s limitations.
True.

quote:philscott
At the same time, the next version of DirectX is being released and games are starting to incorporate ever more advanced 3D graphics which need to be learned. How do you keep up, and which do you give priority to?

When I get new hardware, I read the dox
As to graphics (more important than knowing the last details of a CPU''s performance characteristics), I use OpenGL, which has remained much more stable than DirectX.
Once you know what you''re doing, remembering how to code for AMD/Intel CPUs ain''t too bad.


quote:davepermen
why no one of the biggies supports 3dnow? because its _DEAD_. every new amd can do sse as well.

3DNow! is not dead. Most people are just too lazy to take advantage of what''s there. SSE is emulated on Athlons - packed instructions are vector decode, and take 5 clocks vs. 3DNow! ops'' 4.

quote:davepermen
well.. it just means you compare wrong stuff. try to get vc7 code _FASTER_ with x86 asm, then you see that you don''t get much out of it most the time.
and with the intel compilers, and vc7.1 as well, its the same for sse.
[...]
i don''t think you can outperform a new compiler by much if you use the same options he has. of _COURSE_ you are bether than him if you use stuff he can''t do (3dnow,sse depending on compiler). but thats no fair comparison.
[...]
but you can''t get much speedboost if you fight against the compiler with the same weapons. else i agree. if you have bether weapons than he has, you''ll win. logically. but else, possibly yes, possibly not. depends much on situation, but all in all you''ll get about equal.

1) as to using the same weapons, I refer you to my other example (nbits): 10% speed gain, 10 minutes spent optimizing, standard x86 instructions only (even within the compiler''s grasp).
I really don''t care if the comparison is fair - if the compiler can''t emit 3DNow!, and it would help performance, that''s the compiler''s problem, and a good reason to write the code in asm.

quote:yes i know how the p4 is designed, and i don''t think its a flaw.
[...]
it _IS_ much faster on amd cpu''s, showing the big flaw of p4: original floatingpoint unit.

Questionable design decisions aside, that''s a bit much. Lack of a barrel shifter and AGU is just plain unbelievable. You are also correct in stating that the FPU is weak: it has only 1 FP/SIMD unit! (vs. the Athlon''s 2)

quote:davepermen
what i think with your code btw is a) unreadable, but that doesn''t mather, you know what you code, blahblah.. and b) yes, possibly that thing is not much optimizeable anymore.

I can''t win First you say the C code must suck, then you complain it''s unreadable (yes, kiddies, I did everything I could to make it fast). Oh well. I can live with #2

Oh BTW: the LOD algorithm is Lindstrom and Pascucci''s "Visualization of Large Terrains Made Easy".

quote:davepermen
for the best proof that asm is not _needed_ anymore most the time: www.realstorm.com. why is realtime raytracing possible with that much speed? because they made an excelent algorithm in their engine. its plain c code, no asm in there, still it is very fast.

That''s just not valid reasoning. I say use asm when necessary; examples of when it is not are irrelevant.

quote:daverpermen
asm helps there where you have no other choise (like in your situation, optimizing for amd), and asm helps much in understanding how a pc works, to know while coding highlevel even, what would be slow, what could be done faster, etc.

Agreed.

whew! that was long
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3

This topic is closed to new replies.

Advertisement