Assembly : When is it worth your time?

Started by
101 comments, last by OpenGL_Guru 19 years, 11 months ago
quote:Original post by Red Drake
Who uses assembly when its not necessary ??

The earlier comments and comparisons were beginning to imply that you should use assembler even for something as trivial as incrementing elements in an array, while simply restructuring your HLL code would be more than sufficient. I was only trying to steer this back onto topic...

Advertisement
quote:Original post by CloudNine
Visual C++ 6 (yes, i know ) regularly generates assembly code for a

++variable;  


that amounts to this

mov eax,DWORD PTR ainc [eax]mov DWORD PTR a,eax  


when surely:

inc [DWORD PTR a]  


is faster.

Surely 2 memory accesses are slower than 1? And 3 instructions are definitely slower than 1.

It''ll be interesting to see if Visual C++ .NET still generate the same code.

note: I''ve probably got DWORD PTR and the square brackets mixed up, but you should be able to see my point.


In regards to CloudNine and TokJuniors posts, cloudnine is spot on, and yes C++ .NET still generates the same waffle.
Borlands compilers are a little bit nearer (only 2 instruciions) but that is the best they come.

As for tok_junionrs, even with using a pointer, you are still about 50% slower than the assembly code posted. It''s basically like cloudnine put it.

When given consequetive commands, compilers only optimies for them partially. They can''t fully optimise for them (atleast yet - it wouldn''t be TOO hard I guess - if I can do it when drunk, a computer can do it).
Some (like borlands) will atleast re-arrange the registers to make them faster, but they still leave in alot of excessive instructions.

I''ve been recommended to check out a compiler called PGC as it was meant to be alot better when compiling the blitz library - it may actually remove the redundant instructions, but who knows.
Beer - the love catalystgood ol' homepage
quote:Original post by CloudNine
Visual C++ 6 (yes, i know ) regularly generates assembly code for a

++variable;  


that amounts to this

mov eax,DWORD PTR ainc [eax]mov DWORD PTR a,eax  



You can''t base the quality of a compiler on one out of context example.
If you rewrote it like this:

counter = 0;
++counter;

it compiles to

mov dword ptr [ebp-0x10],0x1

Different situations produces different sets of instructions
What about gaming on mobile phones, game boys, n-guage''s and things like that, surely they are still worth the time to use asm?
Quote:Original post by BosskIn Soviet Russia, you STFU WITH THOSE LAME JOKES!
ASM, THE ANTI-PROPAGANDA MANIFESTO.

Today asm is still nearly as required as 10 years ago. After all optimizations, culling, BSP, portal, PVS, s-buffer, efficient lighting in object space, etc ... yesterday the bottlenecks concerned persp correct scanline rendering, today it's about coding pixel shaders. Yesterday it was about projecting vertices now it's all about complex CLOD algos, more physics, more AI. I coded less than 5% in asm, it would be still roughly the same today. Except if I count using SIMD intrisics as asm, then it's even more.

A) The parameter "high level optimizations are many orders of magnitude more important" is true but totally fallacious when the conclusion becomes "why then bother about asm ?" There is nothing new. This issue has always exited in graphics programming, even in 1970 it was true. We're talking of C/asm which require high skills to be discusssed seriously. So in this case I assume everyone knows about space partition, occlusion, look up tables, caching, sort algos, etc ... Which makes this parameter a non issue, and recenters the debate on the actual question : C or higher versus asm .

B) The parameter of exponential GHz is a non issue in the field of game programming . Because there is a never ending competition. There are fields which can still be infinitely refined : physics, AI, at leats for the next decades. It's possible that the graphics may reach their final state of art in the next years, since after every algorithmic optimization, caching, imagery is basically treating one million pixels at a given frequency, say 100FPS.

Thus the argument only makes sense for general desktop applications products. That's why we get polluted by anti-asm theories. But that's not the subject here right ?


C) The argument that the compilers get more and more efficient is also a non issue in our field.

C-a) - It's true in absolute, but not relative to the evolutions of the CPU architectures. I can tell you with an absolute certainty, after six months of hard work that vectorizers or compilers are totally uncapable of getting anything close from what I can reach with SIMD asm in many common cases. I doubt anyone here has made as many tests as I have these last 6 months.

C-b) - Once again there is a crucial argument people almost never see. Comparing C to asm is not simply comparing machine code generated . That's a really naive vision of the problem.

This forgets that an (real) asm coder has more knowledge on lower levels and higher levels. Thus there is a far wider feedback loop where the coder can make far more meaningful decisions than the compiler.
- He can modify the data structures, the CC is not allowed.
- He can modify some details of the algo, the CC is not allowed or capable.

- The coder can read the latest docs and find how to implement new instruction sets efficiently in small crucial routines.
- The compiler is always late, because it's very complex to exploit the "spirit" of a new instruction set in general cases. and beind the compilers, there are onl human coders. For instance, neither Visual nor gcc implement the intrisics, which only map to one asm instruction (!!!), correctly. (*)

(*) Visual is uncapable to use reg, mem operands :
pfadd mm?, [eax]. It loads and operates systematical, which causes a much higher register pressure. This results in tons of load and store dependencies. Reducing perfs in practice to 50% of the optimal. This is backed by serious rdtsc benchmark sessions. Same thing for gcc that forces MMX output to mm0, killing register strategies. But this can be solved thanks to the great inline asm of gcc.

As a conclusion, I find the future will add an intermediate level : the intrisics and various low level extensions to C and C++. Today they are not well implemented but they will certainly soon, since I did it myself. Check the "math\horsepower math lib (2)" thread to have an idea. Vectorizers can help a bit, but to me they are just better compilers, don't count on them for the most critical inner loops for the last reason (C-b) I explained, which will always be true.

[edited by - Charles B on June 8, 2004 7:37:57 AM]

[edited by - Charles B on June 8, 2004 8:12:46 AM]
"Coding math tricks in asm is more fun than Java"
Just have to say this:

why use 95% more time when i only get 5% speed enhancement?

I wouldn''t bother to use a slow function when it is only called once at startup time of the programm.


Important for every Improvement:
- profile. remember the 10-90 rule (10% of the code will eat 90% of processor time). no need to improve something besides these 10% of code.
- rethink your problem. using assember instead of c will not get you that boost you probably would get if you find a better algorithm.

Dredge-Master tried to show how much faster asm was with the increase loop. he found a way to eleminate one for loop, and unrolled the remaining loop. this is the way of "rethinking".
Also knowing about your underlying system may help you a lot:
for (b=MAX_WIDTH-1; b>=0; --b)  for (h=MAX_HEIGHT-1; h>=0; --h)    // copy pixel from screen 1 to screen 2 

might be much slower than just excanging the two for loops, as with a small cache your cache hitting would be much better.


On the other hand a state-of-the-art game should scale well with better performance (always use 100% performance), so with a faster rendering it would be able to put more performance into AI or other subsystems (or just be able to run more game entities).

At the end it will all turn out on how much time you have (both runtime and development-time) on how much you will optimize your code.
I am pretty positive most people will agree you can code way faster in a high-level-language than in assembler. So the cut is just when you have to ask yourself "is it worth this efford?"
-----The scheduled downtime is omitted cause of technical problems.
@DBX & Red Drake
The Quaternion mul challenge :

Take a look at here :
here

I get 18 clock cycles (Athlon, gcc, my own intrisics). By optimizing the loop a bit and scheduling a bit more I could even
reach 15-16 clocks I think. The inline floating point version costs 28 cycles. I use a fastcall function (24 cycles) under visual, because intrisics are bad with Visual (30 cycles). I suppose the Intel compiler would be as good as gcc. But my licence has expired.

Can you benchmark your routine Drake ? I am quasi sure yours is slower. I would bet around 25-30 cycles. That is equivalent to the floating point C version. But Quaternion does not favor SIMD code because of the swizzling required.
"Coding math tricks in asm is more fun than Java"
why use 95% more time when i only get 5% speed enhancement ?
Of course. No real asm coder would do this. Because writing a highly scheduled, unrolled asm code is such a pain and sacrifice of time. I should have put this argument before A). But to me it''s the same. One has to assume we''re talking of really experienced coders.

Say that writing an asm routines takes 10 more times than writing C. If 5% of the source code size is in asm, then it is 50 against 95 in development time. I think it''s the max bearable for nayone not wanting to become mad.


Dredge-Master ... reordering he loops
Which shows argument C-b. The knowledge of asm by itself is definitely a plus, even if the choice is finally keep the code in c. The p++ stuff is just typically something that might be fairly optimized today. Just unroll x4 in C, it''s a good compromize. Eventually add a prefecth intrisic. And done. Better find a lib that does it at full speed for you.


state-of-the-art game should scale well with better performance (always use 100% performance) </i> <br>Right LOD rendering should be used more as an elastic joint and help to always deliver enough power for AI or physics.<br><br><br><i>So the cut is just when you have to ask yourself "is it worth this effort?" </i> <br>Yes development-time against CPU-time is really a ratio to always keep in mind. A balance has to be found on tha whole project scale. I am doing a math lib so that for future users these two constraints together are both reduced. It''s only worth the effort, because it will be open source and used by many, at least I hope. For my own needs the deal is not obviously paying, as my plan is to renew my next gen terrain engine with amazing features, that exploit a small part of the CPU. But on long term (2 years), I suppose, that only considering myself it will be paying. I have plenty ideas I''d like to implement in graphics, physics, AI but it requires a highly powered math lib for that.<br><br><br><br>
"Coding math tricks in asm is more fun than Java"
quote:Original post by Charles B
@DBX & Red Drake
The Quaternion mul challenge :

Take a look at here :
here

I get 18 clock cycles (Athlon, gcc, my own intrisics). By optimizing the loop a bit and scheduling a bit more I could even
reach 15-16 clocks I think. The inline floating point version costs 28 cycles. I use a fastcall function (24 cycles) under visual, because intrisics are bad with Visual (30 cycles). I suppose the Intel compiler would be as good as gcc. But my licence has expired.

Can you benchmark your routine Drake ? I am quasi sure yours is slower. I would bet around 25-30 cycles. That is equivalent to the floating point C version. But Quaternion does not favor SIMD code because of the swizzling required.



I can''t test my code - I don''t hawe any profilers instaled btuh i hawe an entire SDK of AMD SIMD (for 3D) assembler code and I am shure that most of that code is faster (why woud the guys at AMD bother to write it if it''s slower )
Dude - can you give me an explanation of the word "swizzling" - (don''t know the meaning )
Red Drake
"Swizzling" is generally used to mean "shuffling the order of" parts of a vector (using the term in the generic math sense). It''s also closely related with "masking" (only writing/reading from a few components of a vector) and some people use it to mean both.

In SIMD this is still quite undesirable, and wasn''t even possible until SSE2 (which added the "shuffle" instructions). It''s still somewhat awkward, although in my experience it is certainly useable now, as the shuffle operations are quite flexable.

This topic is closed to new replies.

Advertisement