Assembly : When is it worth your time?

Started by
101 comments, last by OpenGL_Guru 19 years, 10 months ago
I can't test my code
If you can write your code, ou can test your code.

I don't hawe any profilers instaled
I don't use profilers for that. I use unrolled loops and the asm instruction rdtsc. My method gives me a benchmarking accuracy under one cycle, due to the loop. I repeat twice and get 18.25 cycles twice exactly for instance.

But I hawe an entire SDK of AMD SIMD (for 3D) assembler code and...
I have worked the last 6 months on this subject. Be sure I have downloaded and read, tested everything available on the web. This SDK being one of the first thing I got.

I am sure that most of that code is faster
than what ? The standard floating point version or my 3DNow version ? Mine is inlined, thus removing the call overhead, mine is necessarilly faster since the call costs roughly 8-10 cycles the way it's done in the AMD code sample.

Why would the guys at AMD bother to write it if it's slower
Because you never know until you have written it. Because most users do not even make the effort of testing it, and have the same prejudices as you. Because their version is faster than C with floating point on a K6, not on an Athlon. Because they wrote this 3DNow version and put in in the 3DNow file, so why remove it even if it brings nothing compared to the floating point version ? Coherence dictated this matter of fact.

Don't think anything coming from Intel, AMD, Microsoft, is god written stuff. It's made by guys like you and me, (or rather me ). I could explain in details why you can nearly always do better. In the case of the AMD lib, it's mainly due to Visual related stuff, and the fact that this lib is a small one, a fraction of what I do in my library. But that would be long and boring for you unless you had tried coding math in C/asm as much as I did over the last 10 years.

[edited by - Charles B on June 9, 2004 7:39:57 AM]
"Coding math tricks in asm is more fun than Java"
Advertisement
quote:
Don''t think anything coming from Intel, AMD, Microsoft, is god written stuff. It''s made by guys like you and me, (or rather me ). I could explain in details why you can nearly always do better. In the case of the AMD lib, it''s mainly due to Visual related stuff, and the fact that this lib is a small one, a fraction of what I do in my library. But that would be long and boring for you unless you had tried coding math in C/asm as much as I did over the last 10 years.


Before 10 years i was donig math - "1+1 = 2" I am 16.
And I saw your Horse power math lib - buth i like C/ASM - especialy SIMD and I am not working on anyithing specific - yust learning.
The reason i use AMD lib is becouse it''s preaty clear & it''s easy to understand - It does not hawe to be the best buth it''s learning.

All the uper things you poseted are don by you in C/ASM - this topic is about how the compilers woud do it - and you proven that you can write bether code than compiler (at least it seams so).

And one last question
In all your posts you say that 3DNow! is slower than your code - does this means 3DNow! is slow ??? (On AthlonXP - not K6)
Red Drake
You are motivated to test such things as 3DNow at the age of 16. I started as you do now .... 18 years ago (and even earlier).

And one last question In all your posts you say that 3DNow! is slower than your code - does this means 3DNow! is slow ??? (On AthlonXP - not K6)

??? Sure I haven't said that ! Either you misunderstood me or I have written something unclear somewhere.

In all your the last posts you say that 3DNow! the 3DNow version of Quaternion multiplication written by AMD is slower not faster than your code the equivalent C function with floats . But by inlining the code in GCC it becomes false and back to normal :

Proof :
- my 3DNow version of Quaternion mul : 18 cycles
- my 387 version of the quaternion mul : 28 cycles.
I don't have a SSE machine atm to test but I suppose it would give 12 cycles.

In general for any function the best implementation possible of a given routine in 387 (using float), 3DNow or SSE should reflect that :

387 < 3DNow < SSE

SIMD wins X2 with operations such as add, or int ot float converions, or mixing logical ops and floating ops, but once there is swizzling it looses compared to floating point. But all in all, on the average SIMD always wins. It's really worth using it. I can give you some example wher SIMD code completly smashes traditional C code with floats.

Example : AABox/Plane distance or AABox/Frustrum culling is awesome with SIMD instructions. Probably 40 times fasters than what most people would write in C with the naive algo. And many times faster than what I could do with IEEE math tricks and floating points.


[edited by - Charles B on June 9, 2004 9:57:13 AM]
"Coding math tricks in asm is more fun than Java"
Sory to ask so many questions Buth I am courios.
Is SSE faster than 3DNow! - or is this a nother thng that I misunderstud ?
I do hawe an SSE machine - AMD Athlon XP 1800 + SSE I - so let''s say on my PC woud SSE run faster (if it woud I don''t see the point of learning the 3DNow! when I can use SSE wich is faster)
Red Drake
quote:Original post by Red Drake
Sory to ask so many questions Buth I am courios.
Is SSE faster than 3DNow! - or is this a nother thng that I misunderstud ?
I do hawe an SSE machine - AMD Athlon XP 1800 + SSE I - so let''s say on my PC woud SSE run faster (if it woud I don''t see the point of learning the 3DNow! when I can use SSE wich is faster)


SSE would certainly run faster for most functions.

A SSE capable machine is also MMX, 3DNow and 3DNowExt capable. So for image processing, MMX can still be useful. And if you for instance want to accelerate 2D vectors processing, then 3DNow is still more relevent than SSE. So yes it''s useful to know 3DNow. More on many AthlonXP SSE is not set on the mother board. You need a BIOS update for that. So it''s better for compatibility to have several versions of your routines. One for 3DNow one for SSE for instnace.
"Coding math tricks in asm is more fun than Java"
quote:Original post by Charles B
A SSE capable machine is also MMX, 3DNow and 3DNowExt capable

Correct me if I'm wrong, but Intel machines (P4, etc.) do not support 3DNow. They do support MMX (since Pentium/Pro), SSE (since Pentium 3) and SSE2 (since Pentium 4) and a few SSE3 (since Pentium 4E).



[edited by - AndyTX on June 9, 2004 11:15:29 AM]
You''re probably right :/ I must admit as I don''t have a P4 at hand I am not 100% certain of what I wrote. But this remains valid in the context of the AthlonXP. It''s a bit confusing because beyond 3DNow, and SSE, are some variations like 3DNowExt, and some instruction sets overlap on many machines. I''ll have to study that more precisely when implementing my CPUID based virtual table. I have currently have studied the AMD docs more than the Intel.
"Coding math tricks in asm is more fun than Java"
quote:Original post by Tree Penguin
Set your display to a 60 Hz refreshrate, you will see it flickering and you will get a headache when you're not used to it. Movement however is fine at a steady 28 Hz (for some people maybe less). Tv and such are all 28 or 30 Hz, their refreshrate is set to 80 or 100 Hz to get rid of the flickering.


It may give you a headache, but the point is you can't actually consciously see the dark down-time between the frames.

You Can see problems with movement at 28 or 30hz.

At least you can when it's done the way a computer does it. A frame on a television blurs on fast motion because the shutter was open for a significant fraction of the frame. Computers don't do that.

When a hard edge moves from point A to point B on the TV (or a movie) you see a frame where there's a blur between the two points. But if a hard edge moves from point A to point B on the computer you see the hard edge first at A and then at B and absolutely nothing between. You can tell that's a problem. (You can even see this effect on cheazy computer animation on television)

If computers blurred between frames like cameras do then yes, 30hz would be enough. But computer games don't work like that because it would be a processing nightmare, to do it well you'd have to calculate a significant number of 'in-between' frames and then blur them together. (Just blurring with the last frame does not help. That's just a gimmick. You still wind up with two edges and no hint that the edge was ever between the two.)


Of course, pure rendering isn't necessarily the sort of thing that you'd want to make sure was optimized beyond all doubt with ASM.
My personal belief is that it won't be long before rendering itself is not the critical bottleneck. I'll bet games will start to feature very intensive physics engines and AIs that could bring even the fanciest modern computer to it's knees. But what do I know? (Not much.)


[edited by - AndyL on June 9, 2004 5:36:25 PM]
Anyone...just check this recent article :

http://www.onlamp.com/pub/a/onlamp/2004/05/06/writegreatcode.html
Bruno B
No, the real reason assembly language programs tend to be more efficient than programs written in other languages is because assembly language forces the programmer to consider how the underlying hardware operates with each machine instruction they write.

The most interesting thing he points out. Already mentionned this point as crucial. Software Engineering concepts taught in universities (many of them being very trivial compared to math for instance) are nearly pointless without knowing how a (modern) computer works.

The typical prejudices against low level programming really piss me off. Obviously ideas shared by guys who have never written any truely state of art code.
"Coding math tricks in asm is more fun than Java"

This topic is closed to new replies.

Advertisement