Assembly : When is it worth your time?

Started by
101 comments, last by OpenGL_Guru 19 years, 11 months ago
That''s the advantige of 3DNow! to SSE - it inherits the instructions of mmx In MMX there are quite a lot of surfling functions (as far as i know it).
Thank''s on the vocabulary info
Red Drake
Advertisement
quote:Original post by Charles B

I get 18 clock cycles (Athlon, gcc, my own intrisics).


My own intrisics ?? Clarify please
Red Drake
SSE also inherits all of the MMX instructions (ie. all MMX instructions work on xmm registers as well)... however I''m unaware of any shuffling instructions in MMX. The ones that they added in SSE2 are something like "pshuf" (with postfixes for data size) as far as I remember (don''t have the reference on me).
quote:Original post by AndyTX
SSE also inherits all of the MMX instructions (ie. all MMX instructions work on xmm registers as well)... however I''m unaware of any shuffling instructions in MMX. The ones that they added in SSE2 are something like "pshuf" (with postfixes for data size) as far as I remember (don''t have the reference on me).


Isn''t

Shift operations
psllw/d/q Parallel shift logical left words / dwords / qwords
psraw/d Parallel shift right signed words / dwords
psrlw/d/q Parallel shift right unsigned words / dwords / qwords

or did i mis understud the word "shuffling".
Red Drake
Yeah by "shuffling" I meant arbitrary moves. Ie. if you have a 4-element vector you may want another vector to be comprised of the values from it, except in a different order... perhaps {2, 1, 1, 4} instead of {1, 2, 3, 4}.

Bit shifting doesn''t really do the same thing as it won''t shift "across" elements in the vector. If each of your elements are 8-bits, an 8-bit-right-shift will zero all of your elements, NOT shift them into the adjacent locations.

Look up the "pshuf" instructions in the Intel Instruction Set Reference for diagrams and a better explanation
Well there is no problem with this in 3DNow! becouse you only hawe 2 numbers per MMX regiseter - so it''s a lot easyer (at least it seams so)
Red Drake
quote:Original post by OmniBrain
Just have to say this:

why use 95% more time when i only get 5% speed enhancement?


Think of it this way (came up in the Carmacks SQRT thread under Maths and Physics section of this forum).

The base sqrt function is 50ticks.
The one they refer to as carmacks inv sqrt on the first page is 30 ticks (before inversing it).
The one using the sub,add 3800000h trick is about 15ticks
The routine that I use that can only be written in assembly without significantly increasing it''s time is only 8ticks.
That''s an 85% increase in speed, not just 5%.
That''s not using SSE or 3DNow!
True, if you are only going to gain 5% increase, rearrange your code, but lets say in the example posted above, I unrolled the code, then removed half the memory accesses in assembly. You can''t remove all those extra memory calls with asm (the compiler will wack them back in again).

The other advantage of assembly is using the 3DNow! instructions. I mean a sqrt that is 50 times faster than the normal sqrt? Divides, Adds, subtracts and multiples that are the same speed? (ie fast divides) Even faster invsqrt approximations?
Assembly only baby.
Beer - the love catalystgood ol' homepage
IMHO ASM is only useful when your writting device drivers or a library which needs to save every clock cycle it can.

ASM is like regex it''s ummm...easy??? :S to write, but hellishly tiresome and annoying to read and understand, even if it''s your own code.

I used assembly maybe 2 times...once when I was learning it I created a tic tac toe game the second time was for C++ code profiling class.

I personally feel ASM is now only an educational tool and should not really be considered as a development language.

Cheers
quote:Original post by Red Drake
quote:Original post by Charles B

I get 18 clock cycles (Athlon, gcc, my own intrisics).


My own intrisics ?? Clarify please


Example : xor_2i, add_8u8, mul_2f,
I have a layer that wraps the Intel intrisics and gcc builtins into a more standardized and compatible form. It has type checking. This is explained in my "Horse power math lib (2)" thread. I have also improved the MMX builtins of gcc by replacing them with inline asm.

This way I really code asm in C. This let''s the compiler optimize the inlined routine within the given context. This gives the full range of optimizations. For instance a better register allocation across the routines.
"Coding math tricks in asm is more fun than Java"
@AndyTX, Red Drake
quote:Original post by AndyTX
... however I''m unaware of any shuffling instructions in MMX.


unpack, shift, swap are in practice equivalent to the shuffle instructions of SSE. I call all these instructions swizzling instructions in the context of working on vectors of four floats. The most obvious case of swizzling in linear algebra is the cross product.

Example :

; mm0 : x y
; mm1 : z w
mov mm2, mm0
punpackldq mm0, mm1 ; x z
punpackhdq mm1, mm2 ; y w
punpackldq mm2, mm2 ; x x, might serve as a scalar later.

"Coding math tricks in asm is more fun than Java"

This topic is closed to new replies.

Advertisement