Jump to content
  • Advertisement
Sign in to follow this  
bpj1138

3D math lib (C and Asm)...

This topic is 2336 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Here's a little program to test 3D Asm vs C functions. It's funny how it actually tell yous the number of cycles a function takes to complete. Feel free to optimize any of this or use it in your code.. It's in the public domain.

Share this post


Link to post
Share on other sites
Advertisement
Some notes on the test:

  • rdtsc is not a reliable timer on multi-core CPUs.
  • The test is very artificial, with the entire data set fitting in L1 cache at all times (and no cache warmup/clearing between tests).
  • The results of your calculations aren't used, so a good optimiser is allowed to simply remove a lot of your test code.
  • Your C code isn't equivalent to the ASM code.

    • The C compiler has to account for possible aliasing in some of your functions. Reading your inputs into local variables before writing to your outputs would help the compiler write optimal assembly in these cases.
    • Some of the ASM functions use SIMD types, whereas none of the C functions do.

Share this post


Link to post
Share on other sites
A few remarks on top of Hodgman's:

- Your test is hardly fair, most of the matrix operations are implemented using MMX/SSE opcodes in assembler but C doesn't get that luxury (I wouldn't bet on the compiler figuring it out and using SIMD on its own). You could use intrinsics. As it is now you may as well pepper your C code with random sleep()'s because the results will be just as meaningless.

- Benchmarking the smaller functions like zero vector and copy vector is useless, as the processor will spend most of its time pushing/popping to/from the stack (especially in C, since the compiler usually leaves the standard push/pop's even when unnecessary just to be sure, and assuming their cost will be amortized over the function's effective instruction count - which obviously is a flawed assumption for trivial functions)

- In some of your C functions, you call other functions (like in matrixCamera, you call a bunch of vector* functions) - function calls are not free. However, in assembler - surprise - you inline the extra work instead... although I suspect a good compiler would inline too (but likely not as efficiently)

So, in conclusion, yes, assembler is faster, but you really aren't helping C to compete. Your test heavily favours the assembler implementation, and as such doesn't give very relevant measurements. Edited by Bacterius

Share this post


Link to post
Share on other sites
I was pretty disappointed with the speed improvement you get from assembly language. If you don't believe me, the C compiler actually outperforms SSE code. This is a 3x3 determinant that I coded in SSE: (You can find it in the SSE version in the dead.txt file)


float determinant(float a[3], float b[3], float c[3])
{
return
a[0] * (b[1] * c[2] - c[1] * b[2]) -
b[0] * (a[1] * c[2] - c[1] * a[2]) +
c[0] * (a[1] * b[2] - b[1] * a[2]);
}

In SSE, you can actually do all these ops in parallel, almost like looking at the equation in vertical columns instead of looking at it horizontally like this:


a[0] * (b[1] * c[2] - c[1] * b[2]) - b[0] * (a[1] * c[2] - c[1] * a[2]) + c[0] * (a[1] * b[2] - b[1] * a[2]);

But, it tuns out that the time it takes to rearrange the data with the SHUFPS into different combinations of a,b,c vectors takes longer than the speed gained from parallel operations. Also, once you do things in parallel, its hard to operate on the result vector "horizontally". I just ended up using SHUFPS to "rotate the components" and then scalar SSE ops to do the final additions/subtractions. Well, it's slower than just letting the C compiler use the FPU and non of this fancy shuffling!

This isn't even the half of it. I tried to code the dot product with SSE. You'd think this would give you a huge boost, but it suffers from the same "horizontal" op problem. Mind you this is fixed in SSE3.

The timing should be good. I'm suspicious of figures like "9 cycles", but maybe this is because the instruction pipeline parallelism? Anyhow, these small functions should be inlined, but that would mean coding the asm with GAS syntax, which I wanted to avoid in the first place.

Yes, the C coded a bit lazily.

Share this post


Link to post
Share on other sites
[font=georgia,serif]Don't get discouraged --- persist. While many of the comments about your code are legit, don't let the current "racism" against assembly-language bother you. A lot of the motivation for this is simply jealousy (present company excluded, I'm sure). However, also don't be too surprised when you find the potential of many applications for assembly-language are "defeated" by some subtle issue, or "the way things are done". Also, don't be too surprised when you look at the disassembly of code your normal old GNU compiler (gcc/g++) generated... and find all sorts of complex SIMD gymnastics --- they are getting pretty good, those compiler writers.[/font]

[font=georgia,serif]However, if you persist, you'll find some places where SIMD/SSE/AVX/FMA will help a lot. They can moan and groan all they like, but my f64mat4x4 matrix multiply is considerably faster with SIMD/AVX assembly language, measured fairly against maximally optimized C. And since it gets called a LOT in some of my applications, that little chunk of code matters. Also the function that transforms vertices (which transforms the position as well as zenith, north, east vectors) is very much faster than even the spiffy SIMD code the C compilers generate. Since it processes all the vertices in one call of my assembly-language function, it does a lot of work amazingly quickly.[/font]

[font=georgia,serif]I upgrade my routines every time I get a new CPU. It was actually fun to watch how much more compact, and how much faster these routines got when I recently wrote my bulldozer version (latest AMD CPU with AVX/FMA/etc instructions). It helps a lot to have 256-bit wide registers and 16 of them instead of only 8. Enjoy those ymm registers and AVX instructions --- you'll love 'em![/font]

[font=georgia,serif]My latest assembly-language project was a set of 4-wide trig functions, so you can compute the sine or cosine of four f64 values simultaneously (or certain combinations thereof, like 2 sine and 2 cosine). They work well enough too, though I have some very strange "precision issue" that I'm about to post a question about. But before the anti-assembly hoards jump on me, understand this "precision issue" exists in my C version of these functions too.[/font]

[font=georgia,serif]Incidentally, I'm happy when my assembly-language can't handsomely beat the compilers! That just means I can spend more time on more interesting problems. However, whenever I identify a seriously fundamental core routine that gets executed a zillion times by many of my applications, I suffer zero shame for creating an assembly-language function, no matter what the anti-assembly crowd says.[/font]

[font=georgia,serif]Of course, assembly-language is a "mid-level language" for me, which is true only for a small minority. What does that mean? Well, just try designing FPGA algorithms [at the LUT level] and writing microcode some time. Those are my "low-level languages". So assembly-language is my "mid-level language", and C is my high-level language. Anything so-called "higher" than C is just religion in my judgement (causes more trouble than any possible narrow benefits). Fortunately for me, I stay far away from religions, even including any assembly-language religion anyone may choose to start.[/font]

[font=georgia,serif]BTW, on the motherboards I have, the clock cycle counter instruction can be configured to be "accurate" (whichever way you want to consider "accurate" - as in "actual cpu cycles no matter how fast the CPU is clocking", or in "full speed clock cycles no matter how fast or slow your CPU is clocking". So you can perform reasonable speed ratings if you take the time to make sure you know what you're doing. Personally I only care about number of CPU cycles, because that's all we can reasonable optimize for (and doesn't change every 100 microseconds).[/font] Edited by maxgpgpu

Share this post


Link to post
Share on other sites

But, it tuns out that the time it takes to rearrange the data...
What if your input data had been "rearranged" to begin with? Then the SSE version would be fast and the C version would be slow.

This is the main lesson taught by DOD -- that the ideal data layout is different depending on the type of transformation/processing that needs to be done, showing that the OOD practice of binding all transforms to a single data layout is actually harmful.
e.g. if when generating your input data, you know that it's going to be used to calculate determinants, then why not have that original data generator output 'rotated' data in the first place?

don't let the current "racism" against assembly-language bother you
N.B. a lot of the critics of assembly would only suggest that instead of writing asm by hand, you should be still writing C and using the equivalent compiler intrisics for the asm commands that you want to use. This gives you the same functionality that asm gives you, but doesn't disable the optimising compiler (meaning your code can be inlined and rearranged as usual, instead of creating a big anti-optimisation wall around the points where you transition from C to asm and back) and allows your code to work on newer compilers that have dropped support for inline asm, etc...
e.g. where you'd use [font=courier new,courier,monospace]shufps[/font] in asm, you'd use [font=courier new,courier,monospace]_mm_shuffle_ps[/font] in C.
However, when writing in this style, you're trusting the compiler to make sensible use of the stack/registers/etc, which not all will do... The latest GCC and MSVC do a decent job, but older versions will still write some pretty silly asm for you. Edited by Hodgman

Share this post


Link to post
Share on other sites

They work well enough too, though I have some very strange "precision issue" that I'm about to post a question about. But before the anti-assembly hoards jump on me, understand this "precision issue" exists in my C version of these functions too.


Same thing here, the FPU and SSE versions are different by about 0.0000001%.. I assume this is due to 80 bit internal precision of the FPU.

BTW, I coded an SSE audio mixer, which actually relies on a conditional statement like (x < 1.0?). With this kind of logic, precision really matters, and so the FPU and SSE versions return TOTALLY different results..

I'm still writing a paper about it.


(meaning your code can be inlined and rearranged as usual, instead of creating a big anti-optimisation wall around the points where you transition from C to asm and back)


That's a good point. I don't even know how the C compiler understands which registers I used in my NASM routines. I believe you can specify this in GAS.

Share this post


Link to post
Share on other sites
[font=georgia,serif]I guess I didn't say it, but it is worth saying just to make sure everyone understands. Almost without exception I write and test a C version of what I eventually implement in assembly-language. That's almost necessary these days anyway, unless you have the luxury of only needing to support one CPU. Typically, of course, the C function can be set to "run on any [[semi]-recent] CPU", and your ultra-whizz-bang hand-optimized assembly-language function only runs on hardware that supports your favorite instruction set. For example, my newest assembly-language routines only run on CPUs that support AVX/FMA instructions (256-bit wide ymm registers). So when someone runs on an older CPU, they're stuck with the oldie, moldie, slowish C function.[/font]

[font=georgia,serif]If you're writing a super wizz-bang library that zillions of important programs will adopt, then you almost have to make your library interrogate the cpuid information, then have every call to your functions thereafter get routed to one of 2, 3, 5, several versions of your code based upon CPU capability. That is a royal pain in the butt, and a job no anti-assembly-language hack would ever consider in this world of 99-weeks of unemployment! Hahaha.[/font]

[font=georgia,serif]When it comes to precision, always watch out for subtracting two large numbers from each other. Somehow, "subtract" seems so simple and innocuous, but in fact, a great deal of trouble leads back to simple looking subtract. I first learned about this hassle when my optical design program (the first program I ever wrote, back in the dark ages) had precision problems, even with everything in f64 (double precision floating point). It was a royal pain in the butt to figure out how to re-formulate equations to eliminate that problem, but it sure taught me this important lesson! This problem tends to pop up in all sorts of "range reduction" issues too. Ouch! You think you're gonna get better precision by range reducing an angle near PI/2 to (PI/2 - angle), and... kaboom! Instead of getting better precision, you get your shorts handed to you. To understand this problem is very easy by considering that specific example. Say your angle is just short of PI/2 and you want to take the sine of that angle. But your chebyshev routines for sine and cosine are only good from 0.0000 to PI/4, so you say "no problem", because cosine(PI/2-angle) == sine(angle) and sine(PI/2-angle) = cosine(angle). So whenever the angle gets over PI/4, you just do the subtraction and execute the "other" function. Piece of cake, right? But just look at the binary bit patterns for PI/2 and angles just short of PI/2 !!! Uh, oh. The bit patterns are almost identical down to the last few bits. Which means, when you subtract your angle from your PI/2 constant, the result only has a few significant bits! In some cases, this problem can royally bite you![/font]

[font=georgia,serif]Of course, the exact same problems exist in C too, this is not an assembly-language issue. Though you do have more control in assembly language than you do in C, which sometimes lets you find a way to minimize or eliminate the problem. So this is in fact one more legitimate reason to dip into assembly-language once in a while.[/font]

[font=georgia,serif]PS: My current precision issue is not what I mention above. In fact, my issue is finding a way to determine whether the existing library functions are more or less precise than each of my several assembly-language routines. There is something suspicious about the existing library functions too, even those in some of the super-fancy super-fast math libraries.[/font]

[font=georgia,serif]Your example of comparing to 1.0000 is probably the first, simplest and most classic introduction to floating-point precision problems! Since adding and subtracting tiny numbers from 1.0000 is a sure-fire way to lose precision, special attention must be paid, sometimes excruciating attention (like in my optical design software). This problem plagues high-level language programmers too, but gives us another justification to dip into assembly-language, this time to re-arrange operations to reduce/eliminate precision loss, not to speed up execution! Imagine that![/font]

[font=georgia,serif]As for the "optimization" issue, I tend to write all my assembly-language as pure functions. Since I'm not personally much concerned about people with "older hardware" (except that my programs "execute correctly"), making completely separate assembly-language functions eliminates many (but not all) compiler optimization issues, especially in 64-bit mode code (which is all I care much about any more), where the function protocol is quite intelligent on the linux side (but royally stupid on the windoze side). However, on both OS you can write functions that have very little "efficiency overhead" as long as you only change a limited number of registers and don't need a stack frame (though honestly, the overhead of (1 subtract-immediate, 2 register-register moves, 1 push and 1 pop) isn't that much But I've written some mighty powerful (and not that short) assembly-language routines (so-called "leaf functions") that don't even need their own stack frame (which means we can even skip those 5 instructions if we wish).[/font]

[font=georgia,serif]One thing that saved my shorts recently is throwing VisualStudio into the trash-heap of history and doing everything with gnu C tools on both linux and windoze. There is still hassle in making portable code, but at least I don't need to create (and keep in sync) an entirely separate MASM file for every last freaking bit of assembly-language I write. And since I learned naming my assembly-language files with a .S suffix instead of .s suffix makes the pre-processor function on my assembly-language file (!!! thank you GNU !!!), I can take care of minor linux/windoze ABI differences all in the same file, which is especially welcome in 64-bit mode! BTW, I've heard (but can't confirm from personal knowledge) that some optimization is done by the GNU tools on the assembly-language generated by their compiler. Which means it can take note of what we do inside our assembly-language functions for purposes of optimization just as easily as its own. Thanks again GNU (if this is true). I mention in passing that when I wrote a compiler many years ago, I too did optimization on the assembly-language output. But even before they optimize, the assembly-language generated by the GNU tools is getting rather impressive - chock full of SIMD for one thing, sometimes including impressive parallelism.[/font]

[font=georgia,serif]PS: I certainly see the potential allure of intrinsics. However, I do not trust those who control those intrinsics to look after my interest. When I write assembly, I freaking know what is happening (to the extent possible)... and I freaking want to.[/font]

[font=georgia,serif]PS: One general comment about the "pain and hassle" of assembly-language. The detractors of assembly-language make a big deal about this "pain and hassle"... and they have a point. Getting up to speed on assembly-language does involve some pain and hassle. Point taken and fully admitted. However, because assembly-language is so clear and so essentially simple, the "pain and hassle" is extremely limited! Once you figure out the register-set, the instruction set, and the ABI requirements --- you pretty much know everything you need to know. In contrast, the "pain and hassle" of unlimited and ever-expanding messy interactions with higher-level packages never ends. This is also one important reason I switched from C++ to C, after learning C++ first. My life has been indescribably easier and less painful since that switch. I can deal with low-level where I "know everything". I can design whole CPUs with SSI gates or FPGA LUTs, no problem. Just keep me away from endless high-level whiz-bang packages with all sorts of fancy internal schemes designed to "make my life easier and simpler". In my experience, they do exactly the opposite! When "everything works the same", it is much easier to make everything work, and make sure it keeps working in the future! I mean, as far as I'm concerned, all I need is operators and functions. All the hidden "fancy" is just trouble waiting to happen. This is what the term "Murphy's Law" was invented to describe, I sometimes think. And boy, when anything is "hidden", it will blow up in your face in the worst way, in the most obscure and complicated ways, at the worst possible times. I perfer to avoid Murphy![/font] Edited by maxgpgpu

Share this post


Link to post
Share on other sites

If you're writing a super wizz-bang library that zillions of important programs will adopt, then you almost have to make your library interrogate the cpuid information, then have every call to your functions thereafter get routed to one of 2, 3, 5, several versions of your code based upon CPU capability. That is a royal pain in the butt, and a job no anti-assembly-language hack would ever consider in this world of 99-weeks of unemployment! Hahaha.


Best I can figure out how to do this, is first determine the SSE version, then copy the appropriate routines to fixed addresses so you don't have to use a jump table. This would mean that we'd have to stagger these functions in memory, so that all versions can fit into the memory spaces, wasting some memory. That way we don't add more function call overhead..


But I've written some mighty powerful (and not that short) assembly-language routines (so-called "leaf functions") that don't even need their own stack frame (which means we can even skip those 5 instructions if we wish).


Right, if the function in question doesn't call any other functions, you don't need to make a stack frame for it. You'll see this in my code a lot. That's what all those "var" definitions are. They're just positive offsets from the stack pointer (instead of negative offsets which get you the parameters). This all works fine, unless you call a function, which will override your variables (because we didn't actually allocate any stack space). But! If you know this and expect it, you can go around it in simple cases. If there was a lot of function calls, you'd be better off actually allocating the stack space.


In contrast, the "pain and hassle" of unlimited and ever-expanding messy interactions with higher-level packages never ends.


Tell me about it! This is why I just started writing x86 code. I'm tired of learning the fad of the day and then holding the bag when the fad disappears.

Share this post


Link to post
Share on other sites
As usual, I'm talking to myself... Yes, you should also look at libSIMDx86 v0.4.0... http://simdx86.sourceforge.net/

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!