Quote:Original post by samothI cannot say much about the particular compiler you named (VC 2008), since I'm not using that one, but in general I would strongly recommend against "hand optimizing" code in assembler. If your compiler does not properly optimize intrinsics, then you either forgot to turn on optimizations or the compiler is broken.
What compiler do you use? I'm mainly familiar with Visual C++, GCC, or Intel (which costs money). I know there are others out there, but I haven't had the opportunity to work with them. I'd be interested in playing around with any out there that do intrinsics well.
Quote:In general, assembler code is more or less a black box for the compiler. Some (gcc, most notably) are a bit more intelligent and can still perform minor optimizations (scheduling and register coloring) even on inline assembly, but most compilers will just treat your code as-is and add some extra instructions around it.
Since assembler is a "black box", the compiler cannot prove certain things or even make assumptions, and thus is unable to do most (or all) optimizations inside and around it. On the other hand, a compiler will normally perform all valid optimizations that it is capable of doing with intrinsic functions just fine. A decent optimizer will also interleave your SSE code with non-SSE code when it is possible, which is just awesome, since that code will run "for free".
In theory... in practice I've found that to not be quite so true. And I'm not the only one:
http://www.virtualdub.org/blog/pivot/entry.php?id=162
Its a bit old, bad sadly still true for the most part.
Also consider that a fair number of crucial intrinsics are just plain missing. An example is a 64 bit long divide who's asm mnemonic happens to be simply idiv just doesn't exist (or at least it didn't when I checked a year ago). And I'm not even talking about many of the obscure SSE packed add with swizzled kittens type instructions. I was just trying to do fixed point math.
The whole 'optimize' around never seemed to pan out in real code. Take a look at some dis-assembled intrinsics. I personally was surprised at just how badly VC++ 2008 mangled it. Now most of my experience was with the 64-bit compiler, so maybe its because it was rather new. I don't know. But even simple vector math had all sorts of mov and pack instructions.
On top of that I'm no asm guru by any stretch, I just dabble in it from time to time. And sure debugging asm can be a bit annoying, and there's very little documentation, but I've found it very easy to beat the VC++ 2008 compiler in performance.
In my experience smaller functions or ones that will be inlined are best left as C++ code. Larger performance critical ones hand tuned assembly. Intrinsics so far have left me quite disappointed performance wise, simply because they're not properly supported. I'd love to use them, in say a vector library or what-not (where they would be perfect). I'm hoping VC++ 2010 steps fixes things. But if your playing with intrinsics and getting slow code, take a look at the disassembly, the problem might not lie where you expect it.