Compiler VS Hand Tuned Assembly...

Started by
16 comments, last by outRider 16 years, 7 months ago
I find myself constantly having the argument over whether a compiler will produce faster code than hand tuned assembly. For the sake of focusing the argument, I am generally referring to the current cast of C++ compilers namely g++ and MSVC8. I always insist that compilers will almost always out optimize hand tuned assembly. However I don't have any real assembly experience, I simply repeat words I have heard from programmers who happen to be smarter/more experienced than me. Is there ever a time when hand tuned assembly can produce measurable results? At this point in time, is there a reason to inline asm into C,C++ code? One argument that was posed was that GPU shaders are regularly hand tuned at the driver level to boost performance. Isn't this simply because shader compilers are very young and have to address a very wide array of chips. These compilers simply aren't very good at this point in time, but can get away with it because they run of very fast highly parallel hardware. What do you guys think?
Advertisement
A person skilled in optimization and in a particular CPU's architecture and assembly language can probably write faster code than a compiler. Why? The person has the tools and the knowledge that the compiler has, but has more knowledge about the context.

So then, what are the drawbacks?
  1. While a person may be able to write faster code, it will take him 10 times as long write it in assembly and optimize it than someone who writes it in a high-level language and lets the compiler optimize it.
  2. The biggest savings from optimization are gotten from improving the algorithm rather than tweakiing the assembly language.
  3. Optimized code is unmaintainable and furthermore if the optimized code is changed, it must be re-optimized.
John BoltonLocomotive Games (THQ)Current Project: Destroy All Humans (Wii). IN STORES NOW!
Compilers will almost always produce decent optimised code but a compiler does not know everything about the code being written, so can not perform certain optimisations (i.e. memory access latencies, branch miss predictions, etc. ) a good ( I repeat good ) assembler programmer can optimise code to the point where instruction memory access latencies are hidden (using instruction pairing, etc.) and branches are minimised.

I can tell you now I have seen code come out of some modern compilers that could be re-written in assembler and would perform noticeably faster, but yes generally compilers do a pretty good job. But generally if the code your writing doesn't need that extra .2ms speed increase for every iteration then the c/c++ compiled code will do. But in some situation especially the game development industry hand optimising some code in assembler is a must.
Quote:Original post by fpsgamer
For the sake of focusing the argument, I am generally referring to the current cast of C++ compilers namely g++ and MSVC8.

ICC is a much more interesting subject than g++. GCC is focused on broad compatibility first, optimization a distant second at best. It remains quite conservative in its default optimizations, and tends to break things when pushed. It's definitely not going to take advantage of, say, the latest EM64T instructions like the Intel compilers can.

Quote:At this point in time, is there a reason to inline asm into C,C++ code?

Yes. Just not for speed.

The question of GPU assembly is a complete red herring, no more relevant than the fact that it's still sometimes appropriate in DSPs and other embedded systems.
I generally find it more useful to try and rewrite my C++ code in a way that encourages the compiler to produce efficient assembly rather than to try and hand code assembly. This may mean using intrinsics or simply finding alternative ways to express your intent that the compiler has an easier job mapping to efficient code. The benefit of this approach is that you are working with the compiler and optimizer rather than bypassing it and your code will benefit from future compiler improvements, including future compiler versions that target processors whose exact performance characteristics you don't know when you write the code. The resulting code is also generally easier to read and to maintain and to port to other platforms if necessary.

I've often found it necessary and beneficial to optimize in this way but almost never find it worth the additional effort to revert to inline assembly. This kind of optimization requires a good knowledge of your platform's assembly language and for maximum effect it involves a fair amount of staring at the resulting disassembly in the debugger but it still leaves the really tedious parts of assembly programming like register allocation and detailed scheduling to the compiler which usually does a better job than a human (unless you have near infinite time and patience).

Game Programming Blog: www.mattnewport.com/blog

Quote:Original post by fpsgamer
I find myself constantly having the argument over whether a compiler will produce faster code than hand tuned assembly.


There is NOTHING a compiler can do that cannot be done in hand-tuned assembly language, but there are MANY THINGS that hand-tuned assembly language can do that a specific compiler will not do.

Isnt it obvious?

Quote:Original post by fpsgamer
However I don't have any
real assembly experience


The people you are repeating are making faith based statements, probably because the faith they follow makes them feel good about not being fluent in assembly or computer architecture.

Clearly the faith is wrong:

It is illogical to think that the compiler can out-perform the complete unconstrained versatility that is offered to assembly language programmers.

There are (seldom) cases where you can only equal the performance of the compiler, although usualy these are cases where there is a hard bottleneck in play such as memory bandwidth. Often the algorithms that are banging on a hard bottleneck can be refactored into better algorithms, because more work can be done in those wasted cpu cycles. The problem then reverts to one where assembly again can be superior.

The extra performance comes at a cost, in both development time and maintainability.

A finely tuned assembly algorithm can approach the theoretical limits of instruction throughput and memory bandwidth on a given machine (3 instructions per cycle on AMD64's, for instance)
Quote:Original post by Rockoon1
It is illogical to think that the compiler can out-perform the complete unconstrained versatility that is offered to assembly language programmers.
No, it isn't. A compiler can do detailed code analysis about the best registers to use, how to move things in and out of memory to minimize cache misses, how to order instructions to maximize instruction level parallelism and minimize stalls in multiple pipeline chips, and more. Most importantly, it can do all that work in the blink of an eye. While a human might be capable of the same (and we're talking a highly paid senior engineer with lots of years of experience here), doing it without mistakes for more than a few dozen instructions could take days or weeks. The time ratios involved are on the order of a million to one -- and if new behavior is required of that code, the wasted time is huge.

Coders typically beat compilers in awkward situations where the compiler is simply not equipped to do the analysis properly or suffers a serious bug. Either way, this usually ends up as a misserable failure on the compiler's part, and beating it actually becomes trivial. Vectorized instructions are a good example. Outrunning the compiler for something like a matrix invert or multiply is easy.

It wasn't always like this. Highly effective optimization technology is very recent. Even VC6 generates lousy code, so really it's only since 2003 that Windows coders have had a real optimizer. Chips didn't used to be so complex either. Maybe even later than that on the GCC side. Quake II was the last id Software title to really feature assembly as a significant optimizing factor; the high end target processor at the time would've been the Pentium Pro. It was a superscalar chip, but just barely. A modern processor now is a far more complex beast -- on PC. (I am deliberately omitting any consideration of consoles or mobile devices.)

[Edited by - Promit on October 12, 2007 2:58:28 AM]
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
Quote:Original post by Promit
Quote:Original post by Rockoon1
It is illogical to think that the compiler can out-perform the complete unconstrained versatility that is offered to assembly language programmers.
No, it isn't. A compiler can do detailed code analysis about the best registers to use, how to move things in and out of memory to minimize cache misses, how to order instructions to maximize instruction level parallelism and minimize stalls in multiple pipeline chips, and more.


You have jumped to the faith-based conclusion that an assemby language programmer cannot do these things.

And inspite of that error on your part, it is not even necessary for the assembly language programmer to do them because the assembly language programmer has the freedom to try different methodologies, as well as examine the methodologies employed by a compiler - the compiler is going to produce the exact same fixed strategy every time and that is proof that you are wrong.

Sorry, this is not a religion. We can use logic to answer the question.

Quote:Original post by Promit
Most importantly, it can do all that work in the blink of an eye.


I did mention development time. Did you miss it or is the rest of your post not a reply to me?

Quote:Original post by Promit
While a human might be capable of the same (and we're talking a highly paid senior engineer with lots of years of experience here), doing it without mistakes for more than a few dozen instructions could take days or weeks.


The only way I can swallow your "days or weeks" theory is if you are arguing the strawman stance that an entire program would be written in assembly. Its a strawman.

In practice, specific high-workload algorithms are optimised and then only when there is a performance concern. The majority of most applications can be written in an interpreted scripting language without a noticable performance penalty. Its the 80/20 rule all the way down the rabbit hole.
Quote:Original post by Rockoon1
You have jumped to the faith-based conclusion that an assemby language programmer cannot do these things.

And inspite of that error on your part, it is not even necessary for the assembly language programmer to do them because the assembly language programmer has the freedom to try different methodologies, as well as examine the methodologies employed by a compiler - the compiler is going to produce the exact same fixed strategy every time and that is proof that you are wrong.

Sorry, this is not a religion. We can use logic to answer the question.


You know, I have a special place in my heart for arrogant assholes who know they're always right -- and they *are* right. People like Ciaran McCreesh or Theo de Raadt. But your arrogance is completely unjustified. You're making sweeping assertions without the slightest shred of evidence.

Have *you* written any modern assembly? x86-64 or SPARC assembly, for example? This isn't MIPS32. The amount of factors to handle is enormous. AMD published a five-volume, ~2000-page manual on AMD64. Yes, if your compiler is being utterly stupid in some situation, hand-tweaking the assembly code it generates makes sense. It was popular in the past because compilers generally sucked, and the instruction set was much more limited (there was no such thing as SSE*). These days, in the vast vast majority of cases, you're not going to outperform ICC.
Quote:Original post by drakostar
Quote:Original post by Rockoon1
You have jumped to the faith-based conclusion that an assemby language programmer cannot do these things.

And inspite of that error on your part, it is not even necessary for the assembly language programmer to do them because the assembly language programmer has the freedom to try different methodologies, as well as examine the methodologies employed by a compiler - the compiler is going to produce the exact same fixed strategy every time and that is proof that you are wrong.

Sorry, this is not a religion. We can use logic to answer the question.


You know, I have a special place in my heart for arrogant assholes who know they're always right -- and they *are* right. People like Ciaran McCreesh or Theo de Raadt. But your arrogance is completely unjustified. You're making sweeping assertions without the slightest shred of evidence.

Have *you* written any modern assembly? x86-64 or SPARC assembly, for example? This isn't MIPS32. The amount of factors to handle is enormous. AMD published a five-volume, ~2000-page manual on AMD64. Yes, if your compiler is being utterly stupid in some situation, hand-tweaking the assembly code it generates makes sense. It was popular in the past because compilers generally sucked, and the instruction set was much more limited (there was no such thing as SSE*). These days, in the vast vast majority of cases, you're not going to outperform ICC.


I think the point Rockoon1 was making is that compilers are very linear, the code they produce is defined by the source code and a large set of predefined rules. They are incapable of producing optimisations that fall outside their rule set. This leaves a large number of ways of creating unsual code. For example, I recently, just for fun, optimised an RC64 decryption algorithm and managed to increase it's speed by over 25 times. I did this by simulataneously processing two keys - one key in the integer unit and the other key in the SIMD unit1. However, drakostar is right in that it did require a good understanding of the CPU and its instruction set.

Skizz

1) Using compiler intrinsic functions almost counts as assembly programming - in this case it wouldn't really help as there's no way to easily pipeline the IA32 and the SIMD instructions. Having said that, I've not analysed the output when interlacing integer and SIMD using intrinsics.

This topic is closed to new replies.

Advertisement