Sign in to follow this  
fpsgamer

Compiler VS Hand Tuned Assembly...

Recommended Posts

fpsgamer    856
I find myself constantly having the argument over whether a compiler will produce faster code than hand tuned assembly. For the sake of focusing the argument, I am generally referring to the current cast of C++ compilers namely g++ and MSVC8. I always insist that compilers will almost always out optimize hand tuned assembly. However I don't have any real assembly experience, I simply repeat words I have heard from programmers who happen to be smarter/more experienced than me. Is there ever a time when hand tuned assembly can produce measurable results? At this point in time, is there a reason to inline asm into C,C++ code? One argument that was posed was that GPU shaders are regularly hand tuned at the driver level to boost performance. Isn't this simply because shader compilers are very young and have to address a very wide array of chips. These compilers simply aren't very good at this point in time, but can get away with it because they run of very fast highly parallel hardware. What do you guys think?

Share this post


Link to post
Share on other sites
JohnBolton    1372
A person skilled in optimization and in a particular CPU's architecture and assembly language can probably write faster code than a compiler. Why? The person has the tools and the knowledge that the compiler has, but has more knowledge about the context.

So then, what are the drawbacks?
  1. While a person may be able to write faster code, it will take him 10 times as long write it in assembly and optimize it than someone who writes it in a high-level language and lets the compiler optimize it.
  2. The biggest savings from optimization are gotten from improving the algorithm rather than tweakiing the assembly language.
  3. Optimized code is unmaintainable and furthermore if the optimized code is changed, it must be re-optimized.

Share this post


Link to post
Share on other sites
technomancer    199
Compilers will almost always produce decent optimised code but a compiler does not know everything about the code being written, so can not perform certain optimisations (i.e. memory access latencies, branch miss predictions, etc. ) a good ( I repeat good ) assembler programmer can optimise code to the point where instruction memory access latencies are hidden (using instruction pairing, etc.) and branches are minimised.

I can tell you now I have seen code come out of some modern compilers that could be re-written in assembler and would perform noticeably faster, but yes generally compilers do a pretty good job. But generally if the code your writing doesn't need that extra .2ms speed increase for every iteration then the c/c++ compiled code will do. But in some situation especially the game development industry hand optimising some code in assembler is a must.

Share this post


Link to post
Share on other sites
drakostar    224
Quote:
Original post by fpsgamer
For the sake of focusing the argument, I am generally referring to the current cast of C++ compilers namely g++ and MSVC8.

ICC is a much more interesting subject than g++. GCC is focused on broad compatibility first, optimization a distant second at best. It remains quite conservative in its default optimizations, and tends to break things when pushed. It's definitely not going to take advantage of, say, the latest EM64T instructions like the Intel compilers can.

Quote:
At this point in time, is there a reason to inline asm into C,C++ code?

Yes. Just not for speed.

The question of GPU assembly is a complete red herring, no more relevant than the fact that it's still sometimes appropriate in DSPs and other embedded systems.

Share this post


Link to post
Share on other sites
mattnewport    1038
I generally find it more useful to try and rewrite my C++ code in a way that encourages the compiler to produce efficient assembly rather than to try and hand code assembly. This may mean using intrinsics or simply finding alternative ways to express your intent that the compiler has an easier job mapping to efficient code. The benefit of this approach is that you are working with the compiler and optimizer rather than bypassing it and your code will benefit from future compiler improvements, including future compiler versions that target processors whose exact performance characteristics you don't know when you write the code. The resulting code is also generally easier to read and to maintain and to port to other platforms if necessary.

I've often found it necessary and beneficial to optimize in this way but almost never find it worth the additional effort to revert to inline assembly. This kind of optimization requires a good knowledge of your platform's assembly language and for maximum effect it involves a fair amount of staring at the resulting disassembly in the debugger but it still leaves the really tedious parts of assembly programming like register allocation and detailed scheduling to the compiler which usually does a better job than a human (unless you have near infinite time and patience).

Share this post


Link to post
Share on other sites
Rockoon1    104
Quote:
Original post by fpsgamer
I find myself constantly having the argument over whether a compiler will produce faster code than hand tuned assembly.


There is NOTHING a compiler can do that cannot be done in hand-tuned assembly language, but there are MANY THINGS that hand-tuned assembly language can do that a specific compiler will not do.

Isnt it obvious?

Quote:
Original post by fpsgamer
However I don't have any
real assembly experience


The people you are repeating are making faith based statements, probably because the faith they follow makes them feel good about not being fluent in assembly or computer architecture.

Clearly the faith is wrong:

It is illogical to think that the compiler can out-perform the complete unconstrained versatility that is offered to assembly language programmers.

There are (seldom) cases where you can only equal the performance of the compiler, although usualy these are cases where there is a hard bottleneck in play such as memory bandwidth. Often the algorithms that are banging on a hard bottleneck can be refactored into better algorithms, because more work can be done in those wasted cpu cycles. The problem then reverts to one where assembly again can be superior.

The extra performance comes at a cost, in both development time and maintainability.

A finely tuned assembly algorithm can approach the theoretical limits of instruction throughput and memory bandwidth on a given machine (3 instructions per cycle on AMD64's, for instance)

Share this post


Link to post
Share on other sites
Promit    13246
Quote:
Original post by Rockoon1
It is illogical to think that the compiler can out-perform the complete unconstrained versatility that is offered to assembly language programmers.
No, it isn't. A compiler can do detailed code analysis about the best registers to use, how to move things in and out of memory to minimize cache misses, how to order instructions to maximize instruction level parallelism and minimize stalls in multiple pipeline chips, and more. Most importantly, it can do all that work in the blink of an eye. While a human might be capable of the same (and we're talking a highly paid senior engineer with lots of years of experience here), doing it without mistakes for more than a few dozen instructions could take days or weeks. The time ratios involved are on the order of a million to one -- and if new behavior is required of that code, the wasted time is huge.

Coders typically beat compilers in awkward situations where the compiler is simply not equipped to do the analysis properly or suffers a serious bug. Either way, this usually ends up as a misserable failure on the compiler's part, and beating it actually becomes trivial. Vectorized instructions are a good example. Outrunning the compiler for something like a matrix invert or multiply is easy.

It wasn't always like this. Highly effective optimization technology is very recent. Even VC6 generates lousy code, so really it's only since 2003 that Windows coders have had a real optimizer. Chips didn't used to be so complex either. Maybe even later than that on the GCC side. Quake II was the last id Software title to really feature assembly as a significant optimizing factor; the high end target processor at the time would've been the Pentium Pro. It was a superscalar chip, but just barely. A modern processor now is a far more complex beast -- on PC. (I am deliberately omitting any consideration of consoles or mobile devices.)

[Edited by - Promit on October 12, 2007 2:58:28 AM]

Share this post


Link to post
Share on other sites
Rockoon1    104
Quote:
Original post by Promit
Quote:
Original post by Rockoon1
It is illogical to think that the compiler can out-perform the complete unconstrained versatility that is offered to assembly language programmers.
No, it isn't. A compiler can do detailed code analysis about the best registers to use, how to move things in and out of memory to minimize cache misses, how to order instructions to maximize instruction level parallelism and minimize stalls in multiple pipeline chips, and more.


You have jumped to the faith-based conclusion that an assemby language programmer cannot do these things.

And inspite of that error on your part, it is not even necessary for the assembly language programmer to do them because the assembly language programmer has the freedom to try different methodologies, as well as examine the methodologies employed by a compiler - the compiler is going to produce the exact same fixed strategy every time and that is proof that you are wrong.

Sorry, this is not a religion. We can use logic to answer the question.

Quote:
Original post by Promit
Most importantly, it can do all that work in the blink of an eye.


I did mention development time. Did you miss it or is the rest of your post not a reply to me?

Quote:
Original post by Promit
While a human might be capable of the same (and we're talking a highly paid senior engineer with lots of years of experience here), doing it without mistakes for more than a few dozen instructions could take days or weeks.


The only way I can swallow your "days or weeks" theory is if you are arguing the strawman stance that an entire program would be written in assembly. Its a strawman.

In practice, specific high-workload algorithms are optimised and then only when there is a performance concern. The majority of most applications can be written in an interpreted scripting language without a noticable performance penalty. Its the 80/20 rule all the way down the rabbit hole.

Share this post


Link to post
Share on other sites
drakostar    224
Quote:
Original post by Rockoon1
You have jumped to the faith-based conclusion that an assemby language programmer cannot do these things.

And inspite of that error on your part, it is not even necessary for the assembly language programmer to do them because the assembly language programmer has the freedom to try different methodologies, as well as examine the methodologies employed by a compiler - the compiler is going to produce the exact same fixed strategy every time and that is proof that you are wrong.

Sorry, this is not a religion. We can use logic to answer the question.


You know, I have a special place in my heart for arrogant assholes who know they're always right -- and they *are* right. People like Ciaran McCreesh or Theo de Raadt. But your arrogance is completely unjustified. You're making sweeping assertions without the slightest shred of evidence.

Have *you* written any modern assembly? x86-64 or SPARC assembly, for example? This isn't MIPS32. The amount of factors to handle is enormous. AMD published a five-volume, ~2000-page manual on AMD64. Yes, if your compiler is being utterly stupid in some situation, hand-tweaking the assembly code it generates makes sense. It was popular in the past because compilers generally sucked, and the instruction set was much more limited (there was no such thing as SSE*). These days, in the vast vast majority of cases, you're not going to outperform ICC.

Share this post


Link to post
Share on other sites
Skizz    794
Quote:
Original post by drakostar
Quote:
Original post by Rockoon1
You have jumped to the faith-based conclusion that an assemby language programmer cannot do these things.

And inspite of that error on your part, it is not even necessary for the assembly language programmer to do them because the assembly language programmer has the freedom to try different methodologies, as well as examine the methodologies employed by a compiler - the compiler is going to produce the exact same fixed strategy every time and that is proof that you are wrong.

Sorry, this is not a religion. We can use logic to answer the question.


You know, I have a special place in my heart for arrogant assholes who know they're always right -- and they *are* right. People like Ciaran McCreesh or Theo de Raadt. But your arrogance is completely unjustified. You're making sweeping assertions without the slightest shred of evidence.

Have *you* written any modern assembly? x86-64 or SPARC assembly, for example? This isn't MIPS32. The amount of factors to handle is enormous. AMD published a five-volume, ~2000-page manual on AMD64. Yes, if your compiler is being utterly stupid in some situation, hand-tweaking the assembly code it generates makes sense. It was popular in the past because compilers generally sucked, and the instruction set was much more limited (there was no such thing as SSE*). These days, in the vast vast majority of cases, you're not going to outperform ICC.


I think the point Rockoon1 was making is that compilers are very linear, the code they produce is defined by the source code and a large set of predefined rules. They are incapable of producing optimisations that fall outside their rule set. This leaves a large number of ways of creating unsual code. For example, I recently, just for fun, optimised an RC64 decryption algorithm and managed to increase it's speed by over 25 times. I did this by simulataneously processing two keys - one key in the integer unit and the other key in the SIMD unit1. However, drakostar is right in that it did require a good understanding of the CPU and its instruction set.

Skizz

1) Using compiler intrinsic functions almost counts as assembly programming - in this case it wouldn't really help as there's no way to easily pipeline the IA32 and the SIMD instructions. Having said that, I've not analysed the output when interlacing integer and SIMD using intrinsics.

Share this post


Link to post
Share on other sites
silvermace    634
@the op: use your common sense, that involves reading material from respected sources about what compilers are good at, and what they aren't good at. Improve your algorithms, and again, and again, then use the tools at your disposal to improve your chances, namely intrinsics and then blocks of inline asm.

and its pretty much a given that when you're doing blackart stuff like this: http://www.codeproject.com/system/soviet_protector.asp you're going to need assembler. But this discussion is about speed not functionality.

To that end....

@Promit: I'm with you all the way.

@Rockoon1: By the sounds of things you've never dealt with a code-base larger than a few hundred thousand lines, and you probably have never had a PWHC or E&Y risk audit team analyze your code and give a rating for all the things Promit mentioned (specifically the ones which have absolutely nothing to do with how fast your code is). If you provided them with massive listings of pure assembler and then told them it was to gain performance, you probably would have been blind-folded and chucked into a black GMC panel-van and never heard from again.
Quote:
the assembly language programmer has the freedom to try different methodologies
funny you should say that, because the C++ programmer from your direct market competitor is free to try out those same methodologies using a high-level language, a smart compiler and some pretty Pro tools like VTune which will tell you cool stuff like how well you're utilizing the Multilevel cache on your Intel chip... of course "the assembly programmer" could use these tools too, but, oh, wait...you need a C++ Compiler to instrument your code for VTune.

FYI> I'd love to work where Rackoon1 works, because it sounds like the chap doesn't have any deadlines.

@Skizz: I see where your rational and considered argument is coming from (and I also appreciate the fact that you can present your point without the condescending tone which Rackoon1 insists on using). Interestingly, you managed to make good use of the SIMD and Integer units in parallel, I think its only a matter of time till we're having this conversation over "hand written C++ or OpenMP5.0"? Especially now that CPU architectures like CELL B/E and AMD's new Quad cores are going to be the norm.

Share this post


Link to post
Share on other sites
implicit    504
Quote:
@Rockoon1: By the sounds of things you've never dealt with a code-base larger than a few hundred thousand lines, and you probably have never had a PWHC or E&Y risk audit team analyze your code and give a rating for all the things Promit mentioned (specifically the ones which have absolutely nothing to do with how fast your code is). If you provided them with massive listings of pure assembler and then told them it was to gain performance, you probably would have been blind-folded and chucked into a black GMC panel-van and never heard from again.
Uh.. You're missing the point really. He said it was possible, not practical. There's a bit of a difference between them you know..

Quote:
funny you should say that, because the C++ programmer from your direct market competitor is free to try out those same methodologies using a high-level language, a smart compiler and some pretty Pro tools like VTune which will tell you cool stuff like how well you're utilizing the Multilevel cache on your Intel chip... of course "the assembly programmer" could use these tools too, but, oh, wait...you need a C++ Compiler to instrument your code for VTune.
Well, yes. But there are quite a few optimizations that modern C++ compilers just can't handle, or where they are simply impractical, and where you'd have to resort to assembly language. Consider the dynamic recompilers used in modern emulators for instance. Or simply to take advantage of specific optimizations which your current compiler just doesn't support, there's no way you're going to get your C compiler to generate a SIMD instruction which it doesn't have a concept of.

Share this post


Link to post
Share on other sites
Rockoon1    104
Quote:
Original post by drakostar
But your arrogance is completely unjustified. You're making sweeping assertions without the slightest shred of evidence.


What asserions are those? Do you understand them?

Quote:
Original post by drakostar
Have *you* written any modern assembly?


yes

Quote:
Original post by drakostar
x86-64 or SPARC assembly, for example?


x86-64, 80x86, 8088 ... need I go back farther?

Quote:
Original post by drakostar
This isn't MIPS32.


no kidding - what does that have to do with it?

Its fine and all that you can bring up irrelevant details.. but dont think for a second that I am not knowledgable enough to know that they are irrelevant.

Quote:
Original post by drakostar
The amount of factors to handle is enormous.


Not true. Processors are deterministic and follow a set of simple rules. The person who told you otherwise was preaching a faith.

Quote:
Original post by drakostar
AMD published a five-volume, ~2000-page manual on AMD64.


great.. I guess thats a link to tech specs.. did you read it?

Quote:
Original post by drakostar
Yes, if your compiler is being utterly stupid in some situation, hand-tweaking the assembly code it generates makes sense.


..or if the abstract machine the compiler is founded upon (such as the C "abstract machine") does not encompass the full feature set of the processor..

Does ICC have a rotate-through-carry intrinsic? no?

Then how on earth can you write any algorithm which is best performed by using those instructions? You can't. It is simply not possible with ICC.

Why wouldn't ICC have such an intrinsic? It isnt because the instruction is useless. It is because the C abstract machine has no concept of a flags register.

Can you show me the C compiler that can generate a function that will return bits on, and then take advantage of, the flags register? They don't do that.

How about the reverse of that? A function that takes as input the flags register? Can't do that either?

How about a C compiler that will correctly mix FPU and Scaler SSE instructions in order to reduce register pressure due to many floating point constants in play? They don't do that. While that reload of a constant, or the swizzle of an SSE register "seems" free .. it isnt free at all. Yet another execution unit wasting time.

Quote:
Original post by drakostar
It was popular in the past because compilers generally sucked, and the instruction set was much more limited


The compilers sucked *even though* CPU's were simpler. Think about it.

Quote:
Original post by drakostar
These days, in the vast vast majority of cases, you're not going to outperform ICC.


Speak for yourself.

Funny that every version of ICC brings greater speed, yet "in the vast majority of cases" it cannot be beaten.. you arguement is faith based. The facts speak for themselves.

Share this post


Link to post
Share on other sites
Rockoon1    104
Quote:
Original post by silvermace
@Rockoon1: By the sounds of things you've never dealt with a code-base larger than a few hundred thousand lines


I'm wondering what the entire code base has to do with leveraging assembly language. Nobody suggested writing an entire program in assembler. Its the 80/20 rule all the way down the rabbit hole.

Quote:
Original post by silvermace
, and you probably have never had a PWHC or E&Y risk audit team analyze your code and give a rating for all the things Promit mentioned (specifically the ones which have absolutely nothing to do with how fast your code is).


That is certainly true. I don't even know what PWHC or E&Y stands for...

Quote:
Original post by silvermace
If you provided them with massive listings of pure assembler and then told them it was to gain performance, you probably would have been blind-folded and chucked into a black GMC panel-van and never heard from again.


There are certainly many reasons not to write something in assembler.. there are many reasons not to code in C or C++ as well..

Each language offers a unique feature set that is suited to the problem before you. This includes Visual Basic or other RAD language, as well as domain-specific languages. Each one is not nullified another. Each is a tool that has its place.

Now what do you do when performance is an issue yet you are using an algorithm that is provably asymptotically minimal? Throw your hands in the air and claim that ICC is the best asm programmer ever, so its impossible?

To quote Abrash: "There ain't no such thing as the fastest code"

Quote:
Quote:
the assembly language programmer has the freedom to try different methodologies
funny you should say that, because the C++ programmer from your direct market competitor is free to try out those same methodologies using a high-level language


No, he isnt. He is free to try different algorithms but in each case he gets a single methodology for that algorithm out of the compiler..

Quote:

a smart compiler and some pretty Pro tools like VTune which will tell you cool stuff like how well you're utilizing the Multilevel cache on your Intel chip...


pssst... many of the feature sets of vtune and codeanalyst are specifically designed for someone who can fiddle with the instructions without interference from a compiler that thinks it knows best.

Quote:

I'd love to work where Rackoon1 works, because it sounds like the chap doesn't have any deadlines.


This has nothing to do with the OP's question. You can surely find many reasons not to code in assembler .. so what? You can also come dengerously close to casting ad hominems .. do you think that makes you right?

If you dont like my attitude, you should check the moderators first. My attitude was fine until he dismissed my comments and then flooded his reply with irrelevant detail after irrelevant detail. He didnt address my point at all. he want on and on with the assumption that what compilers do is otherwise impossibe. Such facts not in evidence and never will be because its not logically sound on the face of it.

I don't take kindly to people who try to come off as smart when really all they are doing is changing the subject or shifting the goal post. I can't imagine a reason for why he did it that doesnt deserve a negative response. Not a single point he made had contradicted what he dismissed.

Share this post


Link to post
Share on other sites
mattnewport    1038
Quote:
Original post by Rockoon1
Quote:
Quote:
the assembly language programmer has the freedom to try different methodologies
funny you should say that, because the C++ programmer from your direct market competitor is free to try out those same methodologies using a high-level language

No, he isnt. He is free to try different algorithms but in each case he gets a single methodology for that algorithm out of the compiler..

That's not true. For any given algorithm there are many ways to implement it in C++. Once you introduce intrinsics for vector instructions etc. you open up many more possibilities for the detailed implementation of an algorithm. Relatively minor changes in the way you express the implementation of an algorithm can have significant effects on the quality and performance of the assembly the compiler generates. There are things that you can't make the compiler do and that you have to resort to assembly for but in my experience it is almost never the best use of your time to do so, for a given amount of programmer time more siginficant performance benefits can usually be had by moving on to lower hanging fruit elsewhere.

Share this post


Link to post
Share on other sites
outRider    852
Quote:
Original post by mattnewport
Relatively minor changes in the way you express the implementation of an algorithm can have significant effects on the quality and performance of the assembly the compiler generates.


Let's think about that for a moment. The optimizing quality of a compiler is inversely proportional to the veracity of that statement for the compiler in question.

Share this post


Link to post
Share on other sites
mattnewport    1038
Quote:
Original post by outRider
Quote:
Original post by mattnewport
Relatively minor changes in the way you express the implementation of an algorithm can have significant effects on the quality and performance of the assembly the compiler generates.


Let's think about that for a moment. The optimizing quality of a compiler is inversely proportional to the veracity of that statement for the compiler in question.


What's your point? If optimizing compilers were perfect then this wouldn't be as necessary. They're not perfect. Some modern optimizing compilers are very good but even the best current optimizing compilers still need a helping hand from the programmer if they are to produce the most efficient code.

In some cases there are code changes to the implementation of an algorithm that preserve the results at a high level but that would not be legal for the compiler to make because they change the detailed semantics of the code. Even a theoretically perfect optimizing compiler might not be able to make all such changes because it doesn't know that changing a few of the least significant bits of the result of a calculation that uses floating point arithmetic is acceptable for your needs for example.

Share this post


Link to post
Share on other sites
outRider    852
Quote:
Original post by mattnewport
Quote:
Original post by outRider
Quote:
Original post by mattnewport
Relatively minor changes in the way you express the implementation of an algorithm can have significant effects on the quality and performance of the assembly the compiler generates.


Let's think about that for a moment. The optimizing quality of a compiler is inversely proportional to the veracity of that statement for the compiler in question.


What's your point? If optimizing compilers were perfect then this wouldn't be as necessary. They're not perfect. Some modern optimizing compilers are very good but even the best current optimizing compilers still need a helping hand from the programmer if they are to produce the most efficient code.

In some cases there are code changes to the implementation of an algorithm that preserve the results at a high level but that would not be legal for the compiler to make because they change the detailed semantics of the code. Even a theoretically perfect optimizing compiler might not be able to make all such changes because it doesn't know that changing a few of the least significant bits of the result of a calculation that uses floating point arithmetic is acceptable for your needs for example.


My point is self-evident. I agree with you that compilers are not perfect, that they need a helping hand, so I'm not sure why you're justifying yourself to me unless it is to concur. I think most who have commented here hold the opinion that compilers are not perfect.

But Rockoon's statement still holds, you cannot influence the decisions a compiler makes at the register level. Compilers play by static and rigid scheduling rules, suffer from idiosyncrasies, and do ship with bugs--which cannot be remedied at source, no matter how much you fiddle with your code.

As for whether or not it's the best use of time, that entirely depends on how much you stand to gain, on your skill level, and on how much time you have.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this