|Original post by Procyon Lotor|
Did you read my post? I said "I'm not an expert", and "it doesn't use any newer instruction sets".
I'm 99.9% sure the C# version doesn't use SSE instructions either. First, compilers just aren't smart enough to do that, and second, there's just nothing to be gained by it in your case.Edit:
Ok, they might use the scalar versions of SSE instructions, but that won't offer any big performance boost.Edit2:
Ok, just looked some of the latencies up. The SSE version of sqrt is actually a good deal faster than the x87 one. But the rest are pretty much the same. Try compiling your C++ program to use SSE, and see what happens.
I was looking for advice on why my C++/ASM code wasn't performing well, not an argument about which language is faster.
Well, you did start out by saying "I was interested in seeing what the performance different between C++, C# and ASM with a C++ wrapper would be".
Sounds like you were interested in which language was faster.
My point is simply that there's nothing too surprising in C# being fast.
But to answer your question, the main reason your ASM version isn't performing very well is because ASM is damn complex. You not only have to write the code, you also have to take latency and register pressure and other factors into account. You need to reorder your instructions to hide the latency. The CPU can read up to 3 instructions per cycle, and a lot of operations have several cycles latency. (Division and square root in particular are horrible). You have to reorder your instructions so that the instructions aren't dependant on the previous ones (especially if the previous ones had high latency). And branches are an entire research topic on their own. You want to avoid control hazards that stall the pipeline, and again, the order of instructions plays a huge role here as well.
So basically, don't bother with the ASM version. It's not faster, and doesn't even have the potential to be faster, but it takes *a lot* more work.
The C++ version is more interesting. Not sure why that is slower (Although the type conversions as you mentioned might play a role. That, or the compiler just isn't as good at optimizing this particular program.[Edited by - Spoonbender on May 18, 2006 8:50:29 PM]