# Assembly vs. C++ vs. C# finding primes

## Recommended Posts

raccoonone    158
I was interested in seeing what the performance different between C++, C# and ASM with a C++ wrapper would be. So far the results have surprised me since C# is the fastest, this must be because my C++/ASM isn't good enough so maybe you guys can point out where I'm going wrong. The program I'm testing with is quite simple all it does is: a) get a number from the user b) calculate all the primes between 1 and the number c) output the number of primes found and the time taken to find them. All 3 programs use the same algorithm: if x is 1, 2, or 3 return true if x devided by 2 has a remainder of zero return false otherwise devide by all odd numbers starting with 3 until the square root of x is reached or the remainder is zero (indicating a composite number) I compiled all of the programs with full optimizations and ran them on my computer (AthlonXP 2800+ with 1GB of RAM). The C++ version and my ASM version both took 19.5-20secs to find all the primes between 1 and 10,000,000. The C# version took 20-20.5secs when I was using ints, but after I changed it to use doubles to reduce the ammount of type conversions, it does it in 16-17secs! When I looked at the disassembly for the C# version I found that it's using almost all floating point processer instructions, but then I calls some functions that I couldn't step into (well I'm sure I could if I spent some time figuring out how). Here's a short part of the disassembly. I haven't used floating point instructions much and I don't really understand what it's doing in the beginning there. It looks to me like it's just moving data through the ALU by pushing and then poping it somewhere else.
                if (TestNumber % x == 0)
0000008c  fld         qword ptr [ebp+8]
0000008f  sub         esp,8
00000092  fstp        qword ptr [esp]
00000095  fld         qword ptr [esp+10h]
00000099  sub         esp,8
0000009c  fstp        qword ptr [esp]
0000009f  call        79235954
000000a4  fldz
000000a6  fcomip      st,st(1)
000000a8  fstp        st(0)
000000aa  jp          000000B9
000000ac  jne         000000B9


Here's the source of my C# version
using System;
using System.Collections.Generic;
using System.Text;

namespace Csharp_Test_console_application
{
class Program
{
static void Main(string[] args)
{
double x;
int PrimesFound = 0;
DateTime start, end;
Console.WriteLine("Find primes up to 'x'. Please enter x:");
start = System.DateTime.Now;
for (double y = 1; y <= x; y++)
{
if (IsPrime(y))
{
PrimesFound++;
}
}
end = System.DateTime.Now;
Console.WriteLine("Found {0} primes in {1}", PrimesFound, end - start);
}

static bool IsPrime(double TestNumber)
{
double limit = System.Math.Sqrt(TestNumber) + 1;
if (TestNumber == 2)
return true;
if(TestNumber % 2 == 0) //Check if TestNumber is even
return false;
double x = 3;
//Devide until the square root is reached
for (; x < limit; x += 2)
{
if (TestNumber % x == 0)
return false;
}
return true;
}
}
}


Source of my C++ version
// Cpp primes.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
using std::cin;
using std::cout;
bool IsPrime(int TestNumber);

int _tmain(int argc, _TCHAR* argv[])
{
int x, PrimesFound = 0;
DWORD start, end;
cout << "Find primes up to 'x'. Please enter x:";
cin >> x;
start = GetTickCount();
for (int y = 1; y <= x; y++)
{
if (IsPrime(y))
{
PrimesFound++;
}
}
end = GetTickCount();
cout << "Found " << PrimesFound << " primes in " << end - start;
while(!_kbhit());
return 0;
}

_inline bool IsPrime(int TestNumber)
{
int limit = (int)sqrt((double)TestNumber) + 1;
if (TestNumber == 2)
return true;
if(TestNumber % 2 == 0) //Check if TestNumber is even
return false;
int x = 3;
//Devide until the square root is reached
for (; x < limit; x += 2)
{
if (TestNumber % x == 0)
return false;
}
return true;
}


Assembly with C++ for the input and output
// asm primes.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"

using std::cout;
using std::cin;

int _tmain(int argc, _TCHAR* argv[])
{
int x, PrimesFound = 0;
DWORD start, end;
cout << "Find primes up to 'x'. Please enter x:";
cin >> x;
start = GetTickCount();
int limit, temp;
_asm
{
pusha					//Save registers
mov	esi, 1				//Start checking numbers from 1
mov	edi, x				//Loop until X is reached
Forloop: cmp	esi, edi
ja		Done			//Exit if X has been reached

//Check if this number is one, two, or three
cmp		esi, 3
ja		Next
inc		PrimesFound
jmp		CheckDone
//Check if divisible by two
Next:	test	esi, 1						//If ESI is odd it's not divisible by 2
jnz		Next2						// so continuing checking for primality
jmp		CheckDone

Next2:	mov		temp, esi
fild	temp						//Convert
fsqrt								// to square
fistp	limit						//  root

mov		ebx, 3						//Set EBX to 3

mov		ecx, limit					//Set up the loop counter
shr		ecx, 1						//  devide by 2 since we'll be incrementing by 2

Check:	mov		eax, esi					//Restore test number
xor		edx, edx					// and clear remainder

div		ebx							//Divide and
or		edx, edx					// test for
jz		CheckDone					//  zero

inc		ebx							//increment divisor
inc		ebx							// ""
loop	Check						//Continue until the square root is reached

//Number is prime
inc		PrimesFound
CheckDone: inc	esi							//Increment ESI so the next number will be checked
jmp	Forloop							// for primality

Done: popa									//Restore registers
}
end = GetTickCount();
cout << "Found " << PrimesFound << " primes in " << end - start;
while(!_kbhit());
return 0;
}


stdafx.h for C++ versions
// stdafx.h : include file for standard system include files,
// or project specific include files that are used frequently, but
// are changed infrequently
//

#pragma once

#define WIN32_LEAN_AND_MEAN		// Exclude rarely-used stuff from Windows headers
#include <stdio.h>
#include <tchar.h>
#include <math.h>
#include <windows.h>
#include <conio.h>
#include <iostream>



##### Share on other sites
Washu    7829
None of your methods are particularily optimized at all. From number theory, you can show that you do not need to test against every number below the sqrt(n), but just the primes below the sqrt(n) to see if n is prime. As such, I would recommend changing your algorithm to remember each prime it finds.
Quote:
 All 3 programs use the same algorithm:if x is 1, 2, or 3 return true

1 is not prime.
A simple sieve i wrote a bit back for a similar post.
#include <iostream>#include <deque>#include <cmath>void FindPrimes(std::deque<std::size_t>& primes, std::size_t upperBound) {	std::size_t current = 15;	while(current < upperBound) {		current += 2;		bool isPrime = true;		std::size_t mod6 = current % 6;		//Every prime number p > 3 has the form 6q + 1 or 6q + 5.		if(!(mod6 == 1 || mod6 == 5)) {			continue;		}		std::size_t maxVal = static_cast<std::size_t>(std::ceil(std::sqrt(static_cast<double>(current))));		for(int i = 0; primes[i] <= maxVal; ++i) {			if(current % primes[i] == 0) {				isPrime = false;				break;			}		}		if(isPrime) {			primes.push_back(current);		}	}}int main() {	std::size_t knownPrimes[6] = {2, 3, 5, 7, 11, 13}; //seed the list.	std::deque<std::size_t> primes(&knownPrimes[0], &knownPrimes[6]);	std::size_t upperBound;	std::cout<<"Please enter the upper bound: ";	std::cin>> upperBound;	FindPrimes(primes, upperBound);	std::copy(primes.begin(), primes.end(), std::ostream_iterator<std::size_t>(std::cout, "\n"));}

##### Share on other sites
raccoonone    158
Ya, I realize that only the primes below sqrt(n) need to be checked. But to make the assembly version simpler I decided not to keep track of the primes that have been found. I'm more interested in a comparison of the languages than making it as fast as possible. Since the C# version uses the same algorithm I don't understand how it can be faster.

Oh ya, I forgot 1 isn't prime.

##### Share on other sites
rip-off    10979
Quote:
 Original post by Procyon LotorYa, I realize that only the primes below sqrt(n) need to be checked. But to make the assembly version simpler I decided not to keep track of the primes that have been found. I'm more interested in a comparison of the languages than making it as fast as possible. Since the C# version uses the same algorithm I don't understand how it can be faster.Oh ya, I forgot 1 isn't prime.

The c# code will be compiled and optimised on the fly. The actual code that will be executed will likely not look like yours.

Langugage comparisons like this ( i.e. a very narrow comparision ) are meaningless, they don't accurately represent how the languages will compete in a complex system.

##### Share on other sites
Zahlman    1682
0) 1 is not normally counted as a prime. I will not count it below.

1) Speed improvements are more likely to be found by improving the algorithm. In your case you can get rid of all the FP operations by doing it the other way around: gradually increment a limit value. You could also try making it only use primes that have been found so far. I will implement that below, but no guarantees that it will actually help (while you test against fewer primes, you add the overhead of tracking which primes to test against).

I will just show a function to test for primes up to a limit; I/O and timing should be kept separate and naturally aren't intended to be part of what is benchmarked.

#include <vector>using std::vector;int primesUpTo(int max) {  if (max < 2) return 0;  // Find primes < max (not <=)  vector<int> primes;  int num_primes_in_use = 1;  primes.push_back(2);  int limit = 4;  for (int testvalue = 3; testvalue < max; testvalue += 2) {    if (testvalue > limit) {      // Ran out of numbers that can be tested with the current set; add another      int last_prime_in_use = primes[num_primes_in_use++];      limit = last_prime_in_use * last_prime_in_use;    }    bool isPrime = true;    for (int testprime = 0; testprime < num_primes_in_use; ++testprime) {      if (!(testvalue % primes[testprime])) { isPrime = false; break; }    }    if (isPrime) { primes.push_back(testvalue); }  }  return primes.length();}

##### Share on other sites
Washu    7829
Of course, if an exact answer isn't needed...
#include <iostream>#include <cmath>int main() {    std::size_t upper_bound;    std::cout<<"Enter the upper bound: ";    std::cin>>upperBound;    std::cout<<"There are approximately "<<upperBound / std::log(static_cast<double>(upperBound))<<" primes below "<<upperBound<<std::endl;}

##### Share on other sites
raccoonone    158
Sorry I should have been more clear. I wasn't looking for help on a better algorithm. I was wondering why the C# version is faster than the other two, and if there was some specific C++/ASM optimizations that I missed (maybe floating point instructions are faster than the integer instructions?)

##### Share on other sites
Spoonbender    1258
Quote:
 Original post by Procyon LotorYa, I realize that only the primes below sqrt(n) need to be checked. But to make the assembly version simpler I decided not to keep track of the primes that have been found. I'm more interested in a comparison of the languages than making it as fast as possible. Since the C# version uses the same algorithm I don't understand how it can be faster.

Please, don't waste your time. You can't compare the "speed" of languages.
They all compile to ASM, so it depends entirely on how much the compiler can optimize the code *in your specific case*.
For any other code, you might get different results. Or when a new version of the compiler is released. It just doesn't make sense to test.

But I can easily tell you why your ASM version is slow. Because you're not a compiler. This goes back to my first point. The compiler can optimize. It's pretty hard for a human to keep track of the complexities of a modern CPU. A compiler can do it easily. That is why it pays to use high level languages. Low level is not "fast". it just means it's easier to screw up and slow things down. And that is why it doesn't make sense to talk about the "speed" of a language.

So basically, use whatever language you're most comfortable with, and use the best algorithm you can find. And don't bother searching for the "fastest" language. Please. [wink]

Oh yeah, and if you really want to compare, you should at least use the same algorithm in all three, shouldn't you? So either all versions should use floats/doubles, or none of them should.

##### Share on other sites
mnansgar    421
Try printing out the primes to a file and see if the C# output is equivalent to the C++ and assembly output. It seems that in this case you may be safe, but using the modulus operator with doubles is a recipe for disaster, as is testing whether doubles are exact values (e.g. == 2).

##### Share on other sites
nmi    978
Quote:
 Original post by Procyon LotorI was interested in seeing what the performance different between C++, C# and ASM with a C++ wrapper would be. So far the results have surprised me since C# is the fastest, this must be because my C++/ASM isn't good enough so maybe you guys can point out where I'm going wrong....When I looked at the disassembly for the C# version I found that it's using almost all floating point processer instructions, but then I calls some functions that I couldn't step into (well I'm sure I could if I spent some time figuring out how).

Maybe C# generates SSE instruction for floating point computations ? Could explain why its faster than your asm version.

BTW if you are comparing the speed of compiled programs, you should also mention the flags you used for compilation and the compiler you used (i.e. gcc has an option -mfpmath=sse besides -O3).

##### Share on other sites
raccoonone    158
Quote:
 Original post by SpoonbenderBut I can easily tell you why your ASM version is slow. Because you're not a compiler. This goes back to my first point. The compiler can optimize. It's pretty hard for a human to keep track of the complexities of a modern CPU. A compiler can do it easily. That is why it pays to use high level languages. Low level is not "fast". it just means it's easier to screw up and slow things down. And that is why it doesn't make sense to talk about the "speed" of a language.

low level is fast if you know what you're doing, and while I'm not an expert I know the code I wrote should be quite good (except that it doesn't use any newer instruction sets). My ASM version keeps everything in registers and only goes to memory in one place, because you can't load to the FPU from a register only from memory.
Quote:
 Oh yeah, and if you really want to compare, you should at least use the same algorithm in all three, shouldn't you? So either all versions should use floats/doubles, or none of them should.

I would have, but you can't do modulus division in C++ using floats/doubles. and floating point division in ASM doesn't give you a remainder.

##### Share on other sites
raccoonone    158
Quote:
 Original post by nmiMaybe C# generates SSE instruction for floating point computations ? Could explain why its faster than your asm version.BTW if you are comparing the speed of compiled programs, you should also mention the flags you used for compilation and the compiler you used (i.e. gcc has an option -mfpmath=sse besides -O3).

That could be, I didn't use any SSE instructions, thanks I'll look at those.

I used Visual Studio 2005 to compile all of them. For C# it just has the Optimize Code box checked. For the C++ versions they're built with /O2 /Ot and /GL

Quote:
 Original post by mnansgarTry printing out the primes to a file and see if the C# output is equivalent to the C++ and assembly output. It seems that in this case you may be safe, but using the modulus operator with doubles is a recipe for disaster, as is testing whether doubles are exact values (e.g. == 2).

I checked the C# version against the C++ version up to 101, because that was the first thing I thought of is that it must be skipping some numbers. But it's not making any mistakes, and on longer checks (find all the primes up to 10,000,000 for example) it gives the same answer.

##### Share on other sites
Spoonbender    1258
Quote:
 Original post by Procyon Lotorlow level is fast if you know what you're doing, and while I'm not an expert I know the code I wrote should be quite good (except that it doesn't use any newer instruction sets). My ASM version keeps everything in registers and only goes to memory in one place, because you can't load to the FPU from a register only from memory.

So you're saying that, apart from the fact that your code should be good, but is *still* outperformed by C#, ASM is faster?
You have completely and totally missed my point. Congratulations. You have in fact managed to disprove your own claim, if you hadn't noticed. You make a benchmark showing C# to be faster, and then you argue that it's not? Good show.

Let me try again. Both C# and C++ compile to ASM. That's asm just like the stuff you've written. That means there is *no* magic overhead that makes it slower than a hand-written program. There *might* be, yes, or there might not. There isn't a "Run C# program" instruction that suddenly lowers the CPU speed by 15%. A well written C# compiled with a good compiler will generate *the exact* same ASM code as the best ASM programmer will.
Low level is not faster, it just gives you more fine-grained control. That means, yes, you can occasionally avoid doing things the compiler would do, which means you can save a few cycles. But it also means you have to do *all* your optimizations yourself. And there are a lot the compiler can do, but which are practically impossible to keep track of yourself.
So in the best case, you'll get the *exact* same performance from all three versions. But it's a lot easier to get *near* that performance in higher level languages.

Quote:
Quote:
 Oh yeah, and if you really want to compare, you should at least use the same algorithm in all three, shouldn't you? So either all versions should use floats/doubles, or none of them should.

I would have, but you can't do modulus division in C++ using floats/doubles. and floating point division in ASM doesn't give you a remainder.

So you think C# can do things that can't be done in ASM? Again, C# code compiles *to* ASM. If C# can do it, so can ASM, because otherwise, C# would not be able to do it.

##### Share on other sites
raccoonone    158
Quote:
Original post by Spoonbender
Quote:
 Original post by Procyon Lotorlow level is fast if you know what you're doing, and while I'm not an expert I know the code I wrote should be quite good (except that it doesn't use any newer instruction sets). My ASM version keeps everything in registers and only goes to memory in one place, because you can't load to the FPU from a register only from memory.

So you're saying that, apart from the fact that your code should be good, but is *still* outperformed by C#, ASM is faster?
You have completely and totally missed my point. Congratulations. You have in fact managed to disprove your own claim, if you hadn't noticed. You make a benchmark showing C# to be faster, and then you argue that it's not? Good show.

Did you read my post? I said "I'm not an expert", and "it doesn't use any newer instruction sets".
and in my original post "this must be because my C++/ASM isn't good enough so maybe you guys can point out where I'm going wrong."

I was looking for advice on why my C++/ASM code wasn't performing well, not an argument about which language is faster.

##### Share on other sites
Quote:
 A well written C# compiled with a good compiler will generate *the exact* same ASM code as the best ASM programmer will.

*tired laugh*
Ah, an invocation of the fabled "any (good|modern) compiler". I give you this quote: "In theory, there is no difference between theory and practice. But, in practice, there is.".

As to the asm code: there is considerable room for improvement. On the non-algorithmic level, probably the biggest issue is branches. You have some questionable constructs:

jnz Next2// so continuing checking for primality
jmp CheckDone
> use jz instead for smaller code size and better static prediction; your code is only necessary if Next2 is out of range of a short jump.

ja Next
inc PrimesFound
jmp CheckDone
> same issue. try to organize code in a straight line, but at least this branch is well-predicted.

Also:

OR edx, edx
> prefer TEST instead for less pipeline dependencies.

loop Check
> as a general rule, never use LOOP or other CISC-style instructions. dec/jnz is much faster.

##### Share on other sites
Spoonbender    1258
Quote:
 Original post by Procyon LotorDid you read my post? I said "I'm not an expert", and "it doesn't use any newer instruction sets".

I'm 99.9% sure the C# version doesn't use SSE instructions either. First, compilers just aren't smart enough to do that, and second, there's just nothing to be gained by it in your case.
Edit: Ok, they might use the scalar versions of SSE instructions, but that won't offer any big performance boost.
Edit2: Ok, just looked some of the latencies up. The SSE version of sqrt is actually a good deal faster than the x87 one. But the rest are pretty much the same. Try compiling your C++ program to use SSE, and see what happens.

Quote:
 I was looking for advice on why my C++/ASM code wasn't performing well, not an argument about which language is faster.

Well, you did start out by saying "I was interested in seeing what the performance different between C++, C# and ASM with a C++ wrapper would be".
Sounds like you were interested in which language was faster.
My point is simply that there's nothing too surprising in C# being fast.

But to answer your question, the main reason your ASM version isn't performing very well is because ASM is damn complex. You not only have to write the code, you also have to take latency and register pressure and other factors into account. You need to reorder your instructions to hide the latency. The CPU can read up to 3 instructions per cycle, and a lot of operations have several cycles latency. (Division and square root in particular are horrible). You have to reorder your instructions so that the instructions aren't dependant on the previous ones (especially if the previous ones had high latency). And branches are an entire research topic on their own. You want to avoid control hazards that stall the pipeline, and again, the order of instructions plays a huge role here as well.
So basically, don't bother with the ASM version. It's not faster, and doesn't even have the potential to be faster, but it takes *a lot* more work.

The C++ version is more interesting. Not sure why that is slower (Although the type conversions as you mentioned might play a role. That, or the compiler just isn't as good at optimizing this particular program.

[Edited by - Spoonbender on May 18, 2006 8:50:29 PM]

##### Share on other sites
Spoonbender    1258
Quote:
Original post by Jan Wassenberg
Quote:
 A well written C# compiled with a good compiler will generate *the exact* same ASM code as the best ASM programmer will.

*tired laugh*
Ah, an invocation of the fabled "any (good|modern) compiler". I give you this quote: "In theory, there is no difference between theory and practice. But, in practice, there is.".

True. But in practice, a C# compiler will generate much better code than *most* ASM programmers. I was just trying to say that there isn't some native, unavoidable overhead when programming in high level languages. It depends on the compilers ability to optimize. I'm well aware there are no compilers that are able to generate optimal code for everything (or anything, probably).
but that doesn't make claims that "ASM is just faster" any more true.

##### Share on other sites
raccoonone    158
Quote:
Original post by Jan Wassenberg
Quote:
 A well written C# compiled with a good compiler will generate *the exact* same ASM code as the best ASM programmer will.

*tired laugh*
Ah, an invocation of the fabled "any (good|modern) compiler". I give you this quote: "In theory, there is no difference between theory and practice. But, in practice, there is.".

As to the asm code: there is considerable room for improvement. On the non-algorithmic level, probably the biggest issue is branches. You have some questionable constructs:

jnz Next2// so continuing checking for primality
jmp CheckDone
> use jz instead for smaller code size and better static prediction; your code is only necessary if Next2 is out of range of a short jump.

ja Next
inc PrimesFound
jmp CheckDone
> same issue. try to organize code in a straight line, but at least this branch is well-predicted.

Also:

OR edx, edx
> prefer TEST instead for less pipeline dependencies.

loop Check
> as a general rule, never use LOOP or other CISC-style instructions. dec/jnz is much faster.

Thanks! Exactly the kind of help I was looking for. I had no idea dec&jnz was faster than loop, doesn't loop do exactly the same thing but in one instruction?

##### Share on other sites
sjelkjd    171
Quote:
 Original post by Procyon LotorThanks! Exactly the kind of help I was looking for. I had no idea dec&jnz was faster than loop, doesn't loop do exactly the same thing but in one instruction?

Welcome to the x86 ISA, where different instructions have different latencies and many will compile to multiple micro-code instructions.

##### Share on other sites
Dmytry    1151
BTW, if you actually print all primes, it will be IO limited.
edit: as about assembler, there is certain overhead in use of C++ or C# in cases when compiler can't optimize code well (which is common), but this is not going to happen with such algorithm(even though it is backwards), especially that most of time is probably spent doing division anyway.
As about C# using doubles, did you check you're getting same number of primes ?
(Also, doubles often works faster.)

##### Share on other sites
Rockoon1    104
First, thats one great quote given by Jan Wassenberg:

"In theory, there is no difference between theory and practice. But, in practice, there is."

The SSE sqrt instruction is faster only because its a poor approximation and as such a compiler will (should) not be emitting it without being told explicitely to do so. We are talking about a severe difference in accuracy. The instruction is meant for cases where approximations are reasonable (such as normalizing a vector to a length "near" 1.0)

Scaler SSE is typically slower mainly because the instructions are so damn big relative to FPU code. Its not uncommon for a scaler SSE version of a routine to be twice as large (in bytes) as an FPU version, even though scaler SSE code doesnt require any stack management while FPU code does. There is a lot of pressure on the instruction prefetch and decode logic of todays processors when they execute the fastest SSE instructions. One of the cases where scaler SSE code is usualy prefered is when the FPU is already in MMX mode and switching back and forth will be too expensive. The other (obviously) is when you arent using scalers, but instead are performing SIMD work.

ASM optimisations are still fairly simple because processors are still fairly simple. You just need to know about it. Lack of knowledge will obviously prevent you from writing efficient ASM. Knowledge is power. The amount of knowledge needed wont even fill a smaller than average sized book. Not even close. Most of it is common sense based on architecture understanding, rather than special case memorization. It is important to understand/memorize which instructions have been sacrificed for the sake of the efficiency of others (such as: loop, xlat, and so on) but quite honestly, the list of these special "memorization required" cases is quite small.

Floating point division doesnt give a remainder in ANY language discussed here.

No... a "well written C# compiled with a good compiler" will NOT produce the exact same code as the best asm programmer unless by "good compiler" you mean the magical ideal one that doesnt actualy exist.

Other notes:

That hand written ASM code is fairly bad but I guess we all agree on that now.

Compilers are very very good these days... especialy the latest .NET JIT... and it still has lots of room to mature... we should all be very excited about that.

If you program in ASM with an HLL mindset then you are not going to make any big strides verses the .NET JIT although you can certainly make small gains. Keep in mind that any strides you make might not apply to the "other" processor (Intel vs AMD)

The folks who are fluent in optimisation understand WHY you should drop down to ASM. Its for ALGORITHMIC changes that simply CANNOT be accomplished in an HLL. Thats right, HLL's do not offer operations with the same functionality as x86 assembler instructions and as such, some instruction streams are all but IMPOSSIBLE to generate via any HLL that doesnt have inline assembler. No matter how good the compiler is and no matter how craftily you arrange your code, you cannot coerce all sequences of well optimised assembler sequences out of any traditional HLL.

You don't have to take my word for it..

##### Share on other sites
Quote:
 So basically, don't bother with the ASM version. It's not faster, and doesn't even have the potential to be faster, but it takes *a lot* more work.

hm, I read this and think: argh! But then you say:
Quote:
 True. But in practice, a C# compiler will generate much better code than *most* ASM programmers. [..] I'm well aware there are no compilers that are able to generate optimal code for everything (or anything, probably).

which is more like it :)

Quote:
 Thanks! Exactly the kind of help I was looking for. I had no idea dec&jnz was faster than loop, doesn't loop do exactly the same thing but in one instruction?

Glad to. In order to really do well though, you need a good understanding of the microarchitecture, which cannot just be passed on as hints.
About LOOP vs. DEC/JNZ: the former is implemented via microcode (small register-transfer "programs" in the CPU itself, not actual hardware in the normal sense), which is much slower. This is slower to decode and even execute, so definitely avoid all such instructions.

##### Share on other sites
raccoonone    158
Wow, there's a big difference between LOOP and DEC/JNZ. Changing that one instruction gave a 5% increase in speed. I cleaned up some of the other ASM also, and it's not about 7% faster than the pure C++ version, it's still a good 10% behind C# but it's closer.

Thanks for the help.

// asm primes.cpp : Defines the entry point for the console application.//#include "stdafx.h"using std::cout;using std::cin;using std::endl;int _tmain(int argc, _TCHAR* argv[]){	int x, PrimesFound = 0;    DWORD start, end;	cout << "Assembly with C++ wrapper" << endl;    cout << "Find primes up to 'x'. Please enter x:";    cin >> x;    start = GetTickCount();	int limit, temp;	_asm 	{		pusha					//Save registers		mov	esi, 1				//Start checking numbers from 1		mov	edi, x				//Loop until X is reached//Check if this number is one, two, or threeForloop:cmp		esi, 3		jbe		Prime		//Check if divisible by two		test	esi, 1						//If ESI is odd it's not divisible by 2		jz		CheckDone					// so continuing checking for primality		mov		temp, esi		fild	temp						//Convert		fsqrt								// to square		fistp	limit						//  root		mov		ebx, 3						//Set EBX to 3		mov		ecx, limit					//Set up the loop counter		shr		ecx, 1						//  devide by 2 since we'll be incrementing by 2Check:	mov		eax, esi					//Restore test number		xor		edx, edx					// and clear remainder		div		ebx							//Divide and		test	edx, edx					// test for		jz		CheckDone					//  zero		inc		ebx							//increment divisor		inc		ebx							// ""		dec		ecx		jnz		Check						//Continue until the square root is reached		//Number is primePrime:	inc		PrimesFoundCheckDone: inc	esi							//Increment ESI so the next number will be checked		dec		edi		jnz		Forloop							// for primality		popa									//Restore registers	}    end = GetTickCount();    cout << "Found " << PrimesFound << " primes in " << end - start;    while(!_kbhit());	return 0;}

##### Share on other sites
Guest Anonymous Poster
I'm sorry that Spoonfeeder considers it necessary to be impolite rather than helpful, so if you are looking for more polite and also more competent support concerning performance improvements of your assembly program, you are invited over to the forum at http://www.asmcommunity.net/board/ where people will happily disprove Spoonfeeder's point about compilers being able to outperform humans ;-)
If you have a 3D graphics card, we can even provide a program that runs on your GPU :-)

Regarding the compiler vs. human debate, make sure to also check out: http://webster.cs.ucr.edu/Articles/GreatDebate/index.html

##### Share on other sites
PRS102076    122
Is the C++ version of your program even using floating point operations at all? If not, you are putting your ASM program at a disadvantage.
Also, instead of repeatedly dumping and loading an integer into the FPU, and relying on the FPU to convert to and from integer, why not load a 1.0 outside of the main loop, and add that value to your working value each time?
You may alo find that your pusha and popa instructions would be better replaced by a series of specific push and pop instructions.
You can also replace your two INC EBX instructions with an ADD EBX,2.
Although a compiler cant outperform well written ASM code, if the compiler has itself been written by a good ASM programmer, it will often produce better code than a beginner assembler programmer.