• Create Account

Awesome job so far everyone! Please give us your feedback on how our article efforts are going. We still need more finished articles for our May contest theme: Remake the Classics

# Assembly vs. C++ vs. C# finding primes

Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

42 replies to this topic

### #1raccoonone  Members   -  Reputation: 154

Like
0Likes
Like

Posted 18 May 2006 - 11:27 AM

I was interested in seeing what the performance different between C++, C# and ASM with a C++ wrapper would be. So far the results have surprised me since C# is the fastest, this must be because my C++/ASM isn't good enough so maybe you guys can point out where I'm going wrong. The program I'm testing with is quite simple all it does is: a) get a number from the user b) calculate all the primes between 1 and the number c) output the number of primes found and the time taken to find them. All 3 programs use the same algorithm: if x is 1, 2, or 3 return true if x devided by 2 has a remainder of zero return false otherwise devide by all odd numbers starting with 3 until the square root of x is reached or the remainder is zero (indicating a composite number) I compiled all of the programs with full optimizations and ran them on my computer (AthlonXP 2800+ with 1GB of RAM). The C++ version and my ASM version both took 19.5-20secs to find all the primes between 1 and 10,000,000. The C# version took 20-20.5secs when I was using ints, but after I changed it to use doubles to reduce the ammount of type conversions, it does it in 16-17secs! When I looked at the disassembly for the C# version I found that it's using almost all floating point processer instructions, but then I calls some functions that I couldn't step into (well I'm sure I could if I spent some time figuring out how). Here's a short part of the disassembly. I haven't used floating point instructions much and I don't really understand what it's doing in the beginning there. It looks to me like it's just moving data through the ALU by pushing and then poping it somewhere else.
```                if (TestNumber % x == 0)
0000008c  fld         qword ptr [ebp+8]
0000008f  sub         esp,8
00000092  fstp        qword ptr [esp]
00000095  fld         qword ptr [esp+10h]
00000099  sub         esp,8
0000009c  fstp        qword ptr [esp]
0000009f  call        79235954
000000a4  fldz
000000a6  fcomip      st,st(1)
000000a8  fstp        st(0)
000000aa  jp          000000B9
000000ac  jne         000000B9

```
Here's the source of my C# version
```using System;
using System.Collections.Generic;
using System.Text;

namespace Csharp_Test_console_application
{
class Program
{
static void Main(string[] args)
{
double x;
int PrimesFound = 0;
DateTime start, end;
Console.WriteLine("Find primes up to 'x'. Please enter x:");
start = System.DateTime.Now;
for (double y = 1; y <= x; y++)
{
if (IsPrime(y))
{
PrimesFound++;
}
}
end = System.DateTime.Now;
Console.WriteLine("Found {0} primes in {1}", PrimesFound, end - start);
}

static bool IsPrime(double TestNumber)
{
double limit = System.Math.Sqrt(TestNumber) + 1;
if (TestNumber == 2)
return true;
if(TestNumber % 2 == 0) //Check if TestNumber is even
return false;
double x = 3;
//Devide until the square root is reached
for (; x < limit; x += 2)
{
if (TestNumber % x == 0)
return false;
}
return true;
}
}
}

```
Source of my C++ version
```// Cpp primes.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
using std::cin;
using std::cout;
bool IsPrime(int TestNumber);

int _tmain(int argc, _TCHAR* argv[])
{
int x, PrimesFound = 0;
DWORD start, end;
cout << "Find primes up to 'x'. Please enter x:";
cin >> x;
start = GetTickCount();
for (int y = 1; y <= x; y++)
{
if (IsPrime(y))
{
PrimesFound++;
}
}
end = GetTickCount();
cout << "Found " << PrimesFound << " primes in " << end - start;
while(!_kbhit());
return 0;
}

_inline bool IsPrime(int TestNumber)
{
int limit = (int)sqrt((double)TestNumber) + 1;
if (TestNumber == 2)
return true;
if(TestNumber % 2 == 0) //Check if TestNumber is even
return false;
int x = 3;
//Devide until the square root is reached
for (; x < limit; x += 2)
{
if (TestNumber % x == 0)
return false;
}
return true;
}

```
Assembly with C++ for the input and output
```// asm primes.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"

using std::cout;
using std::cin;

int _tmain(int argc, _TCHAR* argv[])
{
int x, PrimesFound = 0;
DWORD start, end;
cout << "Find primes up to 'x'. Please enter x:";
cin >> x;
start = GetTickCount();
int limit, temp;
_asm
{
pusha					//Save registers
mov	esi, 1				//Start checking numbers from 1
mov	edi, x				//Loop until X is reached
Forloop: cmp	esi, edi
ja		Done			//Exit if X has been reached

//Check if this number is one, two, or three
cmp		esi, 3
ja		Next
inc		PrimesFound
jmp		CheckDone
//Check if divisible by two
Next:	test	esi, 1						//If ESI is odd it's not divisible by 2
jnz		Next2						// so continuing checking for primality
jmp		CheckDone

Next2:	mov		temp, esi
fild	temp						//Convert
fsqrt								// to square
fistp	limit						//  root

mov		ebx, 3						//Set EBX to 3

mov		ecx, limit					//Set up the loop counter
shr		ecx, 1						//  devide by 2 since we'll be incrementing by 2

Check:	mov		eax, esi					//Restore test number
xor		edx, edx					// and clear remainder

div		ebx							//Divide and
or		edx, edx					// test for
jz		CheckDone					//  zero

inc		ebx							//increment divisor
inc		ebx							// ""
loop	Check						//Continue until the square root is reached

//Number is prime
inc		PrimesFound
CheckDone: inc	esi							//Increment ESI so the next number will be checked
jmp	Forloop							// for primality

Done: popa									//Restore registers
}
end = GetTickCount();
cout << "Found " << PrimesFound << " primes in " << end - start;
while(!_kbhit());
return 0;
}

```
stdafx.h for C++ versions
```// stdafx.h : include file for standard system include files,
// or project specific include files that are used frequently, but
// are changed infrequently
//

#pragma once

#define WIN32_LEAN_AND_MEAN		// Exclude rarely-used stuff from Windows headers
#include <stdio.h>
#include <tchar.h>
#include <math.h>
#include <windows.h>
#include <conio.h>
#include <iostream>

```

### #2Washu  Senior Moderators   -  Reputation: 3114

Like
0Likes
Like

Posted 18 May 2006 - 11:42 AM

None of your methods are particularily optimized at all. From number theory, you can show that you do not need to test against every number below the sqrt(n), but just the primes below the sqrt(n) to see if n is prime. As such, I would recommend changing your algorithm to remember each prime it finds.
Quote:
 All 3 programs use the same algorithm: if x is 1, 2, or 3 return true

1 is not prime.
A simple sieve i wrote a bit back for a similar post.
```
#include <iostream>
#include <deque>
#include <cmath>

void FindPrimes(std::deque<std::size_t>& primes, std::size_t upperBound) {
std::size_t current = 15;

while(current < upperBound) {
current += 2;

bool isPrime = true;
std::size_t mod6 = current % 6;

//Every prime number p > 3 has the form 6q + 1 or 6q + 5.
if(!(mod6 == 1 || mod6 == 5)) {
continue;
}

std::size_t maxVal = static_cast<std::size_t>(std::ceil(std::sqrt(static_cast<double>(current))));

for(int i = 0; primes[i] <= maxVal; ++i) {
if(current % primes[i] == 0) {
isPrime = false;
break;
}
}

if(isPrime) {
primes.push_back(current);
}
}
}

int main() {
std::size_t knownPrimes[6] = {2, 3, 5, 7, 11, 13}; //seed the list.
std::deque<std::size_t> primes(&knownPrimes[0], &knownPrimes[6]);
std::size_t upperBound;

std::cout<<"Please enter the upper bound: ";
std::cin>> upperBound;

FindPrimes(primes, upperBound);

std::copy(primes.begin(), primes.end(), std::ostream_iterator<std::size_t>(std::cout, "\n"));
}

```

### #3raccoonone  Members   -  Reputation: 154

Like
0Likes
Like

Posted 18 May 2006 - 11:50 AM

Ya, I realize that only the primes below sqrt(n) need to be checked. But to make the assembly version simpler I decided not to keep track of the primes that have been found. I'm more interested in a comparison of the languages than making it as fast as possible. Since the C# version uses the same algorithm I don't understand how it can be faster.

Oh ya, I forgot 1 isn't prime.

### #4rip-off  Moderators   -  Reputation: 5061

Like
0Likes
Like

Posted 18 May 2006 - 11:56 AM

Quote:
 Original post by Procyon LotorYa, I realize that only the primes below sqrt(n) need to be checked. But to make the assembly version simpler I decided not to keep track of the primes that have been found. I'm more interested in a comparison of the languages than making it as fast as possible. Since the C# version uses the same algorithm I don't understand how it can be faster.Oh ya, I forgot 1 isn't prime.

The c# code will be compiled and optimised on the fly. The actual code that will be executed will likely not look like yours.

Langugage comparisons like this ( i.e. a very narrow comparision ) are meaningless, they don't accurately represent how the languages will compete in a complex system.

### #5Zahlman  Moderators   -  Reputation: 1666

Like
0Likes
Like

Posted 18 May 2006 - 12:06 PM

0) 1 is not normally counted as a prime. I will not count it below.

1) Speed improvements are more likely to be found by improving the algorithm. In your case you can get rid of all the FP operations by doing it the other way around: gradually increment a limit value. You could also try making it only use primes that have been found so far. I will implement that below, but no guarantees that it will actually help (while you test against fewer primes, you add the overhead of tracking which primes to test against).

I will just show a function to test for primes up to a limit; I/O and timing should be kept separate and naturally aren't intended to be part of what is benchmarked.

`#include <vector>using std::vector;int primesUpTo(int max) {  if (max < 2) return 0;  // Find primes < max (not <=)  vector<int> primes;  int num_primes_in_use = 1;  primes.push_back(2);  int limit = 4;  for (int testvalue = 3; testvalue < max; testvalue += 2) {    if (testvalue > limit) {      // Ran out of numbers that can be tested with the current set; add another      int last_prime_in_use = primes[num_primes_in_use++];      limit = last_prime_in_use * last_prime_in_use;    }    bool isPrime = true;    for (int testprime = 0; testprime < num_primes_in_use; ++testprime) {      if (!(testvalue % primes[testprime])) { isPrime = false; break; }    }    if (isPrime) { primes.push_back(testvalue); }  }  return primes.length();}`

### #6Washu  Senior Moderators   -  Reputation: 3114

Like
0Likes
Like

Posted 18 May 2006 - 12:12 PM

Of course, if an exact answer isn't needed...
`#include <iostream>#include <cmath>int main() {    std::size_t upper_bound;    std::cout<<"Enter the upper bound: ";    std::cin>>upperBound;    std::cout<<"There are approximately "<<upperBound / std::log(static_cast<double>(upperBound))<<" primes below "<<upperBound<<std::endl;}`

### #7raccoonone  Members   -  Reputation: 154

Like
0Likes
Like

Posted 18 May 2006 - 12:27 PM

Sorry I should have been more clear. I wasn't looking for help on a better algorithm. I was wondering why the C# version is faster than the other two, and if there was some specific C++/ASM optimizations that I missed (maybe floating point instructions are faster than the integer instructions?)

### #8Spoonbender  Members   -  Reputation: 1250

Like
0Likes
Like

Posted 18 May 2006 - 12:46 PM

Quote:
 Original post by Procyon LotorYa, I realize that only the primes below sqrt(n) need to be checked. But to make the assembly version simpler I decided not to keep track of the primes that have been found. I'm more interested in a comparison of the languages than making it as fast as possible. Since the C# version uses the same algorithm I don't understand how it can be faster.

Please, don't waste your time. You can't compare the "speed" of languages.
They all compile to ASM, so it depends entirely on how much the compiler can optimize the code *in your specific case*.
For any other code, you might get different results. Or when a new version of the compiler is released. It just doesn't make sense to test.

But I can easily tell you why your ASM version is slow. Because you're not a compiler. This goes back to my first point. The compiler can optimize. It's pretty hard for a human to keep track of the complexities of a modern CPU. A compiler can do it easily. That is why it pays to use high level languages. Low level is not "fast". it just means it's easier to screw up and slow things down. And that is why it doesn't make sense to talk about the "speed" of a language.

So basically, use whatever language you're most comfortable with, and use the best algorithm you can find. And don't bother searching for the "fastest" language. Please. [wink]

Oh yeah, and if you really want to compare, you should at least use the same algorithm in all three, shouldn't you? So either all versions should use floats/doubles, or none of them should.

### #9mnansgar  Members   -  Reputation: 421

Like
0Likes
Like

Posted 18 May 2006 - 12:49 PM

Try printing out the primes to a file and see if the C# output is equivalent to the C++ and assembly output. It seems that in this case you may be safe, but using the modulus operator with doubles is a recipe for disaster, as is testing whether doubles are exact values (e.g. == 2).

### #10nmi  Members   -  Reputation: 978

Like
0Likes
Like

Posted 18 May 2006 - 12:58 PM

Quote:
 Original post by Procyon LotorI was interested in seeing what the performance different between C++, C# and ASM with a C++ wrapper would be. So far the results have surprised me since C# is the fastest, this must be because my C++/ASM isn't good enough so maybe you guys can point out where I'm going wrong....When I looked at the disassembly for the C# version I found that it's using almost all floating point processer instructions, but then I calls some functions that I couldn't step into (well I'm sure I could if I spent some time figuring out how).

Maybe C# generates SSE instruction for floating point computations ? Could explain why its faster than your asm version.

BTW if you are comparing the speed of compiled programs, you should also mention the flags you used for compilation and the compiler you used (i.e. gcc has an option -mfpmath=sse besides -O3).

### #11raccoonone  Members   -  Reputation: 154

Like
0Likes
Like

Posted 18 May 2006 - 01:02 PM

Quote:
 Original post by SpoonbenderBut I can easily tell you why your ASM version is slow. Because you're not a compiler. This goes back to my first point. The compiler can optimize. It's pretty hard for a human to keep track of the complexities of a modern CPU. A compiler can do it easily. That is why it pays to use high level languages. Low level is not "fast". it just means it's easier to screw up and slow things down. And that is why it doesn't make sense to talk about the "speed" of a language.

low level is fast if you know what you're doing, and while I'm not an expert I know the code I wrote should be quite good (except that it doesn't use any newer instruction sets). My ASM version keeps everything in registers and only goes to memory in one place, because you can't load to the FPU from a register only from memory.
Quote:
 Oh yeah, and if you really want to compare, you should at least use the same algorithm in all three, shouldn't you? So either all versions should use floats/doubles, or none of them should.

I would have, but you can't do modulus division in C++ using floats/doubles. and floating point division in ASM doesn't give you a remainder.

### #12raccoonone  Members   -  Reputation: 154

Like
0Likes
Like

Posted 18 May 2006 - 01:08 PM

Quote:
 Original post by nmiMaybe C# generates SSE instruction for floating point computations ? Could explain why its faster than your asm version.BTW if you are comparing the speed of compiled programs, you should also mention the flags you used for compilation and the compiler you used (i.e. gcc has an option -mfpmath=sse besides -O3).

That could be, I didn't use any SSE instructions, thanks I'll look at those.

I used Visual Studio 2005 to compile all of them. For C# it just has the Optimize Code box checked. For the C++ versions they're built with /O2 /Ot and /GL

Quote:
 Original post by mnansgarTry printing out the primes to a file and see if the C# output is equivalent to the C++ and assembly output. It seems that in this case you may be safe, but using the modulus operator with doubles is a recipe for disaster, as is testing whether doubles are exact values (e.g. == 2).

I checked the C# version against the C++ version up to 101, because that was the first thing I thought of is that it must be skipping some numbers. But it's not making any mistakes, and on longer checks (find all the primes up to 10,000,000 for example) it gives the same answer.

### #13Spoonbender  Members   -  Reputation: 1250

Like
0Likes
Like

Posted 18 May 2006 - 01:18 PM

Quote:
 Original post by Procyon Lotorlow level is fast if you know what you're doing, and while I'm not an expert I know the code I wrote should be quite good (except that it doesn't use any newer instruction sets). My ASM version keeps everything in registers and only goes to memory in one place, because you can't load to the FPU from a register only from memory.

So you're saying that, apart from the fact that your code should be good, but is *still* outperformed by C#, ASM is faster?
You have completely and totally missed my point. Congratulations. You have in fact managed to disprove your own claim, if you hadn't noticed. You make a benchmark showing C# to be faster, and then you argue that it's not? Good show.

Let me try again. Both C# and C++ compile to ASM. That's asm just like the stuff you've written. That means there is *no* magic overhead that makes it slower than a hand-written program. There *might* be, yes, or there might not. There isn't a "Run C# program" instruction that suddenly lowers the CPU speed by 15%. A well written C# compiled with a good compiler will generate *the exact* same ASM code as the best ASM programmer will.
Low level is not faster, it just gives you more fine-grained control. That means, yes, you can occasionally avoid doing things the compiler would do, which means you can save a few cycles. But it also means you have to do *all* your optimizations yourself. And there are a lot the compiler can do, but which are practically impossible to keep track of yourself.
So in the best case, you'll get the *exact* same performance from all three versions. But it's a lot easier to get *near* that performance in higher level languages.

Quote:
Quote:
 Oh yeah, and if you really want to compare, you should at least use the same algorithm in all three, shouldn't you? So either all versions should use floats/doubles, or none of them should.

I would have, but you can't do modulus division in C++ using floats/doubles. and floating point division in ASM doesn't give you a remainder.

So you think C# can do things that can't be done in ASM? Again, C# code compiles *to* ASM. If C# can do it, so can ASM, because otherwise, C# would not be able to do it.

### #14raccoonone  Members   -  Reputation: 154

Like
0Likes
Like

Posted 18 May 2006 - 01:34 PM

Quote:
Original post by Spoonbender
Quote:
 Original post by Procyon Lotorlow level is fast if you know what you're doing, and while I'm not an expert I know the code I wrote should be quite good (except that it doesn't use any newer instruction sets). My ASM version keeps everything in registers and only goes to memory in one place, because you can't load to the FPU from a register only from memory.

So you're saying that, apart from the fact that your code should be good, but is *still* outperformed by C#, ASM is faster?
You have completely and totally missed my point. Congratulations. You have in fact managed to disprove your own claim, if you hadn't noticed. You make a benchmark showing C# to be faster, and then you argue that it's not? Good show.

Did you read my post? I said "I'm not an expert", and "it doesn't use any newer instruction sets".
and in my original post "this must be because my C++/ASM isn't good enough so maybe you guys can point out where I'm going wrong."

I was looking for advice on why my C++/ASM code wasn't performing well, not an argument about which language is faster.

### #15Jan Wassenberg  Members   -  Reputation: 982

Like
0Likes
Like

Posted 18 May 2006 - 01:43 PM

Quote:
 A well written C# compiled with a good compiler will generate *the exact* same ASM code as the best ASM programmer will.

*tired laugh*
Ah, an invocation of the fabled "any (good|modern) compiler". I give you this quote: "In theory, there is no difference between theory and practice. But, in practice, there is.".

As to the asm code: there is considerable room for improvement. On the non-algorithmic level, probably the biggest issue is branches. You have some questionable constructs:

jnz Next2// so continuing checking for primality
jmp CheckDone
> use jz instead for smaller code size and better static prediction; your code is only necessary if Next2 is out of range of a short jump.

ja Next
inc PrimesFound
jmp CheckDone
> same issue. try to organize code in a straight line, but at least this branch is well-predicted.

Also:

OR edx, edx
> prefer TEST instead for less pipeline dependencies.

loop Check
> as a general rule, never use LOOP or other CISC-style instructions. dec/jnz is much faster.

### #16Spoonbender  Members   -  Reputation: 1250

Like
0Likes
Like

Posted 18 May 2006 - 01:50 PM

Quote:
 Original post by Procyon LotorDid you read my post? I said "I'm not an expert", and "it doesn't use any newer instruction sets".

I'm 99.9% sure the C# version doesn't use SSE instructions either. First, compilers just aren't smart enough to do that, and second, there's just nothing to be gained by it in your case.
Edit: Ok, they might use the scalar versions of SSE instructions, but that won't offer any big performance boost.
Edit2: Ok, just looked some of the latencies up. The SSE version of sqrt is actually a good deal faster than the x87 one. But the rest are pretty much the same. Try compiling your C++ program to use SSE, and see what happens.

Quote:
 I was looking for advice on why my C++/ASM code wasn't performing well, not an argument about which language is faster.

Well, you did start out by saying "I was interested in seeing what the performance different between C++, C# and ASM with a C++ wrapper would be".
Sounds like you were interested in which language was faster.
My point is simply that there's nothing too surprising in C# being fast.

But to answer your question, the main reason your ASM version isn't performing very well is because ASM is damn complex. You not only have to write the code, you also have to take latency and register pressure and other factors into account. You need to reorder your instructions to hide the latency. The CPU can read up to 3 instructions per cycle, and a lot of operations have several cycles latency. (Division and square root in particular are horrible). You have to reorder your instructions so that the instructions aren't dependant on the previous ones (especially if the previous ones had high latency). And branches are an entire research topic on their own. You want to avoid control hazards that stall the pipeline, and again, the order of instructions plays a huge role here as well.
So basically, don't bother with the ASM version. It's not faster, and doesn't even have the potential to be faster, but it takes *a lot* more work.

The C++ version is more interesting. Not sure why that is slower (Although the type conversions as you mentioned might play a role. That, or the compiler just isn't as good at optimizing this particular program.

[Edited by - Spoonbender on May 18, 2006 8:50:29 PM]

### #17Spoonbender  Members   -  Reputation: 1250

Like
0Likes
Like

Posted 18 May 2006 - 01:53 PM

Quote:
Original post by Jan Wassenberg
Quote:
 A well written C# compiled with a good compiler will generate *the exact* same ASM code as the best ASM programmer will.

*tired laugh*
Ah, an invocation of the fabled "any (good|modern) compiler". I give you this quote: "In theory, there is no difference between theory and practice. But, in practice, there is.".

True. But in practice, a C# compiler will generate much better code than *most* ASM programmers. I was just trying to say that there isn't some native, unavoidable overhead when programming in high level languages. It depends on the compilers ability to optimize. I'm well aware there are no compilers that are able to generate optimal code for everything (or anything, probably).
but that doesn't make claims that "ASM is just faster" any more true.

### #18raccoonone  Members   -  Reputation: 154

Like
0Likes
Like

Posted 18 May 2006 - 02:09 PM

Quote:
Original post by Jan Wassenberg
Quote:
 A well written C# compiled with a good compiler will generate *the exact* same ASM code as the best ASM programmer will.

*tired laugh*
Ah, an invocation of the fabled "any (good|modern) compiler". I give you this quote: "In theory, there is no difference between theory and practice. But, in practice, there is.".

As to the asm code: there is considerable room for improvement. On the non-algorithmic level, probably the biggest issue is branches. You have some questionable constructs:

jnz Next2// so continuing checking for primality
jmp CheckDone
> use jz instead for smaller code size and better static prediction; your code is only necessary if Next2 is out of range of a short jump.

ja Next
inc PrimesFound
jmp CheckDone
> same issue. try to organize code in a straight line, but at least this branch is well-predicted.

Also:

OR edx, edx
> prefer TEST instead for less pipeline dependencies.

loop Check
> as a general rule, never use LOOP or other CISC-style instructions. dec/jnz is much faster.

Thanks! Exactly the kind of help I was looking for. I had no idea dec&jnz was faster than loop, doesn't loop do exactly the same thing but in one instruction?

### #19sjelkjd  Members   -  Reputation: 168

Like
0Likes
Like

Posted 18 May 2006 - 03:09 PM

Quote:
 Original post by Procyon LotorThanks! Exactly the kind of help I was looking for. I had no idea dec&jnz was faster than loop, doesn't loop do exactly the same thing but in one instruction?

Welcome to the x86 ISA, where different instructions have different latencies and many will compile to multiple micro-code instructions.

### #20Dmytry  Members   -  Reputation: 1140

Like
0Likes
Like

Posted 18 May 2006 - 09:01 PM

BTW, if you actually print all primes, it will be IO limited.
edit: as about assembler, there is certain overhead in use of C++ or C# in cases when compiler can't optimize code well (which is common), but this is not going to happen with such algorithm(even though it is backwards), especially that most of time is probably spent doing division anyway.
As about C# using doubles, did you check you're getting same number of primes ?
(Also, doubles often works faster.)

Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

PARTNERS