my c++ d c# benchmark!

Started by
70 comments, last by Raghar 17 years, 6 months ago
O'Caml : 61.98 for your version, 51.29 for mine. Also, by disabling array bounds-checking: 59.71 for your version, 50.04 for mine.

My code is given below. Numbers in comments are the number of executions for each line of code when computing 1000 digits. The modifications with regard to your code mainly consisted in removing the unnecessary indirections (references) you added. The original C code doesn't use pointers-to-variables, so why should we?

let args = Sys.argv let q =  if Array.length args = 2  then int_of_string args.(1) + 1  else 101let p = Array.make (q + 1) 0let t = Array.make (q + 1) 0  let main2 () =  (* 2 *) let add () =    (* 1486 *) for j = q downto 0 do      if t.(j) + p.(j) > 9      then        (           p.(j) <- p.(j) + t.(j) - 10;           p.(j-1) <- p.(j-1) + 1        )      else        p.(j) <- p.(j) + t.(j)    done  and sub () =    (* 1486 *) for j = q downto 0 do      if p.(j) < t.(j)      then        (           p.(j) <- p.(j) - t.(j) + 10;           p.(j - 1) <- p.(j - 1) - 1        )      else        p.(j) <- p.(j) - t.(j)    done  and mul multiplier =    (* 2968 *) let rec mul i carry =      (* 2737504 *) if i >= 0 then        let b = t.(i) * multiplier + carry in        let carry = b / 10 in        (t.(i) <- b mod 10; mul (i-1) carry)    in mul q 0  and div divisor =    (* 5940 *) let rec div i remainder =      (* 5477220 *) if i <= q then        let b = 10 * remainder + t.(i) in        let remainder = b mod divisor in        (t.(i) <- b / divisor; div (i+1) remainder)    in div 0 0  and mul4 () =        (* 2 *) let rec mul i c =      (* 1106 *) if i >= 0 then        let a = p.(i) * 4 + c in        let d = a mod 10 in        let c = a / 10 in        (p.(i) <- d; mul (i-1) c)    in mul q 0  and tiszero () =    (* 2968 *) let rec check i =        (* 1375939 *) (i > q) || ( t.(i) = 0 && check (i+1))    in check 0  in let arctan s =    (* 4 *) let sqs = s * s in    let rec loop n =      (* 2968 *) mul n;      div sqs;      let n = n + 2 in      div n;      if ((n - 1) / 2) mod 2 = 0 then add () else sub ();      if not (tiszero ()) then loop n;    in    t.(0) <- 1;    div s;    add ();    loop 1   in  let show array =    (* 2 *) Array.iter print_int array  and run () =    (* 2 *) arctan 2;    arctan 3;    mul4 ()  in  let time () =    (* 2 *) let before = Sys.time () in    let _ = run () in    let after = Sys.time () in    show p;    print_newline();    Printf.printf "Time measured: %f" (after -. before)  in  Printf.printf "Computing pi (%d digits)..." (q - 1);  print_newline();  time ();;
Advertisement
Quote:Original post by doctorsixstring
What version of .NET was used for the tests?
.Net 2.0


I have tested gumpy macdrunken's java code here the result:
15.578000 seconds to compute pi with a precision of 10000 digits.
Promit's c# version:
19.443 seconds to compute pi with a precision of 10000 digits.

I think many people here misunderstood me:
1)I'm not searching for a fast way to compute pi
2)I'm not interested in optimizing the original program

I just saw the pi.d sample and noticed that I do not need to change any of the important function(add, sub, mul,...) to make the program run in c/c++ and c#. (I have just add some castings)
D , C++ and c# have a similar syntax so I thought “give them the same code and see what happens!”.

Why can the ms c++ compiler optimize those mul and div functions but the c# compiler can not?

[Edited by - Kambiz on October 7, 2006 5:29:26 PM]
Good to see some discussion -- I was hoping this thread would pick up some steam.

First of all, with regards to the Java version -- how do they compare on client and server VMs? I don't really want to set up a full java environment right now. Also, how does Java do with the original version of the code, without my modifications?

Second, I intentionally withheld discussion of why I made the changes I did, and what the implications were. It was mainly to see what others would say first, instead of me jumping full into the discussion from the beginning. So here's my perspective on the matter...

I started things off with NProf, just so that I didn't actually have to understand how this program works. Div and mul obviously come up with the vast majority of CPU time. Notice that we're working with C code's ideal case here -- it is reasonable to assume that it is impossible to beat the C version. (By the way, the C code is really tight. I couldn't find any modifications that made any dent in the execution time without falling back to MMX.) So looking at the functions, it's completely trivial operations, with the exception of the divide and modulus. Now on x86, the idiv instruction computes both the division and modulus results at the same time. Looking at the generated assembly listing from VC makes it clear that the compiler has leveraged this optimization.

Some quick googling about C# math performance issues followed, which revealed a suggestion that the C# compiler and CLR do not optimize division constructs particularly well. So at that point, I began to suspect that replacing the division and modulus would make an impact. As you can see, it did. Note that the replacement is kind of retarded -- it involves the FPU to do its work, when there's really no reason to involve the FPU. More importantly, super-micro-benchmarks would most likely reveal that conversion from floating point to int is a performance problem. In fact, I expected that bringing the C# optimizations back to C++ would negatively impact performance, which it didn't.

On later reflection, I decided that DMD probably botches this somewhat machine specific optimization in exactly the same way. The results posted later that are built on DM C++, combined with the substantially better DMD results using my modifications, support this conclusion. I strongly suspect that JVM will screw this up too.

In other words, all I did was a crude workaround of a more fundamental optimization problem. I didn't utilize special C# functionality or features, I just wrote a hack that basically dodges the bug. This is a problem that we can expect to vanish in the next...oh, I don't know, let's say something vague like the next 5-10 years. It's pretty obvious that math performance is hardly ideal in the managed world -- and as I said at the beginning, we're playing on C's home turf. We can't win here.

Lastly, realize that if someone were to break out the MMX intrinsics, everything except C would be seriously screwed. You'd be looking at times for the C code that were about 1/4 to 1/3 what they are now. DMD, C#, and Java wouldn't be able to touch that -- and we can't even do autovectorization like that statically, let alone in a JITter.

Let's be frank. The non-C languages here all manage to make a positively lame turnout. DMD, C#, and Java are all very respectable after my modifications, and certainly adequate for nearly all real world applications, but it took extra work to get there. In my case it was only 15 minutes because I knew what I was doing -- for a less experienced developer, or more complex code, the story could get much worse. There are lots of situations where C# and Java can outperform C/C++ code, but tight math and array accesses are obviously not one of them.

[Edited by - Promit on October 7, 2006 8:18:19 PM]
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
Quote:Original post by deathkrush
The fastest way to calculate PI in C is of course:

*** Source Snippet Removed ***


Hey, I loved the snippet!
Though, it wouldnt compile as given (F_00() is used before defined) so I rearranged the code blocks.

Then it crunches out the number '0.250'.

So I might add: it may be a great way to calculate pi, compact and cool looking, but it lacks one extra nicety we would like in a program to calculate pi. Determining the correct result.
On all other accounts its cool though! :)

- Jacob
"1 is equal to 2 for significantly large quantities of 1" - Anonymous
Quote:Original post by Promit
On later reflection, I decided that DMD probably botches this somewhat machine specific optimization in exactly the same way.

Examining the assembler output of DMD shows what you suspect is true. Strangely, in other cases, DMD does do that optimization. So it looks like a compiler problem that can be easily corrected.

Quote:This is a problem that we can expect to vanish in the next...oh, I don't know, let's say something vague like the next 5-10 years.
Or maybe in the next update, if it is an easily correctible problem?

Quote:It's pretty obvious that math performance is hardly ideal in the managed world -- and as I said at the beginning, we're playing on C's home turf. We can't win here.

C is at a disadvantage with heavy numerical work because of the aliasing problem. It's why Fortran is still preferred (and to be sure, Fortran compilers have gotten the benefit of decades of optimization work by legions of very smart programmers).

Quote:Lastly, realize that if someone were to break out the MMX intrinsics, everything except C would be seriously screwed. You'd be looking at times for the C code that were about 1/4 to 1/3 what they are now. DMD, C#, and Java wouldn't be able to touch that -- and we can't even do autovectorization like that statically, let alone in a JITter.


I don't know much about the intrinsics to support MMX, but since DMD already has built in intrinsics for some functions (like sin(), cos(), etc.), then adding more doesn't seem like it would be any more of a problem than it would be to add intrinsics to C. D has an inline assembler with support for MMX, so those instructions can be gotten at that way.

Quote:Let's be frank. The non-C languages here all manage to make a positively lame turnout. DMD, C#, and Java are all very respectable after my modifications, and certainly adequate for nearly all real world applications, but it took extra work to get there. In my case it was only 15 minutes because I knew what I was doing -- for a less experienced developer, or more complex code, the story could get much worse. There are lots of situations where C# and Java can outperform C/C++ code, but tight math and array accesses are obviously not one of them.


The substantial difference here is little more than an optimization folding /% into one instruction rather than two. I don't see any reason why C, C++, D, Java, or C# cannot do that, nor does it seem like a particularly hard optimization to do. All you're seeing is the result of a special case put into the optimizer, not any fundamental language difference.
Quote:Original post by Stachel
Quote:This is a problem that we can expect to vanish in the next...oh, I don't know, let's say something vague like the next 5-10 years.
Or maybe in the next update, if it is an easily correctible problem?
It's a damn lucky day when an optimizer problem is easily corrected -- optimization is dangerous business. Though I was actually referring more to the JITted languages.
Quote:C is at a disadvantage with heavy numerical work because of the aliasing problem. It's why Fortran is still preferred (and to be sure, Fortran compilers have gotten the benefit of decades of optimization work by legions of very smart programmers).
Looking at the assembly output of the C version, I don't see anything that can be fixed here -- what you're saying may be true for more complex mathematical work, but for this program, the C version is as fast as it gets without hitting machine specific code.
Quote:I don't know much about the intrinsics to support MMX, but since DMD already has built in intrinsics for some functions (like sin(), cos(), etc.), then adding more doesn't seem like it would be any more of a problem than it would be to add intrinsics to C. D has an inline assembler with support for MMX, so those instructions can be gotten at that way.
Intrinsic support itself is relatively easy to do. The real benefit of intrinsic support is that the optimizer can be conscious of your "assembly" when you use them, and that's a more difficult challenge. It's not an insurmountable challenge -- VC and GCC do it, after all. I'd put my hopes on GDC rather than DMD here.
Quote:The substantial difference here is little more than an optimization folding /% into one instruction rather than two. I don't see any reason why C, C++, D, Java, or C# cannot do that, nor does it seem like a particularly hard optimization to do. All you're seeing is the result of a special case put into the optimizer, not any fundamental language difference.
Sure (with the exception of support for vectorized instruction sets, which Java and C# lack). The problem is that introducing a lot of these features into the compiler is difficult to do well, and requires extensive testing. I can't claim to know why CLR can't do this optimization. There's no reason D can't do it, but it doesn't, which leads back into the original objection I expressed in the other thread -- D lacks developer support and maturity that the other languages, C and C++ in particular, have. Without that, they're screwed.
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
Quote:Original post by Promit
Quote:Original post by Stachel
Quote:This is a problem that we can expect to vanish in the next...oh, I don't know, let's say something vague like the next 5-10 years.
Or maybe in the next update, if it is an easily correctible problem?
It's a damn lucky day when an optimizer problem is easily corrected -- optimization is dangerous business. Though I was actually referring more to the JITted languages.


I would agree that just sticking a programmer on an optimizer and expecting him to make non-breaking improvements is unreasonable. But for the experts who do work on them, they may very well be easily corrected.

Quote:
Quote:C is at a disadvantage with heavy numerical work because of the aliasing problem. It's why Fortran is still preferred (and to be sure, Fortran compilers have gotten the benefit of decades of optimization work by legions of very smart programmers).
Looking at the assembly output of the C version, I don't see anything that can be fixed here -- what you're saying may be true for more complex mathematical work, but for this program, the C version is as fast as it gets without hitting machine specific code.


Where Fortran shines is its ability to replace entire loops with parallel vectorized code. With C's aliasing problem, such transformations are very nearly impossible, though the new restrict pointer type may help.

Quote:
Quote:I don't know much about the intrinsics to support MMX, but since DMD already has built in intrinsics for some functions (like sin(), cos(), etc.), then adding more doesn't seem like it would be any more of a problem than it would be to add intrinsics to C. D has an inline assembler with support for MMX, so those instructions can be gotten at that way.
Intrinsic support itself is relatively easy to do. The real benefit of intrinsic support is that the optimizer can be conscious of your "assembly" when you use them, and that's a more difficult challenge. It's not an insurmountable challenge -- VC and GCC do it, after all. I'd put my hopes on GDC rather than DMD here.


That makes sense. GDC does leverage all the ongoing work done with gcc. There's no technical reason D should be at a disadvantage relative to C.

Quote:
Quote:The substantial difference here is little more than an optimization folding /% into one instruction rather than two. I don't see any reason why C, C++, D, Java, or C# cannot do that, nor does it seem like a particularly hard optimization to do. All you're seeing is the result of a special case put into the optimizer, not any fundamental language difference.
Sure (with the exception of support for vectorized instruction sets, which Java and C# lack). The problem is that introducing a lot of these features into the compiler is difficult to do well, and requires extensive testing. I can't claim to know why CLR can't do this optimization. There's no reason D can't do it, but it doesn't, which leads back into the original objection I expressed in the other thread -- D lacks developer support and maturity that the other languages, C and C++ in particular, have. Without that, they're screwed.


This is why D is obviously built to leverage existing technology. Both DMD and GDC are built on top of well developed C compilers. C libraries are directly accessible from D. C header files can be mechanically converted to D. I think you place too much of a store on the /% optimization; in my experiments DMD will do that optimization in other circumstances.
Quote:Original post by PromitFirst of all, with regards to the Java version -- how do they compare on client and server VMs? I don't really want to set up a full java environment right now. Also, how does Java do with the original version of the code, without my modifications?

For the java version I just made a new project in netbeans and ran the program on jre 1.6.0 (beta 2). (I tested it also from the command line)
I replaced the modified function with the original ones and tested again:
package test;import java.io.IOException;public class Main{	public static void main( String[] args )	{		Main pi = new Main();		pi.run( args );	}	private final int LONG_TIME = 4000;	byte[] p;	byte[] t;	int q;	private void run( String[] args )	{		int i;		if( args.length == 1 )		{			q = Integer.parseInt( args[0] );		}		else		{			System.out.println( "Usage: pi [precision]" );			return;		}		if( q < 0 )		{			System.out.println( "Precision was too low, running with precision of 0." );			q = 0;		}		if( q > LONG_TIME )		{			System.out.println( "Be prepared to wait a while..." );		}		// Compute one more digit than we display to compensate for rounding		q++;		p = new byte[q + 1];		t = new byte[q + 1];		/* compute pi */                long start = System.currentTimeMillis();		arctan( 2 );		arctan( 3 );		mul4();                long time = System.currentTimeMillis() - start;		// Return to the number of digits we want to display		q--;		/* print pi */		System.out.printf( "pi = %d.", p[0] );		for( i = 1; i <= q; i++ )			System.out.print( p );		System.out.println();		System.out.printf( "%f seconds to compute pi with a precision of %d digits.", time / 1000.0, q );		return;	}	void arctan( int s )	{		int n;		t[0] = 1;		div( s ); /* t[] = 1/s */		add();		n = 1;		do		{			mul( n );			div( s * s );			div( n += 2 );			if( ( ( n - 1 ) / 2 ) % 2 == 0 )				add();			else				sub();		} while( !tiszero() );	}         void add()    {        int j;        for (j = q; j >= 0; j--)        {            if (t[j] + p[j] > 9)            {                p[j] += (byte)(t[j] - 10);                p[j - 1] += 1;            }            else                p[j] += t[j];        }    }    void sub()    {        int j;        for (j = q; j >= 0; j--)            if (p[j] < t[j])            {                p[j] -= (byte)(t[j] - 10);                p[j - 1] -= 1;            }            else                p[j] -= t[j];    }    void mul(int multiplier)    {        int b;        int i;        int carry = 0, digit = 0;        for (i = q; i >= 0; i--)        {            b = (t * multiplier + carry);            digit = b % 10;            carry = b / 10;            t = (byte)digit;        }    }    /* t[] /= l */    void div(int divisor)    {        int i, b;        int quotient, remainder = 0;        for (i = 0; i <= q; i++)        {            b = (10 * remainder + t);            quotient = b / divisor;            remainder = b % divisor;            t = (byte)quotient;        }    }    void div4()    {        int i, c, d = 0;        for (i = 0; i <= q; i++)        {            c = (10 * d + p) / 4;            d = (10 * d + p) % 4;            p = (byte)c;        }    }    void mul4()    {        int i, c, d;        d = c = 0;        for (i = q; i >= 0; i--)        {            d = (p * 4 + c) % 10;            c = (p * 4 + c) / 10;            p = (byte)d;        }    }    boolean tiszero()    {        int k;        for (k = 0; k <= q; k++)            if (t[k] != 0)                return false;        return true;    }	}


Result:
39.000000 seconds to compute pi with a precision of 10000 digits.
equivalent c# test:
c# : 34.745 seconds to compute pi with a precision of 10000 digits.

I thought that JIT compilation can produce more efficient code because it can optimize for the target machine. Maybe I should just wait for the next version of those compilers. C/C++ is a mature language and it is not surprising that there are excellent compilers available for it.

I do not understand why people are rating me down for this thread(?)
I just downloaded the half-an-hour-ago released DMD 0.169. The change log contains:
"- Improved optimization of div/mod combinations."

pi.d compiled with DMD 0.168 took 20 seconds to run on my machine. When compiled with DMD 0.169, 14 seconds is all it takes. That's now only 2 seconds behind GDC and g++.

DMD.score++;
I downloaded the new d compiler:
DMD 1.69: 17 seconds to compute pi with a precision of 10000 digits.

DMD 1.68: 26 seconds to compute pi with a precision of 10000 digits.
MSC++ 2005 : 15 seconds to compute pi with a precision of 10000 digits.

[attention]Interesting, D is now much faster.

This topic is closed to new replies.

Advertisement