Jump to content

  • Log In with Google      Sign In   
  • Create Account

my c++ d c# benchmark!


Old topic!

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.


  • You cannot reply to this topic
71 replies to this topic

#21   Members   

228
Like
0Likes
Like

Posted 07 October 2006 - 03:02 AM

Looks like DMD is missing some optimization oportunities in the div and mul functions. I tried using a similar modification to what Promit had posted:

import std.c.stdio;
import std.c.stdlib;
import std.c.time;

const int LONG_TIME=4000;

byte[] p;
byte[] t;
int q;

int main(char[][] args)
{
int startime, endtime;
int i;

if (args.length == 2) {
sscanf(&args[1][0],"%d",&q);
} else {
printf("Usage: pi [precision]\n");
exit(55);
}

if (q < 0)
{
printf("Precision was too low, running with precision of 0.\n");
q = 0;
}

if (q > LONG_TIME)
{
printf("Be prepared to wait a while...\n");
}

// Compute one more digit than we display to compensate for rounding
q++;

p.length = q + 1;
t.length = q + 1;

/* compute pi */

std.c.time.time(&startime);
arctan(2);
arctan(3);
mul4();
std.c.time.time(&endtime);

// Return to the number of digits we want to display
q--;

/* print pi */

printf("pi = %d.",cast(int)(p[0]));
for (i = 1; i <= q; i++)
printf("%d",cast(int)(p[i]));
printf("\n");
printf("%ld seconds to compute pi with a precision of %d digits.\n",endtime-startime,q);

return 0;
}

void arctan(int s)
{
int n;

t[0] = 1;
div(s); /* t[] = 1/s */
add();
n = 1;
do {
mul(n);
div(s * s);
div(n += 2);
if (((n-1) / 2) % 2 == 0)
add();
else
sub();
} while (!tiszero());
}

void add()
{
int j;

for (j = q; j >= 0; j--)
{
if (t[j] + p[j] > 9) {
p[j] += t[j] - 10;
p[j-1] += 1;
} else
p[j] += t[j];
}
}

void sub()
{
int j;

for (j = q; j >= 0; j--)
if (p[j] < t[j]) {
p[j] -= t[j] - 10;
p[j-1] -= 1;
} else
p[j] -= t[j];
}

void mul(int multiplier)
{
int b;
int i;
int carry = 0, digit = 0;

for (i = q; i >= 0; i--) {
b = (t[i] * multiplier + carry);
carry = b / 10;
digit = b - carry * 10;
t[i] = digit;
}
}

/* t[] /= l */

void div(int divisor)
{
int i, b;
int quotient, remainder = 0;

for (i = 0; i <= q; i++) {
b = (10 * remainder + t[i]);
quotient = b / divisor;
remainder = b - divisor * quotient;
t[i] = quotient;
}
}

void div4()
{
int i, c, d = 0;

for (i = 0; i <= q; i++) {
c = (10 * d + p[i]) / 4;
d = (10 * d + p[i]) % 4;
p[i] = c;
}
}

void mul4()
{
int i, c, d;

d = c = 0;

for (i = q; i >= 0; i--) {
d = (p[i] * 4 + c) % 10;
c = (p[i] * 4 + c) / 10;
p[i] = d;
}
}

int tiszero()
{
int k;

for (k = 0; k <= q; k++)
if (t[k] != 0)
return false;
return true;
}


Then I tested the original and modified program on two systems, on linux machine from my univ and my laptop running windows. Here are my results:

Linux 2.6.17-1.2142_FC4smp

Digital Mars D Compiler v0.168
dmd, no modifications: 50 seconds
dmd, modifications: 33 seconds
dmd flags: -release -inline -O

g++ (GCC) 4.0.2 20051125 (Red Hat 4.0.2-8)
g++, no modifications: 29 seconds
g++, modifications: 32 seconds
g++ flags: -O3



Windows XP SP2 Pro

Digital Mars D Compiler v0.168
dmd, no modifications: 20 seconds
dmd, modifications: 15 seconds
dmd flags: -release -inline -O

g++ (GCC) 3.4.2 (mingw-special)
g++, no modifications: 12 seconds
g++, modifications: 14 seconds
g++ flags: -O3

gdc (GCC) 3.4.2 (mingw-special) (gdc 0.19, using dmd 0.162)
gdc, no modifications: 12 seconds
gdc, modifications: 14 seconds
gdc flags: -frelease -finline -O3



I ran each test twice and took the better result. I used a windows build of the GCC D Compiler available from the following page: http://gdcwin.sourceforge.net/

#22   Members   

1135
Like
0Likes
Like

Posted 07 October 2006 - 03:06 AM

Quote:
Original post by Anonymous Poster
I think this was fair. It's not like Promit's code is more idiomatic C#, it's just a tweak that happens to make .NET run faster. Similar tweaks could probably be found in C++ and D versions as well, and we could spend weeks optimizing each one. Why not actually compare versions that are near identical in code to save the trouble? It even looks to me like his code would make the algorithm wrong due to the fact that floating point math isn't as accurate, but I didn't check this..

Comparing identical versions is pointless unless you're comparing different compilers of the same langage. In other words, its only fair if you're comparing VC++ to g++. When comparing different languages, you have to make changes for each langage in order to keep the test fair...a tightly optimized program for language X may well run like shit in language Y. You should keep the algorithm the same, and that's it.

CM

#23 Anonymous Poster_Anonymous Poster_*   Guests   

0Likes

Posted 07 October 2006 - 03:17 AM

Quote:
Original post by doctorsixstring
Different languages work in different ways.
It's more of a comparison of compilers than languages here. As you can see, the codes are near identical. What we could conclude from Promit's tests is that the C++ compiler relieves the writer of the code from micro-optimizations more than the .NET compiler does. He did the same fix for C++ and the speed was unaffected. So when writing code casually for C++ without regard to micro-optimizations, you'd get 34/15 times faster code than with the .NET compiler. With .NET you'd have to spend extra effort doing the compiler's job to get down to 22/20. I think this already tells us something useful.
Quote:
What I care about is the performance I'll achieve using those languages in the way they're designed.
Can you claim with a straight face that Promit's opitimization is how you would've written the code in the first place with C#, and you would've used the modulo operator on C++ instead? It's not like C# was "designed" to perform worse with the integer modulo/div combo than C++. It's just a quirk.

#24 Anonymous Poster_Anonymous Poster_*   Guests   

0Likes

Posted 07 October 2006 - 03:28 AM

Quote:
Original post by Conner McCloud
Comparing identical versions is pointless unless you're comparing different compilers of the same langage. In other words, its only fair if you're comparing VC++ to g++. When comparing different languages, you have to make changes for each langage in order to keep the test fair...a tightly optimized program for language X may well run like shit in language Y.
Even if we were comparing different compilers of the same language, one could use the same reasoning you do. One would want to optimize one's code for the particular compiler's quirks, no? And code tightly optimized for compiler X could run like shit when compiled with compiler Y.

For a code like this all three are essentially identical. None of them provide some super speed constructs that lack from the other languages that should be exploited for this code. Promit's changes still work for D and C++ directly. The original D code worked for C++ and C# (essentially) directly. I know there are tons of actual cases where language differences matter for speed, only this is not one of them. Promit's change was only to make the compiler happy, not to make the code "better C#".

#25   Members   

388
Like
0Likes
Like

Posted 07 October 2006 - 03:47 AM

Quote:
Original post by Anonymous Poster
It's more of a comparison of compilers than languages here.


Comparing a C# compiler to a C or C++ compiler to a D compiler is a bit like comparing apples to oranges, don't you think?

Quote:
Original post by Anonymous Poster
As you can see, the codes are near identical.


As I mentioned earlier, I think this is the wrong way to run a benchmark. What if the original example is written in a non-C language, and then ported to C in a way that is beneficial to the original language? Obviously that would be unfair.

Quote:
Original post by Anonymous Poster
What we could conclude from Promit's tests is that the C++ compiler relieves the writer of the code from micro-optimizations more than the .NET compiler does. He did the same fix for C++ and the speed was unaffected. So when writing code casually for C++ without regard to micro-optimizations, you'd get 34/15 times faster code than with the .NET compiler. With .NET you'd have to spend extra effort doing the compiler's job to get down to 22/20. I think this already tells us something useful.


I agree that C++ has the edge over C# in this case. It is interesting that C#'s modulus operator is so much slower that C++. What version of .NET was used for the tests? Has anyone compared it to non-MS implementations like Mono?

Quote:
Original post by Anonymous Poster
Can you claim with a straight face that Promit's opitimization is how you would've written the code in the first place with C#, and you would've used the modulo operator on C++ instead?


No, I honestly would never have thought to use Promit's code instead of the modulus.

Quote:
Original post by Anonymous Poster
It's not like C# was "designed" to perform worse with the integer modulo/div combo than C++. It's just a quirk.


I agree. It would be interesting to see if this quirk is common across all versions and implementations of .NET.

- Mike

#26   Members   

795
Like
0Likes
Like

Posted 07 October 2006 - 04:54 AM

i ported promit's c# code to java. i also just woke up, so beware silly mistakes. this runs about 22.7 seconds on a x2 3800+ @2ghz to give a general idea. i don't have any c/c++/c# compilers installed right now so i can't give any comparisons.

public class Pi
{
public static void main( String[] args )
{
Pi pi = new Pi();
pi.run( args );
}

private final int LONG_TIME = 4000;
byte[] p;
byte[] t;
int q;

private void run( String[] args )
{
int i;

if( args.length == 1 )
{
q = Integer.parseInt( args[0] );
}
else
{
System.out.println( "Usage: pi [precision]" );
return;
}

if( q < 0 )
{
System.out.println( "Precision was too low, running with precision of 0." );
q = 0;
}

if( q > LONG_TIME )
{
System.out.println( "Be prepared to wait a while..." );
}

// Compute one more digit than we display to compensate for rounding
q++;

p = new byte[q + 1];
t = new byte[q + 1];
/* compute pi */

long start = System.currentTimeMillis();
arctan( 2 );
arctan( 3 );
mul4();
long time = System.currentTimeMillis() - start;

// Return to the number of digits we want to display
q--;

/* print pi */

System.out.printf( "pi = %d.", p[0] );
for( i = 1; i <= q; i++ )
System.out.print( p[i] );
System.out.println();
System.out.printf( "%f seconds to compute pi with a precision of %d digits.", time / 1000.0, q );

return;
}

void arctan( int s )
{
int n;

t[0] = 1;
div( s ); /* t[] = 1/s */
add();
n = 1;
do
{
mul( n );
div( s * s );
div( n += 2 );
if( ( ( n - 1 ) / 2 ) % 2 == 0 )
add();
else
sub();
} while( !tiszero() );
}

void add()
{
int j;

for( j = q; j >= 0; j-- )
{
if( t[j] + p[j] > 9 )
{
p[j] += (byte) ( t[j] - 10 );
p[j - 1] += 1;
}
else
p[j] += t[j];
}
}

void sub()
{
int j;

for( j = q; j >= 0; j-- )
if( p[j] < t[j] )
{
p[j] -= (byte) ( t[j] - 10 );
p[j - 1] -= 1;
}
else
p[j] -= t[j];
}

void mul( int multiplier )
{
int b;
int i;
int carry = 0, digit;

for( i = q; i >= 0; i-- )
{
b = ( t[i] * multiplier + carry );
carry = (int) ( b * 0.1f );
digit = b - 10 * carry;
t[i] = (byte) digit;
}
}

/* t[] /= l */

void div( int divisor )
{
int i, b;
int quotient, remainder = 0;
float fdiv = 1.0f / divisor;

for( i = 0; i <= q; i++ )
{
b = ( 10 * remainder + t[i] );
quotient = (int) ( b * fdiv );
remainder = b - divisor * quotient;
t[i] = (byte) quotient;
}
}

void div4()
{
int i, c, d = 0;

for( i = 0; i <= q; i++ )
{
c = ( 10 * d + p[i] ) / 4;
d = ( 10 * d + p[i] ) % 4;
p[i] = (byte) c;
}
}

void mul4()
{
int i, c, d;

d = c = 0;

for( i = q; i >= 0; i-- )
{
d = ( p[i] * 4 + c ) % 10;
c = ( p[i] * 4 + c ) / 10;
p[i] = (byte) d;
}
}

boolean tiszero()
{
int k;

for( k = 0; k <= q; k++ )
if( t[k] != 0 )
return false;
return true;
}
}



This space for rent.

#27   Members   

100
Like
0Likes
Like

Posted 07 October 2006 - 07:28 AM

Quote:
Original post by h3r3tic
Looks like DMD is missing some optimization oportunities in the div and mul functions.
I did a little experimenting and found out that the performance hinges upon whether the / and % operations are done together with one divide instruction, or separately with two. Doing them separately nearly doubles the benchmark time. The divide instruction is one expensive operation. All the benchmark does is test whether that optimization is being done or not. The DMD and C# compilers are deficient in not doing it, but that isn't a deficiency in the languages, just their compiler implementations.

#28 Anonymous Poster_Anonymous Poster_*   Guests   

0Likes

Posted 07 October 2006 - 08:41 AM

I'm a little off topic here, but it took 1.672 seconds to calculate pi to 10,000 places in Lisp ^-^ (I didn't write the code).

(defun pi-atan (k n)
(do* ((a 0)
(w (* n k))
(k2 (* k k))
(i -1))
((= w 0) a)
(setq w (truncate w k2))
(incf i 2)
(incf a (truncate w i))
(setq w (truncate w k2))
(incf i 2)
(decf a (truncate w i))))

(defun calc-pi (digits)
(let* ((n digits)
(m (+ n 3))
(tenpower (expt 10 m)))
(values
(truncate
(-
(+ (pi-atan 18 (* tenpower 48))
(pi-atan 57 (* tenpower 32)))
(pi-atan 239 (* tenpower 20)))
1000))))


#29 Anonymous Poster_Anonymous Poster_*   Guests   

0Likes

Posted 07 October 2006 - 08:42 AM

Pants, I forgot the code tags.

#30   Members   

1865
Like
0Likes
Like

Posted 07 October 2006 - 09:13 AM

...and Mathematica does it in 0.05 seconds.


O'Caml version (based on the OP's version):

let main args =
let q =
if Array.length args = 2
then int_of_string args.(1) + 1
else 101
in
let p = Array.make (q + 1) 0
and t = Array.make (q + 1) 0
in
let add () =
for j = q downto 0 do
if t.(j) + p.(j) > 9
then
(
p.(j) <- p.(j) + t.(j) - 10;
p.(j-1) <- p.(j-1) + 1
)
else
p.(j) <- p.(j) + t.(j)
done
and sub () =
for j = q downto 0 do
if p.(j) < t.(j)
then
(
p.(j) <- p.(j) - t.(j) + 10;
p.(j - 1) <- p.(j - 1) - 1
)
else
p.(j) <- p.(j) - t.(j)
done
and mul multiplier =
let carry = ref 0
and digit = ref 0
and b = ref 0
in
for i = q downto 0 do
b := t.(i) * multiplier + !carry;
digit := !b mod 10;
carry := !b / 10;
t.(i) <- !digit
done
and div divisor =
let quotient = ref 0
and remainder = ref 0
and b = ref 0
in
for i = 0 to q do
b := 10 * !remainder + t.(i);
quotient := !b / divisor;
remainder := !b mod divisor;
t.(i) <- !quotient
done
and mul4 () =
let c = ref 0
and d = ref 0
in
for i = q downto 0 do
d := (p.(i) * 4 + !c) mod 10;
c := (p.(i) * 4 + !c) / 10;
p.(i) <- !d
done
and tiszero () =
let k = ref 0
and r = ref true
in
while (!k <= q) && !r do
r := t.(!k) = 0;
k := !k + 1
done;
!r
in
let arctan s =
let n = ref 1
in
let loop () =
mul !n;
div (s * s);
n := !n + 2;
div !n;
if ((!n - 1) / 2) mod 2 = 0
then add ()
else sub ()
in
t.(0) <- 1;
div s;
add ();
loop ();
while not (tiszero ()) do
loop ()
done
in
let show array =
Array.iter print_int array
and run () =
arctan 2;
arctan 3;
mul4 ()
in
let time () =
let before = Sys.time ()
in
run ();
let after = Sys.time ()
in
show p;
print_newline();
Printf.printf "Time measured: %f" (after -. before)
in
Printf.printf "Computing pi (%d digits)..." (q - 1);
time ();;

main Sys.argv;;



C++: 27s
C#: 53.6s
O'Caml: 60.7s

#31   Members   

1591
Like
0Likes
Like

Posted 07 October 2006 - 10:19 AM

O'Caml : 61.98 for your version, 51.29 for mine. Also, by disabling array bounds-checking: 59.71 for your version, 50.04 for mine.

My code is given below. Numbers in comments are the number of executions for each line of code when computing 1000 digits. The modifications with regard to your code mainly consisted in removing the unnecessary indirections (references) you added. The original C code doesn't use pointers-to-variables, so why should we?


let args = Sys.argv
let q =
if Array.length args = 2
then int_of_string args.(1) + 1
else 101

let p = Array.make (q + 1) 0
let t = Array.make (q + 1) 0

let main2 () =
(* 2 *) let add () =
(* 1486 *) for j = q downto 0 do
if t.(j) + p.(j) > 9
then
(
p.(j) <- p.(j) + t.(j) - 10;
p.(j-1) <- p.(j-1) + 1
)
else
p.(j) <- p.(j) + t.(j)
done
and sub () =
(* 1486 *) for j = q downto 0 do
if p.(j) < t.(j)
then
(
p.(j) <- p.(j) - t.(j) + 10;
p.(j - 1) <- p.(j - 1) - 1
)
else
p.(j) <- p.(j) - t.(j)
done

and mul multiplier =
(* 2968 *) let rec mul i carry =
(* 2737504 *) if i >= 0 then
let b = t.(i) * multiplier + carry in
let carry = b / 10 in
(t.(i) <- b mod 10; mul (i-1) carry)
in mul q 0

and div divisor =
(* 5940 *) let rec div i remainder =
(* 5477220 *) if i <= q then
let b = 10 * remainder + t.(i) in
let remainder = b mod divisor in
(t.(i) <- b / divisor; div (i+1) remainder)
in div 0 0

and mul4 () =
(* 2 *) let rec mul i c =
(* 1106 *) if i >= 0 then
let a = p.(i) * 4 + c in
let d = a mod 10 in
let c = a / 10 in
(p.(i) <- d; mul (i-1) c)
in mul q 0

and tiszero () =
(* 2968 *) let rec check i =
(* 1375939 *) (i > q) || ( t.(i) = 0 && check (i+1))
in check 0

in let arctan s =
(* 4 *) let sqs = s * s in
let rec loop n =
(* 2968 *) mul n;
div sqs;
let n = n + 2 in
div n;
if ((n - 1) / 2) mod 2 = 0 then add () else sub ();
if not (tiszero ()) then loop n;
in
t.(0) <- 1;
div s;
add ();
loop 1
in
let show array =
(* 2 *) Array.iter print_int array
and run () =
(* 2 *) arctan 2;
arctan 3;
mul4 ()
in
let time () =
(* 2 *) let before = Sys.time () in
let _ = run () in
let after = Sys.time () in
show p;
print_newline();
Printf.printf "Time measured: %f" (after -. before)
in
Printf.printf "Computing pi (%d digits)..." (q - 1);
print_newline();
time ();;




#32   Members   

758
Like
0Likes
Like

Posted 07 October 2006 - 10:29 AM

Quote:
Original post by doctorsixstring
What version of .NET was used for the tests?
.Net 2.0


I have tested gumpy macdrunken's java code here the result:
15.578000 seconds to compute pi with a precision of 10000 digits.
Promit's c# version:
19.443 seconds to compute pi with a precision of 10000 digits.

I think many people here misunderstood me:
1)I'm not searching for a fast way to compute pi
2)I'm not interested in optimizing the original program

I just saw the pi.d sample and noticed that I do not need to change any of the important function(add, sub, mul,...) to make the program run in c/c++ and c#. (I have just add some castings)
D , C++ and c# have a similar syntax so I thought “give them the same code and see what happens!”.

Why can the ms c++ compiler optimize those mul and div functions but the c# compiler can not?

[Edited by - Kambiz on October 7, 2006 5:29:26 PM]

#33   Senior Moderators   

12509
Like
0Likes
Like

Posted 07 October 2006 - 02:18 PM

Good to see some discussion -- I was hoping this thread would pick up some steam.

First of all, with regards to the Java version -- how do they compare on client and server VMs? I don't really want to set up a full java environment right now. Also, how does Java do with the original version of the code, without my modifications?

Second, I intentionally withheld discussion of why I made the changes I did, and what the implications were. It was mainly to see what others would say first, instead of me jumping full into the discussion from the beginning. So here's my perspective on the matter...

I started things off with NProf, just so that I didn't actually have to understand how this program works. Div and mul obviously come up with the vast majority of CPU time. Notice that we're working with C code's ideal case here -- it is reasonable to assume that it is impossible to beat the C version. (By the way, the C code is really tight. I couldn't find any modifications that made any dent in the execution time without falling back to MMX.) So looking at the functions, it's completely trivial operations, with the exception of the divide and modulus. Now on x86, the idiv instruction computes both the division and modulus results at the same time. Looking at the generated assembly listing from VC makes it clear that the compiler has leveraged this optimization.

Some quick googling about C# math performance issues followed, which revealed a suggestion that the C# compiler and CLR do not optimize division constructs particularly well. So at that point, I began to suspect that replacing the division and modulus would make an impact. As you can see, it did. Note that the replacement is kind of retarded -- it involves the FPU to do its work, when there's really no reason to involve the FPU. More importantly, super-micro-benchmarks would most likely reveal that conversion from floating point to int is a performance problem. In fact, I expected that bringing the C# optimizations back to C++ would negatively impact performance, which it didn't.

On later reflection, I decided that DMD probably botches this somewhat machine specific optimization in exactly the same way. The results posted later that are built on DM C++, combined with the substantially better DMD results using my modifications, support this conclusion. I strongly suspect that JVM will screw this up too.

In other words, all I did was a crude workaround of a more fundamental optimization problem. I didn't utilize special C# functionality or features, I just wrote a hack that basically dodges the bug. This is a problem that we can expect to vanish in the next...oh, I don't know, let's say something vague like the next 5-10 years. It's pretty obvious that math performance is hardly ideal in the managed world -- and as I said at the beginning, we're playing on C's home turf. We can't win here.

Lastly, realize that if someone were to break out the MMX intrinsics, everything except C would be seriously screwed. You'd be looking at times for the C code that were about 1/4 to 1/3 what they are now. DMD, C#, and Java wouldn't be able to touch that -- and we can't even do autovectorization like that statically, let alone in a JITter.

Let's be frank. The non-C languages here all manage to make a positively lame turnout. DMD, C#, and Java are all very respectable after my modifications, and certainly adequate for nearly all real world applications, but it took extra work to get there. In my case it was only 15 minutes because I knew what I was doing -- for a less experienced developer, or more complex code, the story could get much worse. There are lots of situations where C# and Java can outperform C/C++ code, but tight math and array accesses are obviously not one of them.

[Edited by - Promit on October 7, 2006 8:18:19 PM]

#34   Members   

122
Like
0Likes
Like

Posted 07 October 2006 - 02:44 PM

Quote:
Original post by deathkrush
The fastest way to calculate PI in C is of course:

*** Source Snippet Removed ***


Hey, I loved the snippet!
Though, it wouldnt compile as given (F_00() is used before defined) so I rearranged the code blocks.

Then it crunches out the number '0.250'.

So I might add: it may be a great way to calculate pi, compact and cool looking, but it lacks one extra nicety we would like in a program to calculate pi. Determining the correct result.
On all other accounts its cool though! :)

- Jacob

#35   Members   

100
Like
0Likes
Like

Posted 07 October 2006 - 03:46 PM

Quote:
Original post by Promit
On later reflection, I decided that DMD probably botches this somewhat machine specific optimization in exactly the same way.

Examining the assembler output of DMD shows what you suspect is true. Strangely, in other cases, DMD does do that optimization. So it looks like a compiler problem that can be easily corrected.

Quote:
This is a problem that we can expect to vanish in the next...oh, I don't know, let's say something vague like the next 5-10 years.
Or maybe in the next update, if it is an easily correctible problem?

Quote:
It's pretty obvious that math performance is hardly ideal in the managed world -- and as I said at the beginning, we're playing on C's home turf. We can't win here.

C is at a disadvantage with heavy numerical work because of the aliasing problem. It's why Fortran is still preferred (and to be sure, Fortran compilers have gotten the benefit of decades of optimization work by legions of very smart programmers).

Quote:
Lastly, realize that if someone were to break out the MMX intrinsics, everything except C would be seriously screwed. You'd be looking at times for the C code that were about 1/4 to 1/3 what they are now. DMD, C#, and Java wouldn't be able to touch that -- and we can't even do autovectorization like that statically, let alone in a JITter.


I don't know much about the intrinsics to support MMX, but since DMD already has built in intrinsics for some functions (like sin(), cos(), etc.), then adding more doesn't seem like it would be any more of a problem than it would be to add intrinsics to C. D has an inline assembler with support for MMX, so those instructions can be gotten at that way.

Quote:
Let's be frank. The non-C languages here all manage to make a positively lame turnout. DMD, C#, and Java are all very respectable after my modifications, and certainly adequate for nearly all real world applications, but it took extra work to get there. In my case it was only 15 minutes because I knew what I was doing -- for a less experienced developer, or more complex code, the story could get much worse. There are lots of situations where C# and Java can outperform C/C++ code, but tight math and array accesses are obviously not one of them.


The substantial difference here is little more than an optimization folding /% into one instruction rather than two. I don't see any reason why C, C++, D, Java, or C# cannot do that, nor does it seem like a particularly hard optimization to do. All you're seeing is the result of a special case put into the optimizer, not any fundamental language difference.

#36   Senior Moderators   

12509
Like
0Likes
Like

Posted 07 October 2006 - 04:04 PM

Quote:
Original post by Stachel
Quote:
This is a problem that we can expect to vanish in the next...oh, I don't know, let's say something vague like the next 5-10 years.
Or maybe in the next update, if it is an easily correctible problem?
It's a damn lucky day when an optimizer problem is easily corrected -- optimization is dangerous business. Though I was actually referring more to the JITted languages.
Quote:
C is at a disadvantage with heavy numerical work because of the aliasing problem. It's why Fortran is still preferred (and to be sure, Fortran compilers have gotten the benefit of decades of optimization work by legions of very smart programmers).
Looking at the assembly output of the C version, I don't see anything that can be fixed here -- what you're saying may be true for more complex mathematical work, but for this program, the C version is as fast as it gets without hitting machine specific code.
Quote:
I don't know much about the intrinsics to support MMX, but since DMD already has built in intrinsics for some functions (like sin(), cos(), etc.), then adding more doesn't seem like it would be any more of a problem than it would be to add intrinsics to C. D has an inline assembler with support for MMX, so those instructions can be gotten at that way.
Intrinsic support itself is relatively easy to do. The real benefit of intrinsic support is that the optimizer can be conscious of your "assembly" when you use them, and that's a more difficult challenge. It's not an insurmountable challenge -- VC and GCC do it, after all. I'd put my hopes on GDC rather than DMD here.
Quote:
The substantial difference here is little more than an optimization folding /% into one instruction rather than two. I don't see any reason why C, C++, D, Java, or C# cannot do that, nor does it seem like a particularly hard optimization to do. All you're seeing is the result of a special case put into the optimizer, not any fundamental language difference.
Sure (with the exception of support for vectorized instruction sets, which Java and C# lack). The problem is that introducing a lot of these features into the compiler is difficult to do well, and requires extensive testing. I can't claim to know why CLR can't do this optimization. There's no reason D can't do it, but it doesn't, which leads back into the original objection I expressed in the other thread -- D lacks developer support and maturity that the other languages, C and C++ in particular, have. Without that, they're screwed.

#37   Members   

100
Like
0Likes
Like

Posted 07 October 2006 - 04:55 PM

Quote:
Original post by Promit
Quote:
Original post by Stachel
Quote:
This is a problem that we can expect to vanish in the next...oh, I don't know, let's say something vague like the next 5-10 years.
Or maybe in the next update, if it is an easily correctible problem?
It's a damn lucky day when an optimizer problem is easily corrected -- optimization is dangerous business. Though I was actually referring more to the JITted languages.


I would agree that just sticking a programmer on an optimizer and expecting him to make non-breaking improvements is unreasonable. But for the experts who do work on them, they may very well be easily corrected.

Quote:
Quote:
C is at a disadvantage with heavy numerical work because of the aliasing problem. It's why Fortran is still preferred (and to be sure, Fortran compilers have gotten the benefit of decades of optimization work by legions of very smart programmers).
Looking at the assembly output of the C version, I don't see anything that can be fixed here -- what you're saying may be true for more complex mathematical work, but for this program, the C version is as fast as it gets without hitting machine specific code.


Where Fortran shines is its ability to replace entire loops with parallel vectorized code. With C's aliasing problem, such transformations are very nearly impossible, though the new restrict pointer type may help.

Quote:
Quote:
I don't know much about the intrinsics to support MMX, but since DMD already has built in intrinsics for some functions (like sin(), cos(), etc.), then adding more doesn't seem like it would be any more of a problem than it would be to add intrinsics to C. D has an inline assembler with support for MMX, so those instructions can be gotten at that way.
Intrinsic support itself is relatively easy to do. The real benefit of intrinsic support is that the optimizer can be conscious of your "assembly" when you use them, and that's a more difficult challenge. It's not an insurmountable challenge -- VC and GCC do it, after all. I'd put my hopes on GDC rather than DMD here.


That makes sense. GDC does leverage all the ongoing work done with gcc. There's no technical reason D should be at a disadvantage relative to C.

Quote:
Quote:
The substantial difference here is little more than an optimization folding /% into one instruction rather than two. I don't see any reason why C, C++, D, Java, or C# cannot do that, nor does it seem like a particularly hard optimization to do. All you're seeing is the result of a special case put into the optimizer, not any fundamental language difference.
Sure (with the exception of support for vectorized instruction sets, which Java and C# lack). The problem is that introducing a lot of these features into the compiler is difficult to do well, and requires extensive testing. I can't claim to know why CLR can't do this optimization. There's no reason D can't do it, but it doesn't, which leads back into the original objection I expressed in the other thread -- D lacks developer support and maturity that the other languages, C and C++ in particular, have. Without that, they're screwed.


This is why D is obviously built to leverage existing technology. Both DMD and GDC are built on top of well developed C compilers. C libraries are directly accessible from D. C header files can be mechanically converted to D. I think you place too much of a store on the /% optimization; in my experiments DMD will do that optimization in other circumstances.


#38   Members   

758
Like
0Likes
Like

Posted 07 October 2006 - 09:03 PM

Quote:
Original post by PromitFirst of all, with regards to the Java version -- how do they compare on client and server VMs? I don't really want to set up a full java environment right now. Also, how does Java do with the original version of the code, without my modifications?

For the java version I just made a new project in netbeans and ran the program on jre 1.6.0 (beta 2). (I tested it also from the command line)
I replaced the modified function with the original ones and tested again:

package test;

import java.io.IOException;

public class Main
{
public static void main( String[] args )
{
Main pi = new Main();
pi.run( args );
}

private final int LONG_TIME = 4000;
byte[] p;
byte[] t;
int q;

private void run( String[] args )
{
int i;

if( args.length == 1 )
{
q = Integer.parseInt( args[0] );
}
else
{
System.out.println( "Usage: pi [precision]" );
return;
}

if( q < 0 )
{
System.out.println( "Precision was too low, running with precision of 0." );
q = 0;
}

if( q > LONG_TIME )
{
System.out.println( "Be prepared to wait a while..." );
}

// Compute one more digit than we display to compensate for rounding
q++;

p = new byte[q + 1];
t = new byte[q + 1];
/* compute pi */

long start = System.currentTimeMillis();
arctan( 2 );
arctan( 3 );
mul4();
long time = System.currentTimeMillis() - start;

// Return to the number of digits we want to display
q--;

/* print pi */

System.out.printf( "pi = %d.", p[0] );
for( i = 1; i <= q; i++ )
System.out.print( p[i] );
System.out.println();
System.out.printf( "%f seconds to compute pi with a precision of %d digits.", time / 1000.0, q );

return;
}

void arctan( int s )
{
int n;

t[0] = 1;
div( s ); /* t[] = 1/s */
add();
n = 1;
do
{
mul( n );
div( s * s );
div( n += 2 );
if( ( ( n - 1 ) / 2 ) % 2 == 0 )
add();
else
sub();
} while( !tiszero() );
}
void add()
{
int j;

for (j = q; j >= 0; j--)
{
if (t[j] + p[j] > 9)
{
p[j] += (byte)(t[j] - 10);
p[j - 1] += 1;
}
else
p[j] += t[j];
}
}

void sub()
{
int j;

for (j = q; j >= 0; j--)
if (p[j] < t[j])
{
p[j] -= (byte)(t[j] - 10);
p[j - 1] -= 1;
}
else
p[j] -= t[j];
}

void mul(int multiplier)
{
int b;
int i;
int carry = 0, digit = 0;

for (i = q; i >= 0; i--)
{
b = (t[i] * multiplier + carry);
digit = b % 10;
carry = b / 10;
t[i] = (byte)digit;
}
}

/* t[] /= l */

void div(int divisor)
{
int i, b;
int quotient, remainder = 0;

for (i = 0; i <= q; i++)
{
b = (10 * remainder + t[i]);
quotient = b / divisor;
remainder = b % divisor;
t[i] = (byte)quotient;
}
}

void div4()
{
int i, c, d = 0;

for (i = 0; i <= q; i++)
{
c = (10 * d + p[i]) / 4;
d = (10 * d + p[i]) % 4;
p[i] = (byte)c;
}
}

void mul4()
{
int i, c, d;

d = c = 0;

for (i = q; i >= 0; i--)
{
d = (p[i] * 4 + c) % 10;
c = (p[i] * 4 + c) / 10;
p[i] = (byte)d;
}
}

boolean tiszero()
{
int k;

for (k = 0; k <= q; k++)
if (t[k] != 0)
return false;
return true;
}


}



Result:
39.000000 seconds to compute pi with a precision of 10000 digits.
equivalent c# test:
c# : 34.745 seconds to compute pi with a precision of 10000 digits.

I thought that JIT compilation can produce more efficient code because it can optimize for the target machine. Maybe I should just wait for the next version of those compilers. C/C++ is a mature language and it is not surprising that there are excellent compilers available for it.

I do not understand why people are rating me down for this thread(?)

#39   Members   

228
Like
0Likes
Like

Posted 07 October 2006 - 09:59 PM

I just downloaded the half-an-hour-ago released DMD 0.169. The change log contains:
"- Improved optimization of div/mod combinations."

pi.d compiled with DMD 0.168 took 20 seconds to run on my machine. When compiled with DMD 0.169, 14 seconds is all it takes. That's now only 2 seconds behind GDC and g++.

DMD.score++;

#40   Members   

758
Like
0Likes
Like

Posted 07 October 2006 - 10:29 PM

I downloaded the new d compiler:
DMD 1.69: 17 seconds to compute pi with a precision of 10000 digits.

DMD 1.68: 26 seconds to compute pi with a precision of 10000 digits.
MSC++ 2005 : 15 seconds to compute pi with a precision of 10000 digits.

[attention]Interesting, D is now much faster.




Old topic!

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.