[java] Perfomance vs. improved readability (Escape analysis does not work!?)

Started by
17 comments, last by Frederick 14 years ago
Hey!

Thanks for your answers!

Quote:
mm, interesting.
How much difference in performance?


Frankly I can´t tell, but I don´t like to have an allocation hell going on in my project. Maybe its not a big deal in this example, but it has a potential to sum up quickly. The JMonkeyEngine guys seem to be fine - maybe I should be too...but I don´t like to run allocations out of control.


Quote:
Java, due to the design of its memory model, is worst possible case for this type of operations and will always remain as such. The biggest problems do not come from temporaries, but from lack of in-place array allocations.


Actually that is not quite right... I use ByteBuffer objects to store all my vertices (same for matrices) and pass them to the jni c++ layer. ;-)
But you are right. This is an annoying hack... but I got around with that.

So no big performance hit here...

I primarily chose java because I hoped for more productivity. That has been mostly right up to now, because of the great refactoring tool support, lack of header files and so on. I was also annoyed of all the small flaws of c++ one has to work around... its simply not a clean language.
It would be a strange insight in my programming career that c++ might actually allow MORE intuitive programming concerning algebra than java.

Frankly I like java and its dynamic nature... There are just a few flaws. EA could solve the stack allocation problem.

Quote:
Escape analysis under 1.6 has been shown to provide around factor of 15 performance improvement in best case


How the heck do I reliably switch that thing on (Yes I tried the -XX:+UseEscapeAnalysis flag). If I am right it is as equally capable than the C++ stack scope auto-deallocation thing, plus may be more capable in future.
The analysis phase takes startup time, but that doesn´t matter.

I have come to believe that EA never worked in any example (That one I wrote about, was most likely a change of scala in the profiler, which seemed like
lower allocation level).

Has anybody perceived real results with EA and can tell me a method to reproduce that, how to enable and verify that it kicks in ?

Lets take this example again:
Quote:
public void rotate(float angle, Vector3 axis) {
Matrix3 rotationMatrix = new Matrix3();

rotationMatrix.loadAxisAngleRotation(angle, axis);
rotationMatrix.transformOnlyOrientation(this);
}


The matrix allocation should be clearly optimized, or am I wrong about that ?

Quote:
But that implies that the escape analysis has to be performed through n-level indirection. I am not sure about that. I cannot deduct that from the link. Can you give me any hint?


Ok you refer to my first more complicated example, keep in mind, that the simpler also does not work for me. But... I have no exact source, but I figured if that HotSpot is capable of:

Quote:
The method makes a copy to prevent modification of the original object by the caller. If the compiler determines that the getPerson() method is being invoked in a loop, it will inline that method. In addition to this, by escape analysis, if the compiler determines that the original object is never modified, it may optimize and eliminate the call to make a copy.


Lets see:

1.)multiply is in a loop and gets inlined.
2.)transform() is in multiply and thus in a loop (the indirection you mentioned)
3.)dx is read only accessed and its pointer is not stored in a collection

So the case is at least similar... why would hotspot only inline one call level.
Why wouldn´t it bake the second level also into one big soup !? ,-)

Not sure about that, but keep in mind, that the more obvious example I provided before doesn´t work either.

Quote:
Java, due to the design of its memory model, is worst possible case for this type of operations and will always remain as such.


Quote:
Until Java introduces C# struct-like type, it's simply not economically viable for this type of tasks.


Don´t be so hard to java. Its actually a good language and HotSpot is magic. Also run-time code loading and reflection gonna be cool in gamedev. Netbeans
is also superior to VS. You don´t believe ? It is... I know both ;-)

There are some pieces missing. But... if an "array"-pool allocation mechanism gets added to OpenJDK and EA does work reliably the language would be there.
I am too busy with game development (-:, but some compiler freaks could kick it...

Till then use ByteBuffers and wrapper classes to solve the pool problem.

C# is not so cross platform and I simply won´t learn another system.

I like java and I don´t feel like going to c++, at least not yet. Please help
me to get that damn thing going !!!

Thank you for your kind attention,
Frederik
Advertisement
Quote:Original post by ppgamedev
http://hal.inria.fr/docs/00/31/20/39/PDF/RT-0353.pdf

Well, in this report the difference with Fortran is not so dramatic.

From the benchmark:
  private static final int datasizes_M[] = {50000,100000,500000};  private static final int datasizes_N[] = {50000,100000,500000};  private static final int datasizes_nz[] = {250000,500000,2500000};  private static final int SPARSE_NUM_ITER = 200;  Random R = new Random(RANDOM_SEED);  double [] x;   double [] y;   double [] val;   int [] col;  int [] row;  int [] lowsum;  int [] highsum;


Like I said - contiguous data. Java only allows it for primitive types.

This is relevant to OP - if you want numeric performance in Java, the bottleneck is not bytecode or CPU, it's memory layout, which means throwing out the OO way. Something which is incredibly unreadable.

As soon as things are wrapped in classes, the instances get scattered all over the memory, incurring huge cache access penalties.

A compromise is SoA layout upon which bulk operations are performed. That way code readability can be preserved, but is only effective if there are large numbers of elements - which is orthogonal to OP's design.


And I just love this type of academic articles. First they reject a test they know performs sub-optimally, then they run a few tutorial-grade benchmarks and draw a pretty picture. Whatever happened to actually interpreting results and analyzing the causes, testing hypothesis and similar.

Quote:How the heck do I reliably switch that thing on (Yes I tried the -XX:+UseEscapeAnalysis flag). If I am right it is as equally capable than the C++ stack scope auto-deallocation thing, plus may be more capable in future.

It was added in 1.6.14 and disabled in 1.6.18. It is currently available experimentally under 1.7, but it may vary depending on the build - I don't follow beta versions. It might have been disabled or crippled again due to problems with G1 GC.

It is also inevitably up to VM to decide when to kick in. So it might be in, but might not be working in this example.

Quote:Don´t be so hard to java. Its actually a good language and HotSpot is magic. Also run-time code loading and reflection gonna be cool in gamedev.

There are three alternatives:
- If readability is vital - go for it, forget about performance until profiler shows it's a problem (should be fine for hundreds, or perhaps even thousands of objects). At some point in the future, escape analysis may help - but there is no guarantee. Readability remains constant, VM will only improve.
- If consistent performance is important today, then array-centric approach or using in/out parameters is the way to go. It doesn't rely on 'magic', but works.
- To really push the boundaries, you simply need a language that allows flat memory layout.

[Edited by - Antheus on May 3, 2010 6:55:39 PM]
Quote:Original post by Frederick
Quote:Original post by ppgamedev
Quote:Original post by Frederick
Thats a thing I really dislike about java... code like this is simply not feasible from a performance standpoint (at least it was), but is perfectly fine in c++

[irony]Hmm, interesting.
How much difference in performance?[/irony]

Frankly I can't tell, but I don't like to have an allocation hell going on in my project.

But unless you change the code, how would you avoid that allocation hell with C++?

Quote:Original post by Antheus
It is also inevitably up to VM to decide when to kick in. So it might be in, but might not be working in this example.

Not really sure about this because:
Quote:Original from Frederick link to sun
Escape analysis is a technique by which the Java™ Hotspot Server Compiler can analyze the scope of an object and decide whether to allocate memory on the heap or not.

so it is not a JVM improvement but a compiler one.
Or maybe not?

Quote:Original post by Antheus
Like I said - contiguous data. Java only allows it for primitive types.

What do you mean with that?
Do you mean that you can create an array of contiguous objects in C++?

Quote:Original post by Antheus
Until Java introduces C# struct-like type, it's simply not economically viable for this type of tasks.

I am not an expert in C# but I think struct has value semantics, and in C# that means pass-by-value.
Wouldn't it be a heavy penalty for performance?

It's curiosity.
Can you try something like this?:
    public Point3 multiply(Point3 p) {        Point3 r = new Point3(w);        Vector3 dx = x.multiply(p.getX());        Vector3 dy = x.multiply(p.getY());        Vector3 dz = x.multiply(p.getZ());        r.translate(dx.getX(), dx.getY(), dx.getZ());		r.translate(dy.getX(), dy.getY(), dy.getZ());		r.translate(dz.getX(), dz.getY(), dz.getZ());        return r;    }

and this
    public Point3 multiply(Point3 p) {        Point3 r = new Point3(w);		float xx = p.getX();		float yy = p.getY();		float zz = p.getZ();        Vector3 dx = new Vector3(xx*x.getX(), xx*x.getY(), xx*x.getZ());        Vector3 dy = new Vector3(yy*x.getX(), yy*x.getY(), yy*x.getZ());        Vector3 dz = new Vector3(zz*x.getX(), zz*x.getY(), zz*x.getZ());        r.translate(dx.getX(), dx.getY(), dx.getZ());		r.translate(dy.getX(), dy.getY(), dy.getZ());		r.translate(dz.getX(), dz.getY(), dz.getZ());        return r;    }

By the way is it dy = x.multiply(p.getY()); or dy = y.multiply(p.getY()); ???

If there are no positive results, can you try the sample from the sun page and see if you can get results with that?

Could it be that the profiler is interfering.
Why don't you generate two compiled versions with and without the optimization and then checked the total time of maybe 100,000 thousands of operations.
Hey...


Quote:
But unless you change the code, how would you avoid that allocation hell with C++?


The cool answer is, you just have to remove the "new" keyword and the object is allocated on the stack, not the heap memory. Thats really a cool feature of c++.

This code is perfectly valid (and performant) c++ code

public void rotate(float angle, Vector3 axis) {Matrix3 rotationMatrix = Matrix3();rotationMatrix.loadAxisAngleRotation(angle, axis);rotationMatrix.transformOnlyOrientation(this);}


When the function is called, all local variables are "bundled" together and put
on the stack. Allocating objects this way is a matter of moving the stack pointer a little bit farther. So allocation cost is truly zero. Only initialization matters. When the function returns the stack pointer is moving back and all objects get auto-deallocated. Very easy, clever & powerful.
But also tricky for learners because you can´t return local objects. You have to know what the language does in the background to program in c++. But... its the same with java, if you ask me.

I am not an expert in C# but I think struct has value semantics, and in C# that means pass-by-value.Wouldn't it be a heavy penalty for performance?


Yes it would, if you would pass the whole struct-array by value, but that is a thing you wouldn´t do. You have to copy each struct individual into its flat-memory array cell and thats in principle the same as initializing an object. Potentially it could be faster because you have the memory where the object will take place directly at hand, you preallocated it in one chunk.

Think of an array, where each cell has the size of a whole object, not just the pointer as in java and you copy the objects properties directly into to the cells, not at the location of the pointer you find in the cell

Quote:
Can you try something like this:


Yes sure. But the thing that matters most is abstraction in programming.
It makes the job really many times easier.

In this simple case, it might be bearable to pass only primitive types, but
this really bloats the code and if you ask me, the right abstraction makes
all the difference.

Could it be that the profiler is interfering.Why don't you generate two compiled versions with and without the optimization and then checked the total time of maybe 100,000 thousands of operations.


Yeah... wahhooo. That showed some real results. I switched back to jdk.1.6.0_16
and did a simple time-benchmark with this code:

        Point3 p = new Point3();        for(int i = 0;i <= 24000000;i++) {            matrix.rotate(0.1f * i, new Vector3(1,0,0));            p.translate(matrix.x());        }


It took 18 seconds without the EA flag turned on and 16 seconds with the flag.
Thats quite a difference. Maybe I should do a C++ speed comparison.
I would call this quite a good result. 2 seconds sounds reasonable as a saving when allocations are omitted.

Its true, maybe the profiler interferes. The longer i think about it, the more reasonable it sounds. I believe the profiler does some kind of bytecode instrumentation and maybe replaces every allocation with a call to the profiler.

Just guessing...

Thank you.
Quote:
The cool answer is, you just have to remove the "new" keyword and the object is allocated on the stack, not the heap memory.

I know about stack vs heap. (at least I knew many time ago)
But how are we going to avoid "new" with this code:
Vector3 dx = x.multiply(p.getX());
in this case "new" is hidden inside multiply but it is still there.
has to be there, am I right?
It is more than ten years that I was forced to changed C++ by Java.
(pretty angry at the beginning but very happy now)
That's the reason I am not sure about all those things.

Quote:
Yes it would, if you would pass the whole struct-array by value, but that is a thing you wouldn´t do.

But as soon as the code is a bit complicated performance is hit.
I mean, I know it is very fast to create an array of C#-structs, but after that,
imagine that every element has to be processed been passed to a couple of functions,
and maybe calling another one inside these...
then performance is hit or you are forced to small structs...
or maybe you can use some C# trick I am not aware of.
Any thoughts?

About the code, Oh I didn't want you to change the code permanently, it was just an attempt to discover the behaviour of the escape analysis performance enhancement.

And, I am really happy we catch the bug.
It has been a funny hunting.

Also, I have another curiosity: what's the difference in performance between the client and the server JVM?
Quote:Original post by ppgamedev
I mean, I know it is very fast to create an array of C#-structs, but after that,
imagine that every element has to be processed been passed to a couple of functions,
and maybe calling another one inside these...
then performance is hit or you are forced to small structs...
or maybe you can use some C# trick I am not aware of.


Arrays are passed by reference. Structs are inside array. In Java, this isn't possible, unless dealing with primitive types.

Comparison.

Quote:what's the difference in performance between the client and the server JVM?

Last I checked, server is tuned for longer running tasks, and takes longer to invoke certain optimizations, so it gets a better "big picture". In general, server is slightly faster, but nothing drastic. The GC might also be tuned slightly differently. Escape analysis might exist only in server VM for now.
Quote:Original post by Antheus
Quote:Original post by ppgamedev
I mean, I know it is very fast to create an array of C#-structs, but after that,
imagine that every element has to be processed been passed to a couple of functions,
and maybe calling another one inside these...
then performance is hit or you are forced to small structs...
or maybe you can use some C# trick I am not aware of.

Arrays are passed by reference. Structs are inside array. In Java, this isn't possible, unless dealing with primitive types.

Are C#-structs your golden hammer?
Hi Antheus, I don't know you personally and maybe you are brilliant.
But it seems you just ignore what I say and just keep praising C#-structs.

Anyway, this could be another problem
	struct Foo { 		public int x; 	}	int modifyFoo(Foo foo) {		int previousX = foo.x;		foo.x = 2 * foo.x;		return previousX;	}	Foo[] foos = new Foo[10];	foos[1].x = 10;	Console.WriteLine("before reset: {0}", foos[1].x);	modifyFoo(foos[1]);	Console.WriteLine("after reset: {0}", foos[1].x);

What's the result from this?
I guess it is 10 and 10.
Value objects are nice, but entities are also.

Quote:Original post by Antheus
Quote:Original post by ppgamedev
what's the difference in performance between the client and the server JVM?

Last I checked, server is tuned for longer running tasks, and takes longer to invoke certain optimizations, so it gets a better "big picture". In general, server is slightly faster, but nothing drastic. The GC might also be tuned slightly differently. Escape analysis might exist only in server VM for now.

I was (obviously?) referring to Frederick's code. That particular case.
Also, what I understand from Sun docs is that "escape analysis" is a compile time improvement, completely independent of the VM.



Hey Frederick

Have you done a C++ comparison?

If you do it don't forget to post the results!!
Quote:
Hey Frederick

Have you done a C++ comparison?

If you do it don't forget to post the results!!


Hey !!!

Ah, nooo, maybe that comes later, but i did another interesting test:

Remember my function looked like this:

public void rotate(float angle, Vector3 axis) {   Matrix3 rotationMatrix = Matrix3();   rotationMatrix.loadAxisAngleRotation(angle, axis);   rotationMatrix.transformOnlyOrientation(this);}



which was called like this:
        Matrix3 matrix = new Matrix3();        Point3 p = new Point3();        for(int i = 0;i <= 24000000;i++) {            matrix.rotate(0.1f * i, new Vector3(1,0,0));            p.translate(matrix.x());        }



So basically, we have an allocation of Matrix3 (which has four subobjects x,y,z,w by the way) per loop iteration - right ?

Running time without EA 59 seconds - with EA enabled 52 seconds. I would call that a good improvement, but the exciting thing comes now:

I tried this also:

        Matrix3 matrix = new Matrix3();        Point3 p = new Point3();        Vector3 axis = new Vector3(1,0,0); //Moved all allocations out of the loop        Matrix3 rotationMatrix = new Matrix3();        for(int i = 0;i <= 48000000;i++) {            //Only assignments in here            rotationMatrix.loadAxisAngleRotation(0.1f * i, axis);            rotationMatrix.transformOnlyOrientation(matrix);            p.translate(matrix.x());        }


Now guess the running time !? 55 seconds with EA disabled and 52 seconds with EA enabled. Well I have only one word for this: MAGIC. Hotspot is real magic.
I am happy and confident now that I can walk far with java. It is a super advanced modern system by now and should get credit for that.

C++ test will follow If I am not lazy. I would guess 42 seconds or something, lets see ;-)

Yes, Antheus is right a flat memory layout would be a great addition to handle massive collections of objects, like vertices, indices and so on. I hope some game and media oriented people will add this to the jdk. It would also make communication with c++ easier. By now I stuff everything into ByteBuffers, which are wrapped into objects on the c++ and the java side, that is a hacky workaround, but I hardly notice, what is going on behind, when I use the wrappers.


So I am really happy I have no need to port my project and can happily stay. The only thing is I would like to have a reliable way to find out when allocations are omitted and when not, to get more control over that, so if anybody has a good idea I am happy to hear about.

Thank you all for your great participation, I don`t think I would have drawn this conclusions by myself (-;

Frederick

This topic is closed to new replies.

Advertisement