Optimization Puzzle

Started by
56 comments, last by MaulingMonkey 18 years, 7 months ago
Thx MrRowl.

Don't worry... I know what I'm doing with regards to my programming model.

I have to use double precision here as I'm doing a physics project based on star cluster formations.... So the numbers can get pretty big pretty quickly and I want to do the simulation using 'real' quantities.

I've been playing with SIMD for a few years now and your right, most of the time the complier does such a fine job at optimizations, that using SIMD yourself is a waste of time in MOST... But not all... cases. (Nowt wrong in experimenting of course... That’s how coders are supposed to come up with new ideas)

I already have a C / C++ version of the central time critical code up and running and its pretty fast... But because its time critical (i.e. the code needs to update the positions and velocities of 1000's of stellar objects each frame)... It does not hurt to see if you CAN make the code run just a little bit faster.

Having read SEVERAL books on the apparent slowness of 'sqrt', I thought I'd try a simple SIMD routine to see if I could beat it. The results remain inconclusive as the complier has a habit of totally altering the code you write.

There are #pragmas, directives and other coding constructs which can alter the behaviour of the complier... Which is why I approached this forum, to see if anyone knew how to stop the complier performing Loop Invariant Computations (the details of which are described above).

My central problem is not one of optimization per se... But more of a complier issue.
Advertisement
Quote:Original post by TerrorFLOP
If you've nothing constructive to say man... THEN DON'T SAY!!!


I don't like the way you talk to people here. Everyone is just trying to help you in solving your problem, mate. ;)

___Quote:Know where basis goes, know where rest goes.
If that help ain't constructive... Then I don't wanna hear it.

Clapton... You read the moderators guidelines?
Firstly, insulting people like Dmytry who have ratings of 1597 is silly, because they're obviously helpful people and are trying to be constructive. They don't get those ratings by being jerks.

For testing, why not just use your actual code? You have two functions to test, so either use a simple search and replace, or create another function like this to use everywhere:

inline float TestSqrt(float x){  // Uncomment to test either one  // return SSE2Sqrt(x);  // return sqrt(x);}


This way you'll know how it performs in your actual code.

tj963

[Edited by - tj963 on September 13, 2005 8:17:23 AM]
tj963
Quote:Original post by TerrorFLOP
Thx MrRowl.

Don't worry... I know what I'm doing with regards to my programming model.

I have to use double precision here as I'm doing a physics project based on star cluster formations.... So the numbers can get pretty big pretty quickly and I want to do the simulation using 'real' quantities.


i dont see what the size of your numbers has to do with using floats or not. do you really expect your floats to overflow? i doubt it.

also i do not really think the mantissa needs to be of double size. if anything i would consider using fixed point numbers for recording your positions, as they give uniform precision, instead of floats which concentrate their precision around the origin, which is quite undesirable for this application.

anyway, while im fine with you experimenting with your code, i think it is a waste of time. sqrt is slow, but that doesnt mean it is a bottleneck or that you can improve it.

optimisation of algorithms is probably much more important here. tell me, what approach do you wish to take for your simulation? just a naive one, which scales quadraticly with the amount of masses you want to simulate wont produce very interesting resuls given the amount of masses youll be able to get on screen, unless you have some sort of supercomputer. if i recall correctly there are methods, not too hard to implement, that can reduce timecomplexity to n*log(n)
As I said, if you want to benchmark something, test your actual code. Anything else is useless, especially if you don't know how optimizations works. Speed of code _always_ depends to context.
You can speed up square root with SSE only if you use low precision square root (so you lose precision). There probably is library function doing fast square root, and intrinsics for that. Any decent compiler can generate SSE code, and for square root aswell.

BTW, to moderators: move such stuff to general programming please. Better yet create forum "premature optimization". It had nothing to do with math or physics, it's just "intel compiler for x86" or "MSVC 7.0 compiler for x86", something entirely platform and compiler-specific.

Anyway: ask on-topic questions to get constructive answer(!). Of course i don't know what exactly you are doing, 'cause you haven't said it(and even if you did, it is too painful to read the text with double linebreaks). If you'd be doing collision detection, you coulda use squared radius.

You should ask: how do I speed up the n-body problem . That would really be related to math and physics.
Simplest optimization: instead of normalizing if compiler is bad it _might_ be faster to use this:
F=(D2-D1)*(G*m1*m2/(squared_length(D1-D2)*length(D1-D2)));// totally equivalent.
so you get one division less (and division is often as slow as square root, for bad compiler, effect of above is same as if square root would be instant. If normalizing code were writted badly and on bad compiler, you might even get 3 divisions less (if you divided x,y,z)). I'm assuming you already are reusing forces so you compute force only n*(n-1)/2 times for n bodies.

There's really many acceleration techniques, most are based on not computing so many forces. For example, you do not really need to compute force between every pair of stars. Usually, grid-based methods is used. General idea is to replace bunch of faraway stars with one point placed at their center of mass. That really improves speed. It is not ideally accurate, but there's anyway a lot of inaccuracies there and it is good enough for practical cases.

more about that.

[Edited by - Dmytry on September 13, 2005 10:46:49 AM]
Eelcho...

As I said I'm doing a simulation based on star clusters... I.e. galactic formations.

The numbers DO get big believe me... Especially in terms of MASS and DISTANCE.

I did write my code using floats, but anomalies crept in due to the floats 'limited' range:

1.2e-38 to 3.4e38

Doubles on the other hand were designed for high precision maths, especially for use in scientific applications. Here I can let my simulation run for LONG periods of time as the range and accuracy of calculations are vastly improved:

Range of doubles: (in addition to double precision)

2.2e-308 to 1.8e308

The application runs fine using doubles, able to simulate the gravitational (or electrical if one so wishes) interactions of up to 10,000 objects simultaneously.

One CAN go beyond 10,000... But even on a release build, the render rate slows down drastically... Not surprising really, since 10,000 objects equates to around 99,990,000 calculations per render (when self interactions are excluded).

The simulation runs fine... I'm using point pixels to simulate the objects for now (and for speed tests) and, depending on the objects initial conditions, it can simulate spiral galactic formations, star system formations in which lighter objects orbit a heavier one... And, if you really wanna crank up the mass on a particular object, you can even simulate crude accretion disks of black holes.

Don't wanna drag on... But I plan to add further features such as bounding the area the objects move in... Thus simulating bounded particles in a box (assuming the objects masses are set to the atomic scale and the force constant is altered accordingly)

One can even change the polarity of the force...

Anyway... getting ahead of myself here...

It’s all done in DirectX for speed considerations, but I'm a greedy programmer who’s always seeking more speed if possible (speed is good).

So, in order to test speeds, I need to time code. In order to time code, I need my complier to behave during the timing process, so accurate readings can be recorded.

I’ve timed plenty of code using timing algorithms I wrote years ago, but in my interest to test just how slow ‘sqrt’ is, I get erroneous results, even though plenty of books... Perhaps you may have read some of them... Perhaps not, say that 'sqrt' IS a slow function...

It’s NOT an emergency... The code works (despite the fact that I don't have a supercomputer)... I'm just in the stages of tweaking the math so that it runs optimal.

In order to TEST optimizations... I need to suppress certain features on my complier.

I have a few books on algorithms which is a topic unto itself… But as yet I am unsure as to turn a O(N^2) algorithm into a much desired O[N*log(N)] algorithm. All objects MUST interact with each other in order to realistically simulate the effects of a force field…

Well... That’s my physicist brain talking... But if you have any ideas… Plz let me know.
tj963 thx for your input.

Simplistic perhaps... But thx for your CONSTRUCTIVE input.

Not sure exactly how your code snippet helps me.

My question is really about the MS VC++ 6.0 complier itself, which can be altered if you wish to tweak your code or try some experimentation... Which is exactly what I am trying to do...

Thx Anyway... Be civil to others and you shall receive in kind!
Quote:Original post by TerrorFLOP
Dmytry...

I find your lack of IQ... Disturbing to say the least.

No wonder there has been posts from the moderators about peeps like you.

If I wish to time my code... Then that is entirely my business.

I've timed code before... And I shall do so... AGAIN.

There’s even code in the online MSDN (and several other books) which shows you how to implement code timing logic and I've implemented some of that code into my own timing routines. If you have a problem with other coders trying to time and experiment with their code... Then you have a serious problem dude.

With regards to the 'radius squared' issue... You have NO idea why I would choose to norm vectors do you?

Go check out Newton's Law Of Gravitation... The VECTOR version (since force IS a vector)... And you'll realize that the famous law utilizes norm cubes.

Anyway, WHY I'm I trying to explain myself to you.

If you've nothing constructive to say man... THEN DON'T SAY!!!

I'm sure the rest of the civilized world will get by just fine without your 'insights' (if indeed you can call it that).

Cut it out with the flaming or I will ban you. Hugs!

Quote:Original post by TerrorFLOP
As I said I'm doing a simulation based on star clusters... I.e. galactic formations.


What's the purpose of the simulation? It's not so uncommon for simulations to run for days. Usually, you don't render and simulate at the same time, but record data and use that to do the rendering later.

Quote:
I’ve timed plenty of code using timing algorithms I wrote years ago, but in my interest to test just how slow ‘sqrt’ is, I get erroneous results, even though plenty of books... Perhaps you may have read some of them... Perhaps not, say that 'sqrt' IS a slow function...


The compiler is probably doing the same optimization you're trying to do. You know ASM, so your best bet is to check the compiler's output. If it is doing your optimization (or something similar), then you can use the more readable sqrt() and have no speed concerns.

If you want to know, for your own knowledge, just how much slower sqrt() the function (as opposed to sqrt() what the compiler actually does), gcc has a -fno-builtin option. I'd guess that your compiler would have something similar. However, I'd still recommend going with sqrt() if it's just as fast in a release (a.k.a. optimized) build.

Quote:
I have a few books on algorithms which is a topic unto itself… But as yet I am unsure as to turn a O(N^2) algorithm into a much desired O[N*log(N)] algorithm. All objects MUST interact with each other in order to realistically simulate the effects of a force field…


I'm fairly new to the subject, but it's not necessary for all objects to interact in order to realistically simulate a force field. Using a truncated potential and constructing a neighbor list could help with computation times while still providing results within tolerances.

This topic is closed to new replies.

Advertisement