Jump to content
  • Advertisement
Sign in to follow this  
TerrorFLOP

Optimization Puzzle

This topic is 5146 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Ah, sorry, linked to wrong FAQ. Try this one.

And no, testing two things against each other in debug is not a fair test. If your SSE code is written using inline assembler then the release build will contain the inline assembler you specified, after some possible register optimisations. In debug build it will contain the inline assembler you specified possibly minus some register optimisations. A sqrt call on the other hand in release can be inlined, can have common subexpression optimisations performed, can have those same register optimisations and probably more that I can't think of right now. In debug mode you lose most if not all of those optimisation possibilities. The results really do tell you nothing about the relative speeds even if you profile both items in debug mode.

Enigma

Share this post


Link to post
Share on other sites
Advertisement
Quote:
Original post by TerrorFLOP
ZQJ… SSE3 can do horizontal calculations, but alas I am stuck with SSE2 (I don't have the latest processor pack at the moment). So you have a valid point, however, SSE2... with a little shuffle, can work out norms pretty quickly.


I'm sure SSE2 can do a pretty quick calculation with that ASM. But my point is unless the shuffle instruction is free (I really don't know anything about x86 cycle counts), I don't see why it would be any faster than the normal FP version because you only save one calculation instruction and you gain the shuffle.

I also see no reason why sqrt should be so dreaded - if you've got to do a square root you've got to do it and I don't see why that particular function should be slower than any other (unless it's badly implemented). Can anyone clear this up?

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Extrarius Wrote:

Inline assembly code is never optimized by VS6 (or any other compiler as far as I'm aware, though you can let gcc choose some aspects of your inline code for optimal performance), so you're testing your assembly code exactly as it is vs a slow sqrt in debug mode and vs a regular sqrt in release mode.

Agreed. I know the actual asm code is untouched by the complier, I was really referring to the fact a function call is made in the debug test, despite it being flagged as inlined (which is the usual debug behaviour).

I've taken your points, but still believe in optimizing at a low level, simply to see which time-critical algorithm works best. One way of doing that is to time that code and compare with other possible algorithms.

Here is the MSDN webpage that inspired me (as well as several books) on optimization and profiling in general. Towards the end, bench-marking code is supplied which can be used in your own programs. My own timing routines were based, in part, on this code.

http://msdn.microsoft.com/library/en-us/dnvc60/html/optcode.asp?frame=true

With regard to caching and other issues during the main program run, I don't dispute that either. However, having a simple test, to see how fast small sections of your code runs, can't be that bad idea in my opinion. It simple gives you an idea of what code is best.

Share this post


Link to post
Share on other sites
Extrarius Wrote:

Inline assembly code is never optimized by VS6 (or any other compiler as far as I'm aware, though you can let gcc choose some aspects of your inline code for optimal performance), so you're testing your assembly code exactly as it is vs a slow sqrt in debug mode and vs a regular sqrt in release mode.

Agreed. I know the actual asm code is untouched by the complier, I was really referring to the fact a function call is made in the debug test, despite it being flagged as inlined (which is the usual debug behaviour).

I've taken your points, but still believe in optimizing at a low level, simply to see which time-critical algorithm works best. One way of doing that is to time that code and compare with other possible algorithms.

Here is the MSDN webpage that inspired me (as well as several books) on optimization and profiling in general. Towards the end, bench-marking code is supplied which can be used in your own programs. My own timing routines were based, in part, on this code.

http://msdn.microsoft.com/library/en-us/dnvc60/html/optcode.asp?frame=true

With regard to caching and other issues during the main program run, I don't dispute that either. However, having a simple test, to see how fast small sections of your code runs, can't be that bad idea in my opinion. It simple gives you an idea of what code is best.

Share this post


Link to post
Share on other sites
Hey TFlop. Just ran into this thread. Oddly enough, it seems that I've written exactly the same program that you're trying to. The part in question looks like this using SSE intrinsics:

// this is the hot-spot
// rsqrt with one iteration of newton raphson approximation
const F32vec4 half(0.5f);
const F32vec4 three(3.0f);
F32vec4 tmp = _mm_rsqrt_ps(r);
r = (half * tmp) * (three - (r * tmp) * tmp);
r *= r * r;

// non-SSE is pretty damn slow
//r[0] = 1.0/sqrt((double)r[0]);
//r[1] = 1.0/sqrt((double)r[1]);
//r[2] = 1.0/sqrt((double)r[2]);
//r[3] = 1.0/sqrt((double)r[3]);
//r *= r*r;

Anyway, I've been able to do 8k particles at 10-15Hz on my 2.0GHz P4 laptop (I'd have to check again). I use intrinsics and Verlet integration (which works well because you need to keep the positions around one frame anyway). I didn't Barnes-Hut Octree since I did it as an SSE exercise. Let's hear it for O(n^2).

The biggest help came in using an array of hybrid SoA/AoS:

struct Vertex4
{
F32vec4 x;
F32vec4 y;
F32vec4 z;
//F32vec4 mass; // this adds alot of muls in the inner loop
};

...where each Vertex4 are four point masses. Accessing the arrays of Vertex4 are almost perfectly regular, so I haven't been able to get any improvements with automatic prefetching. Looking at the disassembly, it looks like the biggest instruction scheduling problem is there are a big clumps of multiplies (and rsqrt) between clumps of adds. It would be better if these were interleaved, but I don't think there are enough registers to unroll the loop (with 64-bit, there ought to be). Actually, the compiler doesn't do a particularly good job of register allocation with intrinsics, so maybe you can do it with assembly. There are probably tricks you can do to squirrel things away into store buffers and load them back cheaply, but I don't know asm too well. The intel optimization guides are free though.

Anyway, enough unstructured babbling. The beer is wearing off, and with it the novelty at looking my old messy code. Hope you go further than I did with this.

-W

Share this post


Link to post
Share on other sites
Quote:
Original post by TerrorFLOP
In response to iMalc (soz guys… newbie on this site so I don’t know how to include your actual quotes…. yet!)
Oh all of the things like that are just commands in square brackets e.g.
[quote]Stuff to quote goes here[/quote]
[source]Source code goes here[/source]
etc.
Quote:


BTW In Debug builds, my SIMD routine beats ‘sqrt’ handsdown… Only in Release builds do I get ‘sqrt’ running as fast as the null statement ‘;’

Question… Did you run the code you sent me in Release?
Quote:
Yup that was in release mode.


Errrr… Harsh?.... I didn’t mean to come across like that… I’m a nice guy….honest…. NEW (which seems to be a problem here)… but nice.
Sorry, I was only trying to say that I wasn't being harsh, and not meaning to imply you were. It's all in the past now anyway.

God luck parallelising your code to do multiple sqrts at once, if you can.

Share this post


Link to post
Share on other sites
Quote:
Original post by TerrorFLOP
the Win32 API function 'sqrt'

As allready mentioned, this is a standard library function, not Win32. This is an important distinction - the Win32 API is not (completely) available on *nix platforms, nor on consoles.
Quote:
is slow, Slow, SLOW...

What makes you think you can do any faster?
Quote:
So of course I opted to utilize SSE2 technology

Which your compiler (if modern) can probably do just as well if not better than yourself should you enable compilation for something more advanced than a target of a base 386... (read: enable more optimization flags).
Quote:
in Release builds using MS Visual C++ 6.0.

Of course, there's the problem that MSVC++ 6 isn't exactly modern anymore. But if you've actually run into speed problems (real, experienced ones, that you've profiled, not potentially imaginary nonexistant problems) then I'd suggest upgrading rather than replacing code willy nilly on the basis of "I think it might be slow".
Quote:
But I have a feeling the damn complier is running some optimization routines that somehow beat SSE2 code.

Again as others have mentioned, this is not a bad thing. To be meaningful you need to test preformance gains and losses in a real situation with profiling, rather than in contrived test cases such as your for loop. A modern compiler will likely _still_ beat you by writing the SSE2 assembly code _better_ than you, since that's it's entire job (writing assembly code).

Quote:
Something is wrong here, 'cos I be dammed if 'sqrt' can beat SSE2.

Unless of course it is better written in SSE2. Or prehaps using some different assembly extension in the future you've never even heard of. You don't have to modify your own code if others modifiy this routine for you.
Quote:
Otherwise, I gotta stick to using the 'sqrt' version to norm all my vectors in a Release build... Suspecting that its faster than SSE2, only 'cos I couldn't time the damn thing properly :-(

You can time "the damn thing" properly in an actual usage scenario. Since you've got this routine in one place, you can use, say, the sqrt version to develop and test in. When you've got something actually running, and if it (as a whole) is too slow, you can break out a profiler to see where the most time is being spent, and optimize the algorithm which calls that section. Once you've done that enough it isn't helping, you can then optimize the assembly. If that happens to be the sqrt function in your normalization function, you can swap it out for the assembly version. If there's no noticable difference, you'll know it has no noticable impact on your program which version you use - and thus dosn't matter.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!