Quote:Original post by TerrorFLOP It uses doubles and I know converting doubles to floats can be not only inefficient but precision losses out too.
In C++ sqrt is overloaded for floats and long doubles aswell.
In C99 there is float and long double versions called sqrtf and sqrtl which some compilers that don't support C99 have them as standard library extensions.
As been mentioned twice already you shouldn't be using the VC++ 6.0 compiler anymore, it's officially a defunct product, it has poor C++ standard compliance among a whole heap of other things wrong with it. Time to upgrade, if it must be an MS compiler go for 7.1 or above nothing below that, you have no excuse as VC++ 7.1 compiler is free now anyways.
Especially on the VC++ 7.1 complier... and its FREE you say???
Kool!!!
In response to Extrarius:
Timing sections of code does NOT tell you what you need to change in order to increase its speed... But it can give you a vital clue. If one is concerned a code speed test is a little slow... Perhaps because a certain function call has been made (in my case 'sqrt' say), then at least it gives a clue as to what could be changed in order to speed up that particular section of code further.
Obviously, to pin point where possible bottle necks may occur, I only usually test very SMALL pieces of code. Testing for bottle necks in complex code is of course a job for the profiler.
As for overall increases in speed... Well, time-critical means just that. If you can re-write VITAL sections of code to make them run much faster, then that too will speed the program as it performs the critical loop.
There are times when speed in the code is important, like say rendering X amount of 3D objects to the screen per frame. And there are times when it’s not so important, like say loading data objects off the hard drive (hard drive access is inherently slow anyway).
So testing small portions of time-critical code CAN make a bit of difference to the performance of the overall program, if those portions are replaced by code which runs a little faster. Well that’s the opinion of myself and dozens upon dozens of authors whose books I’ve read.
Don’t get me wrong however; profilers are certainly useful tools too, especially in large scale projects… Of which, mine is not.
In response to iMalc (soz guys… newbie on this site so I don’t know how to include your actual quotes…. yet!)
Thx for the code snippet…
Not sure it’s gonna work…. But I’ll run it myself.
BTW In Debug builds, my SIMD routine beats ‘sqrt’ handsdown… Only in Release builds do I get ‘sqrt’ running as fast as the null statement ‘;’
Question… Did you run the code you sent me in Release?
Errrr… Harsh?.... I didn’t mean to come across like that… I’m a nice guy….honest…. NEW (which seems to be a problem here)… but nice.
But sure, let’s NOT patronise…
We’re a community and thus we always ain’t gonna see eye to eye… Human nature I guess but sure, being civil and professional… Even when we disagree is what I’m all about. So its kool.
Finally in response to snk_kid.
Thx man!!! Going to download that sucker right now LOL!
Regarding your original attempt to speed up the square root, Intel SSE2 instructions aren't designed (at least as far as I can see) to do horizontal calculations (i.e. like adding the two values in xmm0 together) efficiently. I'd be amazed if the normal square root instruction is slower than the SSE2 one since internally they probably use the same circuitry (I'm no expert...). However you are calculating the length twice - the squared length ends up in both parts of xmm0 and then you take the square root of each which is an obvious waste. Your best chance of a speedup with SSE2 would be to rewrite your code to do the same operations on two vectors at a time (that's what SSE was designed to do), which should give you around a 2x speedup if your code is perfectly suited to it (which it probably isn't). That's a lot of work though because it means changing your data structures all the way through the program.
Quote:Original post by TerrorFLOP [...]So testing small portions of time-critical code CAN make a bit of difference to the performance of the overall program, if those portions are replaced by code which runs a little faster. Well that’s the opinion of myself and dozens upon dozens of authors whose books I’ve read.[...]
You're wrong here, as are whichever authors you're talking about (or, more likely, they would be wrong today but whenever they wrote so many years ago they weren't as wrong). Without proper whole-program profiling, you won't know what bottlenecks there are. If your program does more than square roots in a loop, you should be profiling it with a proper profiler. Note that custom timing code around made-for-benchmarking code is NOT proper profiling. Profiling is _NOT_ only for million-line programs. Even a decent 100 line program is probably complex enough that you can't accurately predict which parts are the bottleneck and which arent.
[Edited by - Extrarius on September 14, 2005 1:25:20 PM]
"Walk not the trodden path, for it has borne it's burden." -John, Flying Monk
Maybe I didn't explain myself very clearly. You have your source code and you have to normalize functions, Norm and FastNorm. I assume you're currently using Norm everywhere, so just jo a search and replace (with match whole word on) so you use FastNorm everywhere. Compare. Then without having to use a profiler or having to trick VS into not optimizing, you can find out which one is faster.
Well I didn't say timing code WAS proper profiling.
As I said earlier, I tend to test SMALL, i.e. no more than say 30 lines of code. My critical code for my project spans about 19 lines, and has NO other function calls (well, non-inlined ones anyway) apart from 'sqrt'.
I guess I tend to do a 'bottom-up' approach to the code I write. I write it to the best of my ability. Then once I have the bulk of it done, I take out and test very small sections of code which I deem to be time critical. Once I feel that code is the best I can get it, I'll re-introduce it back into the main bulk. Once this stage is complete, then I profile.
As I have a Pentium 4 CPU, I did download Intel's manuals on optimization, as well as it’s manuals on the General, FPU, MMX and SIMD instruction sets, and sure, optimization is a VERY complex subject due to disparity with CPU and core memory speeds. So I've read up and caching, instruction re-ordering, data re-ordering and a whole host of other things... It's a complex subject, as are the semantics of compiler technology.
So I certainly don't claim to think that the way I optimize is the best way... I let my complier handle the bulk of that.. But in terms of what algorithms to use, this is how I go about it. Timing VERY small sections of code, see if they can be improved upon, then go to the next section of time critical code. After that, profile the whole thing.
We're on the same level Extrarius. I certainly don't claim profiling is a waste of time... I guess I just prefer to optimize at a low level first and then profile after... It’s the way I prefer and since programming is also a matter of style (and there’s plenty of different programming styles out there), then I guess there’s no harm done LOL.
Thx for you're input however.
ZQJ… SSE3 can do horizontal calculations, but alas I am stuck with SSE2 (I don't have the latest processor pack at the moment). So you have a valid point, however, SSE2... with a little shuffle, can work out norms pretty quickly.
In Debug builds, SSE2 beats the complier in calculating vector norms handsdown. Don't believe me? I can send you the code if you wish.
P.S. Can someone tell me how to include member's quotes?
I have my Norm function in one place... and one place only....
Mmmm... Perhaps I need to show this forum the code. That would clear up a HELL of a lot of confusion.
Anyway, when I come to test Norm and FastNorm... I run them in an empty test program... Totally isolated from the rest of the main program.... Simply to see which one runs best... As I told Extrarius, I only ever test SMALL (no more than 30 lines) of code.
The CPU has to do a lot of caching of instructions and data, so I only use small sections of code to ease the work of the CPU and (hopefully) obtain an accurate code time.
Quote:Original post by TerrorFLOP In Debug builds, SSE2 beats the complier in calculating vector norms handsdown. Don't believe me? I can send you the code if you wish.
I don't think anybody would seriously doubt you on that. But profiling in debug mode tells you absolutely nothing. So sqrt is slow when it's not inlined and you take away all the compilers optimisation possibilities? So what? That's not how it's used! It's like picking an Olympic running team based on their swimming times.
Quote:P.S. Can someone tell me how to include member's quotes?
I only used the Debug mode argument to make a point that SSE2, despite a lack of horizontal operations (save shuffling) can do vector norms pretty fast.
And since I timed both SSE2 norms and sqrt norms BOTH in debug builds (so that the SSE2 function is also non-optimized), I reckon that’s a fair test.
Checked the FAQ... Nowt there mate...
So perhaps you could tell me how to include members quotes?
Quote:Original post by TerrorFLOP [...]Anyway, when I come to test Norm and FastNorm... I run them in an empty test program... Totally isolated from the rest of the main program.... Simply to see which one runs best... As I told Extrarius, I only ever test SMALL (no more than 30 lines) of code.[...]
And we're saying that is the wrong way to time code. When you use the code, you have things like the cache and hundreds of other factors that affect the time the code takes to run. When you make fake code specially for benchmarching you get rid of all the important factors to test the ones that don't really matter.
Quote:Original post by TerrorFLOP [...]And since I timed both SSE2 norms and sqrt norms BOTH in debug builds (so that the SSE2 function is also non-optimized), I reckon that’s a fair test.[...]
Inline assembly code is never optimized by VS6 (or any other compiler as far as I'm aware, though you can let gcc choose some aspects of your inline code for optimal performance), so you're testing your assembly code exactly as it is vs a slow sqrt in debug mode and vs a regular sqrt in release mode.
"Walk not the trodden path, for it has borne it's burden." -John, Flying Monk