Double-precision SSE instructions were introduced later on in SSE2, so have you tried generating code for plain old SSE instead? If I know Microsoft right there's probably some obscure pragma for setting it temporarily..
I'm not saying it's a good solution or anything, in fact you're almost certainly better off using intrinsics, but it would be interesting to see what the compiler does with it.
Not that it should matter but perhaps it might have something to do with the floating point precision mode, so try /fp:fast if you're not using it already.
For the most part getting compilers to generate good code reliably is an exercise in frustration, especially when you know just what you want and you're only trying to massage the code into yielding the right instructions. Personally I trust the compiler to do a decent enough job with the bulk of the code, check the assembly listings once in a while to see what patterns result in good code, and hand-code anything critical in assembly language.
After all it's the mundane branchy logic-code you want help with, if you know enough to get the compiler to generate fast code for SIMD innerloops then you shouldn't have any trouble writing them in assembly language either (or intrinsics, but I view those as assembly language with automatic register allocation).
Visual C++ 2008 producing very strange code
Quote:Original post by implicit
Double-precision SSE instructions were introduced later on in SSE2, so have you tried generating code for plain old SSE instead? If I know Microsoft right there's probably some obscure pragma for setting it temporarily..
I'm not saying it's a good solution or anything, in fact you're almost certainly better off using intrinsics, but it would be interesting to see what the compiler does with it.
Not that it should matter but perhaps it might have something to do with the floating point precision mode, so try /fp:fast if you're not using it already.
Ok, first off, I tried the std::min and std::max templates. These produce the exact same code. Interestingly though, the comparison (a < 0) ? -a : a; produces a simple comiss, as I would expect it in all cases. Anyways, using SSE instead of SSE2 produces plain old x87 FPU code. Except for the aforementioned (a < 0) ? -a : a; for which - again - comiss is used! /fp:fast with SSE2 actually does cause the compiler to emit comiss for all three. Still not optimal as minss/maxss would be better still.
Quote:For the most part getting compilers to generate good code reliably is an exercise in frustration, especially when you know just what you want and you're only trying to massage the code into yielding the right instructions. Personally I trust the compiler to do a decent enough job with the bulk of the code, check the assembly listings once in a while to see what patterns result in good code, and hand-code anything critical in assembly language.
After all it's the mundane branchy logic-code you want help with, if you know enough to get the compiler to generate fast code for SIMD innerloops then you shouldn't have any trouble writing them in assembly language either (or intrinsics, but I view those as assembly language with automatic register allocation).
Usually I don't look at asm code too much, I only noticed this by chance while looking for something else. And yeah, intrinsics are definitely better than inline assembler (also, they are usually compiler independent, albeit not platform-independent)
I wonder what the Intel C++ compiler produces in this situation. Too bad my trial license expired last week.
Quote:Original post by l0calh05t
Ok, first off, I tried the std::min and std::max templates. These produce the exact same code. Interestingly though, the comparison (a < 0) ? -a : a; produces a simple comiss, as I would expect it in all cases. Anyways, using SSE instead of SSE2 produces plain old x87 FPU code. Except for the aforementioned (a < 0) ? -a : a; for which - again - comiss is used! /fp:fast with SSE2 actually does cause the compiler to emit comiss for all three. Still not optimal as minss/maxss would be better still.
Ah, the last bit *might* be the explanation. Have you read the documentation for the different /fp modes? The standard setting is pretty conservative, so it's possible that it forces doubles to be used for intermediate results.
It does look odd though, I agree.
Quote:Original post by Spoonbender
Ah, the last bit *might* be the explanation. Have you read the documentation for the different /fp modes? The standard setting is pretty conservative, so it's possible that it forces doubles to be used for intermediate results.
Agreed. If a and b were intermediate results. (Which they are not) And even if it were the case, why do they get expanded to double precision directly after being loaded as single precision variables? Anyways, it doesn't really matter too much, as it probably won't be much of a bottleneck but I would have expected better from a modern, optimizing compiler.
EDIT:
Just noticed some other, unrelated strangeness... I have a distance function in my namespace, but when I try to use it, VC++ tries to instantiate the std::distance template, although I used neither using namespace std; nor using std::distance; anywhere in my code... this is annoying.
[Edited by - l0calh05t on August 3, 2008 5:03:15 AM]
Quote:Original post by l0calh05tQuote:Original post by Spoonbender
Ah, the last bit *might* be the explanation. Have you read the documentation for the different /fp modes? The standard setting is pretty conservative, so it's possible that it forces doubles to be used for intermediate results.
Agreed. If a and b were intermediate results. (Which they are not) And even if it were the case, why do they get expanded to double precision directly after being loaded as single precision variables? Anyways, it doesn't really matter too much, as it probably won't be much of a bottleneck but I would have expected better from a modern, optimizing compiler.
True. File a bug on it?
it could be something as simple as you are running x86 code in backwards compatibility mode on an x64 chip, and the compiler is actually generating optimal code for the target you chose.
Quote:Original post by thedustbustr
it could be something as simple as you are running x86 code in backwards compatibility mode on an x64 chip, and the compiler is actually generating optimal code for the target you chose.
No, and I already showed that it isn't optimal.
Quote:Original post by Deyja
Your hardware uses 80-bit floats anyway. It's got to be converted somewhere.
If the FPU is used, yes, but if SSE is enabled, VC++ uses SSE for most floating point ops, so the hardware uses 32-bit floats in this case, anyway. And as I already pointed out, there is no reason to load two floating point variables in single precision, to convert them to double precision (not 80-bit) and only then comparing them.
Quote:Original post by l0calh05tQuote:Original post by implicit
Double-precision SSE instructions were introduced later on in SSE2, so have you tried generating code for plain old SSE instead? If I know Microsoft right there's probably some obscure pragma for setting it temporarily..
I'm not saying it's a good solution or anything, in fact you're almost certainly better off using intrinsics, but it would be interesting to see what the compiler does with it.
Not that it should matter but perhaps it might have something to do with the floating point precision mode, so try /fp:fast if you're not using it already.
Ok, first off, I tried the std::min and std::max templates. These produce the exact same code. Interestingly though, the comparison (a < 0) ? -a : a; produces a simple comiss, as I would expect it in all cases. Anyways, using SSE instead of SSE2 produces plain old x87 FPU code. Except for the aforementioned (a < 0) ? -a : a; for which - again - comiss is used! /fp:fast with SSE2 actually does cause the compiler to emit comiss for all three. Still not optimal as minss/maxss would be better still.Quote:For the most part getting compilers to generate good code reliably is an exercise in frustration, especially when you know just what you want and you're only trying to massage the code into yielding the right instructions. Personally I trust the compiler to do a decent enough job with the bulk of the code, check the assembly listings once in a while to see what patterns result in good code, and hand-code anything critical in assembly language.
After all it's the mundane branchy logic-code you want help with, if you know enough to get the compiler to generate fast code for SIMD innerloops then you shouldn't have any trouble writing them in assembly language either (or intrinsics, but I view those as assembly language with automatic register allocation).
Usually I don't look at asm code too much, I only noticed this by chance while looking for something else. And yeah, intrinsics are definitely better than inline assembler (also, they are usually compiler independent, albeit not platform-independent)
I wonder what the Intel C++ compiler produces in this situation. Too bad my trial license expired last week.
If you use Linux you can continue using the Intel C++ compiler for non-commercial use.
Quote:Original post by Deyja
Your hardware uses 80-bit floats anyway. It's got to be converted somewhere.
No, it isn't. It uses 64 bit accuracy in SSE. 64 bit code doesn't know anything else than GP and SSE registers. Even if you are talking about legacy, it wasn't conversion, they placed 64 bit number into register, and computed in greater accuracy, because they were lazy/wanted simple algorithm for sin/cos. They didn't extend the number at all.
BTW His assembly listing showed SSE registers.
In reality current SSE registers should compute with greater than 64 bit precision for 64 bit floating point numbers. However they are doing it behind the scenes. SSE2 register stays 128 bit, and numbers are not expanded.
IIRC Goldsmith and people from the standards organization had some articles/materials about this topic. (look for guard bits, and similar stuff)
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement