• Advertisement
Sign in to follow this  

Visual C++ 2008 producing very strange code

This topic is 3491 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I was looking at some assembler output and I noticed the following: This line of code (a and b are floats)
return a < b ? a : b;
creates the following output
movss	xmm0, DWORD PTR _a$[esp+8]
movss	xmm1, DWORD PTR _b$[esp+8]
cvtps2pd xmm0, xmm0
cvtps2pd xmm1, xmm1
comisd	xmm1, xmm0
lea	eax, DWORD PTR _a$[esp+8]
ja	SHORT $LN6@main
lea	eax, DWORD PTR _b$[esp+8]
$LN6@main:

I find it very strange that two single-precision floats are first expanded into double precision floats before being compared. Why isn't comiss used instead of comisd? This would save two instructions and would generally seem to be far more efficient.

Share this post


Link to post
Share on other sites
Advertisement
Compilers cannot understand a programmer's context and will rarely produce the best code for every situation.

Share this post


Link to post
Share on other sites
Obviously, but in this case the compiler knows a and b are floats. Why does it expand them to doubles??

If it understood intent it should produce something along the lines of:

movss xmm0, a;
movss xmm1, b;
minss xmm0, xmm1;
movss retval, xmm0;

Share this post


Link to post
Share on other sites
Perhaps it is willing to make the tradeoff in favor of execution performance versus space. In other words, the conversion to double may yield better performance in hardware than using float.

Share this post


Link to post
Share on other sites
Why would it? The instruction set supports the exact same instruction in a single precision variant, and doing so would require two instructions less (the conversions to double precision)

Share this post


Link to post
Share on other sites
Most likely the compiler recognizes the pattern and just provides the one solution it knows. The developers might have reasoned that the expansion is practically "free".

Share this post


Link to post
Share on other sites
Well, it isn't:


float a;
float b;
float res;

std::cin >> a;
std::cin >> b;

sf::Clock clock;

clock.Reset();
for(unsigned int i = 0; i < 10000000; ++i)
{
__asm
{
movss xmm0, a;
movss xmm1, b;
movss xmm2, xmm0;
movss xmm3, xmm1;
cvtps2pd xmm0, xmm0;
cvtps2pd xmm1, xmm1;
comisd xmm1, xmm0;
movss res, xmm2;
ja min_is_a;
movss res, xmm3;
min_is_a:
}
}
float time1 = clock.GetElapsedTime();

clock.Reset();
for(unsigned int i = 0; i < 10000000; ++i)
{
__asm
{
movss xmm0, a;
movss xmm1, b;
movss xmm2, xmm0;
movss xmm3, xmm1;
comiss xmm1, xmm0;
movss res, xmm2;
ja min_is_a2;
movss res, xmm3;
min_is_a2:
}
}
float time2 = clock.GetElapsedTime();

std::cout << res << "\n";
std::cout << time1 << "\n";
std::cout << time2 << "\n";




produces

0.0515853
0.0381895

on my pc. so the expansion is everything but free (35% slower)

Share this post


Link to post
Share on other sites
It probably doesn't matter because it's only a few cycles lost, and that code won't normally run a billion times per second, but have you tried std::min or the instrinsic min() function?
I'm not using Visual C++, so no idea about that one, but under gcc, the like functions usually offer an optimal implementation, which is about as good as you could get with (and sometimes better than) hand-written assembly.

Having said that, I've entirely given up going anywhere near assembler quite a while ago, because it isn't really worth it any more. Looking at more than 3-4 isolated instructions, compiler output with full optimization is rarely a few cycles slower than what you could code in assembler, and usually as fast or even faster. Also, writing C++ takes only about 5% of the time, and the code is a lot easier to manage and debug.

Share this post


Link to post
Share on other sites
Run that benchmark 10k times and tell us the averages. I trust that you told your compiler to generate optimized code.

Share this post


Link to post
Share on other sites
Quote:
Original post by thedustbustr
Run that benchmark 10k times and tell us the averages. I trust that you told your compiler to generate optimized code.


Right... and yes.

I ran it 5 times and the differences were at the 5th decimal. Good enough for me.

Share this post


Link to post
Share on other sites
Double-precision SSE instructions were introduced later on in SSE2, so have you tried generating code for plain old SSE instead? If I know Microsoft right there's probably some obscure pragma for setting it temporarily..
I'm not saying it's a good solution or anything, in fact you're almost certainly better off using intrinsics, but it would be interesting to see what the compiler does with it.
Not that it should matter but perhaps it might have something to do with the floating point precision mode, so try /fp:fast if you're not using it already.

For the most part getting compilers to generate good code reliably is an exercise in frustration, especially when you know just what you want and you're only trying to massage the code into yielding the right instructions. Personally I trust the compiler to do a decent enough job with the bulk of the code, check the assembly listings once in a while to see what patterns result in good code, and hand-code anything critical in assembly language.
After all it's the mundane branchy logic-code you want help with, if you know enough to get the compiler to generate fast code for SIMD innerloops then you shouldn't have any trouble writing them in assembly language either (or intrinsics, but I view those as assembly language with automatic register allocation).

Share this post


Link to post
Share on other sites
Quote:
Original post by implicit
Double-precision SSE instructions were introduced later on in SSE2, so have you tried generating code for plain old SSE instead? If I know Microsoft right there's probably some obscure pragma for setting it temporarily..
I'm not saying it's a good solution or anything, in fact you're almost certainly better off using intrinsics, but it would be interesting to see what the compiler does with it.
Not that it should matter but perhaps it might have something to do with the floating point precision mode, so try /fp:fast if you're not using it already.


Ok, first off, I tried the std::min and std::max templates. These produce the exact same code. Interestingly though, the comparison (a < 0) ? -a : a; produces a simple comiss, as I would expect it in all cases. Anyways, using SSE instead of SSE2 produces plain old x87 FPU code. Except for the aforementioned (a < 0) ? -a : a; for which - again - comiss is used! /fp:fast with SSE2 actually does cause the compiler to emit comiss for all three. Still not optimal as minss/maxss would be better still.

Quote:
For the most part getting compilers to generate good code reliably is an exercise in frustration, especially when you know just what you want and you're only trying to massage the code into yielding the right instructions. Personally I trust the compiler to do a decent enough job with the bulk of the code, check the assembly listings once in a while to see what patterns result in good code, and hand-code anything critical in assembly language.
After all it's the mundane branchy logic-code you want help with, if you know enough to get the compiler to generate fast code for SIMD innerloops then you shouldn't have any trouble writing them in assembly language either (or intrinsics, but I view those as assembly language with automatic register allocation).


Usually I don't look at asm code too much, I only noticed this by chance while looking for something else. And yeah, intrinsics are definitely better than inline assembler (also, they are usually compiler independent, albeit not platform-independent)

I wonder what the Intel C++ compiler produces in this situation. Too bad my trial license expired last week.

Share this post


Link to post
Share on other sites
Quote:
Original post by l0calh05t
Ok, first off, I tried the std::min and std::max templates. These produce the exact same code. Interestingly though, the comparison (a < 0) ? -a : a; produces a simple comiss, as I would expect it in all cases. Anyways, using SSE instead of SSE2 produces plain old x87 FPU code. Except for the aforementioned (a < 0) ? -a : a; for which - again - comiss is used! /fp:fast with SSE2 actually does cause the compiler to emit comiss for all three. Still not optimal as minss/maxss would be better still.

Ah, the last bit *might* be the explanation. Have you read the documentation for the different /fp modes? The standard setting is pretty conservative, so it's possible that it forces doubles to be used for intermediate results.

It does look odd though, I agree.

Share this post


Link to post
Share on other sites
Quote:
Original post by Spoonbender
Ah, the last bit *might* be the explanation. Have you read the documentation for the different /fp modes? The standard setting is pretty conservative, so it's possible that it forces doubles to be used for intermediate results.


Agreed. If a and b were intermediate results. (Which they are not) And even if it were the case, why do they get expanded to double precision directly after being loaded as single precision variables? Anyways, it doesn't really matter too much, as it probably won't be much of a bottleneck but I would have expected better from a modern, optimizing compiler.

EDIT:
Just noticed some other, unrelated strangeness... I have a distance function in my namespace, but when I try to use it, VC++ tries to instantiate the std::distance template, although I used neither using namespace std; nor using std::distance; anywhere in my code... this is annoying.

[Edited by - l0calh05t on August 3, 2008 5:03:15 AM]

Share this post


Link to post
Share on other sites
Quote:
Original post by l0calh05t
Quote:
Original post by Spoonbender
Ah, the last bit *might* be the explanation. Have you read the documentation for the different /fp modes? The standard setting is pretty conservative, so it's possible that it forces doubles to be used for intermediate results.


Agreed. If a and b were intermediate results. (Which they are not) And even if it were the case, why do they get expanded to double precision directly after being loaded as single precision variables? Anyways, it doesn't really matter too much, as it probably won't be much of a bottleneck but I would have expected better from a modern, optimizing compiler.

True. File a bug on it?

Share this post


Link to post
Share on other sites
it could be something as simple as you are running x86 code in backwards compatibility mode on an x64 chip, and the compiler is actually generating optimal code for the target you chose.

Share this post


Link to post
Share on other sites
Your hardware uses 80-bit floats anyway. It's got to be converted somewhere.

Share this post


Link to post
Share on other sites
Quote:
Original post by thedustbustr
it could be something as simple as you are running x86 code in backwards compatibility mode on an x64 chip, and the compiler is actually generating optimal code for the target you chose.


No, and I already showed that it isn't optimal.

Quote:
Original post by Deyja
Your hardware uses 80-bit floats anyway. It's got to be converted somewhere.


If the FPU is used, yes, but if SSE is enabled, VC++ uses SSE for most floating point ops, so the hardware uses 32-bit floats in this case, anyway. And as I already pointed out, there is no reason to load two floating point variables in single precision, to convert them to double precision (not 80-bit) and only then comparing them.

Share this post


Link to post
Share on other sites
Quote:
Original post by l0calh05t
Quote:
Original post by implicit
Double-precision SSE instructions were introduced later on in SSE2, so have you tried generating code for plain old SSE instead? If I know Microsoft right there's probably some obscure pragma for setting it temporarily..
I'm not saying it's a good solution or anything, in fact you're almost certainly better off using intrinsics, but it would be interesting to see what the compiler does with it.
Not that it should matter but perhaps it might have something to do with the floating point precision mode, so try /fp:fast if you're not using it already.


Ok, first off, I tried the std::min and std::max templates. These produce the exact same code. Interestingly though, the comparison (a < 0) ? -a : a; produces a simple comiss, as I would expect it in all cases. Anyways, using SSE instead of SSE2 produces plain old x87 FPU code. Except for the aforementioned (a < 0) ? -a : a; for which - again - comiss is used! /fp:fast with SSE2 actually does cause the compiler to emit comiss for all three. Still not optimal as minss/maxss would be better still.

Quote:
For the most part getting compilers to generate good code reliably is an exercise in frustration, especially when you know just what you want and you're only trying to massage the code into yielding the right instructions. Personally I trust the compiler to do a decent enough job with the bulk of the code, check the assembly listings once in a while to see what patterns result in good code, and hand-code anything critical in assembly language.
After all it's the mundane branchy logic-code you want help with, if you know enough to get the compiler to generate fast code for SIMD innerloops then you shouldn't have any trouble writing them in assembly language either (or intrinsics, but I view those as assembly language with automatic register allocation).


Usually I don't look at asm code too much, I only noticed this by chance while looking for something else. And yeah, intrinsics are definitely better than inline assembler (also, they are usually compiler independent, albeit not platform-independent)

I wonder what the Intel C++ compiler produces in this situation. Too bad my trial license expired last week.

If you use Linux you can continue using the Intel C++ compiler for non-commercial use.

Share this post


Link to post
Share on other sites
Quote:
Original post by Deyja
Your hardware uses 80-bit floats anyway. It's got to be converted somewhere.

No, it isn't. It uses 64 bit accuracy in SSE. 64 bit code doesn't know anything else than GP and SSE registers. Even if you are talking about legacy, it wasn't conversion, they placed 64 bit number into register, and computed in greater accuracy, because they were lazy/wanted simple algorithm for sin/cos. They didn't extend the number at all.

BTW His assembly listing showed SSE registers.

In reality current SSE registers should compute with greater than 64 bit precision for 64 bit floating point numbers. However they are doing it behind the scenes. SSE2 register stays 128 bit, and numbers are not expanded.

IIRC Goldsmith and people from the standards organization had some articles/materials about this topic. (look for guard bits, and similar stuff)

Share this post


Link to post
Share on other sites
Quote:
If you use Linux you can continue using the Intel C++ compiler for non-commercial use.


Thanks for the hint, but when I'm not using Windows I'm using FreeBSD. I've been thinking about buying a student license (much cheaper) but I couldn't find any details on the license terms...

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement