# Performance hit from primitive casts?

This topic is 3720 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hello all. I wouldn't think this would be an uncommon question, but I can't find any information on it in google or in the forums archives, so either this really is a non-issue or I'm just not using the right search terms. Basically, I'm curious about casting performance, specifically from integer types to floating types and back. What exactly happens when you do a cast from one primitive type to another (assume low level languages, like C or D)? Do modern x86 processors have instructions that can convert register values natively, or does the compiler add in a few extra opcodes to do the conversion first? How about for lower-level CPUs such as ARM? Either way, is there any performance hit from casting? If so, does anybody know any ballpark figures for extra cycles required before the cast instruction(s) is/are retired? Any and all information would be greatly appreciated. This is a topic I've long been curious about. Anytime performance would be an issue I try to plan my data structures in such a way as to prevent as much casting at runtime as possible, but I'm not even sure if this is even a big enough deal to worry about, as I can find so little information on the subject.

##### Share on other sites
In general, if you're interested in what your compiler is doing under the hood, you can ask it. For example, if you compile this in MSVC 7.1 with the /FA switch:
int f(float b) {  return b;}float g(int b) {   return b;}

You get:
_TEXT	SEGMENT_b$= 8 ; size = 4?f@@YAHM@Z PROC NEAR ; f, COMDAT; Line 2 fld DWORD PTR _b$[esp-4]	jmp	__ftol2?f@@YAHM@Z ENDP						; f_TEXT	ENDSPUBLIC	?g@@YAMH@Z					; gEXTRN	__fltused:NEAR; Function compile flags: /Ogtpy;	COMDAT ?g@@YAMH@Z_TEXT	SEGMENTtv65 = 8						; size = 4_b$= 8 ; size = 4?g@@YAMH@Z PROC NEAR ; g, COMDAT; Line 6 fild DWORD PTR _b$[esp-4]	fstp	DWORD PTR tv65[esp-4]	fld	DWORD PTR tv65[esp-4]; Line 7	ret	0?g@@YAMH@Z ENDP						; g_TEXT	ENDSEND

Other compilers have similar switches to generate assembly output. Ex: gcc you can use the -S switch.

(Of course, since these are function calls rather than inline casts, the results will be different than what happens if you do a cast inside a function. However, you can generate the code for that and look yourself what happens in specific cases.)

##### Share on other sites
Thanks for the help, so it appears that on your computer at least (I'm assuming it's a fairly recent x86), integer to float is practically free but float to integer requires a software library to do the conversion. I'm not really sure on the second one, though, my asm reading skills aren't the best.

##### Share on other sites
Most ARMs don't have FPUs so anything with floating point is done in software. I've never used an ARM that did so I don't know what they support, look in the manual.

On processors that do have FPUs many have instructions to convert between fp/int, but it's not a guarantee, look in the manual. Many also have instructions to convert between single/double fp.

Most processors that support more than one int type also have instructions to sign or zero-extend an int type into a larger int type, but it's not a guarantee, look in the manual.

As for the performance/latency/throughput of such instructions... look in the manual.

##### Share on other sites
Quote:
 Original post by kuroiorandaThanks for the help, so it appears that on your computer at least (I'm assuming it's a fairly recent x86), integer to float is practically free but float to integer requires a software library to do the conversion. I'm not really sure on the second one, though, my asm reading skills aren't the best.

The compiler is being retarded or pedantic in that case; x86 has the fist[p] instruction to convert float to int, but probably that was compiled without optimization and/or with strict float semantics so it generates a call to some library function ftol. I don't remember if fist[p] completely adheres to the standard for floats, so that might be why.

##### Share on other sites
Release build, optimized for speed (/O2), with default floating point consistency and intrinsics enabled. Don't ask me why it does it either. All I know is that's what it says it does.

##### Share on other sites
Outrider>
Thanks, that was exactly the sort of info I was looking for!

SiCrane>
What opcode set are you compiling to? I haven't used VCC since version 6, but I know that with gcc you can specify which processor instruction sets to include. So for example, I think the default is 386 compatibility, in which case it won't include instructions specific to the 486 and above. It's possible that the conversion functions only showed up in later CPUs (486 would be the earliest possible, since the 386 didn't have an FP). I'll try compiling that example later when I get home targeted to a higher x86 CPU.

##### Share on other sites
The default for processor sets for MSVC is /GB, which is equivalent to /G6 for MSVC 7.0 and 7.1. /G6 targets the Pentium Pro, Pentium II, Pentium III, and Pentium 4. Explicitly using either /GB or /G6 generates the same code as posted originally. If you crank it up to /G7, it generates:
?f@@YAHM@Z PROC NEAR					; f, COMDAT; Line 1	push	ecx; Line 2	fld	DWORD PTR _b$[esp] fnstcw WORD PTR tv66[esp] movzx eax, WORD PTR tv66[esp] or ah, 12 ; 0000000cH mov DWORD PTR tv69[esp+4], eax fldcw WORD PTR tv69[esp+4] fistp DWORD PTR tv71[esp+4] mov eax, DWORD PTR tv71[esp+4] fldcw WORD PTR tv66[esp]; Line 3 pop ecx ret 0?f@@YAHM@Z ENDP ; f Which is still a bit more than just a FISTP, though you can see the op in there. #### Share this post ##### Link to post ##### Share on other sites I did a search for the FIST and FISTP instructions, and they appear to be implemented on the Pentium class processors (I even found some tests for the 486, but it was unclear if they were emulated or not). So I honestly have no idea why G6 optimized code wouldn't be using it. The only thing I can think of it that it's because it's being converted for use as a return value, and the overhead of putting the integer in an FPU register for conversion and then pulling it back out and into the program stack incurs enough overhead that it's faster just to do the whole thing in in the integer units with magic numbers. Whereas if it were going to be used for arithmetic, putting it on an FPU stack might be worth the cost of pulling it back out again. #### Share this post ##### Link to post ##### Share on other sites Hence the caveat I added to my original post: Quote:  (Of course, since these are function calls rather than inline casts, the results will be different than what happens if you do a cast inside a function. However, you can generate the code for that and look yourself what happens in specific cases.) #### Share this post ##### Link to post ##### Share on other sites Haha, yeah, I read that. I was just musing out loud about WHY that would be. Best I can do when I'm at work without access to a disassembler. #### Share this post ##### Link to post ##### Share on other sites Quote:  Original post by SiCraneThe default for processor sets for MSVC is /GB, which is equivalent to /G6 for MSVC 7.0 and 7.1. /G6 targets the Pentium Pro, Pentium II, Pentium III, and Pentium 4. Explicitly using either /GB or /G6 generates the same code as posted originally. If you crank it up to /G7, it generates:?f@@YAHM@Z PROC NEAR ; f, COMDAT; Line 1 push ecx; Line 2 fld DWORD PTR _b$[esp] fnstcw WORD PTR tv66[esp] movzx eax, WORD PTR tv66[esp] or ah, 12 ; 0000000cH mov DWORD PTR tv69[esp+4], eax fldcw WORD PTR tv69[esp+4] fistp DWORD PTR tv71[esp+4] mov eax, DWORD PTR tv71[esp+4] fldcw WORD PTR tv66[esp]; Line 3 pop ecx ret 0?f@@YAHM@Z ENDP ; fWhich is still a bit more than just a FISTP, though you can see the op in there.

Incidentally, the above first masks overflow and zero-divide FPU exceptions in the control word, then does fistp, then restores the previous FPU control word. So maybe you have FPU exceptions enabled.

Quote:
 Original post by kuroiorandaI did a search for the FIST and FISTP instructions, and they appear to be implemented on the Pentium class processors (I even found some tests for the 486, but it was unclear if they were emulated or not). So I honestly have no idea why G6 optimized code wouldn't be using it.The only thing I can think of it that it's because it's being converted for use as a return value, and the overhead of putting the integer in an FPU register for conversion and then pulling it back out and into the program stack incurs enough overhead that it's faster just to do the whole thing in in the integer units with magic numbers. Whereas if it were going to be used for arithmetic, putting it on an FPU stack might be worth the cost of pulling it back out again.

If that was the case it wouldn't convert int->float using that method, but as you can see it does just that with the fild instruction.

But then again it immediately stores the float back on the stack and then loads it again for no reason. So like I said earlier, sometimes compilers are retarded.

##### Share on other sites
Quote:
 If that was the case it wouldn't convert int->float using that method, but as you can see it does just that with the fild instruction.

I agree that's odd, but you're assuming (or maybe you know :) ) that fild and fist take a comparable amount of cycles to do their thing. If that's not the case, is it possible that the compiler might try to take that into account when doing the optimizations?

Bear with me, I know very little about compiler design, but am always eager to learn when I can :).

##### Share on other sites
The reason _ftol, _ftol2, and equivilent inlined code fiddles with the FPU control word is to save, change and restore the FP rounding mode.

This isn't because of exceptions, it's because ANSI C requires a specific rounding mode (truncate) for float->int but the FPU may be in a different mode (and is by default IIRC).

The /QIfist compile option will persuade the (MSVC) compiler to just use the plain [and much cheaper] fld,fistp sequence but obviously won't do the 'correct' ANSI C thing if the FPCW rounding mode is set to something other than truncate.

##### Share on other sites
You're right, it's setting bits 10,11 through AH for rounding. For some reason I read his example as setting bits 2,3 through AL for overflow and zero-div.

##### Share on other sites
Quote:
 Original post by S1CAThe reason _ftol, _ftol2, and equivilent inlined code fiddles with the FPU control word is to save, change and restore the FP rounding mode. This isn't because of exceptions, it's because ANSI C requires a specific rounding mode (truncate) for float->int but the FPU may be in a different mode (and is by default IIRC).The /QIfist compile option will persuade the (MSVC) compiler to just use the plain [and much cheaper] fld,fistp sequence but obviously won't do the 'correct' ANSI C thing if the FPCW rounding mode is set to something other than truncate.
Yep that's it. It's all about the rounding mode.
btw /QIfist is deprecated in VS2005, though it still works if you add in to the C/C++ Command Line Additional options. I use it in my software 3D renderer.

I don't think anyone has explicitly said this but, float to int is always slow, even if you only use fistp. Float to int conversions are best avoided when you can.

##### Share on other sites
This seems to fit into this topic:

I once tried to get my executable size down as much as possible. Removed all kind of default libraries.

In the end the intrinsic function _ftol2 was an unresolved external and i linked the .obj of it to the executable.

That's when i noticed that the compiler simply inserts that function when i do a int->float cast.

Is there actually a way to turn this off or is this fixed standard behaviour? Just wondering.

##### Share on other sites

This topic is 3720 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.