Performance hit from primitive casts?

Started by
15 comments, last by Endurion 16 years, 6 months ago
Hello all. I wouldn't think this would be an uncommon question, but I can't find any information on it in google or in the forums archives, so either this really is a non-issue or I'm just not using the right search terms. Basically, I'm curious about casting performance, specifically from integer types to floating types and back. What exactly happens when you do a cast from one primitive type to another (assume low level languages, like C or D)? Do modern x86 processors have instructions that can convert register values natively, or does the compiler add in a few extra opcodes to do the conversion first? How about for lower-level CPUs such as ARM? Either way, is there any performance hit from casting? If so, does anybody know any ballpark figures for extra cycles required before the cast instruction(s) is/are retired? Any and all information would be greatly appreciated. This is a topic I've long been curious about. Anytime performance would be an issue I try to plan my data structures in such a way as to prevent as much casting at runtime as possible, but I'm not even sure if this is even a big enough deal to worry about, as I can find so little information on the subject.
Advertisement
In general, if you're interested in what your compiler is doing under the hood, you can ask it. For example, if you compile this in MSVC 7.1 with the /FA switch:
int f(float b) {  return b;}float g(int b) {   return b;}

You get:
_TEXT	SEGMENT_b$ = 8							; size = 4?f@@YAHM@Z PROC NEAR					; f, COMDAT; Line 2	fld	DWORD PTR _b$[esp-4]	jmp	__ftol2?f@@YAHM@Z ENDP						; f_TEXT	ENDSPUBLIC	?g@@YAMH@Z					; gEXTRN	__fltused:NEAR; Function compile flags: /Ogtpy;	COMDAT ?g@@YAMH@Z_TEXT	SEGMENTtv65 = 8						; size = 4_b$ = 8							; size = 4?g@@YAMH@Z PROC NEAR					; g, COMDAT; Line 6	fild	DWORD PTR _b$[esp-4]	fstp	DWORD PTR tv65[esp-4]	fld	DWORD PTR tv65[esp-4]; Line 7	ret	0?g@@YAMH@Z ENDP						; g_TEXT	ENDSEND

Other compilers have similar switches to generate assembly output. Ex: gcc you can use the -S switch.

(Of course, since these are function calls rather than inline casts, the results will be different than what happens if you do a cast inside a function. However, you can generate the code for that and look yourself what happens in specific cases.)
Thanks for the help, so it appears that on your computer at least (I'm assuming it's a fairly recent x86), integer to float is practically free but float to integer requires a software library to do the conversion. I'm not really sure on the second one, though, my asm reading skills aren't the best.
Most ARMs don't have FPUs so anything with floating point is done in software. I've never used an ARM that did so I don't know what they support, look in the manual.

On processors that do have FPUs many have instructions to convert between fp/int, but it's not a guarantee, look in the manual. Many also have instructions to convert between single/double fp.

Most processors that support more than one int type also have instructions to sign or zero-extend an int type into a larger int type, but it's not a guarantee, look in the manual.

As for the performance/latency/throughput of such instructions... look in the manual.
Quote:Original post by kuroioranda
Thanks for the help, so it appears that on your computer at least (I'm assuming it's a fairly recent x86), integer to float is practically free but float to integer requires a software library to do the conversion. I'm not really sure on the second one, though, my asm reading skills aren't the best.


The compiler is being retarded or pedantic in that case; x86 has the fist

instruction to convert float to int, but probably that was compiled without optimization and/or with strict float semantics so it generates a call to some library function ftol. I don't remember if fist

completely adheres to the standard for floats, so that might be why.

Release build, optimized for speed (/O2), with default floating point consistency and intrinsics enabled. Don't ask me why it does it either. All I know is that's what it says it does.
Outrider>
Thanks, that was exactly the sort of info I was looking for!

SiCrane>
What opcode set are you compiling to? I haven't used VCC since version 6, but I know that with gcc you can specify which processor instruction sets to include. So for example, I think the default is 386 compatibility, in which case it won't include instructions specific to the 486 and above. It's possible that the conversion functions only showed up in later CPUs (486 would be the earliest possible, since the 386 didn't have an FP). I'll try compiling that example later when I get home targeted to a higher x86 CPU.

The default for processor sets for MSVC is /GB, which is equivalent to /G6 for MSVC 7.0 and 7.1. /G6 targets the Pentium Pro, Pentium II, Pentium III, and Pentium 4. Explicitly using either /GB or /G6 generates the same code as posted originally. If you crank it up to /G7, it generates:
?f@@YAHM@Z PROC NEAR					; f, COMDAT; Line 1	push	ecx; Line 2	fld	DWORD PTR _b$[esp]	fnstcw	WORD PTR tv66[esp]	movzx	eax, WORD PTR tv66[esp]	or	ah, 12					; 0000000cH	mov	DWORD PTR tv69[esp+4], eax	fldcw	WORD PTR tv69[esp+4]	fistp	DWORD PTR tv71[esp+4]	mov	eax, DWORD PTR tv71[esp+4]	fldcw	WORD PTR tv66[esp]; Line 3	pop	ecx	ret	0?f@@YAHM@Z ENDP						; f

Which is still a bit more than just a FISTP, though you can see the op in there.
I did a search for the FIST and FISTP instructions, and they appear to be implemented on the Pentium class processors (I even found some tests for the 486, but it was unclear if they were emulated or not). So I honestly have no idea why G6 optimized code wouldn't be using it.

The only thing I can think of it that it's because it's being converted for use as a return value, and the overhead of putting the integer in an FPU register for conversion and then pulling it back out and into the program stack incurs enough overhead that it's faster just to do the whole thing in in the integer units with magic numbers. Whereas if it were going to be used for arithmetic, putting it on an FPU stack might be worth the cost of pulling it back out again.
Hence the caveat I added to my original post:
Quote:
(Of course, since these are function calls rather than inline casts, the results will be different than what happens if you do a cast inside a function. However, you can generate the code for that and look yourself what happens in specific cases.)

This topic is closed to new replies.

Advertisement