fast float number multiplication by 2 and 0.5

Started by
71 comments, last by DigitalDelusion 19 years, 10 months ago
Thanks for the explanation.
Advertisement
Looking like another case of KISS: Keep It Simple Stupid. Write good code and let the compiler optimize it. The following code used srand(..) as a function with a side effect to prevent the compiler from optimizing away the variable f. All code was compiled using Visual C++ .NET with all optimizations enabled.

-benchmark code-
	Timer t;	int iterations = 100000000;	double result1, result2, result3;	srand(0);	t.RestartTimer();	for( int i = 0; i < iterations; i++)	{		float f = (float)i;		*(int*)&f=(*(int*)&f)-(1L<<23L);		srand( (int)f );	}	result1 = t.GetSecondsSinceTimerRestarted();		srand(0);	t.RestartTimer();	for( int i = 0; i < iterations; i++)	{		float f = (float)i;		f /= 2.0f;		srand( (int)f );	}	result2 = t.GetSecondsSinceTimerRestarted();		srand(0);	t.RestartTimer();	for( int i = 0; i < iterations; i++)	{		float f = (float)i;		f *= 0.5f;		srand( (int)f );	}	result3 = t.GetSecondsSinceTimerRestarted();		printf( "Results: \nProc1: %g seconds.\nProc2: %g seconds.\nProc3: %g seconds.\n", result1, result2, result3 );


Yes the casting most likely takes the majority of the time in the benchmark, but the important thing is that it takes an "equal" amount of time for each test, so that can be disregarded. The only thing that is not identical in the three tests is the three ways to divide the float value by 2:

proc1 = charles' way
proc2 = obvious way, just divide by 2
proc3 = multiply by 0.5

Results in seconds:

proc1 = 4.781 4.782 4.748 4.764
proc2 = 3.843 3.888 3.884 3.880
proc3 = 3.854 3.850 3.866 3.854

Conclusion: Charles' way is by far the slowest and most complex. The other two methods compiled into essentially the same code as one another and were significantly faster. K.I.S.S.

[Edited by - haro on June 25, 2004 12:32:34 AM]
Assembly generation if anybody is curious:

proc1:

; 83   : 		float f = (float)i;	fild	DWORD PTR _i$19490[esp+72]	fstp	DWORD PTR _f$19494[esp+72]; 84   : 		*(int*)&f=(*(int*)&f)-(1L<<23L);	mov	edx, DWORD PTR _f$19494[esp+72]	sub	edx, 8388608				; 00800000H	mov	DWORD PTR _f$19494[esp+72], edx; 85   : 		srand( (int)f );	fld	DWORD PTR _f$19494[esp+72]	call	__ftol2	push	eax	call	_srand	add	esp, 4	inc	esi	cmp	esi, 100000000				; 05f5e100H	mov	DWORD PTR _i$19490[esp+72], esi	jl	SHORT $L19491


proc2:

; 92   : 	{; 93   : 		float f = (float)i;; 94   : 		f /= 2.0f;; 95   : 		srand( (int)f );	fild	DWORD PTR _i$19499[esp+72]	fmul	DWORD PTR __real@3f000000	call	__ftol2	push	eax	call	_srand	add	esp, 4	inc	esi	cmp	esi, 100000000				; 05f5e100H	mov	DWORD PTR _i$19499[esp+72], esi	jl	SHORT $L19500


proc3:

; 102  : 	{; 103  : 		float f = (float)i;; 104  : 		f *= 0.5f;; 105  : 		srand( (int)f );	fild	DWORD PTR _i$19506[esp+72]	fmul	DWORD PTR __real@3f000000	call	__ftol2	push	eax	call	_srand	add	esp, 4	inc	esi	cmp	esi, 100000000				; 05f5e100H	mov	DWORD PTR _i$19506[esp+72], esi	jl	SHORT $L19507

Quote:Original post by haro
Conclusion: Charles' way is by far the slowest and most complex. The other two methods compiled into essentially the same code as one another and were significantly faster. K.I.S.S.


Keep It Simple Stupid
[ego on]
I agree, this applies to the average 'stupid' coder not able to write a full 3D game in asm ;) If I have been stupid it's writing crazy 3D software engines with lota asm 10 years ago. thus ruined my (mental) health. But now you'd better not fool yourself in imagining I say stupid things on my favorite terrain : math and asm.
[/ego off]

Write good code and let the compiler optimize it.
Cm'on I am not a noob. I am your math/asm God ! ;) Believe me I know what I am doing, just read what follows cautiously. I said nothing else, let the compiler work, unless you exactly know what you are doing ... which you did not in your benchmark.

I warned clearly :
- I hope you use float already in memory as I mentionned it. Else it's pointless.

This is easy to see here :
float f=(float)i;
	fild	DWORD PTR _i$19490[esp+72]	fstp	DWORD PTR _f$19494[esp+72]	mov	edx, DWORD PTR _f$19494[esp+72]

I already mentionned : Although beware of such tricks, CPUs do not like store load forward dependencies. Specially when the store (fstp) and load (mov) is not of the same type. If you had used an array of float these two instructions : fild, fstp would not be here.

Here is a valid sample loop :
for( int i = 0; i < iterations; i++)
{
*(int*)&Output=(*(int*)&Input)-(1L&lt;&lt;23L);<br>}<br><br><br>I'll let you the pleasure to make the benchmark because I already know asm and the answer. But you can correct yours easilly. You'll see that if your input is in an array, 'my' solution will be the fastest by far. I bet it is at least <b> 200% faster</b> with a good security margin.<br><br><br>Next your benchmark system is also completely invalid for several reasons I also mentionned and warned. Understand what the asm output is to detect any flaw :<br><br>I wrote :<br><i><b>- beware of the surrounding code.</b></i><br><br>Two things prohibited :<br>- call _srand is by far the most time consumming part, it's a complete noise in the inner loop.<br>- same thing for call _ftol2<br><br>Last this :<br><pre><br>mov DWORD PTR _i$19499[esp+72], esi<br></pre><br>Is a print of non optimized debug code. There is no register pressure, so esi, the loop counter has no reason to be stored.<br> <br><b>Remember this : you don't compare the speeds of two Formula &#79;ne cars by attaching a 5 ton van behind it full of measurement equipment !</b><br>(Could anyone correct my english &#111;n the last sentence so that it could become a useful rule of thumb.)<br><br>I think I will propose a paper that discusses benchmarking of <b>small inline functions</b>. I see that nearly none at Gamedev is able to make a really correct &#111;ne (Although Intel and co provide a few valid infos).<br><br>PS : thanks for providing the asm Haro, this enables to see the flaws more easilly.
"Coding math tricks in asm is more fun than Java"
Charles, I meant nothing personal via KISS. I hope you did not take it that way.

Anyhow, if you care to share I'd love to see a benchmark in a similiar format that ends up with your code going as fast as the basic "f *= 0.5". Also as far as the srand(..) and the cast taking the most time, I am well aware. However, this overhead is added to each 'proc' equally. Using your analogy, if I race two formula one cars and then add 5 hours to their final times, the faster car is still the faster car.
Quote:Original post by haro
Charles, I meant nothing personal via KISS. I hope you did not take it that way.

I did not take it so. It was just a humourous way trying to tell something concerning code solidity and code speed. I also agree in general. Even if I have enough experience to cope with such odd code efficiently, I'd surely use the normal way for productivity reasons. It's just the issue of this thread that let me show this odd way.

I can only blame you for not considering acurately all the warnings I added to show in which conditions this might be used. Very rarely. I would write such code for instance in some weird CPU rendering things, where the input data has a very particular form.

Also as far as the srand(..) and the cast taking the most time, I am well aware. However, this overhead is added to each 'proc' equally.

Using your analogy, if I race two formula one cars and then add 5 hours to their final times, the faster car is still the faster car.

This analogy is very valid. Because a formula one with more power and less speed (more massive) will be advantaged. In this case the disadvantage is the fild fstp sequence, store load dependency. You can not argue I did not warn about it clearly. USEFUL FOR DATA ALREADY IN MEMORY. That's also all I formulated as 'noise'. You loose precision in the signal you want to measure with the benchmark 'equipment'. To exxagerate if both solutions cost 1 cycle, and the overhead 100, it's certain that de <1 cycle difference will be totaly lost in small details of compilation that do not depend on the function measure itself, but on the way the benchmark only stuff interacts (schedules) with the function.


Anyhow, if you care to share I'd love to see a benchmark in a similar format that ends up with your code going as fast as the basic "f *= 0.5".


Absolutely no problem I already have the stuff used in the unfamous next higher power of 2 thread :)

Now to be clear the comparison benchamrk I'll use will simply be something like this :

for(i=0; i<n; i++)
pOut = myFunc(pInp)
Where my func is one of the three 'solutions' as an inline function, macro function, or directly expressed. It changes nothing anyway.

I'll post the full code and results if you wish. Gimme a few minutes. Anyway I already gave a prediction of it earlier in my reply to d000hg. I doubt I'll be surprised here. Or else I deserve the guillotine for my self attributed crown of math/asm god.
"Coding math tricks in asm is more fun than Java"
#include <stdio.h>#include <stdlib.h>#include <conio.h>unsigned rdtsc(void){	unsigned timer;	#if defined(_MSC_VER)		/* Visual C++ inline assembly code */        __asm{        	rdtsc        	mov timer, eax        };	#elif defined(__GNUC__)		/* GCC inline assembly code */    	register unsigned lo asm("eax"), hi asm("edx");    	__asm volatile (".byte 15,49" : : : "eax", "edx");		hi;    	timer = lo;  	#endif  	return timer;}#define n 128#define nOuterLoops 128static float pInp[n];static float pOut[n];#define div2_v0(r, s) r=s*0.5f#define div2_v1(r, s) *(int*)&(r)=(*(int*)&(s))-(1<<23)#define test(f) \ { \ 	int i, j; \ 	unsigned tmin = n*100000UL; \ 	unsigned timing; \ 	for(j=0; j<nOuterLoops; j++) \ 	{ \ 		unsigned t=rdtsc(); \ 		for(i=-n; i; i+=4) \ 		{ 			f(pOut[i+n+0], pInp[i+n+0]); \ 			f(pOut[i+n+1], pInp[i+n+1]); \ 			f(pOut[i+n+2], pInp[i+n+2]); \ 			f(pOut[i+n+3], pInp[i+n+3]); \ 		} \ 		t=rdtsc()-t; \ 		tmin = (t < tmin) ? t : tmin; \ 	} \ 	timing = (100*tmin)/n; \ 	printf("%-20s : %3d/100 clock cycles\n", #f, timing); \ }int main(void){	/*		Init the input array		Just in case we compare the outputs, div2_v1 is undefined in 0.		Thus the +1. 	*/	{ int i; for(i=0; i<n; i++) pInp = (float)(1+rand()); }	test(div2_v0);	test(div2_v1);	getch();}


With full optimizations on :
I get 2.28 cycles for v0 (float)
I get 1.97 cycles for v1 (int)

So I win but it's a victory à la Zidane. Last minute one. Only 10% difference isn't worth the pain + the zero problem. At least on an Athlon. I am a bit surprised by the effects of code reordering that reduces the float lantencies. My 200% prediction was thus quite exagerated. Conclusion this optimization is really pointless unless it's grouped with other IEEE representation stuff as in Carmack's sqrt routine. Where is my head ? :)

[Edited by - Charles B on June 26, 2004 5:00:40 AM]
"Coding math tricks in asm is more fun than Java"
You are arguing that he is wrong. He is someone that has a good deal of experience that you lack. He is someone that can pretty well tell you how long each of those two instructions take without measuring it. That isn't good enough for you though, you want to set up a test to prove whether he is right or wrong and conclude he is wrong. That is because your test if flawed. You don't care enough to go to the effort to understand what is flawed about your test, but that doesn't keep you from still arguing he is wrong.
Keys to success: Ability, ambition and opportunity.
Hey who are you talking to ? In either case that's pointless flaming stuff. Haro or I are giving arguments, you just nonsense bashing stuff. Where is your point ?
"Coding math tricks in asm is more fun than Java"
I also agree with Charles -- that was pointless flaming LilBudyWizer. I don't think anybody is arguing that somebody is just "wrong". I doubted Charles' optimization and created a little benchmark program to support my doubt. Charles' proved my benchmark was probably flawed, with his counter-benchmark. And it continues onwards while everybody hopefully learns a few things in the process. I already have for sure.

Anyhow, back on topic. Out of curiosity I rewrote your benchmark Charles, to make it a bit easier for people to read and to use a more realistic data set size. Here is my version of your benchmark:

#include "timer.h"#include <stdio.h>#include <stdlib.h>#include <conio.h>#define n 33554432#define nOuterLoops 10static float pInp[n];static float pOut[n];Timer timer;inline void test1(){	double result;	double bestTime = 999999.0;		for(int j=0; j < nOuterLoops; j++)		{				timer.RestartTimer();		for(int i = -n; i; i += 4 )				{						pOut[i+n+0] = pInp[i+n+0] * 0.5f;						pOut[i+n+1] = pInp[i+n+1] * 0.5f;						pOut[i+n+2] = pInp[i+n+2] * 0.5f;						pOut[i+n+3] = pInp[i+n+3] * 0.5f;					}				result = timer.GetSecondsSinceTimerRestarted();		if( result < bestTime )			bestTime = result;	}			printf("Best time for test1 = %g seconds.\n", bestTime );}inline void test2(){	double result;	double bestTime = 999999.0;		for(int j=0; j<nOuterLoops; j++)		{				timer.RestartTimer();		for(int i = -n; i; i+=4)				{			*(int*)&(pOut[i+n+0])=(*(int*)&(pInp[i+n+0]))-(1<<23);			*(int*)&(pOut[i+n+1])=(*(int*)&(pInp[i+n+1]))-(1<<23);			*(int*)&(pOut[i+n+2])=(*(int*)&(pInp[i+n+2]))-(1<<23);			*(int*)&(pOut[i+n+3])=(*(int*)&(pInp[i+n+3]))-(1<<23);		}				result = timer.GetSecondsSinceTimerRestarted();		if( result < bestTime )			bestTime = result;	}			printf("Best time for test2 = %g seconds.\n", bestTime );}int main(void){	srand(0);	for( int i=0; i<n; i++ ) 		pInp = (float)(1+rand());     test1();	srand(0);	for( int i=0; i<n; i++ ) 		pInp = (float)(1+rand());     test2();}



Assembly for test1, which simply multiplies by 0.5f:

; 209  : 		{			; 210  : 			pOut[i+n+0] = pInp[i+n+0] * 0.5f;				fld	DWORD PTR _pInp[eax]	add	eax, 16					; 00000010H	cmp	eax, 134217728				; 08000000H	fmul	DWORD PTR __real@3f000000	fstp	DWORD PTR _pOut[eax-16]; 211  : 			pOut[i+n+1] = pInp[i+n+1] * 0.5f;				fld	DWORD PTR _pInp[eax-12]	fmul	DWORD PTR __real@3f000000	fstp	DWORD PTR _pOut[eax-12]; 212  : 			pOut[i+n+2] = pInp[i+n+2] * 0.5f;				fld	DWORD PTR _pInp[eax-8]	fmul	DWORD PTR __real@3f000000	fstp	DWORD PTR _pOut[eax-8]; 213  : 			pOut[i+n+3] = pInp[i+n+3] * 0.5f;				fld	DWORD PTR _pInp[eax-4]	fmul	DWORD PTR __real@3f000000	fstp	DWORD PTR _pOut[eax-4]	jne	SHORT $L19583


Assembly for test2, the more complex method.

; 234  : 		{; 235  : 			*(int*)&(pOut[i+n+0])=(*(int*)&(pInp[i+n+0]))-(1<<23);	mov	edx, DWORD PTR _pInp[eax]; 236  : 			*(int*)&(pOut[i+n+1])=(*(int*)&(pInp[i+n+1]))-(1<<23);	mov	ecx, DWORD PTR _pInp[eax+4]	sub	edx, 8388608				; 00800000H	sub	ecx, 8388608				; 00800000H	mov	DWORD PTR _pOut[eax], edx; 237  : 			*(int*)&(pOut[i+n+2])=(*(int*)&(pInp[i+n+2]))-(1<<23);	mov	edx, DWORD PTR _pInp[eax+8]	mov	DWORD PTR _pOut[eax+4], ecx; 238  : 			*(int*)&(pOut[i+n+3])=(*(int*)&(pInp[i+n+3]))-(1<<23);	mov	ecx, DWORD PTR _pInp[eax+12]	sub	edx, 8388608				; 00800000H	sub	ecx, 8388608				; 00800000H	mov	DWORD PTR _pOut[eax+8], edx	mov	DWORD PTR _pOut[eax+12], ecx	add	eax, 16					; 00000010H	cmp	eax, 134217728				; 08000000H	jne	SHORT $L19597


Results in Seconds for best time with 2^25 Inputs:

test1(simple method ): 0.240184 0.239053 0.239093
test2(complex method): 0.241958 0.241791 0.241830

Even using your method of taking only the best times, the complex method still does not pull ahead in timing UNLESS I use extremely small sets of data, as you did ( 128 ), in which case it generally pulled ahead by a small fraction of 1 percent. I am not completely sure why this is.

Do you have any objections to this benchmark?

Even using your exact program, as you increase the number of values, the "simple" method "becomes" faster and begins to achieve better and better comparative results.

This topic is closed to new replies.

Advertisement