Archived

This topic is now archived and is closed to further replies.

nosajghoul

assembly is slower ?! (source included)

Recommended Posts

nosajghoul    100
Ok, newbie to assem, just probing the territory. I made 2 functions that do exactly the same thing. (Add a number together) but one uses assembly, the other not. (Im coding in MSVC++ 6.0 on a P3450 this evening, so your numbers will prolly b better ) I got the start time and finish time for each before I made it loop a million times (literally). I then output the total times, which should (shouldnt it?) be a benchmark of sorts for which is faster. I thought assembly would beat straight c, but nope. My results were : Assembly function = 1702, normal function = 1562. Normal was about .15 seconds faster. Why? And now for the code :
#include <iostream.h>
#include <windows.h> //for GetTickCount()

int x = 0xFFFF;
int y = 0xFFFF;
int r = 0; 

void __fastcall AssemAdd()
{
	_asm
	{
		mov eax, x
		mov ebx, y
		add eax, ebx
		mov r, eax
	}
}

void NormAdd()
{	
	r=x+y;
}

const int loops = 10000000; //10 million

void main()
{
	DWORD StartAssem = 0.0f;
	DWORD FinishAssem = 0.0f;
	DWORD TotalAssem = 0.0f;

	DWORD StartNorm = 0.0f;
	DWORD FinishNorm = 0.0f;
	DWORD TotalNorm = 0.0f;

	int x = 0; 
	
	//log starttime
	StartAssem = GetTickCount();
	
	for(x=0; x < loops; x++) 
		AssemAdd();
	
	//log finishtime
	FinishAssem = GetTickCount();
	//calculate totaltime
	TotalAssem = FinishAssem - StartAssem;

	//now for not assembly
	
	//log starttime
	StartNorm = GetTickCount();

	for(x=0; x < loops; x++)
		NormAdd();

	//log finishtime
	FinishNorm = GetTickCount();
	//calculate totaltime
	TotalNorm = FinishNorm - StartNorm;

	//output the times
	cout << endl << "Assem = " << TotalAssem;
	cout << endl << "Norm = " << TotalNorm;
	cout << endl;

}
 
Should work fine as is, in MSVC++ 6.0 at least. Its a console app, and when ran (on my comp anyway) it seems to hang for a second (its just doing 20 million things, thats all) and then outputs the time in milliseconds it took for the functions. Oh and could someone tell me what __fastcall is? Whats the technical term for those double underscored commands so I can look it up? -Jason

Share this post


Link to post
Share on other sites
Shannon Barber    1681
Inline assembly defeats the compiler''s optimizer. You''re saying "I know better" when you type asm, and it does everything exaxtly how you enter it. With C & C++ code, it just has to gaurantee the result.

(PS iostream.h is deprecated, use iostream, no .h)

Share this post


Link to post
Share on other sites
Guest Anonymous Poster   
Guest Anonymous Poster
Plus, that''s not even the most efficient you can make that code. More efficient would be:

_asm
{
mov eax, x
add eax, y
mov r, eax
}

If I remember my basic x86 asm right.

Are you using compiler optimizations?

Share this post


Link to post
Share on other sites
nosajghoul    100
Anonymous :

compiler optimizations? looking into that now. As I said, Im new at this as of today. I get it so far, havent hit the wall yet.

Im trying your suggestion (add eax, y mov r, eax ...) Makes sense. Ill post the times in a few minutes.

-Jason

Share this post


Link to post
Share on other sites
quote:
Original post by nosajghoul
Oh and could someone tell me what __fastcall is? Whats the technical term for those double underscored commands so I can look it up?



__fastcall, __cdecl, __pascal, etc. are all calling conventions. IIRC in __fastcall, the first two DWORD or smaller sized arguments are passed in ecx and edx, the rest are pushed on the stack right to left. Since you have no parameters, thats really not giving you a performance boost.

Share this post


Link to post
Share on other sites
nosajghoul    100
lol, I did what anonymous said, plus I switched from debug to release. (doh!) Huge difference. Normal function = 200, assembly = 201.

Closer, much faster, but I was really looking for a simple way to show off the speed of assembly to myself. Whilst I learnt, I have opened more doors than Ive shut.

Theres a way (I hear) to compile and see c++ code alongside assem code in MSVC++, right? I looked in settings, dont think its there.... help?

-Jason

Share this post


Link to post
Share on other sites
SiCrane    11839
Try looking at the listing files section of the output tab in your project settings. Or just look up the /FAs compiler switch.

Share this post


Link to post
Share on other sites
Deebo    128
DWORD StartAssem = 0.0f;
The ".0" and "f" is for floating point numbers, use "L"
DWORD StartAssem = 0l;

Doesn''t make a difference I think, but it looks wierd assigning floating point values to a long;

__fastcall is a Microsoft extension to C++.
It simply passes first two args in registers, and stacks the rest from right to left.
Since you are passing no args, it is pointless to use __fastcall (other than for name decoration reasons)

To make the asm code faster, simply put it in the loop, not in a seperate function. You MIGHT get a better time if you change "void AssemAdd(void)" to
__fastcall int AssemAdd(const int x, const int y)
{
// this function returns the answer
_asm
{
xor eax eax
add eax edx
add eax ecx
}
-OR-
_asm
{
mov eax ecx
add eax, edx
}

-OR-

__fastcall void AssemAdd(const int x, const int y)
{
// this will place the result in the variable "r"
_asm
{
mov r, ecd
add r, edx
}
}

There are ALOT of ways to do things in assembly.
Like Magmai Kai Holmlor said, when using inline assembly, you are saying you know mare than the compiler so you must be right. Assembly itself isnt faster than C++. Its what you do with it that makes it faster.

Intro Engine

Share this post


Link to post
Share on other sites
Guest Anonymous Poster   
Guest Anonymous Poster
The problem with learning asm for that purpose is that compilers keep getting better and better at translating C++ to assembly. Example: I used every optimization trick I knew for this one block of code on a program I worked on. Result? It worked faster in the debug mode (which doesn''t use optimizations), but my "clever" optimizations short-circuited the compiler''s optimizations, which were superior.

Nevertheless, you can almost always still find ways to squeeze more speed out of a block of code by converting to asm because no compiler is perfect, and you know exactly what you''re trying to do...it''s just a lot harder than if you''re working on an obscure platform with a poor compiler.

Share this post


Link to post
Share on other sites
Melekor    379
The normal function is faster because your assembler is worse than the compiler's assembler

This is actually a very poor test since it is possible to optimize away the loop entirely by simply replacing it with
r = x + y;

GetTickCount() is only accurate to 10 ms. Consider using QueryPerformanceCounter()

BTW, you might think ebx is the 2nd register (a = 0, b = 1, etc)
but this is false. ecx is 1.
the order of the registers is
EAX, ECX, EDX, EBX, ESP, EBP, ESI, EDI,

the first four are general purpose, the last two are meant for pointers (but you can use them for anything)
esp and ebp are used by the compiler so you'll have to restore them if you use them in an asm block.

My suggestions for a better test:

1)Find a better test (something where you cant optimize it all away to nothing)

2)Write the loop yourself, and think about the algorithm. Does everything inside the loop need to be inside the loop? Or could it be done once, outside of the loop?
Most compilers take advantage of this, so if you don't in your assembler, of course it will be slower than the compiler's code.

3)compile it in release mode with all optimizations on

[edited by - Melekor on March 21, 2004 6:51:15 PM]

Share this post


Link to post
Share on other sites
nosajghoul    100
I think the best reason to learn assembly isnt for the possible boost in speed, but learning in gory technical detail whats really going on under the hood. Man, Ive been at this for only 2 hours tonight, and I already have a much better appreciation of C++!! I think Ill learn it, but will prolly never make anything useful with it.

Thanks all,
-Jason

Share this post


Link to post
Share on other sites
Deebo    128
Assembly is the most powerful code next to machine code. In assembly you control whats going on and how it goes on. You can literally reprogram any program on your computer by using a dissasembler and hex editor. Big software companies do this to hack their own programs, and then recode it so it can''t be hacked that particular way. Anyone crazy enough to take the time to learn and memorize opcodes can reprogram any program using only a hex editor (or even notepad). Although it is very hard to learn assembly and use it efficiently, it definitely worth it.

Intro Engine

Share this post


Link to post
Share on other sites
NickB    146
One thing is I can't remember is to what level the VC6 optimiser works, but if it works globally, or inlines the function the optimiser should analyse the data flow & notice that infact 'r' is only ever assigned one value through the loops with no other side effects, so the r = x + y calculation can be taken out of the loop and only be done once, leaving erm...nothing in the loop, so that can be eliminated too, where as I don't think the optimisers analyse the asm code, meaning that to be sure it must run the asm version 'loops' times. Infact through DFA you can reduce the r = x + y to a constant load (on an x86 architecture that's 5 bytes into eax...thats the same as a call instruction...).

Basically it can get to a situation where the compiled 'c' function = 1 cpu cycle (constant load) and the asm function is a heck of a lot more cycles. This isn't fantasy either, I had this happening under the MSVC++ 7.1 compiler for a similar situation (I was writing a small expression compiler that needed to take a mathematical expression & compile it so that it could be run efficiently many times, obviously I was bench-marking against some compiler generated code - because of the restricted problem domain I was able to beat the compiler quite often - especially when it came to using functions [sin, cos, exp, ln, log10 et al.]. Unfortunately I initially wrote the comparative C function then called it directly 100000 times & spent a little time trying to figure out how the heck the compiled version could execute 100000 iterations in as close to damn it as I could tell 1 cpu cycle...then I looked at the assembley listing & all became clear!)

[edited by - NickB on March 21, 2004 8:50:48 PM]

Share this post


Link to post
Share on other sites
Guest Anonymous Poster   
Guest Anonymous Poster
nosajghoul,

Consider yourself lucky... you figured out on your first try what many kids still can''t get through their heads after a few years of C... that assembly isn''t faster than C.

The truth is that assembly *can be* faster than C, but 95 times out of 100 is not. For all intents and purposes, learning assembly is a much worse use of your time than learning C or C++.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster   
Guest Anonymous Poster
Asm isn''t faster than C/C++ and C/C++ isn''t faster than Asm.

C/C++, pascal, ada, basic ... are only LANGUAGES that later are converted to ASM, ASM (x86, z80, m6800 or any platform-asm) is the base of all program and the result of all language compiled program, so you can''t compare in this terms "Asm isn''t faster than C/C++" this two things.

Share this post


Link to post
Share on other sites
hplus0603    11347
Performance on modern CPUs is quite complex.

First of all, most real algorithms will have their performance determined by how many cache line flushes you generate, as well as how many cache line fetches you generate (i e, memory traffic and locality).

Second of all, an add is an add. A compiler will easily generate optimal code for something as simple as addition. If your goal is to learn assembly, you should focus on getting your assembly *CORRECT* rather than fast, until you have something that really will be faster in hand-coded assembly (SSE based mesh skinning or LCP solving comes to mind).

Third of all, to measure micro-performance like this, you should measure using RDTSC, which increments one per clock tick in the CPU. Most other timers are not precise enough. You also should run the timing pass several times, and choose the *BEST* timing (not the average). That way, you''re not measuring random interrupts.

Share this post


Link to post
Share on other sites
antareus    576
One thing I''ve noticed is that assembly is FASTER to get yourself respect around here. Nobody cares if you can write clean code, but post a black-magic platform-dependent hack that makes dangerous assumptions about target architectures that successfully eeks a cycle out of C++ code and all the sudden everyone takes notice around here.

Share this post


Link to post
Share on other sites
also change
int x = 0xFFFF;
int y = 0xFFFF;
int r = 0;

to:
DWORD x = 0xFFFF;
DWORD y = 0xFFFF;
DWORD r = 0;

its faster..
also consider even if that does work the other one is always in the end converted to asm so obviously your not optimized enough

Also your asm loop is in C++ and thats NOT good :/
Try this and you shall find the asm code is MUCH faster.
Might be a few bugs im not sure.

#include <iostream.h>
#include <windows.h> //for GetTickCount()

DWORD x = 0xFFFF;
DWORD y = 0xFFFF;
DWORD r = 0;
void __fastcall AssemAdd()
{
_asm
{
mov eax, x;
add eax, y;
mov r, eax;
}
}
void NormAdd()
{
r=x+y;
}
const int loops = 10000000; //10 million


void main()
{
DWORD StartAssem = 0L;
DWORD FinishAssem = 0L;
DWORD TotalAssem = 0L;
DWORD StartNorm = 0L;
DWORD FinishNorm = 0L;
DWORD TotalNorm = 0L;
int x = 0;
void *loc = (void*)AssemAdd;
Sleep(100);
//log starttime

StartAssem = GetTickCount();
_asm
{
push ebx; // store ebx

push loops; // push loops into stack

mov ebx, esp; // move stack data to ebx reg

start:
call loc; // call offset of funciton

dec ebx; // decrease register

jnz start; // determain if we at 0

pop ebx; // restore ebx

// Next line optional? I think? maybe someone can clarify

add esp, 4; // fix stack

}
//log finishtime

FinishAssem = GetTickCount();
//calculate totaltime

TotalAssem = FinishAssem - StartAssem;
//now for not assembly

Sleep(100);
//log starttime

StartNorm = GetTickCount();
for(x=0; x < loops; x++)
{
NormAdd();
}
//log finishtime

FinishNorm = GetTickCount();
//calculate totaltime

TotalNorm = FinishNorm - StartNorm;
//output the times

cout << endl << "Assembly = " << TotalAssem << " ticks.";
cout << endl << "Normal = " << TotalNorm << " ticks.";
cout << endl;

}

Assembly = 0-16 ticks.
Normal = 32+ ticks.

[edited by - DevLiquidKnight on March 21, 2004 11:34:11 PM]

Share this post


Link to post
Share on other sites
ironfroggy    122
its possible the compiler is looking at this and saying "x and y never, ever change, so ill add them directly instead of moving them from memory to register!" or, possibly its even adding them together at compile time and just moving the value directly into the result variable. its a possible optimization according to your code, know what i mean?

Share this post


Link to post
Share on other sites
Guest Anonymous Poster   
Guest Anonymous Poster
quote:
Original post by DevLiquidKnight
also change
int x = 0xFFFF;
int y = 0xFFFF;
int r = 0;

to:
DWORD x = 0xFFFF;
DWORD y = 0xFFFF;
DWORD r = 0;

its faster..



No, it''s not faster. int is 32 bits, and so is long. DWORD is an unsigned long. Sign is irrelevant to the operation of adding (when using two''s complement), so adding ints is the same as adding longs is the same as adding unsigned longs.

Therefore, since int and DWORD wind up being the same thing to the compiler, "int" is 40% more efficient (saves on typing).

Share this post


Link to post
Share on other sites
Deebo    128
int varies on different systems.
On a 32 bit system, int is 32 bits (4 bytes).
On a 64 bit system, int is 64 bits (8 bytes).
And so on and so on.
DWORD x = 0xffff should be changed to WORD x = 0xffff.
WORD x = 0xffff ahould be changed to short x = 0xffff (for portability).
This is because 0xffff is 2 bytes long. A short will be always 2 bytes long on any system. A long will always be 4 bytes.
Or you could just add 4 more f''s and forget it.

Intro Engine

Share this post


Link to post
Share on other sites