Sign in to follow this  

Why when I try to do this in asm it gives an access violation but not in C++?

This topic is 3300 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Well basically I'm trying to flip the byte order of an array of DWORDs I made two versions of this function, one in C++ and the other in asm but the ASM version gives me this error "First-chance exception at 0x004499bc in Engine.exe: 0xC0000005: Access violation reading location 0x002ff5fc." I'd really like to get the asm one to work since it's double as fast as the other one. The C++ version:
	for( UINT i = 0; i < 10000; i++)
	{
		unsigned long Temp = pSrcData[i];
		pSrcData[i] =  (Temp >> 24) & 0x000000FF;
		pSrcData[i] += (Temp >>  8) & 0x0000FF00;
		pSrcData[i] += (Temp <<  8) & 0x00FF0000;
		pSrcData[i] += (Temp << 24) & 0xFF000000;
	}



The ASM version:
	for( UINT i = 0; i < 10000; i++)
	{
		__asm
		{
			mov ebx, pSrcData[i];
			bswap ebx;
			mov pSrcData[i], ebx;
		}
	}



I don't really know asm, I found the code on a forum and tried to edit it to work how I need it to.

Share this post


Link to post
Share on other sites
I'm not so familiar with inline x86 asm, but ages ago I used to write stuff in 100% asm.

As a note, the array is long: is made of 64-bit data. You are accessing it as if it were 32-bits wide.

An hint is to try to use registers (at least in 100% asm environment), as that stuff can happen when you think you are using the adress to point at something but you are actually using the content of the first item to do it.

Share this post


Link to post
Share on other sites
Quote:
Original post by undead
I'm not so familiar with inline x86 asm, but ages ago I used to write stuff in 100% asm.

As a note, the array is long: is made of 64-bit data. You are accessing it as if it were 32-bits wide.

An hint is to try to use registers (at least in 100% asm environment), as that stuff can happen when you think you are using the adress to point at something but you are actually using the content of the first item to do it.


Is there a way you can show me what your talking about or at least link me to a page that explains it a bit more?

Share this post


Link to post
Share on other sites
Try this, works for me, not sure about the speed I haven't benchmarked it.


__asm
{
mov ecx,dword ptr [i]
mov eax,dword ptr pSrcData[ecx*4]
bswap eax;
mov ecx,dword ptr [i]
mov dword ptr pDstData[ecx*4],eax
}



The modifications come from reading the disassembly of the C++ solution.

Quote:

As a note, the array is long: is made of 64-bit data. You are accessing it as if it were 32-bits wide.

On x86 a long and an unsigned long are both 32-bits wide, so its perfectly valid to access them like this

Share this post


Link to post
Share on other sites
Quote:
Original post by undead
As a note, the array is long: is made of 64-bit data. You are accessing it as if
it were 32-bits wide.


No. Neither is the array very long (10,000 elements is nothing in game/graphics/a.i./etc.-programming, nor can you know by that code how big a single element is (long is defined to be at least as big as an int), and looking at the exception-message let's me guess CodaKiller is on a 32bit box, as the addresses are 4 byte wide.

CodaKilla:
Where exactly does it break? For which loop element?

Share this post


Link to post
Share on other sites
Quote:
Original post by CodaKiller

I'd really like to get the asm one to work since it's double as fast as the other one.


Calling it individually has little purpose if you need to modify entire array.

	int n = 10000;
unsigned long * baz = new unsigned long[n];

Timer t1;
for( int i = 0; i < n; i++)
{
unsigned long Temp = baz[i];
baz[i] = (Temp >> 24) & 0x000000FF;
baz[i] += (Temp >> 8) & 0x0000FF00;
baz[i] += (Temp << 8) & 0x00FF0000;
baz[i] += (Temp << 24) & 0xFF000000;
}
std::cout << t1.elapsed_ms() << "ms\n";
std::cout << std::hex << baz[2] << std::endl;

Timer t2;
__asm {
mov edx, [baz]
mov ecx, 0
bsloop:
mov eax, dword ptr [edx+ecx*4]
bswap eax
mov dword ptr[edx+ecx*4], eax
inc ecx
cmp ecx, n
jl bsloop
}
std::cout << t2.elapsed_ms() << "ms\n";
std::cout << std::hex << baz[2] << std::endl;



Output:
Quote:
0.0519619ms
0
0.0187175ms
0

I actually expected there to be less difference, but it's quite a nice improvement.

Share this post


Link to post
Share on other sites
You may find that the intrinsic function _byteswap_ulong is quicker - it should end up as a bswap instruction, but doesn't have the downsides of hand written assembly getting in the way of the optimizer.

Share this post


Link to post
Share on other sites
Quote:
Original post by Antheus
cmp ecx, n

Shouldn't that be cmp ecx, [n]?

And here is how I would do it, completely untested:

mov edi, [pSrcData]
mov ecx, 10000
swaploop:
mov eax, [edi]
bswap eax
stosd
loop swaploop

Or without the string instructions, might be faster on current architectures:

mov edi, [pSrcData]
mov ecx, 10000
swaploop:
mov eax, [edi]
bswap eax
mov [edi], eax
add edi, 4
dec ecx
jnz swaploop

Share this post


Link to post
Share on other sites
To me the C++ version nedds 12ms for 1'000'000 cycles, while the asm needs 4ms. The following needs 8ms: Still worse than the asm, but a good trade-off between performant and comprehensible/maintainable code (you really care about that few ms?):


unsigned char * b = reinterpret_cast<unsigned char *>(baz);
unsigned char Temp;
for( int i = 0; i < n; i++)
{
Temp = *b; *b = b[3]; b[3] = Temp; // swap
Temp = b[1]; b[1] = b[2]; b[2] = Temp; // swap
b += 4;
}

Share this post


Link to post
Share on other sites
Quote:
Original post by DevFred

mov edi, [pSrcData]
mov ecx, 10000
swaploop:
mov eax, [edi]
bswap eax
mov [edi], eax
add edi, 4
dec ecx
jnz swaploop


Of course you can also do manual loop unrolling to make it even faster:

mov edi, [pSrcData]
mov ecx, 5000 // only half as many iterations now!
swaploop:
mov eax, [edi]
mov edx, [edi + 4]
bswap eax
bswap edx
mov [edi], eax
mov [edi + 4], edx
add edi, 8
dec ecx
jnz swaploop

Can someone benchmark if this makes a difference?

Share this post


Link to post
Share on other sites
Wow, it seems there is a huge difference. On my pc, 10k loops on a 10k elements array last after 140-330 ms (most of the runs needed about 220 +/- 20 ms), while the unrolled version needed 90-110ms (not mentioning the nearly constant 400ms of the C++ version).

Share this post


Link to post
Share on other sites
What about a further unroll?

mov edi, [pSrcData]
mov ecx, 2500
swaploop:
mov eax, [edi]
mov edx, [edi + 4]
mov ebx, [edi + 8]
mov esi, [edi + 12]
bswap eax
bswap edx
bswap ebx
bswap esi
mov [edi], eax
mov [edi + 4], edx
mov [edi + 8], ebx
mov [edi + 12], esi
add edi, 16
dec ecx
jnz swaploop

Share this post


Link to post
Share on other sites
Quote:
Original post by Nyarlath
you really care about that few ms?


Those few milliseconds mean a lot if you are running it many times a frame, though I'm not saying it will run it that many times but I have no idea what someone will do with it since I am writing a programmable engine.

Share this post


Link to post
Share on other sites
Quote:
Original post by CodaKiller
Quote:
Original post by phresnel
CodaKilla:
Where exactly does it break? For which loop element?

First.

Let's take a look at what you do then:

mov ebx, pSrcData[i];

Both pSrcData and i are lvalues.

In x86 assembly language (intel syntax), i means "the address of i", while [i] means "the value of i".

You cannot combine pSrcData[i] and expect that to behave just like it would in C. I don't know for sure what the processor does, but I suppose he just adds the addresses of pSrcData and i which doesn't make any sense at all (because we don't own that random address).

Instead, you must load the pointer that is stored at pSrcData and then go [i] beyond that, something like

mov ebx, [pSrcData] // load the pointer
add ebx, [i] // go to the i-th position
mov eax, [ebx] // load the i-th element
bswap eax
mov [ebx], eax

But writing the entire loop in assembly is much wiser than writing the loop logic in C and the loop body in assembly. See my posts on loop unrolling for a lightning-fast solution ;)

Share this post


Link to post
Share on other sites
Quote:
Original post by DevFred
Quote:
Original post by CodaKiller
I have no idea what someone will do with it since I am writing a programmable engine.

Hmmm...


Seriously you guys spam that way to much, I am not a sheep, I have the right to chose my own path and you can not generalize this as something that every living person on earth should follow.

Would you tell the maker of infinity to "make games not engines", no because you know he is a competent programmer who does not need your input on the matter. You don't even really know what I'm doing and your already telling me what I have to do.

This is good advice that I agree with but it does not apply to everyone and it does not apply to my project.

Share this post


Link to post
Share on other sites
Quote:
Original post by DevFred
Quote:
Original post by CodaKiller
Quote:
Original post by phresnel
CodaKilla:
Where exactly does it break? For which loop element?

First.

Let's take a look at what you do then:

mov ebx, pSrcData[i];

Both pSrcData and i are lvalues.

In x86 assembly language (intel syntax), i means "the address of i", while [i] means "the value of i".

You cannot combine pSrcData[i] and expect that to behave just like it would in C. I don't know for sure what the processor does, but I suppose he just adds the addresses of pSrcData and i which doesn't make any sense at all (because we don't own that random address).

Instead, you must load the pointer that is stored at pSrcData and then go [i] beyond that, something like

mov ebx, [pSrcData] // load the pointer
add ebx, [i] // go to the i-th position
mov eax, [ebx] // load the i-th element
bswap eax
mov [ebx], eax

But writing the entire loop in assembly is much wiser than writing the loop logic in C and the loop body in assembly. See my posts on loop unrolling for a lightning-fast solution ;)


This is flipping only 2 bytes, though it does work but not quite what I need. Also I better add that my system is x64 but the code is in x86.

Share this post


Link to post
Share on other sites
Someone should write this using SSE.

Also, prefetching should help a bit with unrolled or SIMD versions:

swaploop:
mov esi, [edi + 12]
mov eax, [edi]
mov edx, [edi + 4]
mov ebx, [edi + 8]

Share this post


Link to post
Share on other sites
Quote:
Original post by CodaKiller
Quote:
Original post by DevFred
Quote:
Original post by CodaKiller
I have no idea what someone will do with it since I am writing a programmable engine.

Hmmm...

Would you tell the maker of infinity to "make games not engines", no because you know he is a competent programmer who does not need your input on the matter.
Actually, you might be surprised how many GDnet members have told ysaneya exactly that - there are several long threads around about how he "is a great engine programmer" but "couldn't actually write a game" (check the IOTD threads if you doubt me).

Share this post


Link to post
Share on other sites
Quote:
Original post by CodaKiller
Quote:
Original post by DevFred
Quote:
Original post by CodaKiller
I have no idea what someone will do with it since I am writing a programmable engine.

Hmmm...


Seriously you guys spam that way to much, I am not a sheep, I have the right to chose my own path and you can not generalize this as something that every living person on earth should follow.

Would you tell the maker of infinity to "make games not engines", no because you know he is a competent programmer who does not need your input on the matter. You don't even really know what I'm doing and your already telling me what I have to do.

This is good advice that I agree with but it does not apply to everyone and it does not apply to my project.


Do you mean this Infinity http://www.infinity-universe.com/Infinity/index.php because 'that guy' is making a game, so it doesn't apply anyway.

The whole point of the article is not that it saying "No one can ever make an engine" but that "Making a game is more important". Its all well and good making an engine if you make a game with it, otherwise you run the risk of creating an engine that isn't useful/practical. It seems like you've missed this point since your putting such much effort into micro optimizing a tiny piece of code you don't know will even be time critical or not.

Share this post


Link to post
Share on other sites
Right, I didn't think about prefetching. I suggest already prefetching the last value of the next iteration to make sure everything is in the cache by then:

mov edi, [pSrcData]
mov ecx, 2499
swaploop:
mov esi, [edi + 28] // we don't actually care about esi, just fill the cache
swaploop_last:
mov eax, [edi]
mov edx, [edi + 4]
mov ebx, [edi + 8]
mov esi, [edi + 12]
bswap eax
bswap edx
bswap ebx
bswap esi
mov [edi], eax
mov [edi + 4], edx
mov [edi + 8], ebx
mov [edi + 12], esi
add edi, 16
dec ecx
jg swaploop
jz swaploop_last // no prefetching on last iteration
// to prevent possible segfault

Does that make a difference? Maybe for larger arrays? Come on benchmark people :)

Share this post


Link to post
Share on other sites
Quote:
Original post by grekster
Quote:
Original post by CodaKiller
Quote:
Original post by DevFred
Quote:
Original post by CodaKiller
I have no idea what someone will do with it since I am writing a programmable engine.

Hmmm...


Seriously you guys spam that way to much, I am not a sheep, I have the right to chose my own path and you can not generalize this as something that every living person on earth should follow.

Would you tell the maker of infinity to "make games not engines", no because you know he is a competent programmer who does not need your input on the matter. You don't even really know what I'm doing and your already telling me what I have to do.

This is good advice that I agree with but it does not apply to everyone and it does not apply to my project.


Do you mean this Infinity http://www.infinity-universe.com/Infinity/index.php because 'that guy' is making a game, so it doesn't apply anyway.

The whole point of the article is not that it saying "No one can ever make an engine" but that "Making a game is more important". Its all well and good making an engine if you make a game with it, otherwise you run the risk of creating an engine that isn't useful/practical. It seems like you've missed this point since your putting such much effort into micro optimizing a tiny piece of code you don't know will even be time critical or not.


I am making a game and this is a very key function that controls text for the interface, I need to pass text from the GDI to a DirectX texture but I need it to be in the alpha channel and since the GDI knows nothing about alpha I plan to flip the red channel over to the alpha channel.

This thread is not about my engine it is about flipping bytes in an array of DWORDs.

Share this post


Link to post
Share on other sites

This topic is 3300 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this