Jump to content
  • Advertisement
Sign in to follow this  
CodaKiller

Why when I try to do this in asm it gives an access violation but not in C++?

This topic is 3515 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Well basically I'm trying to flip the byte order of an array of DWORDs I made two versions of this function, one in C++ and the other in asm but the ASM version gives me this error "First-chance exception at 0x004499bc in Engine.exe: 0xC0000005: Access violation reading location 0x002ff5fc." I'd really like to get the asm one to work since it's double as fast as the other one. The C++ version:
	for( UINT i = 0; i < 10000; i++)
	{
		unsigned long Temp = pSrcData;
		pSrcData =  (Temp >> 24) & 0x000000FF;
		pSrcData += (Temp >>  8) & 0x0000FF00;
		pSrcData += (Temp <<  8) & 0x00FF0000;
		pSrcData += (Temp << 24) & 0xFF000000;
	}



The ASM version:
	for( UINT i = 0; i < 10000; i++)
	{
		__asm
		{
			mov ebx, pSrcData;
			bswap ebx;
			mov pSrcData, ebx;
		}
	}



I don't really know asm, I found the code on a forum and tried to edit it to work how I need it to.

Share this post


Link to post
Share on other sites
Advertisement
I'm not so familiar with inline x86 asm, but ages ago I used to write stuff in 100% asm.

As a note, the array is long: is made of 64-bit data. You are accessing it as if it were 32-bits wide.

An hint is to try to use registers (at least in 100% asm environment), as that stuff can happen when you think you are using the adress to point at something but you are actually using the content of the first item to do it.

Share this post


Link to post
Share on other sites
Quote:
Original post by undead
I'm not so familiar with inline x86 asm, but ages ago I used to write stuff in 100% asm.

As a note, the array is long: is made of 64-bit data. You are accessing it as if it were 32-bits wide.

An hint is to try to use registers (at least in 100% asm environment), as that stuff can happen when you think you are using the adress to point at something but you are actually using the content of the first item to do it.


Is there a way you can show me what your talking about or at least link me to a page that explains it a bit more?

Share this post


Link to post
Share on other sites
Try this, works for me, not sure about the speed I haven't benchmarked it.


__asm
{
mov ecx,dword ptr
mov eax,dword ptr pSrcData[ecx*4]
bswap eax;
mov ecx,dword ptr
mov dword ptr pDstData[ecx*4],eax
}



The modifications come from reading the disassembly of the C++ solution.

Quote:

As a note, the array is long: is made of 64-bit data. You are accessing it as if it were 32-bits wide.

On x86 a long and an unsigned long are both 32-bits wide, so its perfectly valid to access them like this

Share this post


Link to post
Share on other sites
Quote:
Original post by undead
As a note, the array is long: is made of 64-bit data. You are accessing it as if
it were 32-bits wide.


No. Neither is the array very long (10,000 elements is nothing in game/graphics/a.i./etc.-programming, nor can you know by that code how big a single element is (long is defined to be at least as big as an int), and looking at the exception-message let's me guess CodaKiller is on a 32bit box, as the addresses are 4 byte wide.

CodaKilla:
Where exactly does it break? For which loop element?

Share this post


Link to post
Share on other sites
Quote:
Original post by CodaKiller

I'd really like to get the asm one to work since it's double as fast as the other one.


Calling it individually has little purpose if you need to modify entire array.

	int n = 10000;
unsigned long * baz = new unsigned long[n];

Timer t1;
for( int i = 0; i < n; i++)
{
unsigned long Temp = baz;
baz = (Temp >> 24) & 0x000000FF;
baz += (Temp >> 8) & 0x0000FF00;
baz += (Temp << 8) & 0x00FF0000;
baz += (Temp << 24) & 0xFF000000;
}
std::cout << t1.elapsed_ms() << "ms\n";
std::cout << std::hex << baz[2] << std::endl;

Timer t2;
__asm {
mov edx, [baz]
mov ecx, 0
bsloop:
mov eax, dword ptr [edx+ecx*4]
bswap eax
mov dword ptr[edx+ecx*4], eax
inc ecx
cmp ecx, n
jl bsloop
}
std::cout << t2.elapsed_ms() << "ms\n";
std::cout << std::hex << baz[2] << std::endl;



Output:
Quote:
0.0519619ms
0
0.0187175ms
0

I actually expected there to be less difference, but it's quite a nice improvement.

Share this post


Link to post
Share on other sites
You may find that the intrinsic function _byteswap_ulong is quicker - it should end up as a bswap instruction, but doesn't have the downsides of hand written assembly getting in the way of the optimizer.

Share this post


Link to post
Share on other sites
Quote:
Original post by Antheus
cmp ecx, n

Shouldn't that be cmp ecx, [n]?

And here is how I would do it, completely untested:

mov edi, [pSrcData]
mov ecx, 10000
swaploop:
mov eax, [edi]
bswap eax
stosd
loop swaploop

Or without the string instructions, might be faster on current architectures:

mov edi, [pSrcData]
mov ecx, 10000
swaploop:
mov eax, [edi]
bswap eax
mov [edi], eax
add edi, 4
dec ecx
jnz swaploop

Share this post


Link to post
Share on other sites
To me the C++ version nedds 12ms for 1'000'000 cycles, while the asm needs 4ms. The following needs 8ms: Still worse than the asm, but a good trade-off between performant and comprehensible/maintainable code (you really care about that few ms?):


unsigned char * b = reinterpret_cast<unsigned char *>(baz);
unsigned char Temp;
for( int i = 0; i < n; i++)
{
Temp = *b; *b = b[3]; b[3] = Temp; // swap
Temp = b[1]; b[1] = b[2]; b[2] = Temp; // swap
b += 4;
}

Share this post


Link to post
Share on other sites
Quote:
Original post by DevFred

mov edi, [pSrcData]
mov ecx, 10000
swaploop:
mov eax, [edi]
bswap eax
mov [edi], eax
add edi, 4
dec ecx
jnz swaploop


Of course you can also do manual loop unrolling to make it even faster:

mov edi, [pSrcData]
mov ecx, 5000 // only half as many iterations now!
swaploop:
mov eax, [edi]
mov edx, [edi + 4]
bswap eax
bswap edx
mov [edi], eax
mov [edi + 4], edx
add edi, 8
dec ecx
jnz swaploop

Can someone benchmark if this makes a difference?

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!