• 15
• 15
• 11
• 9
• 10

# Why when I try to do this in asm it gives an access violation but not in C++?

This topic is 3397 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Well basically I'm trying to flip the byte order of an array of DWORDs I made two versions of this function, one in C++ and the other in asm but the ASM version gives me this error "First-chance exception at 0x004499bc in Engine.exe: 0xC0000005: Access violation reading location 0x002ff5fc." I'd really like to get the asm one to work since it's double as fast as the other one. The C++ version:
	for( UINT i = 0; i < 10000; i++)
{
unsigned long Temp = pSrcData;
pSrcData =  (Temp >> 24) & 0x000000FF;
pSrcData += (Temp >>  8) & 0x0000FF00;
pSrcData += (Temp <<  8) & 0x00FF0000;
pSrcData += (Temp << 24) & 0xFF000000;
}


The ASM version:
	for( UINT i = 0; i < 10000; i++)
{
__asm
{
mov ebx, pSrcData;
bswap ebx;
mov pSrcData, ebx;
}
}


I don't really know asm, I found the code on a forum and tried to edit it to work how I need it to.

##### Share on other sites
I'm not so familiar with inline x86 asm, but ages ago I used to write stuff in 100% asm.

As a note, the array is long: is made of 64-bit data. You are accessing it as if it were 32-bits wide.

An hint is to try to use registers (at least in 100% asm environment), as that stuff can happen when you think you are using the adress to point at something but you are actually using the content of the first item to do it.

##### Share on other sites
Quote:
 Original post by undeadI'm not so familiar with inline x86 asm, but ages ago I used to write stuff in 100% asm.As a note, the array is long: is made of 64-bit data. You are accessing it as if it were 32-bits wide.An hint is to try to use registers (at least in 100% asm environment), as that stuff can happen when you think you are using the adress to point at something but you are actually using the content of the first item to do it.

Is there a way you can show me what your talking about or at least link me to a page that explains it a bit more?

##### Share on other sites
Try this, works for me, not sure about the speed I haven't benchmarked it.

__asm{	mov         ecx,dword ptr  	mov         eax,dword ptr pSrcData[ecx*4]	bswap eax;	mov         ecx,dword ptr  	mov         dword ptr pDstData[ecx*4],eax}

The modifications come from reading the disassembly of the C++ solution.

Quote:
 As a note, the array is long: is made of 64-bit data. You are accessing it as if it were 32-bits wide.

On x86 a long and an unsigned long are both 32-bits wide, so its perfectly valid to access them like this

##### Share on other sites
Quote:
 Original post by undeadAs a note, the array is long: is made of 64-bit data. You are accessing it as if it were 32-bits wide.

No. Neither is the array very long (10,000 elements is nothing in game/graphics/a.i./etc.-programming, nor can you know by that code how big a single element is (long is defined to be at least as big as an int), and looking at the exception-message let's me guess CodaKiller is on a 32bit box, as the addresses are 4 byte wide.

CodaKilla:
Where exactly does it break? For which loop element?

##### Share on other sites
Quote:
 Original post by CodaKillerI'd really like to get the asm one to work since it's double as fast as the other one.

Calling it individually has little purpose if you need to modify entire array.

	int n = 10000;	unsigned long * baz = new unsigned long[n];	Timer t1;	for( int i = 0; i < n; i++)	{		unsigned long Temp = baz;		baz =  (Temp >> 24) & 0x000000FF;		baz += (Temp >>  8) & 0x0000FF00;		baz += (Temp <<  8) & 0x00FF0000;		baz += (Temp << 24) & 0xFF000000;	}	std::cout << t1.elapsed_ms() << "ms\n";	std::cout << std::hex << baz[2] << std::endl;	Timer t2;	__asm {		mov edx, [baz]		mov ecx, 0bsloop:		mov eax, dword ptr [edx+ecx*4]		bswap eax		mov dword ptr[edx+ecx*4], eax		inc ecx		cmp ecx, n		jl bsloop	}	std::cout << t2.elapsed_ms() << "ms\n";	std::cout << std::hex << baz[2] << std::endl;

Output:
Quote:
 0.0519619ms00.0187175ms0

I actually expected there to be less difference, but it's quite a nice improvement.

##### Share on other sites
You may find that the intrinsic function _byteswap_ulong is quicker - it should end up as a bswap instruction, but doesn't have the downsides of hand written assembly getting in the way of the optimizer.

##### Share on other sites
Quote:
 Original post by Antheuscmp ecx, n

Shouldn't that be cmp ecx, [n]?

And here is how I would do it, completely untested:
    mov edi, [pSrcData]    mov ecx, 10000swaploop:    mov eax, [edi]    bswap eax    stosd    loop swaploop

Or without the string instructions, might be faster on current architectures:
    mov edi, [pSrcData]    mov ecx, 10000swaploop:    mov eax, [edi]    bswap eax    mov [edi], eax    add edi, 4    dec ecx    jnz swaploop

##### Share on other sites
To me the C++ version nedds 12ms for 1'000'000 cycles, while the asm needs 4ms. The following needs 8ms: Still worse than the asm, but a good trade-off between performant and comprehensible/maintainable code (you really care about that few ms?):

  unsigned char * b = reinterpret_cast<unsigned char *>(baz);  unsigned char Temp;  for( int i = 0; i < n; i++)  {    Temp = *b; *b = b[3]; b[3] = Temp; // swap    Temp = b[1]; b[1] = b[2]; b[2] = Temp; // swap    b += 4;  }

##### Share on other sites
Quote:
 Original post by DevFred mov edi, [pSrcData] mov ecx, 10000swaploop: mov eax, [edi] bswap eax mov [edi], eax add edi, 4 dec ecx jnz swaploop

Of course you can also do manual loop unrolling to make it even faster:
    mov edi, [pSrcData]    mov ecx, 5000   // only half as many iterations now!swaploop:    mov eax, [edi]    mov edx, [edi + 4]    bswap eax    bswap edx    mov [edi], eax    mov [edi + 4], edx    add edi, 8    dec ecx    jnz swaploop

Can someone benchmark if this makes a difference?