MinGW DLL used in MinGW and VC++ app (memory alignment)

Started by
10 comments, last by kek_miyu 16 years, 3 months ago
Hi, I'm in a bit over my head. I'd appreciate any sort of ideas concerning my problem. Why do I get memory alignment crashes if I load and use a MinGW DLL in an application compiled with VC++ (Express 2005), but not if I compile the same application (.cpp file) with MinGW? I'm developing a DLL that wraps a 3rd party library (FFMPEG, used for video encoding). I have a simple interface to my DLL, basically I just pass in a FFMPEG command line (char*) and all the heavy processing is done in the DLL function (which in turn uses functionality from FFMPEG DLLs). I've compiled my DLL and FFMPEG using MinGW (the only windows compiler supported by FFMPEG). I've created a simple test application which encodes a bunch of videos (30 of them) in a loop. I've compiled this test application in both MinGW and in VC++. The MinGW version runs fine, with or without SSE optimized routines. The VC++ version only works if I run-time disable FFMPEG's SSE/MMX optimized routines (via a command line parameter). If I don't do this, I get a SEGFAULT in a SSE2 (Discreet Cosine Transform - DCT) routine. This seems to be related to memory not being aligned properly (16 byte alignment is required for SSE2 I believe). I don't pass any memory across the DLL boundary that's used directly by any of the FFMPEG routines. The DLL is pretty self contained. So, it's not my test application that supplies any unaligned memory, causing the crash. Any ideas what this problem might be? Why would the DLL loaded in a VC++ app mess up memory alignment? I can add that in the MinGW application the SSE2 code isn't much faster than the non-SEE-optimized one, the 30 videos are encoded in 40 vs 42 seconds. So, I could disable it if it wasn't for the fact that in the VC++ compiled application the non-SSE-optimized code suddenly takes 53 seconds, almost ~25% longer time. The application loop has no real processing overhead, so this is just the DLL function running 25% slower. My guess is that this could also be due to memory misalignments. It works, but at reduced performance. I'm not quite sure if it's misaligned memory on the heap or on the stack, debugging with GDB doesn't make me much wiser. Although I wonder why the block1* pointer, which is set to point to some (hopefully) aligned stack memory, is reported to be set to 0x3. SOME GDB INFO: -------------- Program received signal SIGSEGV, Segmentation fault. 0x6875b33e in ff_fdct_sse2 (block=0x3f1e0) at i386/fdct_mmx.c:369 (gdb) print block $4 = (int16_t *) 0x3f1e0 (gdb) print block1 $5 = (int16_t * const) 0x3 (gdb) print align_tmp $6 = {0 <repeats 16 times>} SOURCE CODE: ------------

void ff_fdct_sse2(int16_t *block)
{
    int64_t align_tmp[16] ATTR_ALIGN(16);
    int16_t * const block1= (int16_t*)align_tmp;

    fdct_col_sse2(block, block1, 0); // <---- SEGFAULT calling this
    fdct_row_sse2(block1, block);
}

Advertisement
Might be a calling convention incompatibility that leads to stack corruption.

You haven't specified how you load the DLL into your applications address space. Do you load it manually through LoadLibrary/GetProcAddress and friends, or do you have the application runtime do it implicitely via an import lib ? The latter can sometimes create problems if the DLL and host app are compiled under different compilers.

Usually, you're fine mixing DLLs and host apps from different compilers if you:

* Keep the interfaces pure C
* do explicit dynamic linking
* have your entry points and calling convention well behaved (ie. extern C, _cdecl, etc)
* don't do weird things in DllMain
* are careful about passing around pointers to heap memory managed by different runtimes
* look at padding and structure packing if you pass around pointers to structs (you don't seem to do that, though)
* are very careful about compatible code generation and register usage (stack frame pointers, EBP/ESP usage, register state preservation, etc)
Thanks for the feedback Yann.

>> * Keep the interfaces pure C

I have a very simple interface, with no complex data type passing. The URLProtocol structure (below in source) just hold a number of function pointers and the structure is 36 bytes in both the MinGW and VC++ application.

extern "C"{	typedef void (*external_log_func_ptr)(int, const char*);//	typedef void (_cdecl *FFMPEG_INIT_FUNC)();//	typedef int (_cdecl *FFMPEG_REGISTER_PROTOCOL_FUNC)(URLProtocol*);//	typedef void (_cdecl *FFMPEG_CLOSE_FUNC)();//	typedef int (_cdecl *FFMPEG_MAIN_FUNC)(char*);//	typedef void (_cdecl *REGISTER_EXTERNAL_LOG_FUNC)(external_log_func_ptr);	extern void ffmpeg_init();	extern int ffmpeg_register_protocol(URLProtocol*);	extern void ffmpeg_close();	extern int ffmpeg_main(char*);	extern void register_external_log(external_log_func_ptr);}


>> * do explicit dynamic linking

I originally just used GetProcAddress (you see the function pointer typedefs commented out in the above source code). I tried linking against a VC .lib as well, generated from the MinGW .def file. Both crash equally.

>> * have your entry points and calling convention well behaved (ie. extern C, _cdecl, etc)

The DLL is compiled with the MinGW C compiler, so it should produce _cdecl functions. VC++ complains if I try the wrong calling convention (i.e. __stdcall). My functions enter fine, I'm just not sure why only the SSE path might suffer from any sort of stack corrruption.

>> * don't do weird things in DllMain

The DLL has no DLLMain.

>> * are careful about passing around pointers to heap memory managed by different runtimes

There's no malloc/free/new/delete across the DLL boundary. Only some memcpy:s of the results in callback functions, but this is fine.

>> * look at padding and structure packing if you pass around pointers to structs (you don't seem to do that, though)

Only the URLProtocol is anything but a standard type. It's the same size in both compilers and its function pointers seem to work fine, generating callbacks up until the crash.

>> * are very careful about compatible code generation and register usage (stack frame pointers, EBP/ESP usage, register state preservation, etc)

Well, perhaps this is the biggest question mark to me. Much of FFMPEG is compiled using "omit-frame-pointer". Thus, I've also compiled the DLL source files with this switch. Omitting the frame pointer in VC++ makes no difference. I also tried compiling my DLL (not FFMPEG) without omitting the frame pointer, but no improvement.

Are there any other sort of switches I perhaps should consider? I can't really see any other interesting switches in the VC project properties to tweak. Do I seem consistent enough with the calling conventions?
Well, the VC++ debugger shows this in the disassembler:

6875B334  psubsw      xmm3,xmmword ptr [ecx+edx*2+50h] 6875B33A  psubsw      xmm4,xmm6 6875B33E  movdqa      xmmword ptr [esp+20h],xmm1       // <--- ACCESS VIOLATION6875B344  paddsw      xmm7,xmm6 6875B348  movdqa      xmm1,xmmword ptr [ecx+edx*2+30h] 6875B34E  psllw       xmm3,4 EAX = 08F22EE0 EBX = 00000003 ECX = 08F22EE0 EDX = 00000000 ESI = 08E60D90 EDI = 08F25360 EIP = 6875B33E ESP = 001334AC EBP = 08E60D90 EFL = 00200216 XMM0DL = +1.3905014071E-309#DEN           XMM0DH = -1.#QNAN000000000E+000           XMM1DL = +1.3906647968E-308#DEN           XMM1DH = -1.#QNAN000000000E+000           XMM2DL = +9.45642809092505E-308           XMM2DH = +4.45048724255787E-308           XMM3DL = +4.1720134853E-309#DEN           XMM3DH = +4.1720771459E-309#DEN           XMM4DL = +1.1125539052E-308#DEN           XMM4DH = +1.3903528544E-309#DEN           XMM5DL = +4.1720134889E-309#DEN           XMM5DH = -1.#QNAN000000000E+000           XMM6DL = +1.91059194736570E-211           XMM6DH = +3.49222795425766E-215           XMM7DL = +2.86115223469088E-211           XMM7DH = +3.49219254285733E-215           001334CC = 00000000000000000000000000000000 


It seems like it's trying to move the 128-bit xmm1 register with movdqa (move double quad word aligned) to memory address [esp+20h], which is not 16-bytes aligned (lowest nibble isn't 0).

Due to the fact it's related to the ESP register as well, I feel like it might be connected to the omit-frame-pointer options, but changing this in VC++ doesn't make a difference.

Any more ideas? I know it's quite tricky..
Madness. Well, in my simple VC++ test application, I can get the DLL function to not crash if I swap to another CRT (i.e. not using DLL Multithreaded). I don't know if this is a fluke or not. Shall I presume that my MinGW DLL uses the Multithreaded CRT and that using the same CRT in VC++ blows things up. I don't understand this, but something is definitely not right.

It's not an option in my real application (as opposed to the test application) to swap to a different CRT, as that links against other DLL libraries as well (built in VC++).

Any ideas on this?
Sorry for replying to my own post at this frequency.

As I suspected, it was a fluke that it worked with the other CRTs. If I create some dummy variables on the stack before calling my DLL function it works or crashes, presumably based on how things end up being aligned on the stack somewhere down in the library calls. I.e. I can get it to work with the proper Multithreaded DLL CRT by declaring an integer array "int a[3]" on the stack before calling the function, and it won't crash:

for(i = 0; i < 10*NUM_CMD_LINES; i++){	int a[3];	int iCmd = i%NUM_CMD_LINES;	sprintf_s(buf, 512, "ffmpeg.exe -dct 3 %s mem:out%04d.%s", cmdLines[iCmd].options, i, cmdLines[iCmd].ext);	buf[511] = 0;	ffmpeg_main(buf);}


The FFMPEG source code seems to align things required using ATTR_ALIGN(16) in the relevant places. Why would this kind of stack alignment stop working? I assume the stack alignment is something resolved at run-time. Any reason the alignment wouldn't work using a MinGW DLL from a VC++ application?
OK, last post ;) MSVC++ aligns the stack at 4-byte boundaries. GCC aligns the stack at 16-byte boundaries. If the stack is not 16-byte aligned when calling the GCC function, the stack will be misaligned and all type of SSE code will blow up. I haven't decided how to handle this yet, some sort of dirty hack to serve my purpose I guess.
Oh yeah, now that you mention it, I vaguely remember something about GCC not adjusting stack alignment before a SIMD block, because it would just 'assume' the stack pointer passed from the calling function was aligned. That's obviously faster, but fails if the function is called from a misaligned stack. I thought they would've fixed that by now.

Anyway, this obviously explains the strange behaviour you got. If you are certain that GCC generates correct alignment on all function calls internally in the DLL (which I would hope to be the case), then the problem are your interface functions to MSVC. Since you don't have too many, and they're not called high frequency, you could just realign the stack when you enter them. There should be an GCC intrinsic to do that.
I've been looking for a way to align the stack in my DLL, but I haven't found anything that will work yet.

Am I correct to assume that inserting x number of byte allocations on the stack, until it "works" on my machine, might still break on another machine/environment if the allocated thread stack gets a different lowest address nibble? Does it consistently work on my machine because the VC++ environment keeps allocating the same address for the thread stack?

Seems like a new GCC 4+ branch will support __attribute__ ((force_align_arg_pointer)) which will force the stack to align to 16-bytes on entry/exit of function by inserting some extra ASM prologue/epilogue. However, I'm using the recommended version of MinGW ~version 3.x something for FFMPEG, so I don't have this feature. There's talk of other things as well:

http://readlist.com/lists/gcc.gnu.org/gcc/3/17895.html

I've googled my way all over the place, but coming up short.

I'm not that good at ASM, so I cannot tell if it'd be possible to make some ASM hack to align the stack, similar to the prologue/epilogue code in the above link. Anyone have any suggestion how I might do the alignment?
Okay, this is the only thing I've managed. The function doesn't crash now, even if I create new stack objects in the VC++ calling function.

	// RELEASE: Hack-hack.. Allocate some additional memory on stack to get the stack aligned when calling GCC DLL	void* align1 = _malloca(16);	void* align2 = _malloca(4);	mainOK = m_ffmpegAPI->main(cmdLine);


I can't say why exactly I do these two dynamical allocations, but it seems as if the 16 byte allocation forces an alignment to 16 bytes and that the 4 byte alignment compensated for some parameter passing to the DLL function. That theory might be complete BS though ;)

Any one know if what I've written makes any sense and if it's stable? Or better yet, a nicer way to solve this?

This topic is closed to new replies.

Advertisement