Jump to content

  • Log In with Google      Sign In   
  • Create Account

MinGW DLL used in MinGW and VC++ app (memory alignment)


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
11 replies to this topic

#1 kek_miyu   Members   -  Reputation: 122

Like
0Likes
Like

Posted 09 February 2008 - 12:24 AM

Hi, I'm in a bit over my head. I'd appreciate any sort of ideas concerning my problem. Why do I get memory alignment crashes if I load and use a MinGW DLL in an application compiled with VC++ (Express 2005), but not if I compile the same application (.cpp file) with MinGW? I'm developing a DLL that wraps a 3rd party library (FFMPEG, used for video encoding). I have a simple interface to my DLL, basically I just pass in a FFMPEG command line (char*) and all the heavy processing is done in the DLL function (which in turn uses functionality from FFMPEG DLLs). I've compiled my DLL and FFMPEG using MinGW (the only windows compiler supported by FFMPEG). I've created a simple test application which encodes a bunch of videos (30 of them) in a loop. I've compiled this test application in both MinGW and in VC++. The MinGW version runs fine, with or without SSE optimized routines. The VC++ version only works if I run-time disable FFMPEG's SSE/MMX optimized routines (via a command line parameter). If I don't do this, I get a SEGFAULT in a SSE2 (Discreet Cosine Transform - DCT) routine. This seems to be related to memory not being aligned properly (16 byte alignment is required for SSE2 I believe). I don't pass any memory across the DLL boundary that's used directly by any of the FFMPEG routines. The DLL is pretty self contained. So, it's not my test application that supplies any unaligned memory, causing the crash. Any ideas what this problem might be? Why would the DLL loaded in a VC++ app mess up memory alignment? I can add that in the MinGW application the SSE2 code isn't much faster than the non-SEE-optimized one, the 30 videos are encoded in 40 vs 42 seconds. So, I could disable it if it wasn't for the fact that in the VC++ compiled application the non-SSE-optimized code suddenly takes 53 seconds, almost ~25% longer time. The application loop has no real processing overhead, so this is just the DLL function running 25% slower. My guess is that this could also be due to memory misalignments. It works, but at reduced performance. I'm not quite sure if it's misaligned memory on the heap or on the stack, debugging with GDB doesn't make me much wiser. Although I wonder why the block1* pointer, which is set to point to some (hopefully) aligned stack memory, is reported to be set to 0x3. SOME GDB INFO: -------------- Program received signal SIGSEGV, Segmentation fault. 0x6875b33e in ff_fdct_sse2 (block=0x3f1e0) at i386/fdct_mmx.c:369 (gdb) print block $4 = (int16_t *) 0x3f1e0 (gdb) print block1 $5 = (int16_t * const) 0x3 (gdb) print align_tmp $6 = {0 <repeats 16 times>} SOURCE CODE: ------------
void ff_fdct_sse2(int16_t *block)
{
    int64_t align_tmp[16] ATTR_ALIGN(16);
    int16_t * const block1= (int16_t*)align_tmp;

    fdct_col_sse2(block, block1, 0); // <---- SEGFAULT calling this
    fdct_row_sse2(block1, block);
}



Sponsor:

#2 Yann L   Moderators   -  Reputation: 1798

Like
0Likes
Like

Posted 09 February 2008 - 02:59 AM

Might be a calling convention incompatibility that leads to stack corruption.

You haven't specified how you load the DLL into your applications address space. Do you load it manually through LoadLibrary/GetProcAddress and friends, or do you have the application runtime do it implicitely via an import lib ? The latter can sometimes create problems if the DLL and host app are compiled under different compilers.

Usually, you're fine mixing DLLs and host apps from different compilers if you:

* Keep the interfaces pure C
* do explicit dynamic linking
* have your entry points and calling convention well behaved (ie. extern C, _cdecl, etc)
* don't do weird things in DllMain
* are careful about passing around pointers to heap memory managed by different runtimes
* look at padding and structure packing if you pass around pointers to structs (you don't seem to do that, though)
* are very careful about compatible code generation and register usage (stack frame pointers, EBP/ESP usage, register state preservation, etc)

#3 kek_miyu   Members   -  Reputation: 122

Like
0Likes
Like

Posted 09 February 2008 - 05:33 AM

Thanks for the feedback Yann.

>> * Keep the interfaces pure C

I have a very simple interface, with no complex data type passing. The URLProtocol structure (below in source) just hold a number of function pointers and the structure is 36 bytes in both the MinGW and VC++ application.


extern "C"
{
typedef void (*external_log_func_ptr)(int, const char*);

// typedef void (_cdecl *FFMPEG_INIT_FUNC)();
// typedef int (_cdecl *FFMPEG_REGISTER_PROTOCOL_FUNC)(URLProtocol*);
// typedef void (_cdecl *FFMPEG_CLOSE_FUNC)();
// typedef int (_cdecl *FFMPEG_MAIN_FUNC)(char*);
// typedef void (_cdecl *REGISTER_EXTERNAL_LOG_FUNC)(external_log_func_ptr);

extern void ffmpeg_init();
extern int ffmpeg_register_protocol(URLProtocol*);
extern void ffmpeg_close();
extern int ffmpeg_main(char*);
extern void register_external_log(external_log_func_ptr);
}



>> * do explicit dynamic linking

I originally just used GetProcAddress (you see the function pointer typedefs commented out in the above source code). I tried linking against a VC .lib as well, generated from the MinGW .def file. Both crash equally.

>> * have your entry points and calling convention well behaved (ie. extern C, _cdecl, etc)

The DLL is compiled with the MinGW C compiler, so it should produce _cdecl functions. VC++ complains if I try the wrong calling convention (i.e. __stdcall). My functions enter fine, I'm just not sure why only the SSE path might suffer from any sort of stack corrruption.

>> * don't do weird things in DllMain

The DLL has no DLLMain.

>> * are careful about passing around pointers to heap memory managed by different runtimes

There's no malloc/free/new/delete across the DLL boundary. Only some memcpy:s of the results in callback functions, but this is fine.

>> * look at padding and structure packing if you pass around pointers to structs (you don't seem to do that, though)

Only the URLProtocol is anything but a standard type. It's the same size in both compilers and its function pointers seem to work fine, generating callbacks up until the crash.

>> * are very careful about compatible code generation and register usage (stack frame pointers, EBP/ESP usage, register state preservation, etc)

Well, perhaps this is the biggest question mark to me. Much of FFMPEG is compiled using "omit-frame-pointer". Thus, I've also compiled the DLL source files with this switch. Omitting the frame pointer in VC++ makes no difference. I also tried compiling my DLL (not FFMPEG) without omitting the frame pointer, but no improvement.

Are there any other sort of switches I perhaps should consider? I can't really see any other interesting switches in the VC project properties to tweak. Do I seem consistent enough with the calling conventions?

#4 kek_miyu   Members   -  Reputation: 122

Like
0Likes
Like

Posted 10 February 2008 - 12:04 AM

Well, the VC++ debugger shows this in the disassembler:


6875B334 psubsw xmm3,xmmword ptr [ecx+edx*2+50h]
6875B33A psubsw xmm4,xmm6
6875B33E movdqa xmmword ptr [esp+20h],xmm1 // <--- ACCESS VIOLATION
6875B344 paddsw xmm7,xmm6
6875B348 movdqa xmm1,xmmword ptr [ecx+edx*2+30h]
6875B34E psllw xmm3,4


EAX = 08F22EE0 EBX = 00000003 ECX = 08F22EE0 EDX = 00000000 ESI = 08E60D90 EDI = 08F25360 EIP = 6875B33E ESP = 001334AC EBP = 08E60D90 EFL = 00200216

XMM0DL = +1.3905014071E-309#DEN XMM0DH = -1.#QNAN000000000E+000 XMM1DL = +1.3906647968E-308#DEN XMM1DH = -1.#QNAN000000000E+000
XMM2DL = +9.45642809092505E-308 XMM2DH = +4.45048724255787E-308 XMM3DL = +4.1720134853E-309#DEN XMM3DH = +4.1720771459E-309#DEN
XMM4DL = +1.1125539052E-308#DEN XMM4DH = +1.3903528544E-309#DEN XMM5DL = +4.1720134889E-309#DEN XMM5DH = -1.#QNAN000000000E+000
XMM6DL = +1.91059194736570E-211 XMM6DH = +3.49222795425766E-215 XMM7DL = +2.86115223469088E-211 XMM7DH = +3.49219254285733E-215

001334CC = 00000000000000000000000000000000



It seems like it's trying to move the 128-bit xmm1 register with movdqa (move double quad word aligned) to memory address [esp+20h], which is not 16-bytes aligned (lowest nibble isn't 0).

Due to the fact it's related to the ESP register as well, I feel like it might be connected to the omit-frame-pointer options, but changing this in VC++ doesn't make a difference.

Any more ideas? I know it's quite tricky..

#5 kek_miyu   Members   -  Reputation: 122

Like
0Likes
Like

Posted 10 February 2008 - 12:29 AM

Madness. Well, in my simple VC++ test application, I can get the DLL function to not crash if I swap to another CRT (i.e. not using DLL Multithreaded). I don't know if this is a fluke or not. Shall I presume that my MinGW DLL uses the Multithreaded CRT and that using the same CRT in VC++ blows things up. I don't understand this, but something is definitely not right.

It's not an option in my real application (as opposed to the test application) to swap to a different CRT, as that links against other DLL libraries as well (built in VC++).

Any ideas on this?

#6 kek_miyu   Members   -  Reputation: 122

Like
0Likes
Like

Posted 10 February 2008 - 02:05 AM

Sorry for replying to my own post at this frequency.

As I suspected, it was a fluke that it worked with the other CRTs. If I create some dummy variables on the stack before calling my DLL function it works or crashes, presumably based on how things end up being aligned on the stack somewhere down in the library calls. I.e. I can get it to work with the proper Multithreaded DLL CRT by declaring an integer array "int a[3]" on the stack before calling the function, and it won't crash:


for(i = 0; i < 10*NUM_CMD_LINES; i++)
{
int a[3];
int iCmd = i%NUM_CMD_LINES;

sprintf_s(buf, 512, "ffmpeg.exe -dct 3 %s mem:out%04d.%s", cmdLines[iCmd].options, i, cmdLines[iCmd].ext);
buf[511] = 0;

ffmpeg_main(buf);
}




The FFMPEG source code seems to align things required using ATTR_ALIGN(16) in the relevant places. Why would this kind of stack alignment stop working? I assume the stack alignment is something resolved at run-time. Any reason the alignment wouldn't work using a MinGW DLL from a VC++ application?


#7 kek_miyu   Members   -  Reputation: 122

Like
0Likes
Like

Posted 10 February 2008 - 03:01 AM

OK, last post ;) MSVC++ aligns the stack at 4-byte boundaries. GCC aligns the stack at 16-byte boundaries. If the stack is not 16-byte aligned when calling the GCC function, the stack will be misaligned and all type of SSE code will blow up. I haven't decided how to handle this yet, some sort of dirty hack to serve my purpose I guess.

#8 Yann L   Moderators   -  Reputation: 1798

Like
0Likes
Like

Posted 10 February 2008 - 03:57 AM

Oh yeah, now that you mention it, I vaguely remember something about GCC not adjusting stack alignment before a SIMD block, because it would just 'assume' the stack pointer passed from the calling function was aligned. That's obviously faster, but fails if the function is called from a misaligned stack. I thought they would've fixed that by now.

Anyway, this obviously explains the strange behaviour you got. If you are certain that GCC generates correct alignment on all function calls internally in the DLL (which I would hope to be the case), then the problem are your interface functions to MSVC. Since you don't have too many, and they're not called high frequency, you could just realign the stack when you enter them. There should be an GCC intrinsic to do that.


#9 kek_miyu   Members   -  Reputation: 122

Like
0Likes
Like

Posted 10 February 2008 - 06:35 AM

I've been looking for a way to align the stack in my DLL, but I haven't found anything that will work yet.

Am I correct to assume that inserting x number of byte allocations on the stack, until it "works" on my machine, might still break on another machine/environment if the allocated thread stack gets a different lowest address nibble? Does it consistently work on my machine because the VC++ environment keeps allocating the same address for the thread stack?

Seems like a new GCC 4+ branch will support __attribute__ ((force_align_arg_pointer)) which will force the stack to align to 16-bytes on entry/exit of function by inserting some extra ASM prologue/epilogue. However, I'm using the recommended version of MinGW ~version 3.x something for FFMPEG, so I don't have this feature. There's talk of other things as well:

http://readlist.com/lists/gcc.gnu.org/gcc/3/17895.html

I've googled my way all over the place, but coming up short.

I'm not that good at ASM, so I cannot tell if it'd be possible to make some ASM hack to align the stack, similar to the prologue/epilogue code in the above link. Anyone have any suggestion how I might do the alignment?

#10 kek_miyu   Members   -  Reputation: 122

Like
0Likes
Like

Posted 10 February 2008 - 07:27 AM

Okay, this is the only thing I've managed. The function doesn't crash now, even if I create new stack objects in the VC++ calling function.


// RELEASE: Hack-hack.. Allocate some additional memory on stack to get the stack aligned when calling GCC DLL
void* align1 = _malloca(16);
void* align2 = _malloca(4);

mainOK = m_ffmpegAPI->main(cmdLine);



I can't say why exactly I do these two dynamical allocations, but it seems as if the 16 byte allocation forces an alignment to 16 bytes and that the 4 byte alignment compensated for some parameter passing to the DLL function. That theory might be complete BS though ;)

Any one know if what I've written makes any sense and if it's stable? Or better yet, a nicer way to solve this?


#11 Evil Steve   Members   -  Reputation: 1987

Like
0Likes
Like

Posted 10 February 2008 - 08:20 AM

Quote:
Original post by kek_miyu
Any one know if what I've written makes any sense and if it's stable? Or better yet, a nicer way to solve this?
Personally, I'd use some inline asm to adjust ESP before calling the function. The code you have isn't really stable because if you allocate more or less stack in a function that calls that function, it'll blow up again. Adjusting ESP with inline asm would be a lot more stable (And could be done in the DLL, which is what GCC should really be doing anyway).

#12 kek_miyu   Members   -  Reputation: 122

Like
0Likes
Like

Posted 10 February 2008 - 08:57 AM

I'm not really comfortable with ASM, especially not with inline ASM where it's tricky to know what's allowed. Is such an inline ASM 16 byte stack alignment a trivial thing? I'm not exactly certain what's required for the stack to be aligned. Is it aligned if the first function parameter passed on the stack is on a 16 byte boundary? If so, I cannot really align the stack after the function is called, because the function parameters are already pushed onto the stack. Any enlightenment available?

I assume the alignment has to cater for the number of function arguments and things.. If anyone can write me such inline ASM alignment code (presumably to be run in VC++ before calling the "int ffmpeg_main(char*)" function) I'd be very grateful ;)

[Edited by - kek_miyu on February 10, 2008 3:57:57 PM]




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS