Sign in to follow this  
Finalspace

Weird behavior for a function with optimizations turned on

Recommended Posts

I have a function which returns the name of the processor for X64 based on __cpuid() intrinsic. When i run that function without optimizations its fine and works correct, but when i compile and run it with O2 for example then the compiler does weird things. The branch for detecting if "destBuffer" is null is totally changed, no comparison will be made and "destLen" will never be set.

The result is that PushMemory_Internal() gets a totally garbage size and may return nullptr. What did i wrong? The compiler definitily rearranged it a lot, so its hard to follow.

Basically the function has two modes, one with a given destBuffer and its max length and one without any dest buffer at all - so in this case we stack allocate the memory we need for worst case size.

	static char *GetProcessorName(char *destBuffer = nullptr, const uint32_t maxDestBufferLen = 0);		

	static char *GetProcessorName(char *destBuffer, const uint32_t maxDestBufferLen) {
			constexpr uint32_t cpuBrandBufferSize = 0x40;

			uint32_t destLen = maxDestBufferLen;
			if (destBuffer == nullptr) {
				destLen = cpuBrandBufferSize + 1;
				destBuffer = (char *)malloca(destLen);
			}
			assert(destBuffer != nullptr);
			assert(destLen >= (cpuBrandBufferSize + 1));

			int cpuInfo[4] = { -1 };
			char cpuBrandBuffer[cpuBrandBufferSize] = {};
			__cpuid(cpuInfo, 0x80000000);
			uint32_t extendedIds = cpuInfo[0];

			// Get the information associated with each extended ID.
			// Interpret CPU brand string.
			for (uint32_t i = 0x80000000; i <= extendedIds; ++i) {
				__cpuid(cpuInfo, i);
				if (i == 0x80000002) {
					CopyMemory(cpuInfo, cpuBrandBuffer, sizeof(cpuInfo));
				} else if (i == 0x80000003) {
					CopyMemory(cpuInfo, cpuBrandBuffer + 16, sizeof(cpuInfo));
				} else if (i == 0x80000004) {
					CopyMemory(cpuInfo, cpuBrandBuffer + 32, sizeof(cpuInfo));
				}
			}

			// Copy result back to the dest buffer
			uint32_t sourceLen = strlen(cpuBrandBuffer);
          	strncpy(destBuffer, cpuBrandBuffer, sourceLen);

			return(destBuffer);
		}

Thanks in advance,

Final

Share this post


Link to post
Share on other sites

There's... Multiple things wrong with that code.

Let's start with this: https://msdn.microsoft.com/en-us/library/5471dc8s.aspx

And move on to this: https://en.m.wikipedia.org/wiki/Dangling_pointer

 

Hopefully you'll see how your approach is fundamentally flawed. (As to why the optimizer scrambles your code - welcome to undefined behavior.)

Design a better approach and we can get nitpicky on the rest of that snippet :-)

Share this post


Link to post
Share on other sites
2 hours ago, ApochPiQ said:

There's... Multiple things wrong with that code.

Let's start with this: https://msdn.microsoft.com/en-us/library/5471dc8s.aspx

And move on to this: https://en.m.wikipedia.org/wiki/Dangling_pointer

 

Hopefully you'll see how your approach is fundamentally flawed. (As to why the optimizer scrambles your code - welcome to undefined behavior.)

Design a better approach and we can get nitpicky on the rest of that snippet :-)

Okay sometimes i should not touch any keyboard at all. Its obvious that alloca allocates memory on the CURRENT stack frame, not on any other one O_o. Changing to malloc solves that particular problem, but the removal of the condition (destBuffer == nullptr) was still there. So i rearranged the condition a bit and now it works: Also the memcpy order in this code i posted was wrong, but in my actual code its correct.

Here is a full working main.cpp for x86 and x64 with and without optimizations:

#include <intrin.h>
#include <inttypes.h>
#include <assert.h>
#include <malloc.h>
#include <stdio.h>
#include <string>

static char *GetProcessorName(char *destBuffer = nullptr, const size_t maxDestBufferLen = 0);

static char *GetProcessorName(char *destBuffer, const size_t maxDestBufferLen) {
	constexpr uint32_t cpuBrandBufferSize = 0x40;

	char *dest;
	size_t destLen;
	if ((destBuffer != nullptr) && (maxDestBufferLen > 0)) {
		dest = destBuffer;
		destLen = maxDestBufferLen;
	} else {
		destLen = cpuBrandBufferSize + 1;
		dest = (char *)malloc(destLen);
	}
	assert(dest != nullptr);
	assert(destLen >= (cpuBrandBufferSize + 1));

	int cpuInfo[4] = { -1 };
	char cpuBrandBuffer[cpuBrandBufferSize] = {};
	__cpuid(cpuInfo, 0x80000000);
	uint32_t extendedIds = cpuInfo[0];

	// Get the information associated with each extended ID.
	// Interpret CPU brand string.
	for (uint32_t i = 0x80000000; i <= extendedIds; ++i) {
		__cpuid(cpuInfo, i);
		if (i == 0x80000002) {
			memcpy(cpuBrandBuffer, cpuInfo, sizeof(cpuInfo));
		} else if (i == 0x80000003) {
			memcpy(cpuBrandBuffer + 16, cpuInfo, sizeof(cpuInfo));
		} else if (i == 0x80000004) {
			memcpy(cpuBrandBuffer + 32, cpuInfo, sizeof(cpuInfo));
		}
	}

	// Copy result back to the dest buffer
	size_t sourceLen = strlen(cpuBrandBuffer);
	strncpy_s(dest, destLen, cpuBrandBuffer, sourceLen);

	return(dest);
}

int main(int argc, char **args) {
    char *cpuName = GetProcessorName();
	printf("CPU Name: %s\n", cpuName);
    free(cpuName); // Really i want a defer free(); function...
	getchar();
	return 0;
}

 

Edited by Finalspace

Share this post


Link to post
Share on other sites
1 minute ago, KidsLoveSatan said:

Is that a problem? The function many have been inlined.

It might be, but there are no guarantee for that - even if it would use the "inline" keyword. Maybe with a forced inline, but this would require to change the api signature...

Share this post


Link to post
Share on other sites
Just now, Finalspace said:

It might be, but there are no guarantee for that - even if it would use the "inline" keyword. Maybe with a forced inline, but this would require to change the api signature...

That doesn't answer why is it a problem if the condition is optimized away? 

Share this post


Link to post
Share on other sites
Just now, Finalspace said:

I have no idea why the condition is optimized away... maybe because of the default argument of nullptr and zero for the destination length?

I compiled your code with /O2 and the function is being inlined, if I turn off function inlining a number of variables are optimized away. I don't see why you have a problem with it, to me it's working as it should.

Share this post


Link to post
Share on other sites

In your original code, "destLen" isn't used outside of assignment and the assert though, so the compiler will totally optimize that away. It should still enter the branch and assign the destBuffer though, are you sure that didn't happen as well?

Share this post


Link to post
Share on other sites
1 hour ago, Juliean said:

In your original code, "destLen" isn't used outside of assignment and the assert though, so the compiler will totally optimize that away. It should still enter the branch and assign the destBuffer though, are you sure that didn't happen as well?

I debugged it in visual studio 2015 and it fully jumps to the allocation directly. Its not inlined, i havent enabled aggresive inline rules.

I think i know what happened, the branch is not compiled out - but rather moved/rearranged, so the comparison is not on the same line anymore. Looking at the disassembly output should make that clear.

So the problem was definitly the alloca() allocating memory on the current stack frame, then leaving the function and then throwing it away. Depending on the inlining of a function is not a good idea i think. Inlining should simply help make your code faster by removing function calls - but it should never change the actual result.

I hate switching between low level and high level languages all the time, so you forget such fundamental things.

Edited by Finalspace

Share this post


Link to post
Share on other sites
3 hours ago, Finalspace said:

So the problem was definitly the alloca() allocating memory on the current stack frame, then leaving the function and then throwing it away. Depending on the inlining of a function is not a good idea i think. Inlining should simply help make your code faster by removing function calls - but it should never change the actual result.

Ah, I didn't even notice the alloca. This perfectly explains whats happening here, and I'm even wondering that the compiler isn't optimizing it more aggressively than it already is. Lets recap what you are telling the compiler here:

1: You want some memory off the stack. Unlike _malloca or equivalents, alloca will always be stack based, so the compiler can be sure about that.

2: You are returning a pointer to this memory, which is only valid inside this function. So that tells the compiler that the return-statement, while being able to return the current pointer of the allocation as-is would be valid, the caller of the function would then point to invalid memory, so this should be undefined behaviour.

3: Furthermore, you are not using the destBuffer outside of strncpy-ing to it in the end. Since the compiler can assume that the value of destBuffer is not going to be used after strncpy for return being undefined behaviour, I belive it can optimize it away (at least thats the case for memcpy, so don't hang me if I'm wrong about that).

4. So that means that the compiler is left with an allocation that is literally unused, which means it can probably just remove it (though from what you describe thats not happening, so maybe it is pretenting to allocate, but just generates the address that the allocation would be at, so that i can return *something* at the end?)

So maybe thats whats already happening (hard to tell w/o the assembly), but a compiler could just turn that whole code into:
 

If(destBuf != nullptr)
 	// execute function code;
  	return destBuf;
else
    // do nothing
	return WHATEVER_VALUE. // probably eigther garbage or a pointer along the stack where the allocation would start

I'm no expert on compilers but I'm pretty sure thats exactly whats happening. The compiler usually is not allowed to alter the visible behaviour of the program, unless you are, as in your case, relying on undefined behaviour - so the lesson is, never invoke undefined behaviour :)

Share this post


Link to post
Share on other sites
On 8.11.2017 at 7:11 PM, ApochPiQ said:

I'ma just leave this here: https://godbolt.org

 

 

Also, your code has a lot of improvement to be made.

  1. Your API for GetProcessorName() is bad. Please don't ever write a function like that. You should either return a string object or fill a string buffer, but not both in one function. As you are discovering this is a recipe for confusion and unhappiness.
  2. Please don't use C-strings. You have a number of issues in this code that indicate (1) you are not familiar with how C-strings work, (2) you aren't thinking carefully enough about how you manipulate C-strings, or (3) both. For example, you confuse allocation size with string length in a couple places, and your attempts to account for NULL terminators look wrong to me.
  3. The loop style invocation of __cpuid is overly complex and needlessly busy if the CPU returns a huge number of capabilities/extended IDs. You can write this simply as a single if check and 3 successive __cpuid calls with no loop.

1/2. I already have the signature of passing in a buffer in all functions, but i (browser crashed in the middle of the sentence) thought i might be a good idea to support passing nullptr as the destination buffer. But you are right it was a bad idea, i should support returning a object instead, string memory allocated on the heap - which is the same std::string does under the hood. But overall my goal was to let the user decide how to handle the memory - not the library. So using a buffer always and never allocate memory would be more sane for that goal.

3. I copied it from stackoverflow, i just wanted a quick result for the moment. I have not thought this through by any means but i will when i feature complete the library, which will take a while because i want linux support as well.

Edited by Finalspace

Share this post


Link to post
Share on other sites

If you wanted to let the user decide how to handle the memory, then you should also let them decide how to allocate it.  You're really just letting the user decide when to free the memory, but not how.  A user could have a string with a custom allocator, so if you templatized this function to accept any string, then the user could truly manage it themselves.  You'd end up using their allocator inside your function, and they could manage everything themselves.

Share this post


Link to post
Share on other sites

Templates are a powerful tool but probably not the right one for this job.

Accepting a buffer and size is the right thing for the API to do unless the API can stipulate the usage of std::string or whatever.

But your main point is spot on - mixing the responsibility for allocation and deallocation across API boundaries is bad juju.

Share this post


Link to post
Share on other sites

It's already been touched on, but this style of code:

    char* cpuName = GetProcessorName();
	//do somthing
    free(cpuName);

...should be avoided at all costs. Manual memory management is very error-prone easy to get wrong. As such, care should be taken to be extremely clear at all times as to who the owner of each resource is, and to ensure that all allocation functions are paired up with a matching deallocation function.

In the above example, it is not clear the 'GetProcessorName':
* Allocates memory,
* Re-assigns ownership of that memory (responsibility for not leaking it) to the caller,
* Or that the appropriate de-allocation function for the caller to use is 'free'.

That's three booby traps created to make it hard for users of this function to write correct code. So:

* Name functions that create new resources appropriately. e.g. 'MallocProcessorName' would be a stronger hint as to what's going on.
* Match up allocation and deallocation functions -- the user should be able to see a malloc/free pair, not a GetProcessorName/free pair. In that case, forcing the user to call both malloc and free themselves makes this clear. If you pass a buffer into 'GetProcessorName', then that also allows the user to get memory from other places (such as alloca) correctly :)
* Alternatively, if it returned a std::string then this is also clear -- the user knows that std::strings contain allocated char buffers, and that the string owns the buffer and will clean it up automatically.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this