Sign in to follow this  
KodeWarrior

Non heap memory slow???!!!

Recommended Posts

Using MS VS C++ 2005 release with all optimizations I could find Input 1000000000 for first cin and 4 for second. Uncomment line with heap memory for faster version. Replacing inner loop with memcpy results in faster code. Windows doing stack checks behind my back but memcpy bypasses?
void __cdecl main(){
	//int *arr1 = new int[1024], *arr2 = new int[1024];
	int arr1[1024], arr2[1024];

	int asdf;
	std::cin >> asdf;

	int cnt;
	std::cin >> cnt;

	for(int i = 0; i < asdf; ++i)
	{
		for(int j = 0; j < cnt; ++j)
		{
			arr1[j] = arr2[j];
		}
	}

	for(int i = 0; i < cnt; i ++)
	{
		std::cout << arr1[0];
	}
}



Edit: I just count to get timings. If I can't see the difference, then there is no difference. 4 sec with loop and heap 11 sec with loop and local 6 sec with memcpy heap/local [Edited by - KodeWarrior on March 17, 2007 12:25:52 PM]

Share this post


Link to post
Share on other sites
My guess is that the compiler implements outer and inner the loops as a test-do-jump scheme, whereas memcpy substitutes the inner loop with a single "rep movsd" instruction. The latter can favour execution speed a great deal.

It also could mean that accessing the stack for large-ish arrays could result in cache misses.

To know for certain, try producing an ASM output for both versions and see what's the diff underneath.



edit: it's "rep movsd", not "rep stosd" - oops

[Edited by - Tachikoma on March 17, 2007 5:04:14 PM]

Share this post


Link to post
Share on other sites
Interesting. It seems that different code generated does make an impact.

Test loop, separated in its own routine.

void mtest(int arr1[], int arr2[], int cnt1, int cnt2)
for(int i = 0; i < cnt1; ++i) {
for(int j = 0; j < cnt2; ++j) {
arr1[j] = arr2[j];
}
}
}




With inlining set to /Ob1 or /Ob2, the above function is inlined into main() method with the following code (slower version):

loc_401070
test edx, edx
jle short loc_401080
mov ecx, edx
lea esi, [esp+88h+var_28]
lea edi, [esp+88h+var_18]
rep movsd

loc_401080:
sub eax, 1
jnz short loc_401070





Calling it with int *arr properly generates this (the faster version):

var_C = dword ptr -0Ch
var_4 = dword ptr -4
arg_0 = dword ptr 4

push ecx
test eax, eax
push ebx
mov ebx, [esp+8+arg_0]
jle short loc_401035
mov [esp+8+var_4], eax
push ebp
nop

loc_401010:
test edi, edi
jle short loc_40102D
mov ecx, ebx
mov eax, esi
sub ecx, esi
mov edx, edi
lea esp, [esp+0]

loc_401020:
mov ebp, [ecx+eax]
mov [eax], ebp
add eax, 4
sub edx, 1
jnz short loc_401020

loc_40102D:
sub [esp+0Ch+var_4], 1
jnz short loc_401010
pop ebp

loc_401035:
pop ebx
pop ecx
retn
mtest endp




Slower version takes 12.4 seconds, faster 5.9.

But, by disabling inlining with /Ob0, the second form is used, and takes 5.9 seconds in both cases.

This is on AMD processor.

The timer class I used is this, and the measurment results are consistent, and also well out of margin of error due to 2x difference in running time:

Timer::Timer(void)
{
LARGE_INTEGER iFrequency;
QueryPerformanceFrequency(&iFrequency);
frequency = 1000.0 / iFrequency.QuadPart;
}

double Timer::time(void)
{
LARGE_INTEGER counter;
QueryPerformanceCounter(&counter);

return counter.QuadPart * frequency;
}

Share this post


Link to post
Share on other sites
You get an even faster version (in VS2005) if you change that to void mtest(char* __restrict arr1, char* __restrict arr2, int cnt1, int cnt2) - it compiles into an outer loop that just calls memcpy. Not sure why the compiler doesn't handle copy-loops of non-chars in the same way, though.

memcpy uses various tricks to copy quickly, like SSE2 and aligned-dword copying, so it's going to be faster than a non-hand-written implementation in most cases.

Share this post


Link to post
Share on other sites
Thanks for the replies. I guess what I'm trying to do is test the overhead of memcpy versus an inlined movsd in a somewhat realistic application. I think the nested loop confused the compiler.

Yes SIMDx instructions can be fast, but only on huge arrays.
memcpy has calling overhead (looked at asm. Intrinsic was not inlined for some reason)
memcpy assumes bytes and can't just do a movsd.
__movsd intrinsic good, but no SIMD. I find my copy int for loops == __movsd.

Maybe a better way to test which is faster is to use a more realistic application. Mergesort on 32bit ints? Has a couple huge arrays (good for SIMD) and a great number of small copies (movsd might be better here).

Does anyone have a mergesort program they could test this on?

Thanks in advance.

Share this post


Link to post
Share on other sites
Quote:
Original post by KodeWarrior
Maybe a better way to test which is faster is to use a more realistic application.


First rule of optimization: Don't do it.

Realistic application is this: Write your application. Profile it. Look for bottlenecks. Resolve them.

If one of these bottlenecks happens to be copying large ammounts of memory, and it happens to be due to copying routine, and that only because it's not inlined, then you should try to fix it.

While these results are quite interesting, I went to check an application I'm working on. Despite pretty heavy memory traffic, copying is the last of my problems, and even in the places where I need to do it, the compiler already did a good job at selecting the seemingly optimal solution.

But especially since this deals with somewhat unexpected compiler issues, profiler of the final application will be only thing that will produce results in the end.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Again, show us your timing code. You likely have issues in it too, either with resolution or not taking into account known issues. Are you using a whole program profiler? Why are you working on code that is likely to not be your games bottleneck. If this isn't a game... Well... There are other sites :)

Share this post


Link to post
Share on other sites
Quote:
Original post by KodeWarrior
Using MS VS C++ 2005 release with all optimizations I could find

Input 1000000000 for first cin and 4 for second.

Uncomment line with heap memory for faster version.

Replacing inner loop with memcpy results in faster code. Windows doing stack checks behind my back but memcpy bypasses?

*** Source Snippet Removed ***


Edit:

I just count to get timings. If I can't see the difference, then there is no difference.
4 sec with loop and heap
11 sec with loop and local
6 sec with memcpy heap/local
I'm sorry, but that's an invalid benchmark.

Rule 1 of benchmarking is that it must not be possible for the compiler to remove any lines of code and still generate the same answer. As it stands, a clever compiler could remove the outer for loop and still produce the same result, because the same data is simply copied over multiple times and doing it once has the same effect. What the particular compiler actually did in this scenario is not as meaningful as you would think.[totally]

Secondly, you're copying data from uninitialised arrays, and hence the compiler could theoretically omit both loops and the copying and just spit random values out instead, since that is what the input is and it simply gets copied around anyway.

Thirdly, whilst your declaring 8K arrays, you're only actually performing a tiny copy of just four integers (you said to input 4).
In reality if you were copying such a small array many times over, then they would all come-from and/or go-to different places in RAM, meaning that the bottleneck would be on the main memory. This would mean that the slight difference in speed between the loops would be negligible compared to the latency of fetching memory etc.

The moral of the story there is don't test anything but real-world situations. The mergesort benchmark idea you have is a great idea. It should confirm the above, that memory fetching etc becomes the bottleneck, and they will be about the same speed. I can donate an implementation of this if desired.[cool]
Alternatively, try entering 16 and 10000000 instead and see what difference there is then. You'll get a buffer overflow, with the stack version, unless you make those statically allocated arrays larger, and increase the stack to a ridiculous size.

Share this post


Link to post
Share on other sites
As for the difference in code generation between the stack and heap versions, that seems to be due to aliasing - the compiler can tell that two stack variables are not overlapping, but it can't tell that the two pointers returned by new are not overlapping, and so it generates different code (rep movsd vs a read-write loop). If you add __restrict to the heap-allocated pointers, then it generates the same code as for the stack-allocated version.

Share this post


Link to post
Share on other sites
Quote:
Original post by Excors
As for the difference in code generation between the stack and heap versions, that seems to be due to aliasing - the compiler can tell that two stack variables are not overlapping, but it can't tell that the two pointers returned by new are not overlapping, and so it generates different code (rep movsd vs a read-write loop). If you add __restrict to the heap-allocated pointers, then it generates the same code as for the stack-allocated version.
You know, my post was initially going to be mentioning the aliasing issue, as I had noticed exactly what you state above. The only reason I didn't is because the one where it can't assume no aliasing is the one that appeared to be faster! (go figure)[totally]
Its really interesting to see what a diference __restrict makes.

Share this post


Link to post
Share on other sites
That seems to be just because this is a pathological case - I guess the CPU assumes you're going to be copying a lot of data, so it goes to the trouble of setting up all the stuff for rep movsd, and then it only copies four bytes because of the input parameters. Presumably the straight 'read, write, decrement, loop until zero' results in much happier pipelining or prediction or whatever it is that's affecting the timings. If you run the program with a larger second parameter (number of bytes to copy), then the alias-free version (stack/__restrict, using rep movsd) becomes faster instead, which makes more sense.

I don't actually see why the compiler cares about aliasing in this case, since rep movsd ought to operate exactly as expected even when you've got overlapping input (unless I'm missing some difference). I'd assume it's part of the mysterious internals of the optimiser, which just happens to have a special case for copying unaliased memory and not for aliased memory, which it uses when it guesses it's going to be copying quite a lot of memory (more than four bytes). But I really don't know, so I'm just guessing [smile]

Anyway, this kind of benchmark doesn't give useful results, but it does give useful information like "the optimiser isn't perfect and sometimes it'll generate worse code when you don't think it should (e.g. when you add __restrict, or when you use stack allocation)", so you know what to look out for when you're profiling a real application, and if it's significantly slowed by memory copying then you can test whether the compiler happens to be generating significantly sub-optimal code in that case. (It also shows that you really do need to profile real code, because it's impossible to anticipate what the compiler's optimiser will do.)

Going back to the original question, "Non heap memory slow?", the answer is definitely no - the code does demonstrate real differences between the two programs, but you need to be very careful when analysing the results, because heap vs stack isn't the relevant issue here.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this