Non heap memory slow???!!!

Started by
10 comments, last by Excors 17 years, 1 month ago
Using MS VS C++ 2005 release with all optimizations I could find Input 1000000000 for first cin and 4 for second. Uncomment line with heap memory for faster version. Replacing inner loop with memcpy results in faster code. Windows doing stack checks behind my back but memcpy bypasses?

void __cdecl main(){
	//int *arr1 = new int[1024], *arr2 = new int[1024];
	int arr1[1024], arr2[1024];

	int asdf;
	std::cin >> asdf;

	int cnt;
	std::cin >> cnt;

	for(int i = 0; i < asdf; ++i)
	{
		for(int j = 0; j < cnt; ++j)
		{
			arr1[j] = arr2[j];
		}
	}

	for(int i = 0; i < cnt; i ++)
	{
		std::cout << arr1[0];
	}
}



Edit: I just count to get timings. If I can't see the difference, then there is no difference. 4 sec with loop and heap 11 sec with loop and local 6 sec with memcpy heap/local [Edited by - KodeWarrior on March 17, 2007 12:25:52 PM]
Advertisement
My guess is that the compiler implements outer and inner the loops as a test-do-jump scheme, whereas memcpy substitutes the inner loop with a single "rep movsd" instruction. The latter can favour execution speed a great deal.

It also could mean that accessing the stack for large-ish arrays could result in cache misses.

To know for certain, try producing an ASM output for both versions and see what's the diff underneath.



edit: it's "rep movsd", not "rep stosd" - oops

[Edited by - Tachikoma on March 17, 2007 5:04:14 PM]
Latest project: Sideways Racing on the iPad
Show us your timing code
Interesting. It seems that different code generated does make an impact.

Test loop, separated in its own routine.
void mtest(int arr1[], int arr2[], int cnt1, int cnt2) 	for(int i = 0; i < cnt1; ++i) {		for(int j = 0; j < cnt2; ++j) {			arr1[j] = arr2[j];		}	}}


With inlining set to /Ob1 or /Ob2, the above function is inlined into main() method with the following code (slower version):
loc_401070  test    edx, edx  jle     short loc_401080  mov     ecx, edx  lea     esi, [esp+88h+var_28]  lea     edi, [esp+88h+var_18]  rep movsdloc_401080:  sub     eax, 1  jnz     short loc_401070


Calling it with int *arr properly generates this (the faster version):
 var_C           = dword ptr -0Ch var_4           = dword ptr -4 arg_0           = dword ptr  4                 push    ecx                 test    eax, eax                 push    ebx                 mov     ebx, [esp+8+arg_0]                 jle     short loc_401035                 mov     [esp+8+var_4], eax                 push    ebp                 nop loc_401010:                                             test    edi, edi                 jle     short loc_40102D                 mov     ecx, ebx                 mov     eax, esi                 sub     ecx, esi                 mov     edx, edi                 lea     esp, [esp+0] loc_401020:                                              mov     ebp, [ecx+eax]                 mov     [eax], ebp                 add     eax, 4                 sub     edx, 1                 jnz     short loc_401020 loc_40102D:                                             sub     [esp+0Ch+var_4], 1                 jnz     short loc_401010                 pop     ebp loc_401035:                                           pop     ebx                 pop     ecx                 retn mtest           endp


Slower version takes 12.4 seconds, faster 5.9.

But, by disabling inlining with /Ob0, the second form is used, and takes 5.9 seconds in both cases.

This is on AMD processor.

The timer class I used is this, and the measurment results are consistent, and also well out of margin of error due to 2x difference in running time:
Timer::Timer(void){	LARGE_INTEGER iFrequency;	QueryPerformanceFrequency(&iFrequency);	frequency = 1000.0 / iFrequency.QuadPart;}double Timer::time(void){	LARGE_INTEGER counter;	QueryPerformanceCounter(&counter);	return counter.QuadPart * frequency;}
You get an even faster version (in VS2005) if you change that to void mtest(char* __restrict arr1, char* __restrict arr2, int cnt1, int cnt2) - it compiles into an outer loop that just calls memcpy. Not sure why the compiler doesn't handle copy-loops of non-chars in the same way, though.

memcpy uses various tricks to copy quickly, like SSE2 and aligned-dword copying, so it's going to be faster than a non-hand-written implementation in most cases.
Thanks for the replies. I guess what I'm trying to do is test the overhead of memcpy versus an inlined movsd in a somewhat realistic application. I think the nested loop confused the compiler.

Yes SIMDx instructions can be fast, but only on huge arrays.
memcpy has calling overhead (looked at asm. Intrinsic was not inlined for some reason)
memcpy assumes bytes and can't just do a movsd.
__movsd intrinsic good, but no SIMD. I find my copy int for loops == __movsd.

Maybe a better way to test which is faster is to use a more realistic application. Mergesort on 32bit ints? Has a couple huge arrays (good for SIMD) and a great number of small copies (movsd might be better here).

Does anyone have a mergesort program they could test this on?

Thanks in advance.
Quote:Original post by KodeWarrior
Maybe a better way to test which is faster is to use a more realistic application.


First rule of optimization: Don't do it.

Realistic application is this: Write your application. Profile it. Look for bottlenecks. Resolve them.

If one of these bottlenecks happens to be copying large ammounts of memory, and it happens to be due to copying routine, and that only because it's not inlined, then you should try to fix it.

While these results are quite interesting, I went to check an application I'm working on. Despite pretty heavy memory traffic, copying is the last of my problems, and even in the places where I need to do it, the compiler already did a good job at selecting the seemingly optimal solution.

But especially since this deals with somewhat unexpected compiler issues, profiler of the final application will be only thing that will produce results in the end.
Again, show us your timing code. You likely have issues in it too, either with resolution or not taking into account known issues. Are you using a whole program profiler? Why are you working on code that is likely to not be your games bottleneck. If this isn't a game... Well... There are other sites :)
Quote:Original post by KodeWarrior
Using MS VS C++ 2005 release with all optimizations I could find

Input 1000000000 for first cin and 4 for second.

Uncomment line with heap memory for faster version.

Replacing inner loop with memcpy results in faster code. Windows doing stack checks behind my back but memcpy bypasses?

*** Source Snippet Removed ***


Edit:

I just count to get timings. If I can't see the difference, then there is no difference.
4 sec with loop and heap
11 sec with loop and local
6 sec with memcpy heap/local
I'm sorry, but that's an invalid benchmark.

Rule 1 of benchmarking is that it must not be possible for the compiler to remove any lines of code and still generate the same answer. As it stands, a clever compiler could remove the outer for loop and still produce the same result, because the same data is simply copied over multiple times and doing it once has the same effect. What the particular compiler actually did in this scenario is not as meaningful as you would think.[totally]

Secondly, you're copying data from uninitialised arrays, and hence the compiler could theoretically omit both loops and the copying and just spit random values out instead, since that is what the input is and it simply gets copied around anyway.

Thirdly, whilst your declaring 8K arrays, you're only actually performing a tiny copy of just four integers (you said to input 4).
In reality if you were copying such a small array many times over, then they would all come-from and/or go-to different places in RAM, meaning that the bottleneck would be on the main memory. This would mean that the slight difference in speed between the loops would be negligible compared to the latency of fetching memory etc.

The moral of the story there is don't test anything but real-world situations. The mergesort benchmark idea you have is a great idea. It should confirm the above, that memory fetching etc becomes the bottleneck, and they will be about the same speed. I can donate an implementation of this if desired.[cool]
Alternatively, try entering 16 and 10000000 instead and see what difference there is then. You'll get a buffer overflow, with the stack version, unless you make those statically allocated arrays larger, and increase the stack to a ridiculous size.
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
As for the difference in code generation between the stack and heap versions, that seems to be due to aliasing - the compiler can tell that two stack variables are not overlapping, but it can't tell that the two pointers returned by new are not overlapping, and so it generates different code (rep movsd vs a read-write loop). If you add __restrict to the heap-allocated pointers, then it generates the same code as for the stack-allocated version.

This topic is closed to new replies.

Advertisement