Jump to content
  • Advertisement
Sign in to follow this  
KodeWarrior

Non heap memory slow???!!!

This topic is 4138 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Using MS VS C++ 2005 release with all optimizations I could find Input 1000000000 for first cin and 4 for second. Uncomment line with heap memory for faster version. Replacing inner loop with memcpy results in faster code. Windows doing stack checks behind my back but memcpy bypasses?
void __cdecl main(){
	//int *arr1 = new int[1024], *arr2 = new int[1024];
	int arr1[1024], arr2[1024];

	int asdf;
	std::cin >> asdf;

	int cnt;
	std::cin >> cnt;

	for(int i = 0; i < asdf; ++i)
	{
		for(int j = 0; j < cnt; ++j)
		{
			arr1[j] = arr2[j];
		}
	}

	for(int i = 0; i < cnt; i ++)
	{
		std::cout << arr1[0];
	}
}



Edit: I just count to get timings. If I can't see the difference, then there is no difference. 4 sec with loop and heap 11 sec with loop and local 6 sec with memcpy heap/local [Edited by - KodeWarrior on March 17, 2007 12:25:52 PM]

Share this post


Link to post
Share on other sites
Advertisement
My guess is that the compiler implements outer and inner the loops as a test-do-jump scheme, whereas memcpy substitutes the inner loop with a single "rep movsd" instruction. The latter can favour execution speed a great deal.

It also could mean that accessing the stack for large-ish arrays could result in cache misses.

To know for certain, try producing an ASM output for both versions and see what's the diff underneath.



edit: it's "rep movsd", not "rep stosd" - oops

[Edited by - Tachikoma on March 17, 2007 5:04:14 PM]

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Show us your timing code

Share this post


Link to post
Share on other sites
Interesting. It seems that different code generated does make an impact.

Test loop, separated in its own routine.

void mtest(int arr1[], int arr2[], int cnt1, int cnt2)
for(int i = 0; i < cnt1; ++i) {
for(int j = 0; j < cnt2; ++j) {
arr1[j] = arr2[j];
}
}
}




With inlining set to /Ob1 or /Ob2, the above function is inlined into main() method with the following code (slower version):

loc_401070
test edx, edx
jle short loc_401080
mov ecx, edx
lea esi, [esp+88h+var_28]
lea edi, [esp+88h+var_18]
rep movsd

loc_401080:
sub eax, 1
jnz short loc_401070





Calling it with int *arr properly generates this (the faster version):

var_C = dword ptr -0Ch
var_4 = dword ptr -4
arg_0 = dword ptr 4

push ecx
test eax, eax
push ebx
mov ebx, [esp+8+arg_0]
jle short loc_401035
mov [esp+8+var_4], eax
push ebp
nop

loc_401010:
test edi, edi
jle short loc_40102D
mov ecx, ebx
mov eax, esi
sub ecx, esi
mov edx, edi
lea esp, [esp+0]

loc_401020:
mov ebp, [ecx+eax]
mov [eax], ebp
add eax, 4
sub edx, 1
jnz short loc_401020

loc_40102D:
sub [esp+0Ch+var_4], 1
jnz short loc_401010
pop ebp

loc_401035:
pop ebx
pop ecx
retn
mtest endp




Slower version takes 12.4 seconds, faster 5.9.

But, by disabling inlining with /Ob0, the second form is used, and takes 5.9 seconds in both cases.

This is on AMD processor.

The timer class I used is this, and the measurment results are consistent, and also well out of margin of error due to 2x difference in running time:

Timer::Timer(void)
{
LARGE_INTEGER iFrequency;
QueryPerformanceFrequency(&iFrequency);
frequency = 1000.0 / iFrequency.QuadPart;
}

double Timer::time(void)
{
LARGE_INTEGER counter;
QueryPerformanceCounter(&counter);

return counter.QuadPart * frequency;
}

Share this post


Link to post
Share on other sites
You get an even faster version (in VS2005) if you change that to void mtest(char* __restrict arr1, char* __restrict arr2, int cnt1, int cnt2) - it compiles into an outer loop that just calls memcpy. Not sure why the compiler doesn't handle copy-loops of non-chars in the same way, though.

memcpy uses various tricks to copy quickly, like SSE2 and aligned-dword copying, so it's going to be faster than a non-hand-written implementation in most cases.

Share this post


Link to post
Share on other sites
Thanks for the replies. I guess what I'm trying to do is test the overhead of memcpy versus an inlined movsd in a somewhat realistic application. I think the nested loop confused the compiler.

Yes SIMDx instructions can be fast, but only on huge arrays.
memcpy has calling overhead (looked at asm. Intrinsic was not inlined for some reason)
memcpy assumes bytes and can't just do a movsd.
__movsd intrinsic good, but no SIMD. I find my copy int for loops == __movsd.

Maybe a better way to test which is faster is to use a more realistic application. Mergesort on 32bit ints? Has a couple huge arrays (good for SIMD) and a great number of small copies (movsd might be better here).

Does anyone have a mergesort program they could test this on?

Thanks in advance.

Share this post


Link to post
Share on other sites
Quote:
Original post by KodeWarrior
Maybe a better way to test which is faster is to use a more realistic application.


First rule of optimization: Don't do it.

Realistic application is this: Write your application. Profile it. Look for bottlenecks. Resolve them.

If one of these bottlenecks happens to be copying large ammounts of memory, and it happens to be due to copying routine, and that only because it's not inlined, then you should try to fix it.

While these results are quite interesting, I went to check an application I'm working on. Despite pretty heavy memory traffic, copying is the last of my problems, and even in the places where I need to do it, the compiler already did a good job at selecting the seemingly optimal solution.

But especially since this deals with somewhat unexpected compiler issues, profiler of the final application will be only thing that will produce results in the end.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Again, show us your timing code. You likely have issues in it too, either with resolution or not taking into account known issues. Are you using a whole program profiler? Why are you working on code that is likely to not be your games bottleneck. If this isn't a game... Well... There are other sites :)

Share this post


Link to post
Share on other sites
Quote:
Original post by KodeWarrior
Using MS VS C++ 2005 release with all optimizations I could find

Input 1000000000 for first cin and 4 for second.

Uncomment line with heap memory for faster version.

Replacing inner loop with memcpy results in faster code. Windows doing stack checks behind my back but memcpy bypasses?

*** Source Snippet Removed ***


Edit:

I just count to get timings. If I can't see the difference, then there is no difference.
4 sec with loop and heap
11 sec with loop and local
6 sec with memcpy heap/local
I'm sorry, but that's an invalid benchmark.

Rule 1 of benchmarking is that it must not be possible for the compiler to remove any lines of code and still generate the same answer. As it stands, a clever compiler could remove the outer for loop and still produce the same result, because the same data is simply copied over multiple times and doing it once has the same effect. What the particular compiler actually did in this scenario is not as meaningful as you would think.[totally]

Secondly, you're copying data from uninitialised arrays, and hence the compiler could theoretically omit both loops and the copying and just spit random values out instead, since that is what the input is and it simply gets copied around anyway.

Thirdly, whilst your declaring 8K arrays, you're only actually performing a tiny copy of just four integers (you said to input 4).
In reality if you were copying such a small array many times over, then they would all come-from and/or go-to different places in RAM, meaning that the bottleneck would be on the main memory. This would mean that the slight difference in speed between the loops would be negligible compared to the latency of fetching memory etc.

The moral of the story there is don't test anything but real-world situations. The mergesort benchmark idea you have is a great idea. It should confirm the above, that memory fetching etc becomes the bottleneck, and they will be about the same speed. I can donate an implementation of this if desired.[cool]
Alternatively, try entering 16 and 10000000 instead and see what difference there is then. You'll get a buffer overflow, with the stack version, unless you make those statically allocated arrays larger, and increase the stack to a ridiculous size.

Share this post


Link to post
Share on other sites
As for the difference in code generation between the stack and heap versions, that seems to be due to aliasing - the compiler can tell that two stack variables are not overlapping, but it can't tell that the two pointers returned by new are not overlapping, and so it generates different code (rep movsd vs a read-write loop). If you add __restrict to the heap-allocated pointers, then it generates the same code as for the stack-allocated version.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!