Sign in to follow this  
DMINATOR

native char* slower on VS2005?

Recommended Posts

OK I am not making any assumptions of any kind, this is just one test. I just made one test to see what benefits does VS2005 give. The test is just a simple check - running functions in the loop and calculating the time taken for execution. Here is the modified source:
#include <iostream>
#include <windows.h>
//#include "tests.h"


//5 million times !
#define BIG_TESTS 5000000



#include <string>

using namespace std;

//a new copy is created - so original string is not changed
void String1(string str)
{
	str = "changed";
}

//a reference is used
void String2(string& str)
{
	str = "changed";

}


void String3(string* str)
{
	*str = "changed";
}


//char* changed by pointer simple printf
void String4(char* str)
{

	sprintf(str,"changed");

}

//direct strcpy
void String5(char* str)
{
	strcpy(str, "changed");
}





LARGE_INTEGER before;
LARGE_INTEGER difference;
LARGE_INTEGER curtime;
LARGE_INTEGER freq;
unsigned long timepassed;
unsigned int fps;




//Calculate time passed
void TimePassed()
{
	

	
	QueryPerformanceFrequency( &freq);
	
	double TimeScale = (1.0/freq.QuadPart)*1000.0;
	
	
	//Add the function in here
	
	QueryPerformanceCounter( &curtime);//end measure

	timepassed = (curtime.QuadPart-before.QuadPart)*TimeScale;


	QueryPerformanceCounter( &before);//begin measure

}



int main()
{
	cout << " Starting test. Number of loops= "<< BIG_TESTS  << endl << endl;

    //get cur time
	QueryPerformanceCounter(&before);
	QueryPerformanceCounter(&curtime);

	string temp = "testing";

	char temp2[50];
	sprintf(temp2,"testing");


	// Regular function with copy
	cout << "1 - (string str)"<< endl;
	TimePassed();

	for(int i = 0; i < BIG_TESTS; i++)
	{
		String1(temp);
	}

	TimePassed();
	cout << " -- passed "<< timepassed << " ms" << endl;



	//function with a referense to string
	cout << "2 - (string& str)"<< endl;
	TimePassed();

	for(int i = 0; i < BIG_TESTS; i++)
	{
		String2(temp);
	}

	TimePassed();
	cout << " -- passed "<< timepassed << " ms" << endl;



    // function with a pointer to string
	cout << "3 - (string* str)"<< endl;
	TimePassed(); 

	for(int i = 0; i < BIG_TESTS; i++)
	{
		String3(&temp);
	}

	TimePassed();
	cout << " -- passed "<< timepassed << " ms" << endl;



	//function with a pointer to char
	cout << "4 - (char* str)"<< endl;
	TimePassed(); 

	for(int i = 0; i < BIG_TESTS; i++)
	{
		String4(temp2);
	}

	TimePassed();
	cout << " -- passed "<< timepassed << " ms" << endl;


	//function with a pointer to char using strcpy
	cout << "5 - strcpy (char* str)"<< endl;
	TimePassed(); 

	for(int i = 0; i < BIG_TESTS; i++)
	{
		String5(temp2);
	}

	TimePassed();
	cout << " -- passed "<< timepassed << " ms" << endl;

	int a;
	cin >> a;
	return 0;
}

Well running it on VS6 gave:
Quote:
//Default optimisations: 1 - 3428 2 - 573 3 - 573 4 - 1997 //And using inline: 1 - 3429 2 - 568 3 - 524 4 - 1888
It seems the result are pretty good, and logical to me. I made some tests on the latest free BuilderX, and got following results:
Quote:
//Standart optimisations 1 - 2486 2 - 452 3 - 459 4 - 1741 5 - 93
Here are the results from VS2005 Express Multi threaded DLL
Quote:
//Standart optimisations , inline, or no optimisations doesn't make //much difference 1 - 1627 2 - 882 3 - 882 4 - 2859 5 - 93
It looks a little strange, the only speed gain was found when passing a copy to the function, everywhere else there is speed decrease abot 20-45% Now I just selected Multithreaded and got even more interesting results:
Quote:
1 - 790 2 - 374 3 - 369 4 - 2778 5 - 13 (!)
Now this is impressive improvement. So why is the difference that big between "multithreaded" and "multithreaded DLL" ? So what do you think ? [Edited by - DMINATOR on December 4, 2005 9:36:00 AM]

Share this post


Link to post
Share on other sites
That test is hardly fair, sprintf is hardly an efficient (or common) function for copying strings. At least use a format "%s" if you're going to do it. Try strcpy(str, "changed") or memcpy(str, "changed", sizeof("changed")); instead.
Also, make sure you're in release mode with full optimizations and intrinsic functions enabled. It depends a bit on what you're actually measuring but it may be a good idea to split off the functions into a separate module to prevent to compiler from optimizing away the entire loop.

Also, you should really do something a bit more complicated string operations than just copying strings around. std::string keeps track of it's length directly which can be a huge advantage for some operations.

Share this post


Link to post
Share on other sites
Besides a malformed test, the second and third most likely contributors are that VS2005 doesn't include the single-threaded runtime and that it adds extra checks by default to help prevent and/or detect bugs such as buffer overflows.

Share this post


Link to post
Share on other sites
Ok thank you. I modifed the code a bit, and strcpy or memcpy does make a big difference.

But the most impressive effect I noticed when changing settings to just Multithreaded. Anyone has any ideas about it ? VS6 Didn't had any difference at all.

Share this post


Link to post
Share on other sites
Quote:
Original post by DMINATOR
Ok thank you. I modifed the code a bit, and strcpy or memcpy does make a big difference.

But the most impressive effect I noticed when changing settings to just Multithreaded. Anyone has any ideas about it ? VS6 Didn't had any difference at all.


When you use a DLL, the functions have an extra level of indirection due to the dynamic linking. Use your debugger to step through the code at the disassembly level and you'll see what I mean.

Share this post


Link to post
Share on other sites
Well about VC2005 performance, I have been doing some tests too, mainly with math functions (vector, matrix op), and the VC2005 produce SLOWER code than VC2003, and not to mention SLOWER than Intel C++ 9.0, thats why I am still using VC 2003 [smile]

There are some posts about VC2005 beeing slower than VC2003 on msnd forums:
http://forums.microsoft.com/MSDN/ShowPost.aspx?PageIndex=2&SiteID=1&PostID=128085&PageID=1

Oscar

Share this post


Link to post
Share on other sites
Quote:
Original post by ogracian
Well about VC2005 performance, I have been doing some tests too, mainly with math functions (vector, matrix op), and the VC2005 produce SLOWER code than VC2003, and not to mention SLOWER than Intel C++ 9.0, thats why I am still using VC 2003 [smile]

There are some posts about VC2005 beeing slower than VC2003 on msnd forums:
http://forums.microsoft.com/MSDN/ShowPost.aspx?PageIndex=2&SiteID=1&PostID=128085&PageID=1

Oscar
As is pointed out in that thread, you should examine and isolate the areas where the code is slower, and talk to someone at MS about it. That way, they can make the necessary changes for SP1 to handle these corner cases more effectively.

Share this post


Link to post
Share on other sites
As for the OP's benchmarks, I'm seeing the following behaviors:

VS 8--------------------------------------------------

* Multithreaded DLL
Starting test. Number of loops= 5000000

1 - (string str)
-- passed 837.778 ms
2 - (string& str)
-- passed 333.692 ms
3 - (string* str)
-- passed 326.002 ms
4 - (char* str)
-- passed 1576.6 ms
5 - strcpy (char* str)
-- passed 7.51576 ms

*Multithreaded
Starting test. Number of loops= 5000000

1 - (string str)
-- passed 366.745 ms
2 - (string& str)
-- passed 182.604 ms
3 - (string* str)
-- passed 187.33 ms
4 - (char* str)
-- passed 1620.86 ms
5 - strcpy (char* str)
-- passed 5.05036 ms

VS 7--------------------------------------------------

*Singlethreaded
Starting test. Number of loops= 5000000

1 - (string str)
-- passed 329.72 ms
2 - (string& str)
-- passed 149.223 ms
3 - (string* str)
-- passed 157.478 ms
4 - (char* str)
-- passed 1000.62 ms
5 - strcpy (char* str)
-- passed 5.03248 ms

*Multithreaded
Starting test. Number of loops= 5000000

1 - (string str)
-- passed 349.87 ms
2 - (string& str)
-- passed 198.168 ms
3 - (string* str)
-- passed 159.792 ms
4 - (char* str)
-- passed 734.929 ms
5 - strcpy (char* str)
-- passed 5.14032 ms

*Multithreaded DLL
Starting test. Number of loops= 5000000

1 - (string str)
-- passed 572.189 ms
2 - (string& str)
-- passed 283.535 ms
3 - (string* str)
-- passed 274.563 ms
4 - (char* str)
-- passed 739.255 ms
5 - strcpy (char* str)
-- passed 5.00734 ms



VS7 flags: /Ox /Og /Ob2 /Oi /Ot /G7 /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_MBCS" /GF /FD /EHsc /arch:SSE2 /Fo"Release/" /Fd"Release/vc70.pdb" /W3 /nologo /c /Wp64 /Zi /TP /D_SECURE_SCL=0

VS8 flags: /Ox /Ob2 /Oi /Ot /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /GF /FD /EHsc /GS- /arch:SSE2 /fp:fast /GR- /Fo"Release\\" /Fd"Release\vc80.pdb" /W3 /nologo /c /Wp64 /Zi /TP /errorReport:prompt /D_SECURE_SCL=0

I've been looking at the assembly for the two, and it seems to be pretty much identical in all cases. Take a look at Test 5:

mov ecx, DWORD PTR ??_C@_07HADGPIEN@changed?$AA@+4
mov edx, DWORD PTR ??_C@_07HADGPIEN@changed?$AA@
mov eax, 5000000 ; 004c4b40H
$LL3@main:

; 159 :
; 160 : for(int i = 0; i < BIG_TESTS; i++)

sub eax, 1

; 161 : {
; 162 : String5(temp2);

mov DWORD PTR _temp2$[esp+256], edx
mov DWORD PTR _temp2$[esp+260], ecx
jne SHORT $LL3@main

; 163 : }
; 164 :



The only difference in the VS7 version is a call to npad 8 just before the loop starts. (What the hell is npad, by the way?) Notice that it simply assigns the string over and over. This is highly suspicious to me, since the optimizer should have dropped that loop completely.

Differences in the first four tests are almost certainly due to a library implementation differences. I'm a little confused about 5 though.

Share this post


Link to post
Share on other sites
Promit: One thing I notice in your command lines is that VS8 appears to be using unicode while VS7 is not. That small difference could cause significantly different string function implementation since unicode characters have different byte lengths (unless it's using UTF-32, which seems unlikely) which makes copying a string more complex than just searching for a 0 byte and copying the bytes up to that point.

The way the loop was only partially optimized is very strange, and you should probably send it off to MS so they can analyze it and maybe find the problem. Any explanation I can think of for a problem (such as the optimizer becoming confused about aliasing since pointers are everywhere despite the local reference graph being rather simple) should cause much less optimization than actually occurred.

Share this post


Link to post
Share on other sites
Right well I actually disabled unicode, and you can see that the copied string is 8 bytes long (7 chars and the null). It's done using two DWORD moves. I posted on the MSDN forums and I know softies from the VS team roam there, so I'm hoping that somebody will have some insight tomorrow. Considering that the optimizer managed to inline the string assignment and replace the string copy with intrinsics, I'm amazed that it didn't maange to pull off the single most obvious optimization. Of course I don't know that much about optimization theory, so maybe it's more difficult than I realized.

Share this post


Link to post
Share on other sites
Hrmm, I guess VS8 sets the defines for character encoding internally, unlike previous versions to set the character encoding.

You probably should have posted it in the "Visual C++ General" (or Language) forum, but it will probably be moved =-)

Share this post


Link to post
Share on other sites
Quote:
Original post by Promit
It IS in the Visual C++ General forum.
Weird, could have sworn it was in VS General =-/

Share this post


Link to post
Share on other sites
This is interesting, both GCC doesn't seem to be able to remove dead loops either. I'd say it's some kind of safety "feature" to preserve delay loops and performance tests like these, except that sounds exceedingly retarded given todays heavily templated code.

Here's what the best I could do (by enabling loop unrolling and setting the maximum number of iterations) without spending an enternity fiddling with all of GCC's optimization settings:
mov eax,4999999
.loop:
sub eax,32
jns .loop

Share this post


Link to post
Share on other sites
An MS guy has graciously explained:
Quote:
Post by Jonathan Caves [MSFT]
Hi: I've talked to the optimizer guys here and they explained why this code is being generated.

1) If you remove the statement:

cout << temp2;

then by applying the dead-store elimination the opimizer can remove all the proceeding code as there is not real use of temp2

2) The reason why the loop is not being completely removed is that for most of the optimization process the compiler does not know what strcpy does: it cannot assume that you are calling the Standard version of strcpy (after all at the link phase you may provide your own version of strcpy with completely different semantics). It is only when the compiler is generating the assembly code for the target platform that it replaces the call to strcpy with a platform specific intrinsic and at this stage it is too late to go back and re-run the global opimizer over the generate x86 (or x64, or IPF) assembly code.

We agree that this is situation is not ideal and the optimizer team is currently working on a new architecture which they hope will enable them to fix issues like this.

Share this post


Link to post
Share on other sites
Yes this seems pretty logical to me. I am glad this topic raised such positive discussion.

But what about the speed difference I am getting with VS2005 vs VS6. Shouldn't the performance be equal or better than previous version ? Or I am missing some compiler option ? 882 is not faster then 573 I am getting with VS6 for sure.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this