Sign in to follow this  
pifor106

Some comparisons of Generic/Native performances on various hardware

Recommended Posts

Hello,

Yesterday and today were the days I feared the most of my integration work. Basically, up to two days ago, I had only compile and tested my code for Win32/Win64 build. Today I got to test on various hardwares as I made sure that they built/ran correctly.

Beside the PS3 compiler, which really is picky when it comes to template, compiling was a breeze and everything in that area went pretty smoothly on the 3 platforms I have verified up to now. I still have a few more to go, but I decided to do some speed comparisons between the native calling convention and the generic implementation.

I got some interesting results, which I haven't had time to fully analyze :

So, the test I ran is quite simple and I totally agree that the test itself is bullshit and in no way a valid representation of what we actually do with scripts, but what I wanted to stress was the cost of going back and forth from the script to the engine code, in both native and generic configuration.

The script :

MyObject myObject;
for (int i = 0; i < 10000; ++i)
myObject.MyMethod(1, 2, 3, 4);

I have a variant for each of the function taking from 0 to 4 arguments. I wanted to test the impact of having multiple arguments passed to the function.

The C++ code does nothing. I'm only calling a stub.


( All times in Microseconds )

PS3 - Generic
Run_Method_0_Arg_Benchmark 3219 4193 4279 3809 3841 4595
Run_Method_1_Arg_Benchmark 6219 4458 4253 4217 4258 4301
Run_Method_2_Arg_Benchmark 5236 9313 7844 9364 7836 7222
Run_Method_3_Arg_Benchmark 8617 6803 8987 6749 8985 9442
Run_Method_4_Arg_Benchmark 11033 8538 8185 8385 8301 8324

PS3 - Native
Run_Method_0_Arg_Benchmark 4467 4473 4735 7071 4556 4706
Run_Method_1_Arg_Benchmark 6600 9985 6824 9259 11174 6821
Run_Method_2_Arg_Benchmark 10554 10666 9207 8902 9115 9456
Run_Method_3_Arg_Benchmark 14863 11168 11513 11270 11395 11611
Run_Method_4_Arg_Benchmark 13036 21163 13558 13315 13410 13729

Well, judging from that, I feel inclined to force the generic calling convention on my setup for the PS3. I haven't investigated why the numbers vary so much, but that's something I'll need to have a look at.

I'm not entirely sure that we can execute the code on the scripts on the SPUs, unless we use them in a really restricted context. So I guess that the main scripting code for everything AI related will always be executed on the PPU. But anyway, that's another debate to have I guess.

Xenon - Native
Run_Method_0_Arg_Benchmark 4355 4356 4352 4354 4355
Run_Method_1_Arg_Benchmark 5041 5021 5025 5045 5024
Run_Method_2_Arg_Benchmark 5776 5773 5770 5773 5773
Run_Method_3_Arg_Benchmark 6431 6430 6428 6425 6425
Run_Method_4_Arg_Benchmark 7029 7026 7026 7025 7026

Xenon - Generic
Run_Method_0_Arg_Benchmark 2570 2569 2570 2571 2566
Run_Method_1_Arg_Benchmark 3586 3586 3586 3587 3590
Run_Method_2_Arg_Benchmark 5107 5100 5095 5094 5114
Run_Method_3_Arg_Benchmark 6979 6976 6980 6976 6991
Run_Method_4_Arg_Benchmark 9373 9336 9332 9339 9372

Now that's consistency. Once again, the generic calling convention seems to be a winner here, but with a catch. The more parameters, the less the gain is. Any idea as to what might cause that ?

Wii - Generic
Run_Method_0_Arg_Benchmark 4088 4092 4090 4087 4089
Run_Method_1_Arg_Benchmark 6555 6557 6551 6551 6551
Run_Method_2_Arg_Benchmark 9447 9454 9452 9451 9446
Run_Method_3_Arg_Benchmark 13307 13294 13304 13298 13297
Run_Method_4_Arg_Benchmark 17695 17698 17703 17700 17703

Ok. I guess, don't use more than 2 parameters :) That's actually a good thing ;) Now I can have something to show to those darn programmers passing a trillion of parameters to their functions !

Anyway, that's the Wii. So yeah, that's the wii...

Finally Win32 on my war machine :

Native
Run_Method_0_Arg_Benchmark 691 649 640 670 649
Run_Method_1_Arg_Benchmark 722 666 691 665 664
Run_Method_2_Arg_Benchmark 757 691 691 691 692
Run_Method_3_Arg_Benchmark 801 787 727 789 728
Run_Method_4_Arg_Benchmark 973 889 898 949 894

Generic
Run_Method_0_Arg_Benchmark 733 685 716 653 709
Run_Method_1_Arg_Benchmark 863 797 794 795 853
Run_Method_2_Arg_Benchmark 1206 1019 1098 1097 1098
Run_Method_3_Arg_Benchmark 1598 1546 1444 1585 1445
Run_Method_4_Arg_Benchmark 2197 2021 1996 2098 1995

Here the native calling convention really shine, but I wondered why so I hooked up a profiler to have a look at what was going on underneath the generic calling convention. Now, I would really like to attach a screenshot here to demonstrate the call graph but it seems that it's not possible, so you'll have to picture it.

In all of my tests, I made an exact 50049 calls to asCContext::CallGeneric, spending 38779.59 microseconds in this code. In turn, those calls made 100055 calls to GetAddressOfArg. For this amount of calls, I have spent 14202.64 microseconds in there.

Therefore, I have spent 36% of my time in this function. which is quite a lot.

If I dig further into GetAddressOfArg, I spend 75.63 % of my time in GetSizeOnStackDWords, which is called a shit load of time as well. Now, I really wonder if there ain't nothing that could be done there to save some time.

I suspect two things :

1) Virtual Function Calls inducing indirection and cache misses. But in my case, probably unlikely since the function pointer should probably be in cache, but still probably an indirection in assembly code to look up the final function pointer.

2) The offset computation loop. I think that the major problem

Here are some changes that I'll try tommorow :

First, I think that instead of always calling GetAddressOfArg, the following code could be executed in the functions generated by auto-wrapper :

...

void* args[PARAM_COUNT];
gen->GetAddressOfArgs(args, PARAM_COUNT);

(*func)(static_cast<Arg0>(args[0]), ...);

...

Here the nice thing is that we only make a single call to a virtual function, and we can simply walk all the parameters from 0 to n and accumulate the offset and store the pointer into args. We can also ditch the if checking the argument index for overflows. Since we know for a fact that we are going to get ALL arguments, then it doesn't make sense to continually call GetSizeOnStackDWords 10 times.

I.e.:

If you have 3 parameters, then you call GetSizeOnStackDWords --> 1 + 2 + 3 = 6 times when actually we could call it 3 times. The funny thing is here is that it's actually easy to get a O(N) algorithm here.

I was also wondering if the offsets could be cached in sysFunction ? I have looked at the code that much but my guess is that it could also be a possibility ? Correct ?

Anyway, it's getting late, and this is already a pretty long post. The thing to remember is, using the generic calling convention along with the auto-wrapper can really speed up stuff in some cases, and I see lots of areas that can easily be optimized.

Also, I know that my test is a bit bogus and really not that representative of a real-world scenario.

As always, I'll send the modifications once they're made and tested on my end.

Have a nice evening,

Pierre

Share this post


Link to post
Share on other sites
I always suspected that the generic calling convention would be faster than the native calling convention on some systems given the sheer amount of dynamic decisions that needs to be done to juggle the arguments into different CPU registers. On Win32 it is a different matter, as all arguments are pushed on the stack, which is just a simple copy. It's good to finally get some numbers on this.

There is definitely a lot of room for optimizations, both on the generic calling convention and the native calling conventions. I'll gladly accept contributions in that regard.

Share this post


Link to post
Share on other sites
Hi again,

I made some optimizations today, around 30 to 45% better in most cases, and I have some ideas as to what could be done to further optimize the generic calling convention, when using it through the auto-wrapper add-on.

Basically, I did implement the
I have some pretty decent numbers now :

PS3 (Quite good improvements)

Run_Method_0_Arg_Benchmark 3261 3370 3110 3330
Run_Method_1_Arg_Benchmark 4203 4235 4221 6577
Run_Method_2_Arg_Benchmark 5671 6005 5713 6320
Run_Method_3_Arg_Benchmark 6427 7395 10290 6431
Run_Method_4_Arg_Benchmark 11124 10818 7092 11230

Xenon (Most improvements were in the 3/4 arguments)

Run_Method_0_Arg_Benchmark 2568 2570 2570 2568
Run_Method_1_Arg_Benchmark 3521 3523 3517 3523
Run_Method_2_Arg_Benchmark 5273 5275 5275 5277
Run_Method_3_Arg_Benchmark 6324 6324 6327 6365
Run_Method_4_Arg_Benchmark 7350 7307 7303 7303

Wii (Dramatic improvements)

Run_Method_0_Arg_Benchmark 4036 4034
Run_Method_1_Arg_Benchmark 6904 6868
Run_Method_2_Arg_Benchmark 8940 8943
Run_Method_3_Arg_Benchmark 10975 10968
Run_Method_4_Arg_Benchmark 12953 12959

Win32 (Improved against generic, but still slower than native convention)

Run_Method_0_Arg_Benchmark 502 444
Run_Method_1_Arg_Benchmark 698 694
Run_Method_2_Arg_Benchmark 936 945
Run_Method_3_Arg_Benchmark 1285 1193
Run_Method_4_Arg_Benchmark 1448 1492

I'll send you the changes, probably tonight.

I have some left to do, but I need your opinion on them.

There are 2 major offenders left :

- asCDataType::GetMemorySizeInBytes() :

I have optimized it a bit by earlying out if the data types were floats or ints, as I suspect that most values are always passed as those data types. In any case, people nowadays should really use ints or floats, as they are faster than their short or byte counteraparts. This save quite a bit.

Actually, I would like to have the size of the type cached in asCDataType, and returned as an inline function. Currently, calling asCDataType::GetStackSizeInDWords() takes about 20 % in the overall benchmarking time. My guess is that we can lower that down to around 0.1% my caching the value and returning it through an inlined method. I want to try that as well, but since the value tokenType passed to the object is set from various places, I fear that I'll forget one and screw things up :)

- asCDataType::IsObject()

I have reduced the type taken by removing the call to is EnumType and contracting the expression into a single return objectType & !(objectType->flags & asOBJ_ENUM); call. This saved quite alot, but the non-inlined call itself takes precious cycles aways ( about 5% in my benchmark ). I wanted to move this call inline but I failed to do so because of circular dependencies in .h file. Ok, I'll admit, I'm a lazy bastard and I didn't try that hard to fix the circular dependencies because I was quite in a hurry :(

Apart from that, virtual functions calls hurt especially on PS3, but I don't really see what can be done for that. I'll try to cast the generic interface pointer to the asCGeneric type and implement non-virtual functions calls for GetArg**** and GetReturn**** to see if that improves stuff. Frankly, I think that solution sucks but since I'm in generated code ( aswrappedcall.h ), I'll live with it if I save a 5-10% extra there... but I won't tell anyone I did so ;)

Share this post


Link to post
Share on other sites
Interesting stuff! Hopefully this will get eventually merged into AS along with http://www.gamedev.net/community/forums/topic.asp?topic_id=583171
The necessity to provide AS with methods and functions for both calling conventions is quite burdensome.

Share this post


Link to post
Share on other sites
Hi Vicious,

Thanks. I'm in the process of seeing the best way to send all of that to WitchLord. Hopefully, I'll try to do it in the near future, but I'm in a bit of rush this week as I learned that I must leave my studio for two weeks to give some training in some other studio halfway around the world.

In any case, implementing the auto-registration system is pretty straightforward, so if you can't wait I guess it'll take you 2 or 3 days to do so.

Pierre

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this